Fact-checked by Grok 2 weeks ago
References
-
[1]
Fault-tolerance - an overview | ScienceDirect TopicsFault-tolerance is defined as the property by which a system continues to operate properly in the event of the failure of (or one or more faults within) some of ...Introduction to Fault-Tolerance... · Fault-Tolerance in Distributed...
-
[2]
[PDF] Software Fault Tolerance: A TutorialFor some applications software safety is more important than reliability, and fault tolerance techniques used in those applications are aimed at preventing.
-
[3]
[PDF] Fundamental Concepts of DependabilityIn 1967, A. Avizienis integrated masking with the practical techniques of error detection, fault diagnosis, and recovery into the concept of fault-tolerant.
-
[4]
Software Fault Tolerance - Carnegie Mellon UniversitySoftware fault tolerance is the ability for software to detect and recover from a fault that is happening or has already happened in either the software or ...
-
[5]
[PDF] The Byzantine Generals Problem - Leslie LamportThe problem of coping with this type of failure is expressed abstractly as the Byzantine Generals Problem. We devote the major part of the paper to a.
-
[6]
[PDF] Practical Byzantine Fault ToleranceThis paper describes a new replication algorithm that is able to tolerate Byzantine faults. We believe that Byzantine- fault-tolerant algorithms will be ...<|control11|><|separator|>
- [7]
-
[8]
[PDF] Von Neumann's Self-Reproducing AutomataABSTRACT. John von Neumann's kinematic and cellular automaton systems are des- cribed. A complete informal description of the cellular system is pre- sented ...Missing: fault tolerance
-
[9]
[PDF] Computers in Spaceflight - NASA Technical Reports Server (NTRS)NASA's use of computer technology has encompassed a long period starting in 1958. During this period, hardware and software developments in the computer field.
-
[10]
A Brief History of the Internet - Internet Society... distributed automated algorithms, and better tools were devised to isolate faults. ... ARPANET was somehow related to building a network resistant to nuclear war.Origins Of The Internet · The Initial Internetting... · Transition To Widespread...
-
[11]
The history of virtualization and its mark on data center managementOct 24, 2019 · The early 1990s saw the onset of several virtualization companies touting services and software to help admins better virtualize their workloads ...
-
[12]
What is fault-tolerant quantum computing? - IBMMay 30, 2025 · A fault-tolerant quantum computer is a quantum computer designed to operate correctly even in the presence of errors.Missing: AI 2010s 2020s microservices
-
[13]
(PDF) AI-ENHANCED FAULT TOLERANCE IN MICROSERVICESSep 24, 2025 · This paper presents a systematic review of how artificial intelligence is integrated to improve fault tolerance in microservices architectures, ...Missing: 2010s 2020s
-
[14]
Simulating fail-stop in asynchronous distributed systemsThe fail-stop model makes two assumptions about the failure behavior of processes: that processes fail only by permanently crashing, and that when a process ...
-
[15]
From crash-stop to permanent omission - ACM Digital LibraryThis paper studies the impact of omission failures on asynchronous distributed systems with crash-stop failures. We provide two different transformations ...Missing: modes | Show results with:modes
-
[16]
The Byzantine Generals Problem - Leslie LamportThe problem of coping with this type of failure is expressed abstractly as the Byzantine Generals Problem. We devote the major part of the paper to a.
-
[17]
[PDF] Reliability Analysis of Fault Tolerant Memory Systems - arXivNov 23, 2023 · This paper analyzes fault-tolerant memory systems using Markov chains, scrubbing methods, and SEC-DED codes, exploring three models and ...
-
[18]
[PDF] A Mission Profile Based Reliability Modeling Framework for Fault ...system has failed (failure rate) is given by: F(t)=1 − e−λt, and the probability that the system is operational (reliability rate) is given by: R(t) = e−λt.Missing: probabilistic | Show results with:probabilistic
-
[19]
Consensus in the presence of partial synchrony - ACM Digital LibraryIn an asynchronous system no fixed upper bounds Δ and Φ exist. In one version of partial synchrony, fixed bounds Δ and Φ exist, but they are not known a priori.
-
[20]
[PDF] Mixed Fault Tolerance Protocols with Trusted Execution EnvironmentAug 3, 2022 · Crash fault tolerance (CFT) protocols assume faulty nodes fail only by crashing, whereas Byzantine fault tolerance (BFT) protocols deal with ...
-
[21]
[PDF] FAULT MANAGEMENT HANDBOOK - NASAApr 2, 2012 · This Handbook is published by the National Aeronautics and Space Administration (NASA) as a guidance document to provide guidelines and ...
-
[22]
In-depth analysis of fault tolerant approaches integrated with load ...Oct 17, 2024 · Parameters: The parameters manipulated during fault tolerance are MTTF (Mean Time to Failure), MTBF (Mean Time Between Failure), MTTR (Mean ...
-
[23]
Disaster Recovery (DR) objectives - Reliability PillarRecovery Time Objective (RTO) Defined by the organization. RTO is the maximum acceptable delay between the interruption of service and restoration of service.
-
[24]
Formal analysis of feature degradation in fault-tolerant automotive ...Mar 1, 2018 · Graceful degradation can be applied when system resources become insufficient, reducing the set of provided functional features. In this paper, ...
-
[25]
Functional Safety FAQ - IECIEC 61508 relates the safety integrity level of a safety function to: the average probability of a dangerous failure on demand (in the case of low demand mode ...
-
[26]
[PDF] Effective Fault Management Guidelines - The Aerospace CorporationJun 5, 2009 · Fault Tolerance—The number of faults that the system must tolerate to meet its specifications. That is, a single fault tolerant space vehicle ...
- [27]
-
[28]
[PDF] Fault-Tolerant Computer StudyFeb 1, 1981 · of failed parts is not available, and the system is certain to fail after ... Redundant buses are required with no common failure mechanism ...
-
[29]
[PDF] Fault Tolerance in Tandem Computer Systems - cs.wisc.eduMay 5, 1990 · Fail-fast logic is required to prevent corruption of data in the event of a failure. Hardware checks (including parity, coding, and selfchecking) ...
-
[30]
[PDF] Fault Tolerance in Distributed Systems - UC Berkeley EECSMay 9, 2022 · Replicated State Machines typically rely on consensus protocols to provide availability and consistency. These applications also require high ...Missing: modularity | Show results with:modularity
-
[31]
Idempotence & Idempotent Design in IT/Tech Systems | SplunkJan 28, 2025 · Idempotent design ensures that the outcome of an operation is the same whether it is executed once or multiple times.Missing: modularity | Show results with:modularity
-
[32]
[PDF] The N-Version Approach to Fault-Tolerant SoftwareThe N-version approach to fault-tolerant software uses N-fold replications in time, space, and information to tolerate design faults.Missing: seminal | Show results with:seminal
-
[33]
Evaluating Fault Tolerance and Scalability in Distributed File SystemsFeb 4, 2025 · A distributed file system should be scalable to account for maintaining replicas and increasing fault tolerance as the number of files, size of ...
-
[34]
Fault tolerance in big data storage and processing systemsThis study aims to provide a consistent understanding of fault tolerance in big data systems and highlights common challenges that hinder the improvement in ...Missing: seminal | Show results with:seminal
-
[35]
[PDF] Final Report for Software Service History and Airborne Electronic ...Nov 1, 2016 · RTCA document DO-178C is the reference standard document used to discuss aircraft software safety assurance processes. This document ...
-
[36]
[PDF] FAULT-TOLERANT COMPUTING: AN OVERVIEW - COREdesign errors and hardware faults. The development of highly reliable ... Some examples are component failure rates, coverages and the relative frequency of ...
-
[37]
[PDF] Fault-Tolerant Computing: An Overview - DTICH'ibrid hardware redundancy combines the attractive features of both the active and passive approaches. Fault king is used to prevent the system from producing ...Missing: temporal | Show results with:temporal
-
[38]
[PDF] Systolic Array Fault Tolerance Performance Analysis. - DTICApr 5, 1988 · Spatial redundancy and temporal redundancy are two generic approaches for fault tolerance. Spatial redundancy capitalizes on additional ...
-
[39]
[PDF] Reliability Analysis of k-out-of-n: G SystemThe k-out-of-n system structure is a very popular type of redundancy in fault tolerant systems with wide applications both in industrial and military systems.
-
[40]
[PDF] An Empirical Evaluation of Consensus Voting and Consensus ...In this paper we discuss system reliability performance offered by more advanced fault-tolerance mechanisms under more severe conditions. The primary goal of ...
-
[41]
Dependability in Embedded Systems: A Survey of Fault Tolerance ...Apr 16, 2024 · This paper presents a comprehensive survey of fault tolerance methods and software-based mitigation techniques in embedded systems.
-
[42]
[PDF] Implementing Fault-Tolerant Services Using the State Machine ...This paper reviews the approach and describes protocols for two different failure models-Byzantine and fail stop. System reconfiguration techniques for removing ...Missing: seminal | Show results with:seminal
-
[43]
[PDF] Vertical Paxos and Primary-Backup Replication - Leslie LamportWe focus on primary-backup replication, a class of replication protocols that has been widely used in practical distributed systems. We develop two new ...
-
[44]
[PDF] A Case for Redundant Arrays of Inexpensive Disks (RAID)RAID, based on magnetic disk tech, offers improvements in performance, reliability, power, and scalability, as an alternative to SLED.
-
[45]
[PDF] A Quorum-Consensus Replication Method for Abstract Data TypesThis paper introduces general quorum consensus, a new method for managing replicated data. A novel aspect of this method is that it systematically exploits type ...
-
[46]
[PDF] Paxos Made Simple - Leslie LamportNov 1, 2001 · We let the three roles in the consensus algorithm be performed by three classes of agents: proposers, acceptors, and learners. In an ...
-
[47]
[PDF] Brewer's Conjecture and the Feasibility of Consistent, Available ...In this note, we will first discuss what Brewer meant by the conjecture; next we will formalize these concepts and prove the conjecture;. *Laboratory for ...
-
[48]
[PDF] Fault-Tolerant Replication with Pull-Based Consensus in MongoDBThus, it does not tolerate faults like network partitions and could suffer from a "split-brain" if such faults happen. The main advantage of ...<|control11|><|separator|>
-
[49]
[PDF] In Search of an Understandable Consensus AlgorithmMay 20, 2014 · The remainder of the paper introduces the replicated state machine problem (Section 2), discusses the strengths and weaknesses of Paxos (Section ...
-
[50]
[PDF] Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable ...This paper introduces heartbeat, a failure detector that can be implemented without timeouts, and shows how it can be used to solve the problem of quiescent ...
-
[51]
A Study of Fault Coverage of Standard and Windowed Watchdog ...Abstract: Both standard and windowed watchdog timers were designed to detect flow faults and ensure the safe operation of the systems they supervise.Missing: seminal | Show results with:seminal
- [52]
-
[53]
[PDF] The Recovery Manager of the System R Database Manager - McJonesThe Recovery Manager of the System R Database Manager. TRANSACTION LOG. 231 ... Jim Gray et al. ments which stress tested the recovery system. Jim. Mehl and ...
-
[54]
[PDF] Adapting Software Fault Isolation to Contemporary CPU ArchitecturesSoftware Fault Isolation (SFI) is an effective approach to sandboxing binary code of questionable provenance, an interesting use case for native plugins in a ...Missing: seminal | Show results with:seminal
- [55]
-
[56]
[PDF] Enhancing Server Availability and Security Through Failure ...Abstract. We present a new technique, failure-oblivious comput- ing, that enables servers to execute through memory er- rors without memory corruption.
-
[57]
[PDF] Automatic Runtime Error Repair and ContainmentRCV implements recovery shepherding, which attaches to the application process when an error occurs, repairs the execution, tracks the repair effects as the ...
-
[58]
Circuit Breaker in Microservices: State of the Art and Future ProspectsApr 18, 2021 · This article provides an overview of recent research in circuit breaker, maps the research subject, and finds opportunities for future research.<|separator|>
-
[59]
[PDF] Large-scale cluster management at Google with BorgApr 23, 2015 · We present a summary of the Borg system architecture and features, important design decisions, a quantitative anal- ysis of some of its policy ...Missing: healing | Show results with:healing
-
[60]
Quantum error correction below the surface code threshold - NatureDec 9, 2024 · Equipped with below-threshold logical qubits, we can now probe the sensitivity of logical error to various error mechanisms in this new regime.
-
[61]
None### Summary of Redundancy Management and Fault Tolerance in Space Shuttle Avionics
-
[62]
Tesla Autopilot Nine Times Safer than Human Driving - Applying AIOct 27, 2025 · Sensor Suite & Fusion: Eight surround cameras (250–850m range), twelve ultrasonic sensors (up to 8m), and forward-facing millimeter-wave radar ...
-
[63]
[PDF] TESLA'S AUTOPILOT: OVERCOMING AI AND HARDWARE ...Apr 7, 2024 · The power delivery system incorporates triple-redundant voltage regulators with real-time monitoring and fault detection capabilities ...
-
[64]
Power system security concepts and principles - IEAAn N-1 secure state is achieved when system conditions are such that a subsequent N-1 event could be absorbed without threatening stable system operation. See ...
-
[65]
[PDF] Self-Diagnostics Digitally Controlled Pacemaker/Defibrillators - DTIC3. The battery must last for approximately 10 years or greater. 4. The system must have a fault-tolerant mechanism.
- [66]
-
[67]
Fault-Tolerant Scheduling Mechanism for Dynamic Edge Computing ...Oct 30, 2024 · In this paper, we propose an innovative fault-tolerant scheduling model based on asynchronous graph reinforcement learning.
- [68]
-
[69]
Building an Adaptive and Resilient Multi-Communication Network ...Jan 13, 2023 · Abstract: Edge computing has gained attention in recent years due to the adoption of many Internet of Things (IoT) applications in domestic, ...
-
[70]
Knight Shows How to Lose $440 Million in 30 Minutes - BloombergAug 2, 2012 · In the mother of all computer glitches, market-making firm Knight Capital Group lost $440 million in 30 minutes on Aug. 1 when its trading ...
-
[71]
[PDF] therac.pdf - Nancy LevesonBetween June 1985 and January 1987, a computer-controlled radiation ther- apy machine, called the Therac-25, massively overdosed six people. These accidents ...
-
[72]
[PDF] An Investigation of the Therac-25 Accidents - Columbia CSSome of the most widely cited software-related accidents in safety-critical systems involved a computerized radiation therapy machine called the Therac-25.
-
[73]
AWS US-EAST-1 Outage: Postmortem and Lessons Learned - InfoQDec 18, 2021 · On December 7th AWS experienced an hours-long outage that affected many services in its most popular region, Northern Virginia.
-
[74]
[PDF] A Peer-to-Peer Electronic Cash System - Bitcoin.orgIn this paper, we propose a solution to the double-spending problem using a peer-to-peer distributed timestamp server to generate computational proof of the ...Missing: Byzantine | Show results with:Byzantine
-
[75]
[PDF] On the Formalization of Nakamoto ConsensusSep 26, 2017 · Nakamoto provides an informal claim that Bitcoin's fundamen- tal mechanism provides a solution to the Byzantine generals problem in the ...
-
[76]
[PDF] Spanner: Google's Globally-Distributed DatabaseSpanner is a scalable, globally-distributed database de- signed, built, and deployed at Google. At the high- est level of abstraction, it is a database that ...
-
[77]
Dark Side of Distributed Systems: Latency and Partition ToleranceMar 6, 2025 · Coordinating multiple nodes over unreliable networks introduces challenges around data consistency, system synchronization, and partial failures ...
-
[78]
Horizontal Pod Autoscaling - Kubernetes26 may 2025 · In Kubernetes, a HorizontalPodAutoscaler automatically updates a workload resource (such as a Deployment or StatefulSet), with the aim of ...Horizontal scaling · HorizontalPodAutoscaler · Resource metrics pipeline
-
[79]
AI augmented Edge and Fog computing: Trends and challengesEdge and Fog nodes are prone to different types of failures, including hardware failures, software failures, network failures and resource overflow (Bagchi et ...Missing: 2020s | Show results with:2020s
-
[80]
DynamoDB read consistency - AWS DocumentationEventually consistent is the default read consistent model for all read operations. When issuing eventually consistent reads to a DynamoDB table or an index ...
-
[81]
Resilience and disaster recovery in Amazon DynamoDBResilient Amazon DocumentDB clusters leverage AWS Regions, Availability Zones, and fault-tolerant storage for high availability and data durability. August 3, ...
- [82]
- [83]
-
[84]
Fault Tolerance In Data Centers: Maximizing Reliability ... - DataBankJul 16, 2024 · To address scalability, organizations should design fault-tolerant systems with modular components that can be easily scaled horizontally. ...
- [85]
- [86]
- [87]
-
[88]
A Survey of Fault-Tolerance Techniques for Embedded Systems ...Jan 16, 2022 · This paper provides an in-depth survey of the emerging research efforts that exploit fault-tolerance techniques while considering timing, power/energy, and ...<|separator|>
-
[89]
The Downside of a Fault Tolerant System - Accendo ReliabilityThe Downside of a Fault Tolerant System · Masking or obscuring low-level failures · Increase in testing challenges · Increase in cost, weight, and complexity.
-
[90]
2.2: Faults, Failures, and Fault-Tolerant DesignSep 25, 2021 · A fault is an underlying defect, imperfection, or flaw that has the potential to cause problems, whether it actually has, has not, or ever will.
- [91]
-
[92]
Cost modelling of fault-tolerant software - ScienceDirect.comCosts of a simplex or single-version system are compared with the following three-version fault-tolerant software systems: N-version programming (NVP), ...Missing: engineering | Show results with:engineering
-
[93]
High availability versus fault tolerance - IBMA fault tolerant environment has no service interruption but a significantly higher cost, while a highly available environment has a minimal service ...Missing: infrastructure ROI
-
[94]
High Availability vs Fault Tolerance | Overview - NinjaOneJul 18, 2025 · Fault tolerant systems are much more costly and complex to implement and maintain than systems designed only for high availability. This is ...Missing: expenses | Show results with:expenses
-
[95]
Reliability design principles - Microsoft Azure Well-Architected ...Sep 30, 2025 · Simplicity reduces the surface area for control, minimizing inefficiencies and potential misconfigurations or unexpected interactions. On the ...Design For Business... · Design For Resilience · Design For Operations
-
[96]
[PDF] THE PATH TO LOWEST TOTAL COST OF OWNERSHIP WITH ...High availability and fault-tolerant solutions not only produce a higher return by significantly reducing the cost of downtime, they also have a lower ...Missing: non- | Show results with:non-
-
[97]
The True Costs of Downtime in 2025: A Deep Dive by Business Size ...Jun 16, 2025 · Gartner (2024) highlights that retail e-commerce platforms lose $1 million to $2 million per hour during peak seasons, while manufacturing ...Missing: MTTR savings
-
[98]
ROI of Reducing MTTR: Real-World Benefits and Savings - SquadcastAug 8, 2024 · The ROI of reducing MTTR is reflected in enhanced productivity, significant cost savings, improved customer satisfaction, better employee morale, competitive ...
-
[99]
[PDF] Top Tech Trends of 2025: AI-powered everything - CapgeminiAs organizations face significant cost pressures, using smaller modals, as well as running them closer the edge will be key. • Inadequate technology/tooling ...
-
[100]
Top 10 software development trends in 2025 - NiotechoneAug 6, 2025 · Discover 2025's top software development trends: AI, low-code, DevOps, and automation driving the future of coding and innovation.
-
[101]
20 Test Automation Trends in 2025 - BrowserStackSome benefits of Scriptless Automation Testing include: Significant reduction in the cost of automation, hence, a good ROI; Requires little effort in setting ...
- [102]
- [103]
- [104]
- [105]
- [106]
- [107]
- [108]
- [109]
- [110]