Fact-checked by Grok 2 weeks ago

Failover

Failover is a critical in and designed to enhance reliability and by automatically or manually transferring operations from a primary or component to a redundant when the primary fails due to malfunction, software errors, issues, or other disruptions, thereby minimizing and . This process ensures that services, such as web applications, databases, or connections, remain accessible to users without significant interruption, often achieving recovery times measured in seconds or minutes. In practice, failover operates within architectures like high-availability clusters, where multiple nodes or servers are interconnected and configured to monitor each other's health using heartbeat signals or monitoring software. When a failure is detected—such as a server crash or overload—the system redirects workloads, including active processes, data replication, and traffic routing, to the standby component, which may be a hot standby (fully synchronized and ready), warm standby (partially synchronized), or cold standby (requiring initialization). This redundancy is commonly implemented in cloud environments, data centers, and enterprise networks through tools like load balancers, virtual IP addressing, and clustering software from vendors such as Microsoft, Oracle, and Cisco. The importance of failover lies in its role in business continuity and strategies, particularly for mission-critical applications in sectors like , healthcare, and , where even brief outages can result in substantial financial losses or safety risks. It differs from related concepts like , which prevents interruptions entirely through continuous , whereas failover accepts minimal disruption during the switch. Modern implementations often incorporate via AI-driven monitoring to predict and preempt failures, further reducing recovery time objectives (RTO) and recovery point objectives (RPO).

Core Concepts

Definition and Purpose

Failover is the process of automatically or manually switching operations from a primary or component to a redundant upon detection of a , thereby maintaining continuous service delivery and minimizing disruptions. This mechanism ensures that critical workloads, such as , applications, or network services, transfer seamlessly to the standby without significant data loss or user impact. Early commercial implementations of failover appeared in the within mainframe and environments, where companies like introduced fault-tolerant systems with redundancy to handle hardware in mission-critical applications like banking and . The primary purpose of failover is to enable (HA) in systems, targeting uptime levels such as 99.99% or higher, which translates to no more than about 52 minutes of annual . By rapidly redirecting traffic or processing to backups, failover reduces service interruptions to near zero, supporting business continuity in environments where even brief outages can lead to substantial losses. Key benefits include enhanced system reliability through proactive failure handling, preservation of by synchronizing primary and backup states, and significant cost savings; for instance, a 2024 ITIC report indicates that over 90% of enterprises face hourly costs exceeding $300,000 (approximately $5,000 per minute) due to lost revenue and productivity. Effective failover requires built-in across (e.g., duplicate servers or ), software (e.g., clustered applications), or networks (e.g., multiple paths for connectivity), ensuring the backup can assume operations without reconfiguration delays. This foundational forms the basis for HA architectures, distinguishing failover from mere backups by emphasizing real-time operational continuity rather than post-failure recovery.

Key Components

Failover systems rely on redundant components to ensure continuity during failures. These include sets of matching servers designed for seamless , where identical or similar across nodes minimizes compatibility issues during switches. Storage arrays, often configured with levels such as 1 for or 5 for parity-based , provide fault-tolerant data access by distributing information across multiple disks. Additionally, redundant power supplies paired with uninterruptible power supplies () protect against outages, allowing systems to maintain operations long enough for failover to occur without interruption. Software components form the core orchestration layer for failover. Clustering software, such as , manages resource allocation and automatic relocation of services to healthy nodes upon detecting issues. Heartbeat protocols, implemented via tools like Corosync, facilitate node communication and membership verification to coordinate synchronized state across the cluster. Load balancers, exemplified by , distribute traffic while supporting failover at layers 4 and 7 to redirect flows dynamically. Virtual IP (VIP) addressing, managed as cluster resources like IPaddr2, enables transparent client redirection by floating the IP to the active node. Architectural elements underpin the failover infrastructure. Shared storage solutions, including Storage Area Networks (SAN) for block-level access and (NAS) for file-level sharing, allow multiple nodes to access the same data pool without conflicts. Replication mechanisms ensure data consistency, with synchronous replication mirroring writes in for zero in low-latency environments, while asynchronous replication defers for longer distances at the of potential loss. Network redundancy, achieved through multiple Network Interface Cards (NICs) configured in bonding modes, provides failover paths by aggregating links for increased and automatic switchover if a link fails. Monitoring tools integrate as agents within the to track system health. These agents collect metrics such as CPU load, memory usage, and I/O , enabling proactive alerts and triggering failover thresholds when anomalies exceed predefined limits. In clusters, for instance, resource agents perform periodic probes to verify service viability based on these indicators. Integration challenges arise from ensuring component compatibility to eliminate single points of failure. Mismatched or software versions can hinder seamless operation, requiring validation of identical configurations across nodes and thorough testing to confirm no bottlenecks in shared resources like or .

Implementation Mechanisms

Detection and Monitoring

Detection and monitoring in failover systems involve identifying failures that necessitate switching to redundant resources, ensuring minimal disruption to service availability. These processes rely on continuous surveillance of system health to detect anomalies promptly, triggering failover only when confirmed. Common detectable failure types include faults such as disk or failures, which can render primary components inoperable; software crashes like application hangs or process terminations; network outages manifested as or connectivity disruptions; and overload conditions where resource utilization exceeds sustainable levels, potentially leading to degraded performance. Monitoring techniques encompass signals, where primary and secondary systems exchange periodic pings—typically every 10 seconds—to verify operational status, with failure declared after missing a configurable number of signals; polling-based checks using protocols like SNMP to query metrics such as CPU load or interface status at regular intervals; and event logging via to analyze logs for indicators of issues like error spikes or unexpected shutdowns. Threshold-based detection defines failure criteria through predefined limits, such as response times exceeding 5 seconds or error rates surpassing 10%, enabling proactive identification of degrading conditions before total collapse. Tools like employ plugin-based polling to monitor cluster states and alert on threshold breaches, while scrapes metrics from endpoints to detect anomalies via rules evaluating latency or error ratios in real-time. To mitigate false positives, which could induce unnecessary failovers and introduce , systems implement multi-factor confirmation, combining heartbeat loss with supplementary checks like resource utilization queries or connection validations to ensure accurate verification. Detection windows typically range from 1 to 30 seconds, balancing rapid response against the risk of erroneous triggers; for instance, Oracle WebLogic declares after 30 seconds of missed , while tuned configurations in systems like achieve detection in 6 to 15 seconds on reliable networks.

Switching and Recovery

The switching process in failover begins with quiescing the primary system to halt new incoming requests and ensure no ongoing transactions are interrupted mid-execution, preventing data inconsistencies during the transition. This step typically involves stopping application updates or services on the primary node. Next, control is transferred by reassigning the virtual IP (VIP) address from the primary to the standby node, allowing clients to seamlessly connect to the new primary without reconfiguration. The standby resources are then activated, starting services and mounting necessary storage or databases on the secondary node. Finally, operations resume on the secondary, with the system processing new requests as the new primary. Recovery techniques during failover emphasize maintaining system integrity through synchronization, often achieved using database replication logs to apply pending transactions from the primary to the standby before . consistency is verified via checksums, which compare values of data blocks across nodes to detect corruption or desynchronization. If the failover encounters issues, such as incomplete , rollback options allow reverting to the previous by aborting the transition and restoring the original primary , minimizing further disruption. Automation levels in failover range from scripted manual interventions, where administrators execute predefined commands, to fully automated systems like in clusters, which detect failures and orchestrate the entire process without human input. Fully automated setups achieve failover times typically between 5 and 60 seconds, depending on , latency, and resource complexity. Potential disruptions include scenarios, where multiple nodes simultaneously attempt failover due to communication failures, risking from concurrent writes. These are prevented through fencing mechanisms, such as (Shoot The Other Node In The Head), which isolates the suspected failed node by powering it off or disconnecting it from shared resources. Success in switching and recovery is measured by mean time to recovery (MTTR), the average duration from failure detection to full service restoration, serving as a key performance indicator for systems.

Types of Failover

Active-Passive Configurations

In active-passive failover configurations, the primary or manages all incoming and operational workloads, while the secondary remains in an idle or standby state, continuously data from the primary through replication mechanisms but not processing requests until a triggers . This setup ensures without concurrent utilization on the secondary, often employing shared or asynchronous replication to maintain data consistency. For instance, in cold standby scenarios, the secondary relies on periodic snapshots to capture the primary's state, allowing it to initialize quickly upon failover without requiring real-time synchronization. These configurations offer several advantages, including lower overall resource overhead since the secondary does not engage in or load handling, which reduces hardware and licensing costs. is simpler, as there are no conflicts from simultaneous updates, making it easier to implement and maintain compared to more complex topologies. They are particularly suitable for non-real-time applications where brief interruptions are tolerable, providing a straightforward path to without the need for advanced load balancing. However, active-passive setups have notable disadvantages, such as potential data lag from asynchronous replication, which can result in a recovery point objective (RPO) of several minutes depending on replication intervals and conditions. times are often longer due to startup delays on the passive , including mounting shared , initializing services, and processing any queued updates, potentially taking 5-10 minutes in automated clusters. Common examples include traditional database clustering, such as MySQL's master-slave replication, where the master handles all writes and reads while slaves asynchronously replicate data in a passive role, ready for promotion via manual or scripted failover if the master fails. In environments, active-passive failover can be achieved through DNS-based switching, where tools monitor primary health and adjust time-to-live () values to propagate traffic to a secondary only upon detection of failure. Implementation considerations emphasize the choice between manual intervention—such as administrator-promoted failover in smaller setups—and automatic activation using cluster management software like IBM HACMP or , which detect failures and orchestrate the switch. These configurations are especially cost-effective for small-scale deployments, as they minimize idle resource waste while providing reliable redundancy without the overhead of always-on systems.

Active-Active Configurations

In active-active configurations, both primary and secondary systems operate concurrently, processing traffic and workloads simultaneously to provide and load distribution. A load balancer or mechanism, such as DNS-based policies or application delivery controllers, directs incoming requests across the active s using algorithms like or least-connections to ensure even distribution. Upon detecting a through health checks or probes, the system automatically redistributes the affected load to the remaining healthy s without interrupting service. These setups offer several key advantages, including the potential for zero-downtime failover since no is required for the surviving nodes, which continue handling seamlessly. Resource utilization is maximized as both nodes contribute to , potentially doubling the overall throughput compared to single-node operations. Additionally, they enhance by allowing horizontal expansion through additional active nodes, supporting higher volumes in demanding environments. However, active-active configurations introduce complexities in to prevent conflicts in shared resources, such as databases or caches, where concurrent writes from multiple nodes can lead to inconsistencies if not properly managed. They also incur higher operational costs due to the need for duplicate infrastructure and ongoing replication overhead. Furthermore, there is an elevated risk of cascading failures if fails or if a widespread issue affects multiple nodes simultaneously. Practical examples include load-balanced web server farms using , where multiple instances handle HTTP requests with integrated health checks to trigger failover and redistribute traffic dynamically. Another is Redis Enterprise's Active-Active databases, which enable multi-region replication for distributed caching, ensuring data availability across geographically dispersed nodes. Key implementation considerations involve maintaining session persistence through sticky sessions, where load balancers route subsequent requests from the same client to the original node using or IP affinity to preserve stateful application data. For conflict resolution in shared data scenarios, protocols like voting ensure among nodes, requiring a agreement on updates to avoid conditions and maintain consistency.

Failback Procedures

Failback refers to the process of restoring operations to the original primary system after a failover event has been resolved, ensuring minimal disruption and . This reversion is critical in setups to return to the preferred primary configuration without introducing new points of failure. In cloud environments like , failback involves reinstating the primary instance and redirecting traffic back to it once stability is confirmed. Recent advancements include AI-driven for predictive failback and reconciliation in multi-cloud setups. The failback procedure typically follows a structured of steps to mitigate risks. First, the of the primary is verified by confirming its , , and operational readiness through health checks and diagnostic tools. Second, the is synchronized from the secondary , often using replication mechanisms to apply any changes that occurred during the failover period. Third, the primary is tested under load to simulate and validate , , and handling. Finally, is switched back to the primary, updating routing configurations such as DNS records or load balancers to complete the reversion. In AWS Elastic Disaster Recovery, this process includes a Failback Client on the source to facilitate replication and progress via the console. Failback can be executed manually or automatically, each with distinct advantages and drawbacks. failback provides greater , allowing administrators to perform thorough validations and intervene if issues arise, which is recommended in scenarios requiring custom assessments; however, it increases due to human involvement. Automatic failback, often scripted for reversion, accelerates the process and reduces operational overhead, but carries risks such as unnecessary switches if the primary experiences intermittent issues, potentially leading to thrashing or oscillating failovers. best practices advocate automatic failover paired with manual failback to balance speed and caution. Data reconciliation is a core aspect of failback, addressing divergent states between primary and secondary systems that arise from operations during failover. Techniques include leveraging logs to replay committed changes and applying synchronization to identify and merge only the discrepancies, thereby minimizing or inconsistencies. In multi-region setups, this ensures across data stores, often requiring automated tools to compare datasets and resolve conflicts. AWS guidance emphasizes replicating data written to recovery instances back to the primary before final switchover. Best practices for failback emphasize timing and oversight to optimize reliability. Procedures should be scheduled during low-traffic periods to limit user impact, with comprehensive for post-reversion stability, including metrics on , rates, and utilization. Regular testing of failback runbooks in non-production environments helps refine these processes, and avoiding automatic failback prevents rapid oscillations. In multi-region deployments, business readiness confirmation, including communication plans, is essential alongside automated tools. Challenges in failback include achieving zero-downtime transitions and managing complexity in state synchronization. Analogous to deployments, where traffic shifts seamlessly between environments, failback requires parallel validation of the primary to avoid interruptions, but this can introduce coordination overhead and potential for incomplete reconciliations if not orchestrated properly. Risks of data divergence or configuration mismatches further complicate reversion, particularly in distributed systems.

Testing and Validation

Testing failover mechanisms is essential to ensure reliability in high-availability systems without disrupting production environments. Common approaches include , which involves deliberately injecting faults to observe system resilience; scripted simulations that automate failure scenarios in isolated test beds; and dry-run modes that preview failover actions without executing them. For instance, chaos engineering tools like Netflix's Chaos Monkey randomly terminate instances to validate automatic failover and recovery processes. Recent developments incorporate to automate and predict failure scenarios in chaos experiments, enhancing efficiency as of 2025. Scripted simulations, such as those using sandboxed s, allow teams to replicate outages and measure response without impacting live data. Dry-run modes enable pre-execution previews of failover commands, helping identify configuration issues in advance. Validation of failover effectiveness relies on key metrics that quantify performance and risk tolerance. The failover success rate, typically targeted above 99%, measures the percentage of simulated failures that result in seamless switching to resources. Recovery Time Objective (RTO) assesses the duration to restore operations, often aiming for under 1 minute in modern setups to minimize . Recovery Point Objective (RPO) evaluates acceptable , with goals like less than 5 seconds for near-real-time replication systems. These metrics guide iterative refinements, ensuring systems meet service-level agreements. Specialized tools and frameworks facilitate structured testing. Veritas Cluster Server includes a Cluster Simulator for modeling failover scenarios in a controlled environment, allowing administrators to test resource switching without hardware risks. The AWS Fault Injection Simulator (FIS) supports chaos experiments by simulating faults like instance failures or network disruptions, enabling validation of failover in cloud infrastructures. These tools integrate with monitoring systems to capture real-time data during tests. Effective planning involves conducting regular drills, such as quarterly simulations, to maintain preparedness and comply with standards. Each test should be documented, detailing outcomes, deviations from expectations, and action items for improvements, fostering a cycle of continuous enhancement. Tests may briefly incorporate failback procedures to verify full recovery cycles. A frequent pitfall in failover testing is neglecting edge cases, such as multi-node failures where multiple components fail simultaneously, leading to cascading issues not captured in single-point simulations. Similarly, overlooking network partitions—where communication between nodes is severed—can result in undetected inconsistencies or scenarios during recovery. Addressing these requires comprehensive to build robust resilience.

Applications and Use Cases

In Database Systems

In database systems, failover presents unique challenges centered on maintaining (, , , ) properties during the transition from a primary to a standby instance. Ensuring atomicity requires careful handling of ongoing transactions, often involving rollbacks of partially committed operations to prevent inconsistencies across replicas. For instance, in systems like SQL Server, the architecture guarantees by logging changes before commit, allowing rollbacks during failover if the primary fails mid-transaction. Isolation is preserved through mechanisms like locking and row versioning, which prevent dirty reads during the switch, though read/write splits—where reads are offloaded to replicas—must be reconfigured to avoid stale data exposure. databases address these via redo log validation on standbys, ensuring no corrupted blocks propagate during failover. Key techniques for database failover include log shipping, block-level replication, and multi-master setups. Log shipping, as implemented in SQL Server's Always On Availability Groups, involves the primary replica transmitting records to secondaries, enabling synchronous or asynchronous to minimize during failover. In , replication primarily uses redo log shipping for physical standbys, maintaining block-for-block identical copies, with additional block-level corruption detection to repair issues automatically during recovery. Multi-master replication, such as GoldenGate's active-active configuration, allows updates at multiple sites using asynchronous or synchronous propagation, resolving conflicts via methods like timestamp priority to ensure eventual convergence post-failover. Representative examples illustrate these approaches in practice. employs streaming replication to send write-ahead log (WAL) records from the primary to standbys, supporting automatic promotion via tools like pg_ctl promote during failover, which restores available WAL for consistency. In , replica sets within sharded clusters handle failover through elections, where a secondary is promoted to primary upon detecting primary unavailability, pausing writes briefly (typically ~12 seconds) while maintaining read availability on secondaries. Performance impacts of failover techniques often involve added in synchronous replication to uphold , with commit times ranging from 50-150 ms in distributed systems like Google's Spanner due to cross-replica acknowledgments. Trade-offs arise with models, which prioritize availability during partitions by allowing temporary divergences—resolvable later via reconciliation—but at the risk of brief inconsistencies, as formalized in the for balancing and consistency in normal and failure scenarios. Asynchronous modes reduce this overhead but may introduce lag exceeding 1 second if the primary fails before shipping logs. Post-failover recovery strategies frequently leverage (PITR) to align datasets precisely. In , PITR combines a base backup with replaying archived WAL files to any since the backup, enabling of the new primary to match the failed one's state without full resynchronization. This approach ensures minimal data loss and supports rollback to a consistent point, though it requires uninterrupted WAL archiving for effectiveness.

In Cloud and Distributed Environments

In cloud and distributed environments, failover mechanisms are integral to maintaining through elastic scaling and automated recovery. Auto-scaling groups, such as those in Amazon EC2, monitor instance health and automatically terminate unhealthy instances while launching replacements to sustain application capacity during failures. Similarly, Traffic Manager facilitates region-level failover by using health probes to detect outages and redirect traffic to secondary endpoints, ensuring minimal disruption in multi-region deployments. Container orchestration platforms like further enhance resilience via rolling updates, which progressively replace pods while guaranteeing that at least 75% of desired replicas remain operational, thereby supporting seamless failover without . Distributed systems introduce unique challenges in failover, particularly in managing failures and ensuring data consistency across nodes. In es like Istio, circuit breakers isolate faulty s by opening after consecutive errors, preventing widespread outages and enabling traffic rerouting to healthy instances. For NoSQL databases such as , the ring topology partitions data via , with replication factors ensuring multiple copies; failover relies on hinted handoffs to temporarily store writes during node unavailability, while is maintained through tunable levels like and anti-entropy repairs using Merkle trees. These approaches balance and partition tolerance but require careful tuning to mitigate risks like load imbalance from multiple virtual nodes. Representative examples illustrate practical implementations in cloud ecosystems. AWS RDS Multi-AZ deployments synchronously replicate data to a standby instance in a different availability zone, automatically failing over in cases of primary instance impairment, with recovery typically completing in 60-120 seconds. In web services, Cloud's global load balancing uses IP addressing and health checks to route traffic away from failed backends, supporting failover to backup load balancers across regions for low-latency resilience. Cloud environments provide distinct advantages for failover, including pay-per-use redundancy that aligns costs with actual resource consumption, avoiding the expenses of idle on-premises hardware. Provisioning redundant capacity occurs in seconds via APIs, contrasting with hours or days required for physical hardware setups, enabling rapid elasticity during incidents. Emerging trends in failover emphasize serverless and edge paradigms for greater automation and decentralization. In , AWS Lambda integrates dead-letter queues to capture asynchronous invocation failures, routing events to Amazon SQS or for debugging and reprocessing without losing data. bolsters through techniques like failover replication and reconfiguration, allowing distributed ML inference to recover from node crashes by redundantly deploying models across nearby devices.

Historical Development

Early Innovations

The origins of failover mechanisms trace back to mid-20th-century mainframe , where basic was introduced to mitigate hardware failures without full system halts. In the family, announced in 1964, architects designed for redundant channels, storage units, and central processing units, enabling continued operation after the failure of individual components through manual or semi-automated reconfiguration. This approach represented an early precursor to automated failover, emphasizing modular design to isolate faults in large-scale business environments. The marked a pivotal shift toward commercial fault-tolerant systems with automatic failover for critical applications. , founded in 1974, released its NonStop system in 1976 as the first commercial fault-tolerant computer, featuring paired processors operating in to detect and recover from faults via immediate failover to redundant hardware during . Designed specifically for high-availability environments like banking, the NonStop architecture used hardware redundancy and software checkpoints to ensure no single component failure interrupted ongoing operations, achieving uptime levels unprecedented for minicomputers at the time. By the 1980s, failover concepts extended to clustered environments in Unix and proprietary systems, enabling access and dynamic recovery. Similarly, Digital Equipment Corporation's VAXcluster, introduced in the mid-1980s, provided shared-disk access across multiple VAX processors, allowing automatic failover of processes to surviving nodes upon hardware failure through distributed lock management and resource migration. also contributed with its fault-tolerant systems based on the VOS operating system, introduced in 1983, featuring hardware redundancy and automatic failover for continuous operation in transaction-heavy applications. Key innovations during this era included the introduction of hot-swappable components, first realized in fault-tolerant systems like Tandem's NonStop expansions, where processors and I/O modules could be replaced without system downtime by leveraging redundant paths and isolation circuits. Basic heartbeat monitoring also emerged, with periodic status signals between clustered nodes to detect failures promptly, as implemented in VAXcluster interconnects for timely failover initiation. These advancements influenced early standards for fault tolerance, notably through ARPANET's distributed network models in the late 1960s and 1970s, which emphasized redundancy of connectivity to route around node or link failures, laying groundwork for resilient communication protocols.

Modern Advancements

The evolution of failover mechanisms in the 2000s was significantly advanced by technologies, which shifted focus from dependencies to software-orchestrated recovery. (HA), introduced in 2006 with ESX Server 3.0 and VirtualCenter 2.5, enabled automated VM restart, building on vMotion (introduced in 2003) for and allowing failover without physical intervention while reducing downtime to minutes. This innovation laid the groundwork for cluster-based resilience, where host failures trigger rapid resource reallocation across virtualized environments. The 2010s marked the rise of cloud-native failover driven by scalable, distributed architectures. launched Elastic Load Balancing in May 2009, providing automatic traffic distribution and health-check-based failover across EC2 instances to maintain application availability during instance or Availability Zone failures. , first released on June 6, 2014, further revolutionized with built-in failover capabilities, such as pod replication and node affinity rules, enabling self-healing clusters that automatically reschedule workloads on healthy s. Complementing these, serverless architectures like , introduced in 2014, incorporated inherent through multi-AZ replication and automatic scaling, eliminating manual server management while ensuring sub-second failover for event-driven functions. Recent innovations up to 2025 have integrated for proactive failover, enhancing reactive models with predictive capabilities. In Cloud, machine learning-based in Operations Suite analyzes metrics and logs to forecast potential failures, triggering preemptive migrations or scaling to avert outages. Zero-trust models have also emerged for secure failover handovers, enforcing continuous verification of identities and micro-segmentation during resource shifts in multi-cloud environments to mitigate lateral movement risks. Standardization efforts have bolstered interoperability in modern failover. The Spanning Tree Protocol, refined in subsequent updates, prevents network loops while facilitating rapid path reconvergence for failover in Ethernet bridges, influencing resilient designs. Open-source contributions like and Corosync provide a modular cluster resource manager, supporting policy-driven failover across distributions and integrating with cloud providers for hybrid setups. Looking ahead, future trends emphasize quantum-resistant failover and for ecosystems. integration ensures secure key exchanges during failover against quantum threats, with NIST-standardized algorithms like CRYSTALS-Kyber being adopted in high-availability protocols by 2025. In , edge platforms enable localized monitoring and automated failover for distributed devices, as demonstrated in StarlingX where health checks monitor workloads to recover from node failures without central cloud dependency.

References

  1. [1]
    What is High Availability? - IBM
    Create reliable failover: Failover is the transfer of workloads from a primary system to a secondary system in the event of a failure on the primary system.
  2. [2]
    What is Failover? - Fortinet
    Failover refers to switching to a computer, system, network, or hardware component that is on standby if the initial system or component fails.
  3. [3]
    What Is High Availability? - Cisco
    A failover occurs when a process performed by the failed primary component moves to a backup component in a high-availability cluster. A best practice for high ...
  4. [4]
    Can you define failover and failback? - TechTarget
    Aug 27, 2008 · Failover is the process of shifting I/O and its processes from a primary location to a secondary disaster recovery (DR) location.Missing: definition | Show results with:definition
  5. [5]
    What is High Availability (HA)? Definition and Guide - TechTarget
    Jul 29, 2024 · For this reason, the system must include redundant components and ensure reliable failover, which is the process of switching from one component ...
  6. [6]
    High availability through failover - IBM
    Failover is the transfer of workload from a primary system to a secondary system in the event of a failure on the primary system.
  7. [7]
    What is server failover? | Failover meaning - Cloudflare
    Server failover is the practice of having a backup server (or servers) prepared to automatically take over if the primary server goes offline.Learning Objectives · What Is Server Failover? · How Does Server Failover...
  8. [8]
    Failover: What It Is and Its Importance in Business Continuity
    May 15, 2023 · Failover is the process of switching critical workloads, systems and applications to a standby or secondary site when the primary site is down or unavailable ...
  9. [9]
    Tandem Computers | IT History Society
    Tandem systems used a number of redundant processors and storage devices to provide high-speed "failover" in the case of a hardware failure, an architecture ...<|control11|><|separator|>
  10. [10]
    [PDF] Tandem Computers Unplugged: A People's History
    Tandem Computers, founded in 1975, aimed for reliable systems. The first system was sold in 1976, and the first public offering was in 1977. The first system ...Missing: failover | Show results with:failover
  11. [11]
    What is high availability? - Red Hat
    Mar 28, 2025 · High availability is the ability of an IT system to be accessible and reliable nearly 100% of the time, eliminating or minimizing downtime.
  12. [12]
    What Is Failover? Definitions, Testing, & Importance in Systems | Druva
    Failover is the ability to switch automatically and seamlessly to a reliable backup system. When a component or primary system fails, either a standby ...
  13. [13]
    Downtime: The True Cost of Retail Network Outages - 1GLOBAL
    May 26, 2025 · A Gartner survey from a decade ago estimated that the average cost of downtime was $5,600 per minute. This average figure was skewed by the ...What Is Downtime? · The True Cost Of Pos... · Seasonal Challenges
  14. [14]
    The Cost of Downtime and How Businesses Can Avoid It | TechTarget
    Aug 8, 2025 · The Oxford Economics study estimated that downtime costs an organization an average of $9,000 per minute or $540,000 per hour. Lost revenue ...
  15. [15]
    Failover Strategies for High Availability - InterSystems Documentation
    Under this approach, a failed primary system is replaced by a backup system; that is, processing fails over to the backup system. Many HA configurations also ...
  16. [16]
    Failover - Cohesity
    Failover is the process of seamlessly and automatically switching to a redundant system when a primary system fails due to an outage, ransomware attack, or ...<|control11|><|separator|>
  17. [17]
    Failover Clustering in Windows Server and Azure Local
    Jun 25, 2025 · Failover clustering uses interconnected nodes to ensure high availability. If a failure occurs, remaining nodes automatically take over, ...
  18. [18]
    Failover clustering hardware requirements and storage options
    Jul 18, 2025 · In order to create a failover cluster, your system must meet the following requirements: This article describes each of these hardware requirements in more ...Missing: key arrays RAID UPS
  19. [19]
    [PDF] High Availability white paper - Cisco
    A facility component such as an uninterruptible power supply (UPS) can be an important tool in adding an extra layer of availability to a UC on UCS solution.
  20. [20]
    Configuring and managing high availability clusters | 9
    The pcs command-line interface controls and configures Pacemaker and the corosync heartbeat daemon. A command-line based program, pcs can perform the following ...
  21. [21]
    Hyper-V storage architectures in Windows Server - Microsoft Learn
    Jan 17, 2025 · Learn how you can use Windows Server with SAN, NAS, fiber channel, iSCSI, SMB storage for Hyper-V. Windows Server Hyper-V is a mature hypervisor platform.<|separator|>
  22. [22]
    Storage Replica Overview | Microsoft Learn
    Aug 22, 2025 · Storage Replica supports synchronous and asynchronous replication: Synchronous replication mirrors data within a low-latency network site ...Missing: NAS | Show results with:NAS
  23. [23]
    Chapter 3. Configuring a network bond | Red Hat Enterprise Linux | 8
    You can combine network interfaces in a bond to provide a logical interface with higher throughput or redundancy. To configure a bond, create a ...
  24. [24]
    Poster Companion Reference: Hyper -V and Failover Clustering
    Oct 24, 2016 · In this network infrastructure, you should avoid having single points of failure. This means that you need to include additional hardware ...
  25. [25]
    High Availability Cluster: Concepts and Architecture | NetApp
    Nov 18, 2020 · Each node in the cluster constantly advertises its availability to the other nodes by sending a “heartbeat” over the dedicated network link. One ...
  26. [26]
    Detecting and responding to system outages in a high availability ...
    Detect the failure. Failover software can use heartbeat monitoring to confirm the availability of system components. A heartbeat monitor listens for regular ...
  27. [27]
    6 Failover and Replication in a Cluster - Oracle Help Center
    Each heartbeat message contains data that uniquely identifies the server that sends the message. Servers broadcast their heartbeat messages at regular ...
  28. [28]
    SNMP polling | FortiMonitor 25.4.0 - Fortinet Document Library
    From the Monitoring Config tab of the network device's Instance Details page, select Add Monitoring > SNMP. · Select Configure Metrics. · Select a model from the ...
  29. [29]
    Monitoring Events Using Syslog - Solace Docs
    May 2, 2016 · You can use the syslog logging mechanism to monitor event broker events. Event brokers generate syslog messages, as defined in RFC 3164, to ...
  30. [30]
    High Availability Fail-safe - MyF5 | Support
    The system triggers the gateway fail-safe action if the threshold falls below this value. setting, select the action that the system takes if the threshold ...High Availability Fail-Safe · Configure System Fail-Safe · About Gateway Fail-Safe
  31. [31]
    Nagios Fusion
    With a global perspective on monitoring server operations, you can effortlessly identify remote server outages and facilitate prompt remedial failover ...
  32. [32]
    High Availability in Prometheus: Best Practices and Tips - Last9
    Oct 4, 2024 · This blog defines high availability in Prometheus, discusses challenges, and offers essential tips for reliable monitoring in cloud-native environments.Setting Up Prometheus High... · Best Practices for Prometheus...
  33. [33]
    Tuning the heartbeat interval setting for failover detection - IBM
    Failovers are typically detected within 30 seconds. Failovers are typically detected within 90 seconds. Specifies a relaxed heartbeat level. With this value, a ...
  34. [34]
    Optimizing Automatic Failover in Common Scenarios to Minimize ...
    Data Guard Broker Property: FastStartFailoverThreshold. Default Value: 30 seconds. Recommended value: 6 to 15 seconds if the network is responsive and reliable.
  35. [35]
    Oracle ACFS Command-Line Tools for Replication
    When the current primary location is active, you should quiesce application updates to the primary before running acfsutil repl failover . Any updates attempted ...
  36. [36]
    Configuring and managing high availability clusters | Red Hat ...
    The Red Hat High Availability Add-On configures high availability clusters that use the Pacemaker cluster resource manager.
  37. [37]
    Pacemaker Explained - ClusterLabs
    Pacemaker is a high-availability cluster resource manager – software that runs on a set of hosts (a cluster of nodes) in order to preserve integrity and ...
  38. [38]
    Measuring Availability Group synchronization lag - SQL Shack
    Aug 9, 2016 · Availability Groups must retain all transaction log records until they have been distributed to all secondary replicas.
  39. [39]
    Enable Data Checksums With Minimum Downtime - Redrock Postgres
    Aug 22, 2024 · Follow these steps to complete this tutorial: 1. Please create a set-up consisting of a primary and standby server using streaming replication ...<|separator|>
  40. [40]
    Failover modes for availability groups - SQL Server Always On
    Jun 16, 2025 · There are three failover modes: automatic (without data loss), planned manual (without data loss), and forced manual (with possible data loss).
  41. [41]
    Impact of Pacemaker Failover Configuration on Mean Time to ...
    Jun 27, 2014 · Pacemaker mean recovery time can take a value between 110 and 160 seconds, if the tool is configured badly. We found that with a proper ...
  42. [42]
    What is MTTR? Mean time to repair for incident management
    Nov 1, 2022 · Mean time to recovery (MTTR) measures the entire amount of time it takes to get a downed network or system back up and running. It starts when ...
  43. [43]
    Active-active and active-passive failover - Amazon Route 53
    Active-active failover uses all resources, while active-passive uses a primary and secondary, with the secondary only used if the primary is unhealthy.
  44. [44]
    Active/passive cluster failover configurations - IBM
    In an active/passive cluster failover configuration, one or more passive or standby nodes are available to take over for failed nodes.
  45. [45]
    Select a DR Method - Oracle Help Center
    Use a Cold Standby ... The term cold standby is used to describe a DR scenario in which a redundant replica of the primary environment is deployed at a DR site.
  46. [46]
    Active-passive topology - IBM
    By using failover, you can take one server offline for maintenance, while the rest of the system remains operational. Cost, This topology requires minimal ...
  47. [47]
    Active-Passive Topologies
    Easier to configure than active-active topologies because you do not have to configure options such as load balancing algorithms, clustering, and replication.
  48. [48]
    Disaster recovery options in the cloud - AWS Documentation
    Asynchronous data replication with this strategy enables near-zero RPO. AWS services like Amazon Aurora global database use dedicated infrastructure that ...
  49. [49]
    MySQL :: MySQL 8.0 Reference Manual :: 19.4.8 Switching Sources During Failover
    ### Summary: Master-Slave Setup for Failover in Active-Passive Manner
  50. [50]
    Creating Amazon Route 53 health checks - Amazon Route 53
    ### Summary of DNS Failover for Web Servers and TTL Adjustments
  51. [51]
    Configuring Active-Active High Availability and Additional Passive ...
    This provides failover redundancy in the event of a problem on the primary NGINX Plus node. We can extend this functionality with additional nodes and changes ...
  52. [52]
    2 Overview of Oracle Application Server High Availability Topologies
    2 Advantages of Active-Passive Topologies. Advantages of an OracleAS Cold Failover Cluster topology include: Increased availability. If the active instance ...
  53. [53]
    Active-Active Vs. Active-Passive High-Availability Clustering | JSCAPE
    Active-active configurations offer reduced downtime and improved performance, making them the preferred choice for continuous system availability.Missing: disadvantages | Show results with:disadvantages
  54. [54]
    Active-Active vs. Active-Passive: High-Availability Guide | Aerospike
    Aug 26, 2025 · Heartbeat monitoring and failover scripts or services automate the switch by detecting the outage and redirecting traffic to the standby system.
  55. [55]
    Active-Active Architecture: Ultimate Guide - Serverion
    Oct 11, 2025 · Setting heartbeat intervals to 5–10 seconds allows your system to ... Latency Testing in Failover Systems: Key Metrics · 5 Scaling ...
  56. [56]
    Active Passive & Active Active Architecture for High Availability System
    Jul 23, 2025 · Failover Process and Recovery Mechanisms in Active-Passive Architecture ... Below are some challenges of Active-Active Architecture: Complexity ...
  57. [57]
    Using nginx as HTTP load balancer
    It is possible to use nginx as a very efficient HTTP load balancer to distribute traffic to several application servers and to improve performance, scalability ...Default load balancing... · Least connected load balancing · Session persistence
  58. [58]
    Understanding cluster and pool quorum - Microsoft Learn
    Feb 12, 2025 · Dynamic quorum enables the ability to assign a vote to a node dynamically to avoid losing the majority of votes and to allow the cluster to run ...How Cluster Quorum Works · Dynamic Quorum Behavior · How Pool Quorum Works
  59. [59]
    Failover and failback | Microsoft Learn
    Feb 26, 2025 · An availability zone failover might occur if a complete availability zone experiences an outage. This type of outage requires all traffic to be ...
  60. [60]
    About failover and failback in Azure Site Recovery - Modernized
    Dec 19, 2024 · Failover and failback in Site Recovery has four stages: Stage 1: Fail over from on-premises: After setting up replication to Azure for ...Connect To Azure After... · Types Of Failover · Failover Processing
  61. [61]
    Performing a failback with Elastic Disaster Recovery
    Failback is the act of redirecting traffic from your recovery system to your primary system. This is an operation that is performed outside of AWS Elastic ...
  62. [62]
    Make all things redundant - Azure Architecture Center
    Oct 11, 2024 · Use automatic failover but manual failback. Use automation for failover, but not for failback. Automatic failback carries a risk that you ...
  63. [63]
    Architecture Best Practices for Azure Traffic Manager - Microsoft Learn
    Aug 17, 2025 · Here are some examples of advantages and drawbacks. Reliability and Performance Efficiency. The TTL setting for DNS lookups. The default TTL ...Reliability · Security · Operational ExcellenceMissing: disadvantages | Show results with:disadvantages
  64. [64]
    Multi-Region fundamental 2: Understanding the data
    Key guidance​​ As part of failover, a data reconciliation process will be needed to ensure that a transactionally consistent state is maintained for data stores ...Missing: procedures | Show results with:procedures
  65. [65]
    Develop a disaster recovery plan for multi-region deployments
    Sep 30, 2025 · Failback is just as important as failover. While many focus on treating failover as a one-way cutover, sometimes failback is a viable option.
  66. [66]
    Avoiding fallback in distributed systems - Amazon AWS
    In the world of distributed systems, fallback strategies are among the most difficult challenges to handle, especially for time-sensitive services.
  67. [67]
    Chaos Engineering Upgraded - Netflix TechBlog
    Sep 25, 2015 · Several years ago we introduced a tool called Chaos Monkey. This service pseudo-randomly plucks a server from our production deployment on ...Missing: methods | Show results with:methods
  68. [68]
    The Failover Test Operation - Zerto Documentation
    Jan 20, 2025 · The Failover Test operation creates test virtual machines in a sandbox, using the test network specified in the VPG definition as opposed to a production ...Missing: dry | Show results with:dry
  69. [69]
    Cluster Linking Disaster Recovery and Failover on Confluent Cloud
    Perform a dry run of a failover to preview the results without actually executing the command. ... Recommended: Have a second application running in the DR site ...
  70. [70]
    RPO and RTO: What's the Difference? - Veeam
    Feb 2, 2024 · RPO is the maximum acceptable data loss, measured backward, while RTO is the maximum downtime, measured forward. RPO is about data loss, RTO is ...
  71. [71]
    The Role of RTO and RPO in Disaster Recovery - Cohesity
    RTO is the time until a service is operational again, while RPO is the maximum data loss between backups. RTO is the time from failure to fix.
  72. [72]
    [PDF] Veritas™ Cluster Server from Symantec
    The Cluster Simulator, a free download, allows cluster administrators to simulate application failover scenarios and familiarize themselves with Cluster Server.<|separator|>
  73. [73]
    AWS Fault Injection Service - Resilience Testing Tools
    AWS Fault Injection Service helps you create real-world conditions needed to uncover hidden bugs, monitor blind spots, and discover performance bottlenecks.FAQs · Features · FIS pricing page
  74. [74]
    Preparing for failover - AWS Elastic Disaster Recovery
    We recommend performing drill on at least a quarterly basis; individual compliance needs may necessitate more frequent drills.
  75. [75]
    How Often Should a Disaster Recovery Plan Be Tested? - Cutover
    Jun 6, 2025 · Quarterly DR tests provide a more frequent cadence for validating components of your plan. These partial tests, focusing on specific systems, ...Missing: drills | Show results with:drills
  76. [76]
    [PDF] An Analysis of Network-Partitioning Failures in Cloud Systems
    Oct 8, 2018 · The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by.
  77. [77]
    SQL Server Transaction Log Architecture and Management Guide
    Aug 28, 2025 · Azure Database for PostgreSQL is an ACID compliant database service. Write-ahead logging ensures changes are both atomic and durable (the A ...
  78. [78]
    Transaction locking and row versioning guide - SQL Server
    This guide describes locking and row versioning mechanisms the Database Engine uses to ensure the integrity of each transaction.
  79. [79]
    Concepts and Administration
    ### Summary of Oracle Data Guard Concepts
  80. [80]
    What is an Always On availability group? - SQL Server Always On
    ### Summary of Always On Availability Groups: Failover, Log Shipping, and ACID Properties
  81. [81]
    2 Master Replication Concepts and Architecture - Oracle Help Center
    Multimaster replication, also known as peer-to-peer or n-way replication, consists of multiple master sites equally participating in an update-anywhere model.
  82. [82]
    26.2. Log-Shipping Standby Servers
    ### Summary of PostgreSQL Streaming Replication, Automatic Promotion for Failover, and Data Consistency
  83. [83]
    Replica Set Elections - Database Manual - MongoDB Docs
    ### Summary of MongoDB Replica Set Elections
  84. [84]
    [PDF] F1: A Distributed SQL Database That Scales - Google Research
    Aug 26, 2013 · F1 is built on Span- ner, which provides synchronous cross-datacenter replica- tion and strong consistency. Synchronous replication im-.
  85. [85]
    [PDF] Consistency Tradeoffs in Modern Distributed Database System Design
    Both sets of tradeoffs are important; unifying CAP and the consistency/latency trade- off into a single formulation—PACELC—can accordingly lead to a deeper ...Missing: failover | Show results with:failover
  86. [86]
    18: 25.3. Continuous Archiving and Point-in-Time Recovery (PITR)
    To recover successfully using continuous archiving (also called “online backup” by many database vendors), you need a continuous sequence of archived WAL files.Missing: post | Show results with:post
  87. [87]
    Auto Scaling benefits for application architecture - Amazon EC2 ...
    Better fault tolerance. Amazon EC2 Auto Scaling can detect when an instance is unhealthy, terminate it, and launch an instance to replace it.<|separator|>
  88. [88]
    Azure Traffic Manager | Microsoft Learn
    Aug 15, 2023 · Traffic Manager is resilient to failure, including the failure of an entire Azure region.Traffic routing methods · How Traffic Manager Works
  89. [89]
    Deployments
    ### Summary of Rolling Updates in Kubernetes Deployments
  90. [90]
    Istio / Circuit Breaking
    Feb 25, 2020 · Circuit breaking allows you to write applications that limit the impact of failures, latency spikes, and other undesirable effects of network ...Configuring the circuit breaker · Adding a client · Tripping the circuit breaker
  91. [91]
    Dynamo | Apache Cassandra Documentation
    These techniques are only best-effort, however, and to guarantee eventual consistency Cassandra implements anti-entropy repair <repair> where replicas ...Dataset Partitioning... · Multi-master Replication... · Distributed Cluster...Missing: failover | Show results with:failover
  92. [92]
    Failing over a Multi-AZ DB instance for Amazon RDS
    Amazon RDS automatically fails over Multi-AZ DB clusters to ensure high availability with minimal disruption during outages. The failover sequence, impact on ...
  93. [93]
  94. [94]
    Benefits of Disaster Recovery in Cloud Computing - NAKIVO
    Jun 1, 2023 · Using a cloud-based DR strategy, you can benefit from high scalability, automation, cost-effectiveness, flexibility, reliability, and the ability to deploy DR ...
  95. [95]
    Invoking a Lambda function asynchronously - AWS Lambda
    ### Summary: Dead-Letter Queues for AWS Lambda Failover and Resilience
  96. [96]
  97. [97]
    [PDF] Architecture of the IBM System/360 - People @EECS
    IBM JOURNAL, * APRIL 1964. Page 2. 1. Adaptation to business data processing ... systems with redundant I/O, storages, and CPU's so that the system can ...
  98. [98]
    [PDF] Fault Tolerance in Tandem Computer Systems - cs.wisc.edu
    May 5, 1990 · The Tandem NonStopTM computer system was introduced in 1976 as the first commercial fault-tolerant computer system. Its basic architecture ...Missing: 1974 | Show results with:1974
  99. [99]
    [PDF] Cluster Computing White Paper - arXiv
    From the other commercial Unix variants, Sun's Solaris has a strong focus on clustering, high availability. (HA) and is also widely used in research. IRIX ...<|separator|>
  100. [100]
    [PDF] Shared (Disk) File Systems - MSST
    □ DEC Vaxcluster (mid 1980s). □ IBM Sysplex (mainframes). □ Oracle Parallel Database Server (~1990). □ LLNL's HPSS. □ Cray's Shared File System (1994).
  101. [101]
    VAXclusters: A Closely-Coupled Distributed System
    This paper describes the new communications hardware developed for VAXclus- ters, the hardware-software interface, and the structure of the distributed VAX/.
  102. [102]
    How VMware HA Works? - ESX Virtualization
    Rating 4.5 (2) Apr 26, 2025 · VMware High Availability (HA) has been launched with the launch of vCenter Server In 2003. The same year, other features such as VMotion, and ...Missing: introduction | Show results with:introduction
  103. [103]
    Document history for Classic Load Balancers
    Initial public beta release of Elastic Load Balancing. May 18, 2009 ...
  104. [104]
    10 Years of Kubernetes
    Jun 6, 2024 · Ten (10) years ago, on June 6th, 2014, the first commit of Kubernetes was pushed to GitHub. That first commit with 250 files and 47501 lines ...
  105. [105]
    [PDF] Optimizing Enterprise Economics with Serverless Architectures
    Since the introduction of Lambda in 2014, AWS has created a complete serverless platform. ... serverless architecture based on AWS Lambda and Amazon API Gateway.
  106. [106]
    ML monitoring & anomaly detection for IOT & IT operations
    Mar 5, 2022 · We are describing a new production machine learning solution to monitor events in IT and industrial operations and explain their symptoms.Use Cases · Algorithm · Deploy At ScaleMissing: failover | Show results with:failover
  107. [107]
    Implement zero trust | Cloud Architecture Center
    Feb 5, 2025 · The zero-trust model helps you ensure confidentiality, integrity, and availability of data and resources in the cloud.
  108. [108]
    [PDF] 13. Spanning Tree Protocols - IEEE 802
    The spanning tree algorithms and protocols specified by this standard provide simple and full connectivity throughout a Bridged Local Area Network comprising ...
  109. [109]
    Pacemaker for Availability Groups and Failover Cluster Instances on ...
    Jul 3, 2025 · This article covers the basic information to understand Pacemaker with Corosync, and how to plan and deploy it for SQL Server configurations.
  110. [110]
    New Draft White Paper | PQC Migration: Mappings to Risk ...
    Sep 18, 2025 · Organizations should start planning now to migrate to PQC, also known as quantum-resistant cryptography, to protect their high value, long-lived ...
  111. [111]
    Edge Workloads Monitoring and Failover: a StarlingX-Based ...
    Sep 8, 2022 · In this paper, we first propose a dynamic failover functionality that centrally monitors Edge workloads to recover from deployment or Edge node failures.