Fact-checked by Grok 2 weeks ago

Linux-HA

Linux-HA, also known as the High-Availability Linux project, was an open-source initiative that developed clustering software to provide for applications and services on and other operating systems, including , , , and OS X. The project focused on creating resilient systems that minimize downtime by detecting failures and automatically transferring workloads to healthy nodes in a . Originating in the late 1990s, Linux-HA introduced key components such as the Heartbeat subsystem, which served as the core engine for cluster membership, inter-node communication, and resource failover. Heartbeat enabled active-passive and active-active configurations, supporting scalable clusters without a fixed maximum number of nodes, and integrated with tools like STONITH for fencing failed nodes to prevent data corruption. Over time, elements of Linux-HA evolved into independent projects under the ClusterLabs umbrella, including Pacemaker as the cluster resource manager and Corosync for reliable messaging and quorum management. As of 2025, these successor projects are actively maintained, with Pacemaker at version 3.0.1 released in August 2025. These tools from the Linux-HA lineage are widely adopted in environments for critical workloads, such as , services, and file systems, offering features like policy-driven resource placement, support for multi-site replication, and integration with shared storage solutions like DRBD. By providing a modular, extensible , Linux-HA and its successors continuous operation with near-100% uptime, making them foundational to open-source high-availability strategies in production systems.

Introduction

Overview

Linux-HA is an open-source that provides high-availability solutions for environments, encompassing clustering, , and mechanisms to maintain service continuity. As the oldest community-driven high-availability initiative, it enables the creation of resilient clusters that detect and respond to failures, ensuring minimal through automated processes. The primary purpose of Linux-HA is to achieve near-continuous of critical services by implementing across multiple nodes, allowing for rapid in response to events such as crashes, partitions, or application faults. This approach minimizes outages to seconds or minutes, elevating system reliability from baseline levels like 99% to higher thresholds such as %. By supporting n-node clusters around nodes, it facilitates no-single-point-of-failure architectures suitable for enterprise-scale deployments. In terms of technical scope, Linux-HA accommodates both active/passive and active/active configurations, enabling the management of diverse services including databases, web servers, file systems, ERP systems, firewalls, and load balancers. Resource management adheres to standards like OCF and LSB, with built-in fault tolerance features such as fencing (STONITH) and quorum to prevent issues like split-brain scenarios. Originating in the late 1990s as a volunteer-led effort, with initial code developed in 1998, the project has influenced subsequent tools through the evolution of components like Heartbeat into modern frameworks such as Pacemaker and Corosync under the ClusterLabs umbrella.

Goals and Principles

High availability solutions from the Linux-HA lineage commonly aim for uptime levels such as 99.999%, often referred to as "," which equates to no more than about 5.26 minutes of per year. This is achieved through that minimize via , typically configurable to occur within seconds to under a minute depending on cluster parameters like monitoring intervals and failure timeouts. The project supports scalable clusters ranging from 2 nodes up to around 32 nodes using underlying layers like Corosync. Guiding principles of Linux-HA emphasize open-source , fostering community-driven and extensibility for integrating with diverse applications and environments. is central, allowing seamless across various distributions and for heterogeneous without requiring setups. is designed to be non-intrusive, relying on agents that probe periodically without significant overhead. At its core, Linux-HA employs a conceptual framework to prevent split-brain scenarios—where multiple nodes independently assume control of shared resources—through quorum mechanisms that require a majority of nodes to agree on cluster state before actions proceed. Resource fencing ensures data integrity by isolating failed nodes, such as powering them off, to avoid concurrent access that could lead to corruption. In contrast to proprietary high-availability solutions, Linux-HA prioritizes standards-based interoperability via Open Cluster Framework (OCF) resource agents, enabling plug-and-play management of services from different vendors. This approach, combined with community extensibility, allows users to customize and expand functionality without vendor lock-in.

History

Early Development

The Linux-HA project was founded in 1998 by Alan Robertson, then at , along with early contributors, as the project to provide high-availability clustering capabilities for systems, which at the time lacked native support for such features. The initiative began with the first working code assembled on March 18, 1998, following Robertson's earlier discussions on Linux mailing lists about the need for reliable mechanisms. Initially, the project concentrated on simple failover and resource monitoring, achieved through periodic heartbeat messages sent approximately once per second over serial ports or Ethernet links to detect node failures or recoveries. This design leveraged Linux kernel features, such as , to enable seamless transfer of virtual IP addresses between nodes during failover without requiring complex reconfiguration. A key early milestone was the release of Heartbeat 0.4 in 1999, which introduced basic clustering functionality and marked the project's first stable version capable of supporting active-passive configurations limited to two nodes. The project's community grew rapidly after being hosted on , fostering open and attracting contributions that expanded its beyond . By , ports to and had been developed, broadening its applicability in heterogeneous environments. These efforts addressed critical challenges of the era, including the relative instability of early kernels for production use and the scarcity of affordable commercial high-availability tools, which were primarily available for proprietary Unix systems like or AIX.

Key Milestones and Evolution

The advanced significantly between and , with the of 2.0 in introducing comprehensive ( The Other In The ) support to enable reliable node fencing and prevent scenarios in clusters. This version enhanced resource agent metadata and cluster management capabilities, building on the foundational software developed in the late for basic detection. In 2008, the underwent through a that separated its components, resulting in as the policy-driven manager and Corosync as the underlying cluster communication engine derived from the OpenAIS . This split allowed for greater flexibility, enabling to operate independently of specific communication layers and supporting advanced features like active/active clustering configurations. During the 2010s, Linux-HA gained widespread enterprise adoption, integrating into major distributions such as Red Hat Enterprise Linux 6 in 2010, where Pacemaker became the core of the High Availability Add-On for managing clustered services. Similarly, SUSE incorporated Pacemaker into its Linux Enterprise High Availability Extension starting around the same period, providing scalable clustering for business-critical applications. In 2012, Pacemaker 1.1 was released, introducing native support for multi-site (Geo) clusters to coordinate resources across geographically dispersed sites for disaster recovery. Recent developments through have focused on modernization and expanded . 2.1.2, released in 2021, included various improvements such as better delay handling and enhancements. for geo clustering, including ticket-based via the Booth , was introduced in earlier versions around 2015. with container platforms like through resource agents and operators has enabled traditional HA clusters to manage stateful workloads in hybrid environments. In 2025, the major 3.0.0 release on January 8 introduced significant updates, including changes to upgrade compatibility and enhanced features for high-availability management, followed by 3.0.1 on August 7. The project has long utilized for primary source hosting to improve and . Overall governance has transitioned to the ClusterLabs community, an open-source collective that oversees development, maintenance, and contributions for , Corosync, and related tools.

Core Components

Pacemaker

Pacemaker serves as the central in Linux-HA clusters, responsible for starting, stopping, , and migrating resources to maintain based on the cluster's state and user-defined constraints. It processes events such as node failures or service disruptions, deciding actions to ensure resources remain active and is preserved through like . This allows for flexible configurations, including active/passive and active/active setups, across multiple nodes. Key features of include support for various resource agent standards, such as Open Cluster Framework (OCF), Linux Standards Base (LSB), , which enable the management of diverse services like databases, s, through standardized scripts. It enforces to dictate resource dependencies—for instance, ensuring a database starts before an associated related services on the same . Additional capabilities encompass thresholds for automatic migration, live resource relocation without downtime for compatible agents, . Architecturally, relies on the Manager (CRM) daemon, now implemented as pacemaker-controld, which coordinates decisions and actions across the cluster. It integrates with the Cluster Information Base (CIB), an XML-based that stores and synchronizes , , and data among nodes, allowing for real-time updates and queries via tools like crm_mon. The communicates with the underlying cluster membership layer, such as Corosync, to receive node status updates. Pacemaker originated as a from the Heartbeat project within the Linux-HA initiative around 2007-2008, evolving into an independent resource manager to enhance flexibility beyond Heartbeat's integrated approach. As of 2025, the is 3.0.1 (released 2025). These developments build on earlier by improving and with environments, including for promotable clones and multi-tenant . Pacemaker powers high-availability setups in major distributions and platforms, including Red Hat Enterprise Linux (RHEL) for enterprise clustering, SUSE Linux Enterprise High Availability Extension for robust service management, and Proxmox Virtual Environment for virtual machine failover. Its widespread adoption stems from its policy-driven automation, which minimizes manual intervention in production environments handling critical workloads.

Corosync

Corosync serves as the foundational communication and membership layer in Linux-HA clusters, providing reliable multicast messaging, node heartbeat detection, and quorum management through the Totem protocol. This open-source cluster engine implements the Totem Single Ring Ordering and Membership protocol, ensuring ordered and reliable delivery of messages among cluster nodes while detecting failures via periodic token passing. Heartbeat detection occurs through configurable timeouts, with defaults such as a 1-second token interval and 10-second failure detection window, allowing administrators to adjust parameters like token_retransmits and join_timeout in the configuration to suit network conditions. The protocol relies on UDP-based for intra-cluster communication, enabling efficient group messaging without requiring a central . For redundancy, Corosync supports multiple communication rings since the introduction of the Kronosnet (KNET) library in version 3.0 in 2018, which facilitates , automatic , and multipathing across interfaces. KNET enhances by allowing up to eight redundant , ensuring even if individual paths fail, and integrates seamlessly with the layer for fragmentation and reassembly. Corosync manages to prevent scenarios, using the votequorum service where each typically holds one vote, requiring a majority (e.g., 50% + 1) for cluster operations to proceed. Upon failure detection, it can , configurable via policies like expected_votes and auto_tie_breaker for even-sized clusters such as two- setups. Configuration is handled through the corosync.conf file, located at /etc/corosync/corosync.conf, which defines ring interfaces (e.g., rrp_mode: active for redundant rings), settings (e.g., knet: : ), and parameters. Evolving from the OpenAIS project in 2008, Corosync was refactored to focus on infrastructure , separating messaging from higher-level . The has since advanced, with 3.1.9 (as of mid-2025) providing maintenance updates. These improvements, such as support for multiple cryptographic libraries (e.g., NSS, ), ensure secure communication in production environments. Corosync integrates with by delivering cluster state events for resource management.

Supporting Tools

Linux-HA relies on several supporting tools that extend its core functionality, providing legacy compatibility, scripts, capabilities, mechanisms, and configuration interfaces. These tools integrate with the primary components to enable flexible high-availability setups across diverse environments. , the original clustering subsystem developed for the Linux-HA project, facilitated basic high-availability features such as failover and monitoring through heartbeat messaging prior to 2008. Although deprecated in favor of more robust alternatives like Corosync, it remains available for simple, low-complexity setups where minimal configuration is preferred. Resource agents in Linux-HA adhere to the Open Cluster Framework (OCF) standard, consisting of standardized scripts that define start, stop, monitor, and status operations for cluster resources. For instance, the ocf:heartbeat:IPaddr agent manages virtual IP addresses, while others handle services like web servers (ocf:heartbeat:apache) and databases (ocf:heartbeat:mysql), with over 100 such agents available in the official repository to support a wide range of applications. These agents allow to abstract and orchestrate third-party services without custom coding. Monitoring integrations enhance Linux-HA by incorporating external tools for proactive that decisions. plugins, for example, can be deployed as OCF-compliant resources within to remote services and failovers based on detected issues, seamless loops between and . Similarly, tools like provide lightweight process supervision that can status updates into the stack for automated responses. Fencing agents, essential for STONITH (Shoot The Other Node In The Head) operations, ensure safe during failures by with devices. Common implementations include the fence_ipmilan agent for IPMI-based power control on servers and fence_apc for SNMP-managed APC Power Distribution Units (PDUs), which allow the to remotely power off malfunctioning s to prevent . These agents are configured as dedicated resources and support a variety of vendors for reliable enforcement. Additional utilities streamline cluster administration: CRMsh offers a command-line shell for configuring and querying Pacemaker resources in a structured, scriptable manner, supporting complex operations like resource migration and constraint definition. , a web-based primarily associated with distributions, provides visual tools for real-time monitoring, resource editing, and status visualization, making it accessible for administrators managing Pacemaker-based clusters. These tools work in tandem with Pacemaker and to simplify deployment and maintenance without altering behaviors.

Architecture

Cluster Communication Layer

The cluster communication layer in Linux-HA serves as the foundational for enabling reliable and ordered delivery among nodes, ensuring state synchronization and membership awareness even in the presence of failures. This layer handles the of signals, updates, and notifications, allowing nodes to maintain a consistent view of the and preventing desynchronization during transient issues or node departures. At its core, the layer employs the protocol, which operates via a single-ring ordering mechanism for communication, guaranteeing that messages are delivered in the same sequence to all nodes. In configurations supporting redundancy, extends to multiring setups, where multiple independent communication paths distribute messages concurrently to enhance . Should fail due to network partitions, the protocol falls back to transmission to specific nodes, maintaining where possible without compromising order. To mitigate split-brain scenarios, where partitioned subsets of nodes might independently assume cluster control, the layer implements a quorum model based on dynamic majority voting. This model calculates the required votes for quorate status as half of the expected_votes parameter plus one, with expected_votes typically auto-derived from the node count but manually configurable for scenarios like maintenance or uneven node weights. Only quorate partitions proceed with operations, ensuring that minority partitions remain passive until reconciliation. Redundancy is achieved through support for dual or multiple communication rings, each operating as an independent instance, allowing the cluster to survive the failure of an entire ring without message loss. timeouts, configurable in milliseconds (e.g., defaulting to 3000 ms), govern failure detection by triggering reconfiguration if a token is not received within the interval, balancing responsiveness against false positives in variable networks. Security features include built-in using symmetric keys generated via tools like corosync-keygen, which verifies message origins and prevents unauthorized participation. Optional , leveraging algorithms such as AES-256 alongside HMAC-SHA-256 for , protects message over untrusted networks. Corosync provides the primary of this layer in modern Linux-HA setups.

Resource Management Layer

The Resource Management Layer in Linux-HA, primarily implemented by the cluster resource manager, oversees the allocation, , and of resources across cluster nodes to ensure and service continuity. This layer abstracts resource lifecycle management from underlying node operations, using policy-driven decisions to handle placement, state transitions, and recovery. It operates on top of reliable cluster communication, coordinating actions that maintain desired service states even during node failures or . Central to this layer are key components that facilitate dynamic resource handling. The Cluster Information Base (CIB) serves as an XML-based, synchronized repository storing the live cluster configuration, resource definitions, node attributes, and current status, enabling all nodes to maintain a consistent view managed by the designated coordinator. The Policy Engine (PE), implemented as the pacemaker-schedulerd daemon, acts as the decision-making core, incorporating a transition engine to orchestrate state changes—such as starting, stopping, or promoting resources—and a constraint solver to evaluate placement rules including location preferences, colocation requirements (e.g., ensuring dependent resources run together), and ordering constraints (e.g., starting a database before its controller). These components process inputs to generate actionable graphs of operations, ensuring resources align with administrative policies. Failover logic within the layer emphasizes proactive and prioritized to minimize . Resources are periodically probed through operations executed by agents, which assess health at configurable intervals (e.g., every 10 seconds for critical services); failures increment a failure counter, and after reaching a threshold (default per ), the PE initiates to a suitable . Placement decisions incorporate stickiness scores, ranging from -∞ (strong avoidance) to +∞ (mandatory placement), with the default resource-stickiness value of 1 for cloned resources encouraging resources to remain on their current unless overridden by constraints or failures, thus balancing and load . For scalability, the layer supports multi-tenancy through resource sets and templated configurations that isolate workloads, allowing multiple services to share without . It also accommodates clusters by leveraging location constraints and node attributes (e.g., site identifiers) to distribute resources across geographically dispersed sites, enabling site with minimal data loss in setups like N+ redundancy models supporting up to nodes. with the Local Resource Manager (LRM), via the pacemaker-execd or pacemaker-lrmd daemon, ensures node-local execution of PE-directed actions—such as invoking OCF-compliant resource agents for start/stop/—while relaying back to the CIB for cluster-wide consistency.

Fencing Mechanisms

Fencing mechanisms in Linux-HA clusters are essential for maintaining by isolating failed or unresponsive s, thereby preventing scenarios such as where multiple s simultaneously access shared resources like , leading to potential from dual writes. These mechanisms ensure that a node is definitively offline before resources are reassigned to another node, avoiding interference from corrupted or rogue processes. The primary fencing method in Linux-HA is STONITH, an acronym for "Shoot The Other Node In The Head," which employs external agents to forcibly power off or reset a failed node. STONITH devices, configured as cluster resources, include hardware interfaces such as IPMI for remote power control or SSH for scripted shutdowns, ensuring the action occurs outside the cluster's internal communication to avoid reliance on potentially compromised paths. Fencing types in Linux-HA are categorized as soft or hard, allowing flexibility based on the . Soft , such as with the for machines, attempts non-destructive like disconnection or graceful shutdown before escalating. Hard , exemplified by the for power distribution units (PDUs), directly cuts to ensure immediate and irreversible termination. Configurable delays, such as a 60-second postponement after failure detection or randomized intervals via parameters like pcmk_delay_max, help coordinate actions in multi-node setups and prevent premature during transient issues. STONITH integrates with policies to only upon of , ensuring decisions reflect and avoiding unnecessary actions in partitioned clusters. For even-numbered counts, nodes—external devices—provide an odd vote to resolve ties and initiate reliably. Best practices for Linux-HA fencing emphasize and validation to enhance reliability. Deploying multiple fencing devices, such as combining IPMI with PDUs, mitigates points of in the fencing topology. Testing configurations using tools like pcmk_host_map to accurately map node hostnames to device ports ensures precise targeting during operations. These are typically triggered by detected through Pacemaker's .

Implementation

Cluster Setup Process

Setting up a Linux-HA cluster using and Corosync requires careful attention to prerequisites to ensure compatibility and reliable communication. Nodes should run a homogeneous operating system, such as 8 or later (or equivalents like 9), to avoid version mismatches in cluster software and kernel features. Shared storage, such as via DRBD or , is optional for basic setups but necessary for stateful resources; network isolation via a dedicated private interface is recommended to separate cluster traffic from public networks, using static addresses for stability. Installation begins with enabling the repository on each node, for example, using dnf config-manager --set-enabled highavailability on RHEL-compatible systems. Install the required packages via the , such as dnf install pacemaker [pcs](/page/PCS) corosync fence-agents-all (or yum on older versions), which includes for resource management and Corosync for communication. Configure the to allow high-availability services, e.g., firewall-cmd --permanent --add-service=high-availability followed by firewall-cmd --reload. For basic , start and enable the PCS daemon with systemctl enable --now pcsd.service, which facilitates cluster management. Set a common password for the hacluster user on all nodes using passwd hacluster, then authenticate nodes with pcs host auth <node1> <node2>. Generate the , which creates and synchronizes corosync.conf across nodes, using pcs cluster setup <clustername> <node1> <node2>; this command also handles authentication keys internally, equivalent to manual use of corosync-keygen for generating a shared secret in /dev/random-based setups. Define the name and node IDs (e.g., 1 for node1, 2 for node2) within this step, ensuring hostname resolution via /etc/hosts or DNS. In manual configurations without PCS, use corosync-cfgtool to query or adjust ring status post-setup, though generation of corosync.conf typically involves editing sections like totem, quorum, and nodelist directly. Start the cluster services with pcs cluster start --all and enable them for boot using systemctl enable corosync pacemaker. For basic testing, disable fencing (STONITH) temporarily with pcs property set stonith-enabled=false, noting this is not recommended for production. Verification involves with crm_mon (or pcs status) to confirm all nodes are and no resources are failing. Check Corosync ring using corosync-cfgtool -s, which should show faultless links. To test, create a simple resource like a virtual IP with pcs resource create test-ip ocf:heartbeat:IPaddr2 ip=192.168.122.150 cidr_netmask=24, then verify its and movement using crm_resource -r test-ip -V.

Configuration and Management

Linux-HA clusters, managed primarily through , rely on the (CIB) for defining resources in XML . Resources are specified as , groups, or clones within the <resources> of the CIB, with each identifying its , type, and provider. For example, a virtual IP resource using the IPaddr2 agent is defined as follows:
xml
<primitive id="ClusterIP" class="ocf" type="IPaddr2" provider="heartbeat">
  <instance_attributes id="ClusterIP-params">
    <nvpair id="ClusterIP-ip" name="ip" value="192.168.122.120"/>
    <nvpair id="ClusterIP-cidr_netmask" name="cidr_netmask" value="24"/>
  </instance_attributes>
  <operations>
    <op id="ClusterIP-monitor" name="monitor" interval="30s" timeout="20s"/>
  </operations>
</primitive>
This configuration ensures the IP address is managed and monitored appropriately. Constraints, such as colocation, are added under the <constraints> section to enforce resource placement rules, using scores to indicate preference or requirement. A mandatory colocation constraint, for instance, ties two resources to the same node with an infinite score:
xml
<rsc_colocation id="colocate-ip-web" score="INFINITY" rsc="ClusterIP" with-rsc="Webserver"/>
Here, INFINITY (equivalent to 1,000,000) makes the colocation mandatory, preventing the resources from running separately. Configuration and management are facilitated by command-line tools like pcs and cibadmin. The pcs tool provides a user-friendly interface for creating and modifying resources, such as pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=192.168.122.120 cidr_netmask=24 op monitor interval=30s. For direct XML edits to the live CIB, cibadmin is used, for example, cibadmin --create --xml-file resource.xml --obj_type resources to add a new resource definition. Constraints can similarly be managed via pcs, like pcs constraint colocation add ClusterIP with Webserver INFINITY. Monitoring involves real-time status viewing and logging mechanisms to track cluster health. The crm_mon utility offers a dynamic display of , resources, and nodes, invoked with crm_mon for continuous output or crm_mon -1 for a one-time . Logging is handled through , with Pacemaker-specific entries in /var/log/pacemaker/pacemaker.log or integrated into /var/log/messages, and logs rotate automatically at 100MB or weekly intervals. Alerts for failures are configured in the CIB under <alerts>, such as defining an SNMP : <alert id="snmp_alert" path="/path/to/alert_snmp.sh"/>, which triggers on events like resource failures. Maintenance tasks include performing rolling upgrades and backing up configurations to ensure operational continuity. Rolling upgrades proceed node-by-node, draining resources from one node before upgrading it, provided version compatibility is maintained (e.g., 2.x requires Corosync 2.3+). The CIB is backed up and restored using cibadmin, for example, cibadmin --backup /path/to/[backup](/page/Backup).xml to the current , and cibadmin --restore /path/to/[backup](/page/Backup).xml to restore it. Note that the backup file is typically in XML format. Troubleshooting focuses on log analysis and simulating scenarios to diagnose issues. Logs are primarily located in /var/log/pacemaker/, where errors can be filtered with commands like grep 'pacemaker.*error' /var/log/pacemaker/pacemaker.log. Common problems include network partitions, which Pacemaker mitigates through fencing mechanisms to isolate faulty nodes and maintain quorum. The crm_simulate tool aids diagnosis by replaying cluster transitions from log files, such as crm_simulate --simulate --xml-file transition.xml.

Applications

Common Use Cases

Linux-HA clusters, leveraging as the resource manager, are commonly deployed for database high availability to ensure minimal downtime during failures. In such setups, tools like DRBD provide synchronous block-level replication for shared , failover configurations for including and . For instance, monitors the primary database instance and, upon detecting a , promotes the standby by mounting the replicated DRBD and starting the database . Automatic migration of a virtual IP (VIP) address facilitates seamless client reconnection without manual intervention. Web services represent another key application, where Linux-HA enables load-balanced clusters to maintain under high or node failures. coordinates active/passive or active/active configurations, often integrating with for distribution across instances while ensuring session persistence through shared storage or sticky sessions. This setup allows for automatic of the load balancer itself, preventing single points of failure in web infrastructures. For file services, Linux-HA supports active/active access to shared using , a that allows multiple nodes to read and write concurrently. Configurations with or NFS over GFS2, managed by , provide high-availability file sharing in environments requiring scalable , such as . handles resource and lock via the () to prevent during concurrent operations. Virtualization platforms from Linux-HA through high-availability setups for KVM/QEMU virtual machines, particularly in Proxmox VE environments. Pacemaker enables restart or live migration of VMs to healthy nodes upon host , using shared like Ceph or GFS2 for data persistence. This integration supports seamless workload relocation, minimizing disruption in virtualized data centers. In these deployments, Linux-HA typically achieves a Recovery Time Objective (RTO) of under one minute, as failover detection and resource promotion occur in seconds to tens of seconds depending on cluster size and configuration. Real-world examples include CERN's use of Pacemaker for high-availability load balancing in database middleware, ensuring continuous operation of critical services.

Integration in Distributions

Linux-HA components, particularly Pacemaker and Corosync, are integrated into major Linux distributions through dedicated high availability packages and extensions that facilitate cluster management and failover capabilities. In Red Hat Enterprise Linux (RHEL) and its community counterpart CentOS, the High Availability Add-On has been available since RHEL 6, released in 2010, providing enterprise-grade clustering tools built on Linux-HA foundations. This add-on includes the pcs command-line interface for cluster configuration and management, as well as fence-agents for node fencing to ensure clean failovers. Additionally, the Resilient Storage Add-On complements these features by enabling concurrent access to shared storage in highly available clusters, supporting technologies like GFS2 filesystems for data integrity. As of 2025, RHEL 10 enhancements extend HA capabilities to edge computing environments, incorporating optimized image-based deployments and live kernel patching to minimize downtime in distributed setups. SUSE Linux Enterprise (SLE) offers the Extension, which integrates Linux-HA tools like for resource management and supports advanced clustering features. components include the web-based for intuitive and of clusters, allowing administrators to visualize resource states and dependencies. The extension also provides clustering , enabling coordinated across geographically dispersed sites for scenarios. For Ubuntu and Debian, Pacemaker and related Linux-HA packages are readily available through official repositories, allowing straightforward installation via package managers like apt. These distributions integrate Pacemaker with cloud orchestration tools such as Juju, where subordinate charms like HAcluster and pacemaker-remote enable automated deployment of high availability setups for virtual IPs and services in cloud environments. Other distributions and platforms extend Linux-HA functionality in specialized ways; for instance, () incorporates the ha-manager since 4.2 in 2016, leveraging for automated VM and in hyper-converged clusters.

References

  1. [1]
    Linux-HA Heartbeat System Design - USENIX
    HA clusters minimize availability interruptions by quickly switching services over from failed systems to working systems, providing the customer with an ...
  2. [2]
    [PDF] End-to-End High Availability solution for System z from a Linux ...
    Using Linux-HA. The Linux High Availability (Linux-HA) project provides high availability solutions for Linux through an open source development community.
  3. [3]
    Pacemaker and Linux-HA: World-Class High Availability Software
    The Linux-HA project (http://linux-ha.org/), together with its child project, Pacemaker, is the oldest and most powerful open source high-availability (HA) ...Missing: history | Show results with:history
  4. [4]
    ClusterLabs
    ### Summary of ClusterLabs Content
  5. [5]
    Ahead of the Pack: the Pacemaker High-Availability Stack
    Jun 18, 2012 · It re-invented itself as an independent and much more community-driven project in 2008, with developers from Red Hat, SUSE and NTT now being ...
  6. [6]
    High-Availability Clustering in the Open Source Ecosystem - Alteeve
    May 28, 2016 · These two projects remained entirely separate until 2007 when, out of the Linux HA project, Pacemaker was born as a cluster resource manager ...
  7. [7]
    [PDF] World Class HA with Linux-HA
    Mar 18, 1998 · Linux-HA overview – Linux Foundation Japan Symposium 2007 / 2. Overview. HA Principles. Introduction to Linux-HA. Who uses it? What do they use ...<|control11|><|separator|>
  8. [8]
    Fault Tolerance with Linux High Availability - Atlantic.Net
    Feb 7, 2017 · ” A commonly held standard for high availability is “five nines,” or 99.999 percent uptime. ... Pacemaker[VI] is a cluster resource manager ...
  9. [9]
    (PDF) Impact of Pacemaker Failover Configuration on Mean Time to ...
    Pacemaker mean recovery time can take a value between 110 and 160 seconds, if the tool is configured badly. We found that with a proper configuration Pacemaker ...
  10. [10]
    Pacemaker for Availability Groups and Failover Cluster Instances on ...
    Jul 3, 2025 · This article covers the basic information to understand Pacemaker with Corosync, and how to plan and deploy it for SQL Server configurations.HA add-on/extension basics · Pacemaker concepts and...
  11. [11]
    4. Nodes — Pacemaker Explained - ClusterLabs
    Every cluster must have at least one cluster node. Scalability is limited by the cluster layer to around 32 cluster nodes. 4.1.1. Host Clock Considerations¶. In ...Missing: maximum | Show results with:maximum
  12. [12]
    Pacemaker Explained - ClusterLabs
    Pacemaker is a high-availability cluster resource manager – software that runs on a set of hosts (a cluster of nodes) in order to preserve integrity and ...<|control11|><|separator|>
  13. [13]
    6. Resource Operations — Pacemaker Explained - ClusterLabs
    Resource agents must support certain common operations such as start, stop, and monitor, and may implement any others. Operations may be explicitly configured ...
  14. [14]
    12. Resource Agents — Pacemaker Administration - ClusterLabs
    Pacemaker sets certain environment variables when it executes an OCF resource agent. Agents can check these variables to get information about resource ...Missing: interoperability | Show results with:interoperability<|control11|><|separator|>
  15. [15]
    None
    ### Summary of Initial Design of Heartbeat in Linux-HA
  16. [16]
    Release 2.0.6 of Linux-HA is available - LWN.net
    Jul 17, 2006 · ... stonith RA metadata basic heartbeat class RA metadata reworked resource addition dialog right-click menu support for clones + Bug fixes ...
  17. [17]
    [PDF] The Corosync Cluster Engine - The Linux Kernel Archives
    Jul 23, 2008 · Pacemaker is now maintained independently of. Heartbeat in order to support both the OpenAIS and Heartbeat cluster stacks equally. Pacemaker ...
  18. [18]
    [PDF] Pacemaker 1.1 Clusters from Scratch - ClusterLabs
    Sep 10, 2018 · Two-node Active/Passive clusters using Pacemaker and DRBD are a cost-effective solution for many. High Availability situations. Figure 1.4.
  19. [19]
    Releases · ClusterLabs/pacemaker - GitHub
    Aug 7, 2025 · Pacemaker is an advanced, scalable High-Availability cluster resource manager - Releases · ClusterLabs/pacemaker.
  20. [20]
    Highly Available NFS Exports with DRBD & Pacemaker - LINBIT
    May 29, 2025 · This blog post explains how to configure an NFS server instance in a 3-node high availability (HA) active/passive Linux cluster using DRBD® and Pacemaker.System Preparation... · Configuring Drbd · Creating Nfs Exports And...
  21. [21]
    ClusterLabs/pacemaker - GitHub
    Pacemaker was initially created by main architect and lead developer Andrew Beekhof andrew@beekhof.net with the aid of project catalyst and advocate Lars ...Pull requests 12 · Actions · Security · Releases 137
  22. [22]
  23. [23]
    Pacemaker Explained
    Summary of each segment:
  24. [24]
    Chapter 18. Managing cluster resources | Red Hat Enterprise Linux | 8
    Pacemaker provides a variety of mechanisms for configuring a resource to move from one node to another and to manually move a resource when needed. You can ...
  25. [25]
    Pacemaker - ClusterLabs
    Pacemaker came to life in late 2003 when Lars convinced SUSE to hire Andrew Beekhof to implement a new cluster resource manager (CRM) for the Heartbeat project.
  26. [26]
  27. [27]
    corosync_overview(8): Corosync overview - Linux man page - Die.net
    The corosync project is a project to implement a production quality "Revised BSD" licensed implementation of the most recent SA Forum's Application ...
  28. [28]
    corosync.conf(5) - Debian Manpages
    This specifies the fully qualified path to the shared key used to authenticate and encrypt data used within the Totem protocol. The default is /etc/corosync/ ...
  29. [29]
    Corosync - Alteeve Wiki
    Oct 30, 2013 · Corosync is the communication layer of modern open-source clusters. It was created out of a desire to have a simplified and focused communication layer.Missing: website | Show results with:website
  30. [30]
    [PDF] Kronosnet: The new face of Corosync communications
    Sep 7, 2017 · There is a Totem Multi Ring protocol but I don't think anyone has ever been mad enough to implement it. Though the layer is still there in the ...
  31. [31]
    Corosync 3.0.2 Release Notes - GitHub
    Jun 12, 2019 · This allows applications to take advantage of knet features such as multipathing, automatic failover, link switching etc. Note that libnozzle is ...<|separator|>
  32. [32]
    Chapter 27. Configuring cluster quorum | Red Hat Enterprise Linux | 8
    Cluster quorum uses votequorum service, assigning votes to each system. Operations proceed only with a majority of votes. Options like auto_tie_breaker can be ...
  33. [33]
    corosync.conf(5) - Linux man page - Die.net
    The corosync.conf file configures the corosync executive with parameters for totem, logging, and event services. It is located at /etc/corosync/corosync.conf.
  34. [34]
    5.3.4. Configure Corosync on Cluster Nodes - ClusterLabs
    Corosync handles Pacemaker's cluster membership and messaging. The corosync config file is located in /etc/corosync/corosync.conf.
  35. [35]
    Releases · corosync/corosync - GitHub
    Nov 15, 2024 · I am pleased to announce the latest maintenance release of Corosync 3.1.8 is available immediately from the GitHub release section.
  36. [36]
    corosync.conf(5) - Debian Manpages
    This specifies which cryptographic library should be used by KNET. Supported values depend on the libknet build and on the installed cryptography libraries.
  37. [37]
    sysutils/heartbeat: Subsystem for High-Availability Clustering
    Heartbeat is the first piece of software which was written for the Linux-HA project. It performs death-of-node detection, communications and cluster ...
  38. [38]
    Relation between Heartbeat and Corosync on openSUSE
    Apr 7, 2013 · Yes heartbeat is deprecated. No this is not a distro specific problem; There are a lot of benefits to using Corosync instead of heartbeat ...what cluster management software to use for linux? - Server FaultHow to suppress a Heartbeat resource from starting in failover data ...More results from serverfault.comMissing: tools | Show results with:tools
  39. [39]
    ClusterLabs/resource-agents: Combined repository of OCF ... - GitHub
    This repository contains resource agents (RAs) compliant with the Open Cluster Framework (OCF) specification. These resource agents are used by two cluster ...Missing: shift | Show results with:shift<|separator|>
  40. [40]
    14 Adding or modifying resource agents - SUSE Documentation
    All tasks that need to be managed by a cluster must be available as a resource. There are two major groups to consider: resource agents and STONITH agents.Missing: interoperability | Show results with:interoperability
  41. [41]
    5. Cluster Resources - ClusterLabs
    6. Nagios Plugins. Nagios Plugins are a way to monitor services. Pacemaker can use these as resources, to react to a change in the service's status.
  42. [42]
    ClusterLabs/nagios-agents-metadata - GitHub
    This is a collection of files containing OCF metadata for the respective nagios monitoring agents for use with Pacemaker.
  43. [43]
    Chapter 10. Configuring fencing in a Red Hat High Availability cluster
    STONITH is an acronym for "Shoot The Other Node In The Head" and it protects your data from being corrupted by rogue nodes or concurrent access.Missing: PDU | Show results with:PDU
  44. [44]
    Fencing and STONITH | Administration Guide | SLE HA 12 SP5
    To set up fencing, you need to configure one or more STONITH resources—the stonithd daemon requires no configuration. All configuration is stored in the CIB. A ...
  45. [45]
    Recommendations for Fencing and STONITH Devices in Pacemaker
    Adding the PDU fencing devices requires distinct off and on actions for each outlet on each PDU. With two nodes, each with two Power Supply Units (PSUs), this ...Missing: HA | Show results with:HA
  46. [46]
    Documentation - crmsh
    ... Hawk, the web GUI which uses the crm shell as its backend. For more information on Pacemaker in general, see the Pacemaker documentation at clusterlabs.org.
  47. [47]
    Hawk - High Availability Web Konsole
    Hawk is a web interface for Pacemaker HA clusters. Use it to configure, manage and monitor just about any kind of application running in Linux as a cluster ...
  48. [48]
    ClusterLabs/hawk: A web-based GUI for managing and ... - GitHub
    Hawk provides a web interface for High Availability clusters managed by the Pacemaker cluster resource manager.
  49. [49]
    [PDF] The Totem Single-Ring Ordering and Membership Protocol - Corosync
    On a Token Retransmission timeout, the processor retransmits the token to the next processor on the ring and then resets the timeout. The token seq eld of the ...
  50. [50]
    [PDF] New quorum features in Corosync 2 - Red Hat People
    Apr 23, 2012 · At its most basic level it is a quorum-based majority voting system where a cluster needs expected_votes/2+1 votes for it to be quorate ...
  51. [51]
    votequorum(5) — corosync — Debian unstable
    Jun 21, 2025 · The number of expected votes will be automatically calculated when the nodelist { } section is present in corosync.conf or expected_votes can be ...
  52. [52]
    corosync-keygen - Generate an authentication key for Corosync.
    DESCRIPTION. If you want to configure corosync to use cryptographic techniques to ensure authenticity and privacy of the messages, you will need to generate a ...
  53. [53]
    [ClusterLabs] Security with Corosync
    Mar 16, 2016 · ... corosync by default uses aes256 for encryption and sha1 for >>>> hmac authentication. >>>> >>>> Pacemaker uses corosync cpg API so as long ...
  54. [54]
    Configuring and managing high availability clusters | 8
    The Red Hat High Availability Add-On configures high availability clusters that use the Pacemaker cluster resource manager.
  55. [55]
    [PDF] Using Pacemaker to Create Highly Available Linux Solutions on IBM ...
    Dec 20, 2023 · This document describes using Pacemaker to create highly available Linux solutions on IBM Power, applying to Red Hat Enterprise Linux 9.0 and 8 ...
  56. [56]
    Chapter 5. Fencing: Configuring STONITH - Red Hat Documentation
    STONITH is an acronym for "Shoot The Other Node In The Head" and it protects your data from being corrupted by rogue nodes or concurrent access.Missing: IPMI APC PDU<|separator|>
  57. [57]
    Clusters from Scratch - ClusterLabs
    This document provides a step-by-step guide to building a simple high-availability cluster using Pacemaker.
  58. [58]
    Pacemaker Administration - ClusterLabs
    This document has instructions and tips for system administrators who manage high-availability clusters using Pacemaker.
  59. [59]
    Configuring High Availability for MySQL Databases Using DRBD
    May 22, 2024 · DRBD provides synchronous replication at the block level for MySQL, offering low latency and ease of use with failover mechanisms, and can ...
  60. [60]
    Active-Passive Cluster for Near HA Using Pacemaker, DRBD ...
    Mar 14, 2019 · In this post, we are going to build a MySQL active-passive cluster using Pacemaker, Corosync, and DRBD.
  61. [61]
    Configure HAProxy to balance Apache web server traffic - Red Hat
    Mar 15, 2022 · HAProxy (short for High Availability Proxy) is a software-based TCP/HTTP load balancer. It sends client requests to multiple servers to evenly distribute ...
  62. [62]
    HAProxy High Availability Setup | Databases at CERN blog
    Jan 16, 2018 · The HAProxy setup uses CentOS, Pacemaker, Corosync, and a load balancer. Steps include installing packages, setting up cluster, fencing, ...
  63. [63]
    Chapter 8. Configuring an active/active Samba server in a Red Hat ...
    To configure an active/active Samba server, configure a GFS2 file system, Samba on cluster nodes, Samba cluster resources, and test the server.
  64. [64]
    GFS2 | Administration Guide | SLE HA 15 SP7 - SUSE Documentation
    GFS2 is a shared disk file system for Linux computer clusters. GFS2 allows all nodes to have direct concurrent access to the same shared block storage.
  65. [65]
    Proxmox VE Administration Guide
    A multi-node Proxmox VE HA Cluster enables the definition of highly available virtual servers. The Proxmox VE HA Cluster is based on proven Linux HA ...
  66. [66]
    Qemu/KVM Virtual Machines - Proxmox VE
    Nov 22, 2022 · QEMU is a user program which has access to a number of local resources like partitions, files, network cards which are then passed to an emulated computer.
  67. [67]
    Overview of the High Availability Add-On for Red Hat Enterprise Linux
    The High Availability Add-On is an integrated set of software components that can be deployed in a variety of configurations to suit your needs for performance, ...<|control11|><|separator|>
  68. [68]
    Chapter 4. Creating a Red Hat High-Availability cluster with ...
    Create a Red Hat High Availability two-node cluster using the pcs command-line interface with the following procedure.Missing: CentOS | Show results with:CentOS
  69. [69]
    Red Hat Enterprise Linux Resilient Storage Add-On
    Jan 20, 2025 · The Red Hat Enterprise Linux Resilient Storage Add-On provides concurrent shared storage access to the members of a highly available cluster.
  70. [70]
    What is new in Red Hat Enterprise Linux 10 and beyond
    Nov 4, 2025 · Red Hat Enterprise Linux 10 offers features from AI-ready capabilities and image mode deployments to expanded hardware support and compliance enhancements.
  71. [71]
    SUSE Linux Enterprise High Availability Extension
    Easily install, configure, manage, and monitor your clustered environment using a powerful unified command-line interface, CRMSH(Cluster Resource Management ...<|control11|><|separator|>
  72. [72]
    Configuration and Administration | SLE HA 12 SP5
    To configure and manage cluster resources, either use HA Web Konsole (Hawk2), or the CRM Shell ( crmsh ) command line utility. If you upgrade from an earlier ...Missing: UI | Show results with:UI<|separator|>
  73. [73]
  74. [74]
    SUSECON 2025 Sessions | SUSE
    * Streamlined Operations: Supports smarter, more efficient infrastructure management and minimizes downtime with advanced AI-driven insights. Prashanth N SMissing: SP5 | Show results with:SP5
  75. [75]
    Debian -- Package Search Results -- pacemaker
    You have searched for packages that names contain pacemaker in all suites, all sections, and all architectures. Found 19 matching packages.Missing: Ubuntu Juju
  76. [76]
    pacemaker package : Ubuntu - Launchpad
    The source charm interface for OpenStack Pacemaker Remote: this LP project is for bug tracking only. Code lives in OpenStack upstream. Bug supervisor: yes: Bug ...Missing: Juju integration
  77. [77]
    HAcluster | Ubuntu
    HAcluster is a Juju subordinate charm that encapsulates corosync and pacemaker for floating virtual IP or DNS addresses and is similar to keepalived.
  78. [78]
    Deploy Pacemaker Remote using Charmhub
    Pacemaker Remote is a small daemon that allows a host to be used as a Pacemaker node without running the full cluster stack. The pacemaker-remote charm is a ...
  79. [79]
    Roadmap - Proxmox VE
    Improve error reporting in case a guest migration fails because the migration tunnel closed. HA Manager. High-Availability (HA) rules for node and resource ...