Fact-checked by Grok 2 weeks ago

Linux-HA

Linux-HA, also known as the High-Availability Linux project, was an open-source initiative that developed clustering software to provide high availability for applications and services on Linux and other Unix-like operating systems, including FreeBSD, OpenBSD, Solaris, and Mac OS X.^[1] The project focused on creating resilient systems that minimize downtime by detecting failures and automatically transferring workloads to healthy nodes in a cluster.^[2] Originating in the late 1990s, Linux-HA introduced key components such as the Heartbeat subsystem, which served as the core engine for cluster membership, inter-node communication, and resource failover.^[1] Heartbeat enabled active-passive and active-active configurations, supporting scalable clusters without a fixed maximum number of nodes, and integrated with tools like STONITH for fencing failed nodes to prevent data corruption.^[3] Over time, elements of Linux-HA evolved into independent projects under the ClusterLabs umbrella, including Pacemaker as the cluster resource manager and Corosync for reliable messaging and quorum management.^[4] As of 2025, these successor projects are actively maintained, with Pacemaker at version 3.0.1 released in August 2025.^[5] These tools from the Linux-HA lineage are widely adopted in enterprise environments for critical workloads, such as databases, web services, and file systems, offering features like policy-driven resource placement, support for multi-site replication, and integration with shared storage solutions like DRBD.^[6] By providing a modular, extensible framework, Linux-HA and its successors ensure continuous operation with near-100% uptime, making them foundational to open-source high-availability strategies in production systems.^[7]

Introduction

Overview

Linux-HA is an open-source project that provides high-availability solutions for Linux environments, encompassing failover clustering, resource management, and fault tolerance mechanisms to maintain service continuity.^[8] As the oldest community-driven high-availability initiative, it enables the creation of resilient clusters that detect and respond to failures, ensuring minimal downtime through automated recovery processes.^[3] The primary purpose of Linux-HA is to achieve near-continuous availability of critical services by implementing redundancy across multiple nodes, allowing for rapid failover in response to events such as hardware crashes, network partitions, or application faults.^[8] This approach minimizes outages to seconds or minutes, elevating system reliability from baseline levels like 99% to higher thresholds such as 99.9%.^[8] By supporting n-node clusters up to around 32 nodes, it facilitates no-single-point-of-failure architectures suitable for enterprise-scale deployments.^[9] In terms of technical scope, Linux-HA accommodates both active/passive and active/active configurations, enabling the management of diverse services including databases, web servers, file systems, ERP systems, firewalls, and load balancers.^[8] Resource management adheres to standards like OCF and LSB, with built-in fault tolerance features such as fencing (STONITH) and quorum to prevent issues like split-brain scenarios.^[8] Originating in the late 1990s as a volunteer-led effort, with initial code developed in 1998, the project has influenced subsequent tools through the evolution of components like Heartbeat into modern frameworks such as Pacemaker and Corosync under the ClusterLabs umbrella.^[8]^[4]

Goals and Principles

High availability solutions from the Linux-HA lineage commonly aim for uptime levels such as 99.999%, often referred to as "five nines," which equates to no more than about 5.26 minutes of downtime per year. This is achieved through mechanisms that minimize downtime via rapid failover, typically configurable to occur within seconds to under a minute depending on cluster parameters like monitoring intervals and failure timeouts.^[10] The project supports scalable clusters ranging from 2 nodes up to around 32 nodes using underlying layers like Corosync.^[11]^[9] Guiding principles of Linux-HA emphasize open-source collaboration, fostering community-driven development and extensibility for integrating with diverse applications and environments.^[4] Modularity is central, allowing seamless integration across various Linux distributions and support for heterogeneous hardware without requiring uniform setups.^[12] Monitoring is designed to be non-intrusive, relying on lightweight resource agents that probe service health periodically without significant performance overhead.^[13] At its core, Linux-HA employs a conceptual framework to prevent split-brain scenarios—where multiple nodes independently assume control of shared resources—through quorum mechanisms that require a majority of nodes to agree on cluster state before actions proceed.^[4] Resource fencing ensures data integrity by isolating failed nodes, such as powering them off, to avoid concurrent access that could lead to corruption.^[4] In contrast to proprietary high-availability solutions, Linux-HA prioritizes standards-based interoperability via Open Cluster Framework (OCF) resource agents, enabling plug-and-play management of services from different vendors.^[14] This approach, combined with community extensibility, allows users to customize and expand functionality without vendor lock-in.^[4]

History

Early Development

The Linux-HA project was founded in 1998 by Alan Robertson, then at Bell Labs, along with early contributors, as the Heartbeat project to provide high-availability clustering capabilities for Linux systems, which at the time lacked native support for such features.^[8]^[7] The initiative began with the first working code assembled on March 18, 1998, following Robertson's earlier discussions on Linux mailing lists about the need for reliable failover mechanisms.^[8] Initially, the project concentrated on simple IP failover and resource monitoring, achieved through periodic heartbeat messages sent approximately once per second over serial ports or Ethernet links to detect node failures or recoveries.^[15] This design leveraged Linux kernel features, such as IP aliasing, to enable seamless transfer of virtual IP addresses between nodes during failover without requiring complex reconfiguration.^[15]^[7] A key early milestone was the release of Heartbeat 0.4 in 1999, which introduced basic clustering functionality and marked the project's first stable version capable of supporting active-passive configurations limited to two nodes.^[7]^[8] The project's community grew rapidly after being hosted on SourceForge, fostering open development and attracting contributions that expanded its scope beyond Linux.^[7] By 2001, ports to FreeBSD and Solaris had been developed, broadening its applicability in heterogeneous environments.^[7] These efforts addressed critical challenges of the era, including the relative instability of early Linux kernels for production use and the scarcity of affordable commercial high-availability tools, which were primarily available for proprietary Unix systems like Solaris or AIX.^[8]^[7]

Key Milestones and Evolution

The Linux-HA project advanced significantly between 2004 and 2008, with the release of Heartbeat 2.0 in 2006 introducing comprehensive STONITH (Shoot The Other Node In The Head) support to enable reliable node fencing and prevent split-brain scenarios in clusters.^[16] This version enhanced resource agent metadata and cluster management capabilities, building on the foundational Heartbeat software developed in the late 1990s for basic failover detection. In 2008, the Heartbeat project underwent a major restructuring through a fork that separated its components, resulting in Pacemaker as the policy-driven resource manager and Corosync as the underlying cluster communication engine derived from the OpenAIS project.^[17] This split allowed for greater flexibility, enabling Pacemaker to operate independently of specific communication layers and supporting advanced features like active/active clustering configurations.^[6] During the 2010s, Linux-HA gained widespread enterprise adoption, integrating into major distributions such as Red Hat Enterprise Linux 6 in 2010, where Pacemaker became the core of the High Availability Add-On for managing clustered services. Similarly, SUSE incorporated Pacemaker into its Linux Enterprise High Availability Extension starting around the same period, providing scalable clustering for business-critical applications. In 2012, Pacemaker 1.1 was released, introducing native support for multi-site (Geo) clusters to coordinate resources across geographically dispersed sites for disaster recovery.^[18] Recent developments through 2025 have focused on modernization and expanded interoperability. Pacemaker 2.1.2, released in 2021, included various improvements such as better fencing delay handling and tool enhancements.^[19] Support for geo clustering, including ticket-based arbitration via the Booth protocol, was introduced in earlier versions around 2015. Integration with container orchestration platforms like Kubernetes through resource agents and operators has enabled traditional HA clusters to manage stateful workloads in hybrid environments.^[20] In 2025, the major Pacemaker 3.0.0 release on January 8 introduced significant updates, including changes to upgrade compatibility and enhanced features for high-availability management, followed by 3.0.1 on August 7.^[21] The project has long utilized GitHub for primary source code hosting to improve collaboration and version control.^[22] Overall governance has transitioned to the ClusterLabs community, an open-source collective that oversees development, maintenance, and contributions for Pacemaker, Corosync, and related tools.^[23]

Core Components

Pacemaker

Pacemaker serves as the central policy engine in Linux-HA clusters, responsible for starting, stopping, monitoring, and migrating resources to maintain high availability based on the cluster's state and user-defined constraints.^[24] It processes events such as node failures or service disruptions, deciding actions to ensure resources remain active and data integrity is preserved through mechanisms like fencing integration. This resource orchestration allows for flexible configurations, including active/passive and active/active setups, across multiple nodes. Key features of Pacemaker include support for various resource agent standards, such as Open Cluster Framework (OCF), Linux Standards Base (LSB), and systemd, which enable the management of diverse services like databases, web servers, and virtual machines through standardized scripts.^[24] It enforces colocation and ordering constraints to dictate resource dependencies—for instance, ensuring a database starts before an associated web server or colocating related services on the same node to optimize performance and reliability.^[24] Additional capabilities encompass failure thresholds for automatic migration, live resource relocation without downtime for compatible agents, and advanced monitoring intervals to detect issues promptly.^[25] Architecturally, Pacemaker relies on the Cluster Resource Manager (CRM) daemon, now implemented as pacemaker-controld, which coordinates decisions and actions across the cluster.^[24] It integrates with the Cluster Information Base (CIB), an XML-based repository that stores and synchronizes configuration, status, and history data among nodes, allowing for real-time updates and queries via tools like crm_mon.^[24] The system communicates with the underlying cluster membership layer, such as Corosync, to receive node status updates.^[24] Pacemaker originated as a spin-off from the Heartbeat project within the Linux-HA initiative around 2007-2008, evolving into an independent resource manager to enhance flexibility beyond Heartbeat's integrated approach.^[26] As of November 2025, the current stable release is version 3.0.1 (released August 2025).^[19] These developments build on earlier versions by improving scalability and integration with modern environments, including support for promotable clones and multi-tenant fencing.^[24] Pacemaker powers high-availability setups in major distributions and platforms, including Red Hat Enterprise Linux (RHEL) for enterprise clustering, SUSE Linux Enterprise High Availability Extension for robust service management, and Proxmox Virtual Environment for virtual machine failover. Its widespread adoption stems from its policy-driven automation, which minimizes manual intervention in production environments handling critical workloads.

Corosync

Corosync serves as the foundational communication and membership layer in Linux-HA clusters, providing reliable multicast messaging, node heartbeat detection, and quorum management through the Totem protocol.^[27]^[28] This open-source cluster engine implements the Totem Single Ring Ordering and Membership protocol, ensuring ordered and reliable delivery of messages among cluster nodes while detecting failures via periodic token passing.^[17] Heartbeat detection occurs through configurable timeouts, with defaults such as a 1-second token interval and 10-second failure detection window, allowing administrators to adjust parameters like token_retransmits and join_timeout in the configuration to suit network conditions.^[29] The protocol relies on UDP-based multicast for intra-cluster communication, enabling efficient group messaging without requiring a central coordinator.^[30] For redundancy, Corosync supports multiple communication rings since the introduction of the Kronosnet (KNET) library in version 3.0 in 2018, which facilitates link aggregation, automatic failover, and multipathing across network interfaces.^[31] KNET enhances fault tolerance by allowing up to eight redundant links, ensuring message delivery even if individual paths fail, and integrates seamlessly with the Totem layer for fragmentation and reassembly.^[32] Corosync manages quorum to prevent split-brain scenarios, using the votequorum service where each node typically holds one vote, requiring a majority (e.g., 50% + 1) for cluster operations to proceed.^[33] Upon failure detection, it can trigger automatic node isolation, configurable via quorum policies like expected_votes and auto_tie_breaker for even-sized clusters such as two-node setups. Configuration is handled through the corosync.conf file, located at /etc/corosync/corosync.conf, which defines ring interfaces (e.g., rrp_mode: active for redundant rings), transport settings (e.g., knet: transport: udp), and quorum parameters.^[34]^[35] Evolving from the OpenAIS project in 2008, Corosync was refactored to focus on core infrastructure primitives, separating messaging from higher-level APIs.^[17] The project has since advanced, with version 3.1.9 (as of mid-2025) providing maintenance updates.^[36] These improvements, such as support for multiple cryptographic libraries (e.g., NSS, OpenSSL), ensure secure communication in production environments.^[37] Corosync integrates with Pacemaker by delivering cluster state events for resource management.^[12]

Supporting Tools

Linux-HA relies on several supporting tools that extend its core functionality, providing legacy compatibility, resource management scripts, monitoring capabilities, fencing mechanisms, and configuration interfaces. These tools integrate with the primary components to enable flexible high-availability setups across diverse environments.^[4] Heartbeat, the original clustering subsystem developed for the Linux-HA project, facilitated basic high-availability features such as IP address failover and node monitoring through heartbeat messaging prior to 2008. Although deprecated in favor of more robust alternatives like Corosync, it remains available for simple, low-complexity setups where minimal configuration is preferred.^[38]^[39] Resource agents in Linux-HA adhere to the Open Cluster Framework (OCF) standard, consisting of standardized scripts that define start, stop, monitor, and status operations for cluster resources. For instance, the ocf:heartbeat:IPaddr agent manages virtual IP addresses, while others handle services like Apache web servers (ocf:heartbeat:apache) and MySQL databases (ocf:heartbeat:mysql), with over 100 such agents available in the official repository to support a wide range of applications. These agents allow Pacemaker to abstract and orchestrate third-party services without custom coding.^[40]^[14]^[41] Monitoring integrations enhance Linux-HA by incorporating external tools for proactive health checks that inform cluster decisions. Nagios plugins, for example, can be deployed as OCF-compliant resources within Pacemaker to monitor remote services and trigger failovers based on detected issues, enabling seamless feedback loops between monitoring and resource management. Similarly, tools like Monit provide lightweight process supervision that can feed status updates into the cluster stack for automated responses.^[42]^[43] Fencing agents, essential for STONITH (Shoot The Other Node In The Head) operations, ensure safe node isolation during failures by interfacing with hardware devices. Common implementations include the fence_ipmilan agent for IPMI-based power control on servers and fence_apc for SNMP-managed APC Power Distribution Units (PDUs), which allow the cluster to remotely power off malfunctioning nodes to prevent data corruption. These agents are configured as dedicated resources and support a variety of hardware vendors for reliable enforcement.^[44]^[45]^[46] Additional utilities streamline cluster administration: CRMsh offers a command-line shell for configuring and querying Pacemaker resources in a structured, scriptable manner, supporting complex operations like resource migration and constraint definition. Hawk, a web-based graphical user interface primarily associated with SUSE distributions, provides visual tools for real-time monitoring, resource editing, and status visualization, making it accessible for administrators managing Pacemaker-based clusters. These tools work in tandem with Pacemaker and Corosync to simplify deployment and maintenance without altering core behaviors.^[47]^[48]^[49]

Architecture

Cluster Communication Layer

The cluster communication layer in Linux-HA serves as the foundational infrastructure for enabling reliable and ordered message delivery among cluster nodes, ensuring state synchronization and membership awareness even in the presence of failures.^[17] This layer handles the dissemination of heartbeat signals, configuration updates, and status notifications, allowing nodes to maintain a consistent view of the cluster topology and preventing desynchronization during transient network issues or node departures.^[50] At its core, the layer employs the Totem protocol, which operates via a single-ring ordering mechanism for multicast communication, guaranteeing that messages are delivered in the same sequence to all nodes.^[50] In configurations supporting redundancy, Totem extends to multiring setups, where multiple independent communication paths distribute messages concurrently to enhance fault tolerance.^[29] Should multicast fail due to network partitions, the protocol falls back to unicast transmission to specific nodes, maintaining connectivity where possible without compromising order.^[17] To mitigate split-brain scenarios, where partitioned subsets of nodes might independently assume cluster control, the layer implements a quorum model based on dynamic majority voting.^[51] This model calculates the required votes for quorate status as half of the expected_votes parameter plus one, with expected_votes typically auto-derived from the node count but manually configurable for scenarios like maintenance or uneven node weights.^[52] Only quorate partitions proceed with operations, ensuring that minority partitions remain passive until reconciliation.^[51] Redundancy is achieved through support for dual or multiple communication rings, each operating as an independent Totem instance, allowing the cluster to survive the failure of an entire ring without message loss.^[29] Token timeouts, configurable in milliseconds (e.g., defaulting to 3000 ms), govern failure detection by triggering reconfiguration if a token is not received within the interval, balancing responsiveness against false positives in variable networks.^[34]^[29] Security features include built-in authentication using symmetric keys generated via tools like corosync-keygen, which verifies message origins and prevents unauthorized node participation.^[53] Optional encryption, leveraging algorithms such as AES-256 alongside HMAC-SHA-256 for integrity, protects message confidentiality over untrusted networks.^[54]^[29] Corosync provides the primary implementation of this layer in modern Linux-HA setups.^[17]

Resource Management Layer

The Resource Management Layer in Linux-HA, primarily implemented by the Pacemaker cluster resource manager, oversees the allocation, monitoring, and migration of resources across cluster nodes to ensure high availability and service continuity. This layer abstracts resource lifecycle management from underlying node operations, using policy-driven decisions to handle placement, state transitions, and recovery. It operates on top of reliable cluster communication, coordinating actions that maintain desired service states even during node failures or maintenance.^[12]^[55] Central to this layer are key components that facilitate dynamic resource handling. The Cluster Information Base (CIB) serves as an XML-based, synchronized repository storing the live cluster configuration, resource definitions, node attributes, and current status, enabling all nodes to maintain a consistent view managed by the designated coordinator. The Policy Engine (PE), implemented as the pacemaker-schedulerd daemon, acts as the decision-making core, incorporating a transition engine to orchestrate state changes—such as starting, stopping, or promoting resources—and a constraint solver to evaluate placement rules including location preferences, colocation requirements (e.g., ensuring dependent resources run together), and ordering constraints (e.g., starting a database before its controller). These components process inputs to generate actionable graphs of operations, ensuring resources align with administrative policies.^[12]^[56]^[55] Failover logic within the layer emphasizes proactive monitoring and prioritized recovery to minimize downtime. Resources are periodically probed through monitor operations executed by resource agents, which assess health at configurable intervals (e.g., every 10 seconds for critical services); failures increment a failure counter, and after reaching a migration threshold (default per resource), the PE initiates migration to a suitable node. Placement decisions incorporate stickiness scores, ranging from -∞ (strong avoidance) to +∞ (mandatory placement), with the default resource-stickiness value of 1 for cloned resources encouraging resources to remain on their current node unless overridden by constraints or failures, thus balancing stability and load distribution.^[12]^[56]^[55] For scalability, the layer supports multi-tenancy through resource sets and templated configurations that isolate workloads, allowing multiple independent services to share cluster infrastructure without interference. It also accommodates Geo clusters by leveraging location constraints and node attributes (e.g., site identifiers) to distribute resources across geographically dispersed sites, enabling site failover with minimal data loss in setups like N+M redundancy models supporting up to 32 nodes. Integration with the Local Resource Manager (LRM), via the pacemaker-execd or pacemaker-lrmd daemon, ensures node-local execution of PE-directed actions—such as invoking OCF-compliant resource agents for start/stop/monitor—while relaying status back to the CIB for cluster-wide consistency.^[12]^[56]^[55]

Fencing Mechanisms

Fencing mechanisms in Linux-HA clusters are essential for maintaining data integrity by isolating failed or unresponsive nodes, thereby preventing scenarios such as split-brain where multiple nodes simultaneously access shared resources like storage, leading to potential corruption from dual writes.^[12]^[57] These mechanisms ensure that a node is definitively offline before resources are reassigned to another node, avoiding interference from corrupted or rogue processes.^[45] The primary fencing method in Linux-HA is STONITH, an acronym for "Shoot The Other Node In The Head," which employs external agents to forcibly power off or reset a failed node.^[12]^[57] STONITH devices, configured as cluster resources, include hardware interfaces such as IPMI for remote power control or SSH for scripted shutdowns, ensuring the action occurs outside the cluster's internal communication to avoid reliance on potentially compromised paths.^[12]^[45] Fencing types in Linux-HA are categorized as soft or hard, allowing flexibility based on the environment. Soft fencing, such as with the fence_vmware agent for virtual machines, attempts non-destructive isolation like network disconnection or graceful shutdown before escalating.^[12]^[57] Hard fencing, exemplified by the fence_apc agent for power distribution units (PDUs), directly cuts power to ensure immediate and irreversible node termination.^[12]^[57] Configurable delays, such as a 60-second postponement after failure detection or randomized intervals via parameters like pcmk_delay_max, help coordinate actions in multi-node setups and prevent premature fencing during transient issues.^[12]^[57] STONITH integrates with quorum policies to trigger fencing only upon loss of quorum, ensuring decisions reflect majority consensus and avoiding unnecessary actions in partitioned clusters.^[12]^[45] For even-numbered node counts, witness nodes—external quorum devices—provide an odd vote to resolve ties and initiate fencing reliably.^[12]^[45] Best practices for Linux-HA fencing emphasize redundancy and validation to enhance reliability. Deploying multiple fencing devices, such as combining IPMI with PDUs, mitigates single points of failure in the fencing topology.^[12]^[45] Testing configurations using tools like pcmk_host_map to accurately map node hostnames to device ports ensures precise targeting during operations.^[12] These mechanisms are typically triggered by failure events detected through Pacemaker's monitoring.^[12]

Implementation

Cluster Setup Process

Setting up a Linux-HA cluster using Pacemaker and Corosync requires careful attention to prerequisites to ensure compatibility and reliable communication. Nodes should run a homogeneous operating system, such as Red Hat Enterprise Linux 8 or later (or equivalents like AlmaLinux 9), to avoid version mismatches in cluster software and kernel features.^[55]^[58] Shared storage, such as via DRBD or GFS2, is optional for basic setups but necessary for stateful resources; network isolation via a dedicated private interface is recommended to separate cluster traffic from public networks, using static IP addresses for stability.^[55]^[58] Installation begins with enabling the High Availability repository on each node, for example, using dnf config-manager --set-enabled highavailability on RHEL-compatible systems.^[58] Install the required packages via the package manager, such as dnf install pacemaker [pcs](/page/PCS) corosync fence-agents-all (or yum on older versions), which includes Pacemaker for resource management and Corosync for communication.^[55]^[58] Configure the firewall to allow high-availability services, e.g., firewall-cmd --permanent --add-service=high-availability followed by firewall-cmd --reload.^[55]^[58] For basic configuration, start and enable the PCS daemon with systemctl enable --now pcsd.service, which facilitates cluster management.^[55] Set a common password for the hacluster user on all nodes using passwd hacluster, then authenticate nodes with pcs host auth <node1> <node2>.^[55]^[58] Generate the cluster configuration, which creates and synchronizes corosync.conf across nodes, using pcs cluster setup <clustername> <node1> <node2>; this command also handles authentication keys internally, equivalent to manual use of corosync-keygen for generating a shared secret in /dev/random-based setups.^[55]^[58] Define the cluster name and node IDs (e.g., 1 for node1, 2 for node2) within this step, ensuring hostname resolution via /etc/hosts or DNS.^[55] In manual configurations without PCS, use corosync-cfgtool to query or adjust ring status post-setup, though generation of corosync.conf typically involves editing sections like totem, quorum, and nodelist directly.^[58] Start the cluster services with pcs cluster start --all and enable them for boot using systemctl enable corosync pacemaker.^[55]^[58] For basic testing, disable fencing (STONITH) temporarily with pcs property set stonith-enabled=false, noting this is not recommended for production.^[55] Verification involves monitoring cluster status with crm_mon (or pcs status) to confirm all nodes are online and no resources are failing.^[55]^[58] Check Corosync ring status using corosync-cfgtool -s, which should show faultless links.^[58] To test, create a simple resource like a virtual IP with pcs resource create test-ip ocf:heartbeat:IPaddr2 ip=192.168.122.150 cidr_netmask=24, then verify its status and movement using crm_resource -r test-ip -V.^[55]

Configuration and Management

Linux-HA clusters, managed primarily through Pacemaker, rely on the Cluster Information Base (CIB) for defining resources in XML format.^[12] Resources are specified as primitives, groups, or clones within the <resources> section of the CIB, with each primitive identifying its class, type, and provider.^[12] For example, a virtual IP resource using the IPaddr2 agent is defined as follows:

xml
<primitive id="ClusterIP" class="ocf" type="IPaddr2" provider="heartbeat">
  <instance_attributes id="ClusterIP-params">
    <nvpair id="ClusterIP-ip" name="ip" value="192.168.122.120"/>
    <nvpair id="ClusterIP-cidr_netmask" name="cidr_netmask" value="24"/>
  </instance_attributes>
  <operations>
    <op id="ClusterIP-monitor" name="monitor" interval="30s" timeout="20s"/>
  </operations>
</primitive>
<primitive id="ClusterIP" class="ocf" type="IPaddr2" provider="heartbeat">
  <instance_attributes id="ClusterIP-params">
    <nvpair id="ClusterIP-ip" name="ip" value="192.168.122.120"/>
    <nvpair id="ClusterIP-cidr_netmask" name="cidr_netmask" value="24"/>
  </instance_attributes>
  <operations>
    <op id="ClusterIP-monitor" name="monitor" interval="30s" timeout="20s"/>
  </operations>
</primitive>

This configuration ensures the IP address is managed and monitored appropriately.^[12] Constraints, such as colocation, are added under the <constraints> section to enforce resource placement rules, using scores to indicate preference or requirement.^[12] A mandatory colocation constraint, for instance, ties two resources to the same node with an infinite score:

xml
<rsc_colocation id="colocate-ip-web" score="INFINITY" rsc="ClusterIP" with-rsc="Webserver"/>
<rsc_colocation id="colocate-ip-web" score="INFINITY" rsc="ClusterIP" with-rsc="Webserver"/>

Here, INFINITY (equivalent to 1,000,000) makes the colocation mandatory, preventing the resources from running separately.^[12] Configuration and management are facilitated by command-line tools like pcs and cibadmin.^[59] The pcs tool provides a user-friendly interface for creating and modifying resources, such as pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=192.168.122.120 cidr_netmask=24 op monitor interval=30s.^[59] For direct XML edits to the live CIB, cibadmin is used, for example, cibadmin --create --xml-file resource.xml --obj_type resources to add a new resource definition.^[59] Constraints can similarly be managed via pcs, like pcs constraint colocation add ClusterIP with Webserver INFINITY.^[59] Monitoring involves real-time status viewing and logging mechanisms to track cluster health.^[59] The crm_mon utility offers a dynamic display of cluster state, resources, and nodes, invoked with crm_mon for continuous output or crm_mon -1 for a one-time snapshot.^[59] Logging is handled through syslog, with Pacemaker-specific entries in /var/log/pacemaker/pacemaker.log or integrated into /var/log/messages, and logs rotate automatically at 100MB or weekly intervals.^[59] Alerts for failures are configured in the CIB under <alerts>, such as defining an SNMP alert script: <alert id="snmp_alert" path="/path/to/alert_snmp.sh"/>, which triggers on events like resource failures.^[59] Maintenance tasks include performing rolling upgrades and backing up configurations to ensure operational continuity.^[59] Rolling upgrades proceed node-by-node, draining resources from one node before upgrading it, provided version compatibility is maintained (e.g., Pacemaker 2.x requires Corosync 2.3+).^[59] The CIB is backed up and restored using cibadmin, for example, cibadmin --backup /path/to/[backup](/page/Backup).xml to save the current configuration, and cibadmin --restore /path/to/[backup](/page/Backup).xml to restore it. Note that the backup file is typically in XML format.^[59] Troubleshooting focuses on log analysis and simulating scenarios to diagnose issues.^[59] Logs are primarily located in /var/log/pacemaker/, where errors can be filtered with commands like grep 'pacemaker.*error' /var/log/pacemaker/pacemaker.log.^[59] Common problems include network partitions, which Pacemaker mitigates through fencing mechanisms to isolate faulty nodes and maintain quorum.^[59] The crm_simulate tool aids diagnosis by replaying cluster transitions from log files, such as crm_simulate --simulate --xml-file transition.xml.^[59]

Applications

Common Use Cases

Linux-HA clusters, leveraging Pacemaker as the resource manager, are commonly deployed for database high availability to ensure minimal downtime during failures. In such setups, tools like DRBD provide synchronous block-level replication for shared storage, enabling failover configurations for databases including PostgreSQL and MySQL.^[60] For instance, Pacemaker monitors the primary database instance and, upon detecting a failure, promotes the standby node by mounting the replicated DRBD resource and starting the database service. Automatic migration of a virtual IP (VIP) address facilitates seamless client reconnection without manual intervention.^[61] Web services represent another key application, where Linux-HA enables load-balanced Apache clusters to maintain availability under high traffic or node failures. Pacemaker coordinates active/passive or active/active configurations, often integrating with HAProxy for traffic distribution across Apache instances while ensuring session persistence through shared storage or sticky sessions.^[62] This setup allows for automatic failover of the load balancer itself, preventing single points of failure in web infrastructures.^[63] For file services, Linux-HA supports active/active access to shared storage using GFS2, a clustered file system that allows multiple nodes to read and write concurrently. Configurations with Samba or NFS over GFS2, managed by Pacemaker, provide high-availability file sharing in environments requiring scalable storage, such as enterprise networks.^[64] Pacemaker handles resource fencing and lock management via the Distributed Lock Manager (DLM) to prevent data corruption during concurrent operations.^[65] Virtualization platforms benefit from Linux-HA through high-availability setups for KVM/QEMU virtual machines, particularly in Proxmox VE environments. Pacemaker enables automatic restart or live migration of VMs to healthy nodes upon host failure, using shared storage like Ceph or GFS2 for data persistence.^[66] This integration supports seamless workload relocation, minimizing disruption in virtualized data centers.^[67] In these deployments, Linux-HA typically achieves a Recovery Time Objective (RTO) of under one minute, as failover detection and resource promotion occur in seconds to tens of seconds depending on cluster size and configuration.^[56] Real-world examples include CERN's use of Pacemaker for high-availability load balancing in database middleware, ensuring continuous operation of critical services.^[63]

Integration in Distributions

Linux-HA components, particularly Pacemaker and Corosync, are integrated into major Linux distributions through dedicated high availability packages and extensions that facilitate cluster management and failover capabilities.^[12] In Red Hat Enterprise Linux (RHEL) and its community counterpart CentOS, the High Availability Add-On has been available since RHEL 6, released in 2010, providing enterprise-grade clustering tools built on Linux-HA foundations.^[68] This add-on includes the pcs command-line interface for cluster configuration and management, as well as fence-agents for node fencing to ensure clean failovers.^[69] Additionally, the Resilient Storage Add-On complements these features by enabling concurrent access to shared storage in highly available clusters, supporting technologies like GFS2 filesystems for data integrity.^[70] As of 2025, RHEL 10 enhancements extend HA capabilities to edge computing environments, incorporating optimized image-based deployments and live kernel patching to minimize downtime in distributed setups.^[71] SUSE Linux Enterprise (SLE) offers the High Availability Extension, which integrates Linux-HA tools like Pacemaker for resource management and supports advanced clustering features.^[72] Key components include the Hawk web-based user interface for intuitive monitoring and administration of clusters, allowing administrators to visualize resource states and dependencies.^[73] The extension also provides Geo clustering support, enabling coordinated failover across geographically dispersed sites for disaster recovery scenarios.^[74] For Ubuntu and Debian, Pacemaker and related Linux-HA packages are readily available through official repositories, allowing straightforward installation via package managers like apt.^[75]^[76] These distributions integrate Pacemaker with cloud orchestration tools such as Juju, where subordinate charms like HAcluster and pacemaker-remote enable automated deployment of high availability setups for virtual IPs and services in cloud environments.^[77]^[78] Other distributions and platforms extend Linux-HA functionality in specialized ways; for instance, Proxmox Virtual Environment (VE) incorporates the ha-manager since version 4.2 in 2016, leveraging Pacemaker for automated VM and container migration in hyper-converged clusters.^[66]

References

[1]
Linux-HA Heartbeat System Design - USENIX
HA clusters minimize availability interruptions by quickly switching services over from failed systems to working systems, providing the customer with an ...
[2]
[PDF] End-to-End High Availability solution for System z from a Linux ...
Using Linux-HA. The Linux High Availability (Linux-HA) project provides high availability solutions for Linux through an open source development community.
[3]
Pacemaker and Linux-HA: World-Class High Availability Software
The Linux-HA project (http://linux-ha.org/), together with its child project, Pacemaker, is the oldest and most powerful open source high-availability (HA) ...Missing: history | Show results with:history
[4]
ClusterLabs
### Summary of ClusterLabs Content
[5]
Ahead of the Pack: the Pacemaker High-Availability Stack
Jun 18, 2012 · It re-invented itself as an independent and much more community-driven project in 2008, with developers from Red Hat, SUSE and NTT now being ...
[6]
High-Availability Clustering in the Open Source Ecosystem - Alteeve
May 28, 2016 · These two projects remained entirely separate until 2007 when, out of the Linux HA project, Pacemaker was born as a cluster resource manager ...
[7]
[PDF] World Class HA with Linux-HA
Mar 18, 1998 · Linux-HA overview – Linux Foundation Japan Symposium 2007 / 2. Overview. HA Principles. Introduction to Linux-HA. Who uses it? What do they use ...<|control11|><|separator|>
[8]
Fault Tolerance with Linux High Availability - Atlantic.Net
Feb 7, 2017 · ” A commonly held standard for high availability is “five nines,” or 99.999 percent uptime. ... Pacemaker[VI] is a cluster resource manager ...
[9]
(PDF) Impact of Pacemaker Failover Configuration on Mean Time to ...
Pacemaker mean recovery time can take a value between 110 and 160 seconds, if the tool is configured badly. We found that with a proper configuration Pacemaker ...
[10]
Pacemaker for Availability Groups and Failover Cluster Instances on ...
Jul 3, 2025 · This article covers the basic information to understand Pacemaker with Corosync, and how to plan and deploy it for SQL Server configurations.HA add-on/extension basics · Pacemaker concepts and...
[11]
4. Nodes — Pacemaker Explained - ClusterLabs
Every cluster must have at least one cluster node. Scalability is limited by the cluster layer to around 32 cluster nodes. 4.1.1. Host Clock Considerations¶. In ...Missing: maximum | Show results with:maximum
[12]
Pacemaker Explained - ClusterLabs
Pacemaker is a high-availability cluster resource manager – software that runs on a set of hosts (a cluster of nodes) in order to preserve integrity and ...<|control11|><|separator|>
[13]
6. Resource Operations — Pacemaker Explained - ClusterLabs
Resource agents must support certain common operations such as start, stop, and monitor, and may implement any others. Operations may be explicitly configured ...
[14]
12. Resource Agents — Pacemaker Administration - ClusterLabs
Pacemaker sets certain environment variables when it executes an OCF resource agent. Agents can check these variables to get information about resource ...Missing: interoperability | Show results with:interoperability<|control11|><|separator|>
[15]
None
### Summary of Initial Design of Heartbeat in Linux-HA
[16]
Release 2.0.6 of Linux-HA is available - LWN.net
Jul 17, 2006 · ... stonith RA metadata basic heartbeat class RA metadata reworked resource addition dialog right-click menu support for clones + Bug fixes ...
[17]
[PDF] The Corosync Cluster Engine - The Linux Kernel Archives
Jul 23, 2008 · Pacemaker is now maintained independently of. Heartbeat in order to support both the OpenAIS and Heartbeat cluster stacks equally. Pacemaker ...
[18]
[PDF] Pacemaker 1.1 Clusters from Scratch - ClusterLabs
Sep 10, 2018 · Two-node Active/Passive clusters using Pacemaker and DRBD are a cost-effective solution for many. High Availability situations. Figure 1.4.
[19]
Releases · ClusterLabs/pacemaker - GitHub
Aug 7, 2025 · Pacemaker is an advanced, scalable High-Availability cluster resource manager - Releases · ClusterLabs/pacemaker.
[20]
Highly Available NFS Exports with DRBD & Pacemaker - LINBIT
May 29, 2025 · This blog post explains how to configure an NFS server instance in a 3-node high availability (HA) active/passive Linux cluster using DRBD® and Pacemaker.System Preparation... · Configuring Drbd · Creating Nfs Exports And...
[21]
ClusterLabs/pacemaker - GitHub
Pacemaker was initially created by main architect and lead developer Andrew Beekhof andrew@beekhof.net with the aid of project catalyst and advocate Lars ...Pull requests 12 · Actions · Security · Releases 137
[22]
https://github.com/ClusterLabs/pacemaker
[23]
Pacemaker Explained
Summary of each segment:
[24]
Chapter 18. Managing cluster resources | Red Hat Enterprise Linux | 8
Pacemaker provides a variety of mechanisms for configuring a resource to move from one node to another and to manually move a resource when needed. You can ...
[25]
Pacemaker - ClusterLabs
Pacemaker came to life in late 2003 when Lars convinced SUSE to hire Andrew Beekhof to implement a new cluster resource manager (CRM) for the Heartbeat project.
[26]
https://projects.clusterlabs.org/w/projects/pacemaker/?v=2
[27]
corosync_overview(8): Corosync overview - Linux man page - Die.net
The corosync project is a project to implement a production quality "Revised BSD" licensed implementation of the most recent SA Forum's Application ...
[28]
corosync.conf(5) - Debian Manpages
This specifies the fully qualified path to the shared key used to authenticate and encrypt data used within the Totem protocol. The default is /etc/corosync/ ...
[29]
Corosync - Alteeve Wiki
Oct 30, 2013 · Corosync is the communication layer of modern open-source clusters. It was created out of a desire to have a simplified and focused communication layer.Missing: website | Show results with:website
[30]
[PDF] Kronosnet: The new face of Corosync communications
Sep 7, 2017 · There is a Totem Multi Ring protocol but I don't think anyone has ever been mad enough to implement it. Though the layer is still there in the ...
[31]
Corosync 3.0.2 Release Notes - GitHub
Jun 12, 2019 · This allows applications to take advantage of knet features such as multipathing, automatic failover, link switching etc. Note that libnozzle is ...<|separator|>
[32]
Chapter 27. Configuring cluster quorum | Red Hat Enterprise Linux | 8
Cluster quorum uses votequorum service, assigning votes to each system. Operations proceed only with a majority of votes. Options like auto_tie_breaker can be ...
[33]
corosync.conf(5) - Linux man page - Die.net
The corosync.conf file configures the corosync executive with parameters for totem, logging, and event services. It is located at /etc/corosync/corosync.conf.
[34]
5.3.4. Configure Corosync on Cluster Nodes - ClusterLabs
Corosync handles Pacemaker's cluster membership and messaging. The corosync config file is located in /etc/corosync/corosync.conf.
[35]
Releases · corosync/corosync - GitHub
Nov 15, 2024 · I am pleased to announce the latest maintenance release of Corosync 3.1.8 is available immediately from the GitHub release section.
[36]
corosync.conf(5) - Debian Manpages
This specifies which cryptographic library should be used by KNET. Supported values depend on the libknet build and on the installed cryptography libraries.
[37]
sysutils/heartbeat: Subsystem for High-Availability Clustering
Heartbeat is the first piece of software which was written for the Linux-HA project. It performs death-of-node detection, communications and cluster ...
[38]
Relation between Heartbeat and Corosync on openSUSE
Apr 7, 2013 · Yes heartbeat is deprecated. No this is not a distro specific problem; There are a lot of benefits to using Corosync instead of heartbeat ...what cluster management software to use for linux? - Server FaultHow to suppress a Heartbeat resource from starting in failover data ...More results from serverfault.comMissing: tools | Show results with:tools
[39]
ClusterLabs/resource-agents: Combined repository of OCF ... - GitHub
This repository contains resource agents (RAs) compliant with the Open Cluster Framework (OCF) specification. These resource agents are used by two cluster ...Missing: shift | Show results with:shift<|separator|>
[40]
14 Adding or modifying resource agents - SUSE Documentation
All tasks that need to be managed by a cluster must be available as a resource. There are two major groups to consider: resource agents and STONITH agents.Missing: interoperability | Show results with:interoperability
[41]
5. Cluster Resources - ClusterLabs
6. Nagios Plugins. Nagios Plugins are a way to monitor services. Pacemaker can use these as resources, to react to a change in the service's status.
[42]
ClusterLabs/nagios-agents-metadata - GitHub
This is a collection of files containing OCF metadata for the respective nagios monitoring agents for use with Pacemaker.
[43]
Chapter 10. Configuring fencing in a Red Hat High Availability cluster
STONITH is an acronym for "Shoot The Other Node In The Head" and it protects your data from being corrupted by rogue nodes or concurrent access.Missing: PDU | Show results with:PDU
[44]
Fencing and STONITH | Administration Guide | SLE HA 12 SP5
To set up fencing, you need to configure one or more STONITH resources—the stonithd daemon requires no configuration. All configuration is stored in the CIB. A ...
[45]
Recommendations for Fencing and STONITH Devices in Pacemaker
Adding the PDU fencing devices requires distinct off and on actions for each outlet on each PDU. With two nodes, each with two Power Supply Units (PSUs), this ...Missing: HA | Show results with:HA
[46]
Documentation - crmsh
... Hawk, the web GUI which uses the crm shell as its backend. For more information on Pacemaker in general, see the Pacemaker documentation at clusterlabs.org.
[47]
Hawk - High Availability Web Konsole
Hawk is a web interface for Pacemaker HA clusters. Use it to configure, manage and monitor just about any kind of application running in Linux as a cluster ...
[48]
ClusterLabs/hawk: A web-based GUI for managing and ... - GitHub
Hawk provides a web interface for High Availability clusters managed by the Pacemaker cluster resource manager.
[49]
[PDF] The Totem Single-Ring Ordering and Membership Protocol - Corosync
On a Token Retransmission timeout, the processor retransmits the token to the next processor on the ring and then resets the timeout. The token seq eld of the ...
[50]
[PDF] New quorum features in Corosync 2 - Red Hat People
Apr 23, 2012 · At its most basic level it is a quorum-based majority voting system where a cluster needs expected_votes/2+1 votes for it to be quorate ...
[51]
votequorum(5) — corosync — Debian unstable
Jun 21, 2025 · The number of expected votes will be automatically calculated when the nodelist { } section is present in corosync.conf or expected_votes can be ...
[52]
corosync-keygen - Generate an authentication key for Corosync.
DESCRIPTION. If you want to configure corosync to use cryptographic techniques to ensure authenticity and privacy of the messages, you will need to generate a ...
[53]
[ClusterLabs] Security with Corosync
Mar 16, 2016 · ... corosync by default uses aes256 for encryption and sha1 for >>>> hmac authentication. >>>> >>>> Pacemaker uses corosync cpg API so as long ...
[54]
Configuring and managing high availability clusters | 8
The Red Hat High Availability Add-On configures high availability clusters that use the Pacemaker cluster resource manager.
[55]
[PDF] Using Pacemaker to Create Highly Available Linux Solutions on IBM ...
Dec 20, 2023 · This document describes using Pacemaker to create highly available Linux solutions on IBM Power, applying to Red Hat Enterprise Linux 9.0 and 8 ...
[56]
Chapter 5. Fencing: Configuring STONITH - Red Hat Documentation
STONITH is an acronym for "Shoot The Other Node In The Head" and it protects your data from being corrupted by rogue nodes or concurrent access.Missing: IPMI APC PDU<|separator|>
[57]
Clusters from Scratch - ClusterLabs
This document provides a step-by-step guide to building a simple high-availability cluster using Pacemaker.
[58]
Pacemaker Administration - ClusterLabs
This document has instructions and tips for system administrators who manage high-availability clusters using Pacemaker.
[59]
Configuring High Availability for MySQL Databases Using DRBD
May 22, 2024 · DRBD provides synchronous replication at the block level for MySQL, offering low latency and ease of use with failover mechanisms, and can ...
[60]
Active-Passive Cluster for Near HA Using Pacemaker, DRBD ...
Mar 14, 2019 · In this post, we are going to build a MySQL active-passive cluster using Pacemaker, Corosync, and DRBD.
[61]
Configure HAProxy to balance Apache web server traffic - Red Hat
Mar 15, 2022 · HAProxy (short for High Availability Proxy) is a software-based TCP/HTTP load balancer. It sends client requests to multiple servers to evenly distribute ...
[62]
HAProxy High Availability Setup | Databases at CERN blog
Jan 16, 2018 · The HAProxy setup uses CentOS, Pacemaker, Corosync, and a load balancer. Steps include installing packages, setting up cluster, fencing, ...
[63]
Chapter 8. Configuring an active/active Samba server in a Red Hat ...
To configure an active/active Samba server, configure a GFS2 file system, Samba on cluster nodes, Samba cluster resources, and test the server.
[64]
GFS2 | Administration Guide | SLE HA 15 SP7 - SUSE Documentation
GFS2 is a shared disk file system for Linux computer clusters. GFS2 allows all nodes to have direct concurrent access to the same shared block storage.
[65]
Proxmox VE Administration Guide
A multi-node Proxmox VE HA Cluster enables the definition of highly available virtual servers. The Proxmox VE HA Cluster is based on proven Linux HA ...
[66]
Qemu/KVM Virtual Machines - Proxmox VE
Nov 22, 2022 · QEMU is a user program which has access to a number of local resources like partitions, files, network cards which are then passed to an emulated computer.
[67]
Overview of the High Availability Add-On for Red Hat Enterprise Linux
The High Availability Add-On is an integrated set of software components that can be deployed in a variety of configurations to suit your needs for performance, ...<|control11|><|separator|>
[68]
Chapter 4. Creating a Red Hat High-Availability cluster with ...
Create a Red Hat High Availability two-node cluster using the pcs command-line interface with the following procedure.Missing: CentOS | Show results with:CentOS
[69]
Red Hat Enterprise Linux Resilient Storage Add-On
Jan 20, 2025 · The Red Hat Enterprise Linux Resilient Storage Add-On provides concurrent shared storage access to the members of a highly available cluster.
[70]
What is new in Red Hat Enterprise Linux 10 and beyond
Nov 4, 2025 · Red Hat Enterprise Linux 10 offers features from AI-ready capabilities and image mode deployments to expanded hardware support and compliance enhancements.
[71]
SUSE Linux Enterprise High Availability Extension
Easily install, configure, manage, and monitor your clustered environment using a powerful unified command-line interface, CRMSH(Cluster Resource Management ...<|control11|><|separator|>
[72]
Configuration and Administration | SLE HA 12 SP5
To configure and manage cluster resources, either use HA Web Konsole (Hawk2), or the CRM Shell ( crmsh ) command line utility. If you upgrade from an earlier ...Missing: UI | Show results with:UI<|separator|>
[73]
https://documentation.suse.com/sle-ha/12-SP5/html/SLE-HA-all/part-config.html
[74]
SUSECON 2025 Sessions | SUSE
* Streamlined Operations: Supports smarter, more efficient infrastructure management and minimizes downtime with advanced AI-driven insights. Prashanth N SMissing: SP5 | Show results with:SP5
[75]
Debian -- Package Search Results -- pacemaker
You have searched for packages that names contain pacemaker in all suites, all sections, and all architectures. Found 19 matching packages.Missing: Ubuntu Juju
[76]
pacemaker package : Ubuntu - Launchpad
The source charm interface for OpenStack Pacemaker Remote: this LP project is for bug tracking only. Code lives in OpenStack upstream. Bug supervisor: yes: Bug ...Missing: Juju integration
[77]
HAcluster | Ubuntu
HAcluster is a Juju subordinate charm that encapsulates corosync and pacemaker for floating virtual IP or DNS addresses and is similar to keepalived.
[78]
Deploy Pacemaker Remote using Charmhub
Pacemaker Remote is a small daemon that allows a host to be used as a Pacemaker node without running the full cluster stack. The pacemaker-remote charm is a ...
[79]
Roadmap - Proxmox VE
Improve error reporting in case a guest migration fails because the migration tunnel closed. HA Manager. High-Availability (HA) rules for node and resource ...