Linux-HA
Linux-HA, also known as the High-Availability Linux project, was an open-source initiative that developed clustering software to provide high availability for applications and services on Linux and other Unix-like operating systems, including FreeBSD, OpenBSD, Solaris, and Mac OS X.[1] The project focused on creating resilient systems that minimize downtime by detecting failures and automatically transferring workloads to healthy nodes in a cluster.[2]
Originating in the late 1990s, Linux-HA introduced key components such as the Heartbeat subsystem, which served as the core engine for cluster membership, inter-node communication, and resource failover.[1] Heartbeat enabled active-passive and active-active configurations, supporting scalable clusters without a fixed maximum number of nodes, and integrated with tools like STONITH for fencing failed nodes to prevent data corruption.[3] Over time, elements of Linux-HA evolved into independent projects under the ClusterLabs umbrella, including Pacemaker as the cluster resource manager and Corosync for reliable messaging and quorum management.[4] As of 2025, these successor projects are actively maintained, with Pacemaker at version 3.0.1 released in August 2025.[5]
These tools from the Linux-HA lineage are widely adopted in enterprise environments for critical workloads, such as databases, web services, and file systems, offering features like policy-driven resource placement, support for multi-site replication, and integration with shared storage solutions like DRBD.[6] By providing a modular, extensible framework, Linux-HA and its successors ensure continuous operation with near-100% uptime, making them foundational to open-source high-availability strategies in production systems.[7]
Introduction
Overview
Linux-HA is an open-source project that provides high-availability solutions for Linux environments, encompassing failover clustering, resource management, and fault tolerance mechanisms to maintain service continuity.[8] As the oldest community-driven high-availability initiative, it enables the creation of resilient clusters that detect and respond to failures, ensuring minimal downtime through automated recovery processes.[3]
The primary purpose of Linux-HA is to achieve near-continuous availability of critical services by implementing redundancy across multiple nodes, allowing for rapid failover in response to events such as hardware crashes, network partitions, or application faults.[8] This approach minimizes outages to seconds or minutes, elevating system reliability from baseline levels like 99% to higher thresholds such as 99.9%.[8] By supporting n-node clusters up to around 32 nodes, it facilitates no-single-point-of-failure architectures suitable for enterprise-scale deployments.[9]
In terms of technical scope, Linux-HA accommodates both active/passive and active/active configurations, enabling the management of diverse services including databases, web servers, file systems, ERP systems, firewalls, and load balancers.[8] Resource management adheres to standards like OCF and LSB, with built-in fault tolerance features such as fencing (STONITH) and quorum to prevent issues like split-brain scenarios.[8] Originating in the late 1990s as a volunteer-led effort, with initial code developed in 1998, the project has influenced subsequent tools through the evolution of components like Heartbeat into modern frameworks such as Pacemaker and Corosync under the ClusterLabs umbrella.[8][4]
Goals and Principles
High availability solutions from the Linux-HA lineage commonly aim for uptime levels such as 99.999%, often referred to as "five nines," which equates to no more than about 5.26 minutes of downtime per year. This is achieved through mechanisms that minimize downtime via rapid failover, typically configurable to occur within seconds to under a minute depending on cluster parameters like monitoring intervals and failure timeouts.[10] The project supports scalable clusters ranging from 2 nodes up to around 32 nodes using underlying layers like Corosync.[11][9]
Guiding principles of Linux-HA emphasize open-source collaboration, fostering community-driven development and extensibility for integrating with diverse applications and environments.[4] Modularity is central, allowing seamless integration across various Linux distributions and support for heterogeneous hardware without requiring uniform setups.[12] Monitoring is designed to be non-intrusive, relying on lightweight resource agents that probe service health periodically without significant performance overhead.[13]
At its core, Linux-HA employs a conceptual framework to prevent split-brain scenarios—where multiple nodes independently assume control of shared resources—through quorum mechanisms that require a majority of nodes to agree on cluster state before actions proceed.[4] Resource fencing ensures data integrity by isolating failed nodes, such as powering them off, to avoid concurrent access that could lead to corruption.[4]
In contrast to proprietary high-availability solutions, Linux-HA prioritizes standards-based interoperability via Open Cluster Framework (OCF) resource agents, enabling plug-and-play management of services from different vendors.[14] This approach, combined with community extensibility, allows users to customize and expand functionality without vendor lock-in.[4]
History
Early Development
The Linux-HA project was founded in 1998 by Alan Robertson, then at Bell Labs, along with early contributors, as the Heartbeat project to provide high-availability clustering capabilities for Linux systems, which at the time lacked native support for such features.[8][7] The initiative began with the first working code assembled on March 18, 1998, following Robertson's earlier discussions on Linux mailing lists about the need for reliable failover mechanisms.[8]
Initially, the project concentrated on simple IP failover and resource monitoring, achieved through periodic heartbeat messages sent approximately once per second over serial ports or Ethernet links to detect node failures or recoveries.[15] This design leveraged Linux kernel features, such as IP aliasing, to enable seamless transfer of virtual IP addresses between nodes during failover without requiring complex reconfiguration.[15][7] A key early milestone was the release of Heartbeat 0.4 in 1999, which introduced basic clustering functionality and marked the project's first stable version capable of supporting active-passive configurations limited to two nodes.[7][8]
The project's community grew rapidly after being hosted on SourceForge, fostering open development and attracting contributions that expanded its scope beyond Linux.[7] By 2001, ports to FreeBSD and Solaris had been developed, broadening its applicability in heterogeneous environments.[7] These efforts addressed critical challenges of the era, including the relative instability of early Linux kernels for production use and the scarcity of affordable commercial high-availability tools, which were primarily available for proprietary Unix systems like Solaris or AIX.[8][7]
Key Milestones and Evolution
The Linux-HA project advanced significantly between 2004 and 2008, with the release of Heartbeat 2.0 in 2006 introducing comprehensive STONITH (Shoot The Other Node In The Head) support to enable reliable node fencing and prevent split-brain scenarios in clusters.[16] This version enhanced resource agent metadata and cluster management capabilities, building on the foundational Heartbeat software developed in the late 1990s for basic failover detection.
In 2008, the Heartbeat project underwent a major restructuring through a fork that separated its components, resulting in Pacemaker as the policy-driven resource manager and Corosync as the underlying cluster communication engine derived from the OpenAIS project.[17] This split allowed for greater flexibility, enabling Pacemaker to operate independently of specific communication layers and supporting advanced features like active/active clustering configurations.[6]
During the 2010s, Linux-HA gained widespread enterprise adoption, integrating into major distributions such as Red Hat Enterprise Linux 6 in 2010, where Pacemaker became the core of the High Availability Add-On for managing clustered services. Similarly, SUSE incorporated Pacemaker into its Linux Enterprise High Availability Extension starting around the same period, providing scalable clustering for business-critical applications. In 2012, Pacemaker 1.1 was released, introducing native support for multi-site (Geo) clusters to coordinate resources across geographically dispersed sites for disaster recovery.[18]
Recent developments through 2025 have focused on modernization and expanded interoperability. Pacemaker 2.1.2, released in 2021, included various improvements such as better fencing delay handling and tool enhancements.[19] Support for geo clustering, including ticket-based arbitration via the Booth protocol, was introduced in earlier versions around 2015. Integration with container orchestration platforms like Kubernetes through resource agents and operators has enabled traditional HA clusters to manage stateful workloads in hybrid environments.[20] In 2025, the major Pacemaker 3.0.0 release on January 8 introduced significant updates, including changes to upgrade compatibility and enhanced features for high-availability management, followed by 3.0.1 on August 7.[21] The project has long utilized GitHub for primary source code hosting to improve collaboration and version control.[22] Overall governance has transitioned to the ClusterLabs community, an open-source collective that oversees development, maintenance, and contributions for Pacemaker, Corosync, and related tools.[23]
Core Components
Pacemaker
Pacemaker serves as the central policy engine in Linux-HA clusters, responsible for starting, stopping, monitoring, and migrating resources to maintain high availability based on the cluster's state and user-defined constraints.[24] It processes events such as node failures or service disruptions, deciding actions to ensure resources remain active and data integrity is preserved through mechanisms like fencing integration. This resource orchestration allows for flexible configurations, including active/passive and active/active setups, across multiple nodes.
Key features of Pacemaker include support for various resource agent standards, such as Open Cluster Framework (OCF), Linux Standards Base (LSB), and systemd, which enable the management of diverse services like databases, web servers, and virtual machines through standardized scripts.[24] It enforces colocation and ordering constraints to dictate resource dependencies—for instance, ensuring a database starts before an associated web server or colocating related services on the same node to optimize performance and reliability.[24] Additional capabilities encompass failure thresholds for automatic migration, live resource relocation without downtime for compatible agents, and advanced monitoring intervals to detect issues promptly.[25]
Architecturally, Pacemaker relies on the Cluster Resource Manager (CRM) daemon, now implemented as pacemaker-controld, which coordinates decisions and actions across the cluster.[24] It integrates with the Cluster Information Base (CIB), an XML-based repository that stores and synchronizes configuration, status, and history data among nodes, allowing for real-time updates and queries via tools like crm_mon.[24] The system communicates with the underlying cluster membership layer, such as Corosync, to receive node status updates.[24]
Pacemaker originated as a spin-off from the Heartbeat project within the Linux-HA initiative around 2007-2008, evolving into an independent resource manager to enhance flexibility beyond Heartbeat's integrated approach.[26] As of November 2025, the current stable release is version 3.0.1 (released August 2025).[19] These developments build on earlier versions by improving scalability and integration with modern environments, including support for promotable clones and multi-tenant fencing.[24]
Pacemaker powers high-availability setups in major distributions and platforms, including Red Hat Enterprise Linux (RHEL) for enterprise clustering, SUSE Linux Enterprise High Availability Extension for robust service management, and Proxmox Virtual Environment for virtual machine failover. Its widespread adoption stems from its policy-driven automation, which minimizes manual intervention in production environments handling critical workloads.
Corosync
Corosync serves as the foundational communication and membership layer in Linux-HA clusters, providing reliable multicast messaging, node heartbeat detection, and quorum management through the Totem protocol.[27][28] This open-source cluster engine implements the Totem Single Ring Ordering and Membership protocol, ensuring ordered and reliable delivery of messages among cluster nodes while detecting failures via periodic token passing.[17] Heartbeat detection occurs through configurable timeouts, with defaults such as a 1-second token interval and 10-second failure detection window, allowing administrators to adjust parameters like token_retransmits and join_timeout in the configuration to suit network conditions.[29]
The protocol relies on UDP-based multicast for intra-cluster communication, enabling efficient group messaging without requiring a central coordinator.[30] For redundancy, Corosync supports multiple communication rings since the introduction of the Kronosnet (KNET) library in version 3.0 in 2018, which facilitates link aggregation, automatic failover, and multipathing across network interfaces.[31] KNET enhances fault tolerance by allowing up to eight redundant links, ensuring message delivery even if individual paths fail, and integrates seamlessly with the Totem layer for fragmentation and reassembly.[32]
Corosync manages quorum to prevent split-brain scenarios, using the votequorum service where each node typically holds one vote, requiring a majority (e.g., 50% + 1) for cluster operations to proceed.[33] Upon failure detection, it can trigger automatic node isolation, configurable via quorum policies like expected_votes and auto_tie_breaker for even-sized clusters such as two-node setups. Configuration is handled through the corosync.conf file, located at /etc/corosync/corosync.conf, which defines ring interfaces (e.g., rrp_mode: active for redundant rings), transport settings (e.g., knet: transport: udp), and quorum parameters.[34][35]
Evolving from the OpenAIS project in 2008, Corosync was refactored to focus on core infrastructure primitives, separating messaging from higher-level APIs.[17] The project has since advanced, with version 3.1.9 (as of mid-2025) providing maintenance updates.[36] These improvements, such as support for multiple cryptographic libraries (e.g., NSS, OpenSSL), ensure secure communication in production environments.[37] Corosync integrates with Pacemaker by delivering cluster state events for resource management.[12]
Linux-HA relies on several supporting tools that extend its core functionality, providing legacy compatibility, resource management scripts, monitoring capabilities, fencing mechanisms, and configuration interfaces. These tools integrate with the primary components to enable flexible high-availability setups across diverse environments.[4]
Heartbeat, the original clustering subsystem developed for the Linux-HA project, facilitated basic high-availability features such as IP address failover and node monitoring through heartbeat messaging prior to 2008. Although deprecated in favor of more robust alternatives like Corosync, it remains available for simple, low-complexity setups where minimal configuration is preferred.[38][39]
Resource agents in Linux-HA adhere to the Open Cluster Framework (OCF) standard, consisting of standardized scripts that define start, stop, monitor, and status operations for cluster resources. For instance, the ocf:heartbeat:IPaddr agent manages virtual IP addresses, while others handle services like Apache web servers (ocf:heartbeat:apache) and MySQL databases (ocf:heartbeat:mysql), with over 100 such agents available in the official repository to support a wide range of applications. These agents allow Pacemaker to abstract and orchestrate third-party services without custom coding.[40][14][41]
Monitoring integrations enhance Linux-HA by incorporating external tools for proactive health checks that inform cluster decisions. Nagios plugins, for example, can be deployed as OCF-compliant resources within Pacemaker to monitor remote services and trigger failovers based on detected issues, enabling seamless feedback loops between monitoring and resource management. Similarly, tools like Monit provide lightweight process supervision that can feed status updates into the cluster stack for automated responses.[42][43]
Fencing agents, essential for STONITH (Shoot The Other Node In The Head) operations, ensure safe node isolation during failures by interfacing with hardware devices. Common implementations include the fence_ipmilan agent for IPMI-based power control on servers and fence_apc for SNMP-managed APC Power Distribution Units (PDUs), which allow the cluster to remotely power off malfunctioning nodes to prevent data corruption. These agents are configured as dedicated resources and support a variety of hardware vendors for reliable enforcement.[44][45][46]
Additional utilities streamline cluster administration: CRMsh offers a command-line shell for configuring and querying Pacemaker resources in a structured, scriptable manner, supporting complex operations like resource migration and constraint definition. Hawk, a web-based graphical user interface primarily associated with SUSE distributions, provides visual tools for real-time monitoring, resource editing, and status visualization, making it accessible for administrators managing Pacemaker-based clusters. These tools work in tandem with Pacemaker and Corosync to simplify deployment and maintenance without altering core behaviors.[47][48][49]
Architecture
Cluster Communication Layer
The cluster communication layer in Linux-HA serves as the foundational infrastructure for enabling reliable and ordered message delivery among cluster nodes, ensuring state synchronization and membership awareness even in the presence of failures.[17] This layer handles the dissemination of heartbeat signals, configuration updates, and status notifications, allowing nodes to maintain a consistent view of the cluster topology and preventing desynchronization during transient network issues or node departures.[50]
At its core, the layer employs the Totem protocol, which operates via a single-ring ordering mechanism for multicast communication, guaranteeing that messages are delivered in the same sequence to all nodes.[50] In configurations supporting redundancy, Totem extends to multiring setups, where multiple independent communication paths distribute messages concurrently to enhance fault tolerance.[29] Should multicast fail due to network partitions, the protocol falls back to unicast transmission to specific nodes, maintaining connectivity where possible without compromising order.[17]
To mitigate split-brain scenarios, where partitioned subsets of nodes might independently assume cluster control, the layer implements a quorum model based on dynamic majority voting.[51] This model calculates the required votes for quorate status as half of the expected_votes parameter plus one, with expected_votes typically auto-derived from the node count but manually configurable for scenarios like maintenance or uneven node weights.[52] Only quorate partitions proceed with operations, ensuring that minority partitions remain passive until reconciliation.[51]
Redundancy is achieved through support for dual or multiple communication rings, each operating as an independent Totem instance, allowing the cluster to survive the failure of an entire ring without message loss.[29] Token timeouts, configurable in milliseconds (e.g., defaulting to 3000 ms), govern failure detection by triggering reconfiguration if a token is not received within the interval, balancing responsiveness against false positives in variable networks.[34][29]
Security features include built-in authentication using symmetric keys generated via tools like corosync-keygen, which verifies message origins and prevents unauthorized node participation.[53] Optional encryption, leveraging algorithms such as AES-256 alongside HMAC-SHA-256 for integrity, protects message confidentiality over untrusted networks.[54][29] Corosync provides the primary implementation of this layer in modern Linux-HA setups.[17]
Resource Management Layer
The Resource Management Layer in Linux-HA, primarily implemented by the Pacemaker cluster resource manager, oversees the allocation, monitoring, and migration of resources across cluster nodes to ensure high availability and service continuity. This layer abstracts resource lifecycle management from underlying node operations, using policy-driven decisions to handle placement, state transitions, and recovery. It operates on top of reliable cluster communication, coordinating actions that maintain desired service states even during node failures or maintenance.[12][55]
Central to this layer are key components that facilitate dynamic resource handling. The Cluster Information Base (CIB) serves as an XML-based, synchronized repository storing the live cluster configuration, resource definitions, node attributes, and current status, enabling all nodes to maintain a consistent view managed by the designated coordinator. The Policy Engine (PE), implemented as the pacemaker-schedulerd daemon, acts as the decision-making core, incorporating a transition engine to orchestrate state changes—such as starting, stopping, or promoting resources—and a constraint solver to evaluate placement rules including location preferences, colocation requirements (e.g., ensuring dependent resources run together), and ordering constraints (e.g., starting a database before its controller). These components process inputs to generate actionable graphs of operations, ensuring resources align with administrative policies.[12][56][55]
Failover logic within the layer emphasizes proactive monitoring and prioritized recovery to minimize downtime. Resources are periodically probed through monitor operations executed by resource agents, which assess health at configurable intervals (e.g., every 10 seconds for critical services); failures increment a failure counter, and after reaching a migration threshold (default per resource), the PE initiates migration to a suitable node. Placement decisions incorporate stickiness scores, ranging from -∞ (strong avoidance) to +∞ (mandatory placement), with the default resource-stickiness value of 1 for cloned resources encouraging resources to remain on their current node unless overridden by constraints or failures, thus balancing stability and load distribution.[12][56][55]
For scalability, the layer supports multi-tenancy through resource sets and templated configurations that isolate workloads, allowing multiple independent services to share cluster infrastructure without interference. It also accommodates Geo clusters by leveraging location constraints and node attributes (e.g., site identifiers) to distribute resources across geographically dispersed sites, enabling site failover with minimal data loss in setups like N+M redundancy models supporting up to 32 nodes. Integration with the Local Resource Manager (LRM), via the pacemaker-execd or pacemaker-lrmd daemon, ensures node-local execution of PE-directed actions—such as invoking OCF-compliant resource agents for start/stop/monitor—while relaying status back to the CIB for cluster-wide consistency.[12][56][55]
Fencing Mechanisms
Fencing mechanisms in Linux-HA clusters are essential for maintaining data integrity by isolating failed or unresponsive nodes, thereby preventing scenarios such as split-brain where multiple nodes simultaneously access shared resources like storage, leading to potential corruption from dual writes.[12][57] These mechanisms ensure that a node is definitively offline before resources are reassigned to another node, avoiding interference from corrupted or rogue processes.[45]
The primary fencing method in Linux-HA is STONITH, an acronym for "Shoot The Other Node In The Head," which employs external agents to forcibly power off or reset a failed node.[12][57] STONITH devices, configured as cluster resources, include hardware interfaces such as IPMI for remote power control or SSH for scripted shutdowns, ensuring the action occurs outside the cluster's internal communication to avoid reliance on potentially compromised paths.[12][45]
Fencing types in Linux-HA are categorized as soft or hard, allowing flexibility based on the environment. Soft fencing, such as with the fence_vmware agent for virtual machines, attempts non-destructive isolation like network disconnection or graceful shutdown before escalating.[12][57] Hard fencing, exemplified by the fence_apc agent for power distribution units (PDUs), directly cuts power to ensure immediate and irreversible node termination.[12][57] Configurable delays, such as a 60-second postponement after failure detection or randomized intervals via parameters like pcmk_delay_max, help coordinate actions in multi-node setups and prevent premature fencing during transient issues.[12][57]
STONITH integrates with quorum policies to trigger fencing only upon loss of quorum, ensuring decisions reflect majority consensus and avoiding unnecessary actions in partitioned clusters.[12][45] For even-numbered node counts, witness nodes—external quorum devices—provide an odd vote to resolve ties and initiate fencing reliably.[12][45]
Best practices for Linux-HA fencing emphasize redundancy and validation to enhance reliability. Deploying multiple fencing devices, such as combining IPMI with PDUs, mitigates single points of failure in the fencing topology.[12][45] Testing configurations using tools like pcmk_host_map to accurately map node hostnames to device ports ensures precise targeting during operations.[12] These mechanisms are typically triggered by failure events detected through Pacemaker's monitoring.[12]
Implementation
Cluster Setup Process
Setting up a Linux-HA cluster using Pacemaker and Corosync requires careful attention to prerequisites to ensure compatibility and reliable communication. Nodes should run a homogeneous operating system, such as Red Hat Enterprise Linux 8 or later (or equivalents like AlmaLinux 9), to avoid version mismatches in cluster software and kernel features.[55][58] Shared storage, such as via DRBD or GFS2, is optional for basic setups but necessary for stateful resources; network isolation via a dedicated private interface is recommended to separate cluster traffic from public networks, using static IP addresses for stability.[55][58]
Installation begins with enabling the High Availability repository on each node, for example, using dnf config-manager --set-enabled highavailability on RHEL-compatible systems.[58] Install the required packages via the package manager, such as dnf install pacemaker [pcs](/page/PCS) corosync fence-agents-all (or yum on older versions), which includes Pacemaker for resource management and Corosync for communication.[55][58] Configure the firewall to allow high-availability services, e.g., firewall-cmd --permanent --add-service=high-availability followed by firewall-cmd --reload.[55][58]
For basic configuration, start and enable the PCS daemon with systemctl enable --now pcsd.service, which facilitates cluster management.[55] Set a common password for the hacluster user on all nodes using passwd hacluster, then authenticate nodes with pcs host auth <node1> <node2>.[55][58] Generate the cluster configuration, which creates and synchronizes corosync.conf across nodes, using pcs cluster setup <clustername> <node1> <node2>; this command also handles authentication keys internally, equivalent to manual use of corosync-keygen for generating a shared secret in /dev/random-based setups.[55][58] Define the cluster name and node IDs (e.g., 1 for node1, 2 for node2) within this step, ensuring hostname resolution via /etc/hosts or DNS.[55] In manual configurations without PCS, use corosync-cfgtool to query or adjust ring status post-setup, though generation of corosync.conf typically involves editing sections like totem, quorum, and nodelist directly.[58]
Start the cluster services with pcs cluster start --all and enable them for boot using systemctl enable corosync pacemaker.[55][58] For basic testing, disable fencing (STONITH) temporarily with pcs property set stonith-enabled=false, noting this is not recommended for production.[55]
Verification involves monitoring cluster status with crm_mon (or pcs status) to confirm all nodes are online and no resources are failing.[55][58] Check Corosync ring status using corosync-cfgtool -s, which should show faultless links.[58] To test, create a simple resource like a virtual IP with pcs resource create test-ip ocf:heartbeat:IPaddr2 ip=192.168.122.150 cidr_netmask=24, then verify its status and movement using crm_resource -r test-ip -V.[55]
Configuration and Management
Linux-HA clusters, managed primarily through Pacemaker, rely on the Cluster Information Base (CIB) for defining resources in XML format.[12] Resources are specified as primitives, groups, or clones within the <resources> section of the CIB, with each primitive identifying its class, type, and provider.[12] For example, a virtual IP resource using the IPaddr2 agent is defined as follows:
xml
<primitive id="ClusterIP" class="ocf" type="IPaddr2" provider="heartbeat">
<instance_attributes id="ClusterIP-params">
<nvpair id="ClusterIP-ip" name="ip" value="192.168.122.120"/>
<nvpair id="ClusterIP-cidr_netmask" name="cidr_netmask" value="24"/>
</instance_attributes>
<operations>
<op id="ClusterIP-monitor" name="monitor" interval="30s" timeout="20s"/>
</operations>
</primitive>
<primitive id="ClusterIP" class="ocf" type="IPaddr2" provider="heartbeat">
<instance_attributes id="ClusterIP-params">
<nvpair id="ClusterIP-ip" name="ip" value="192.168.122.120"/>
<nvpair id="ClusterIP-cidr_netmask" name="cidr_netmask" value="24"/>
</instance_attributes>
<operations>
<op id="ClusterIP-monitor" name="monitor" interval="30s" timeout="20s"/>
</operations>
</primitive>
This configuration ensures the IP address is managed and monitored appropriately.[12] Constraints, such as colocation, are added under the <constraints> section to enforce resource placement rules, using scores to indicate preference or requirement.[12] A mandatory colocation constraint, for instance, ties two resources to the same node with an infinite score:
xml
<rsc_colocation id="colocate-ip-web" score="INFINITY" rsc="ClusterIP" with-rsc="Webserver"/>
<rsc_colocation id="colocate-ip-web" score="INFINITY" rsc="ClusterIP" with-rsc="Webserver"/>
Here, INFINITY (equivalent to 1,000,000) makes the colocation mandatory, preventing the resources from running separately.[12]
Configuration and management are facilitated by command-line tools like pcs and cibadmin.[59] The pcs tool provides a user-friendly interface for creating and modifying resources, such as pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=192.168.122.120 cidr_netmask=24 op monitor interval=30s.[59] For direct XML edits to the live CIB, cibadmin is used, for example, cibadmin --create --xml-file resource.xml --obj_type resources to add a new resource definition.[59] Constraints can similarly be managed via pcs, like pcs constraint colocation add ClusterIP with Webserver INFINITY.[59]
Monitoring involves real-time status viewing and logging mechanisms to track cluster health.[59] The crm_mon utility offers a dynamic display of cluster state, resources, and nodes, invoked with crm_mon for continuous output or crm_mon -1 for a one-time snapshot.[59] Logging is handled through syslog, with Pacemaker-specific entries in /var/log/pacemaker/pacemaker.log or integrated into /var/log/messages, and logs rotate automatically at 100MB or weekly intervals.[59] Alerts for failures are configured in the CIB under <alerts>, such as defining an SNMP alert script: <alert id="snmp_alert" path="/path/to/alert_snmp.sh"/>, which triggers on events like resource failures.[59]
Maintenance tasks include performing rolling upgrades and backing up configurations to ensure operational continuity.[59] Rolling upgrades proceed node-by-node, draining resources from one node before upgrading it, provided version compatibility is maintained (e.g., Pacemaker 2.x requires Corosync 2.3+).[59] The CIB is backed up and restored using cibadmin, for example, cibadmin --backup /path/to/[backup](/page/Backup).xml to save the current configuration, and cibadmin --restore /path/to/[backup](/page/Backup).xml to restore it. Note that the backup file is typically in XML format.[59]
Troubleshooting focuses on log analysis and simulating scenarios to diagnose issues.[59] Logs are primarily located in /var/log/pacemaker/, where errors can be filtered with commands like grep 'pacemaker.*error' /var/log/pacemaker/pacemaker.log.[59] Common problems include network partitions, which Pacemaker mitigates through fencing mechanisms to isolate faulty nodes and maintain quorum.[59] The crm_simulate tool aids diagnosis by replaying cluster transitions from log files, such as crm_simulate --simulate --xml-file transition.xml.[59]
Applications
Common Use Cases
Linux-HA clusters, leveraging Pacemaker as the resource manager, are commonly deployed for database high availability to ensure minimal downtime during failures. In such setups, tools like DRBD provide synchronous block-level replication for shared storage, enabling failover configurations for databases including PostgreSQL and MySQL.[60] For instance, Pacemaker monitors the primary database instance and, upon detecting a failure, promotes the standby node by mounting the replicated DRBD resource and starting the database service. Automatic migration of a virtual IP (VIP) address facilitates seamless client reconnection without manual intervention.[61]
Web services represent another key application, where Linux-HA enables load-balanced Apache clusters to maintain availability under high traffic or node failures. Pacemaker coordinates active/passive or active/active configurations, often integrating with HAProxy for traffic distribution across Apache instances while ensuring session persistence through shared storage or sticky sessions.[62] This setup allows for automatic failover of the load balancer itself, preventing single points of failure in web infrastructures.[63]
For file services, Linux-HA supports active/active access to shared storage using GFS2, a clustered file system that allows multiple nodes to read and write concurrently. Configurations with Samba or NFS over GFS2, managed by Pacemaker, provide high-availability file sharing in environments requiring scalable storage, such as enterprise networks.[64] Pacemaker handles resource fencing and lock management via the Distributed Lock Manager (DLM) to prevent data corruption during concurrent operations.[65]
Virtualization platforms benefit from Linux-HA through high-availability setups for KVM/QEMU virtual machines, particularly in Proxmox VE environments. Pacemaker enables automatic restart or live migration of VMs to healthy nodes upon host failure, using shared storage like Ceph or GFS2 for data persistence.[66] This integration supports seamless workload relocation, minimizing disruption in virtualized data centers.[67]
In these deployments, Linux-HA typically achieves a Recovery Time Objective (RTO) of under one minute, as failover detection and resource promotion occur in seconds to tens of seconds depending on cluster size and configuration.[56] Real-world examples include CERN's use of Pacemaker for high-availability load balancing in database middleware, ensuring continuous operation of critical services.[63]
Integration in Distributions
Linux-HA components, particularly Pacemaker and Corosync, are integrated into major Linux distributions through dedicated high availability packages and extensions that facilitate cluster management and failover capabilities.[12]
In Red Hat Enterprise Linux (RHEL) and its community counterpart CentOS, the High Availability Add-On has been available since RHEL 6, released in 2010, providing enterprise-grade clustering tools built on Linux-HA foundations.[68] This add-on includes the pcs command-line interface for cluster configuration and management, as well as fence-agents for node fencing to ensure clean failovers.[69] Additionally, the Resilient Storage Add-On complements these features by enabling concurrent access to shared storage in highly available clusters, supporting technologies like GFS2 filesystems for data integrity.[70] As of 2025, RHEL 10 enhancements extend HA capabilities to edge computing environments, incorporating optimized image-based deployments and live kernel patching to minimize downtime in distributed setups.[71]
SUSE Linux Enterprise (SLE) offers the High Availability Extension, which integrates Linux-HA tools like Pacemaker for resource management and supports advanced clustering features.[72] Key components include the Hawk web-based user interface for intuitive monitoring and administration of clusters, allowing administrators to visualize resource states and dependencies.[73] The extension also provides Geo clustering support, enabling coordinated failover across geographically dispersed sites for disaster recovery scenarios.[74]
For Ubuntu and Debian, Pacemaker and related Linux-HA packages are readily available through official repositories, allowing straightforward installation via package managers like apt.[75][76] These distributions integrate Pacemaker with cloud orchestration tools such as Juju, where subordinate charms like HAcluster and pacemaker-remote enable automated deployment of high availability setups for virtual IPs and services in cloud environments.[77][78]
Other distributions and platforms extend Linux-HA functionality in specialized ways; for instance, Proxmox Virtual Environment (VE) incorporates the ha-manager since version 4.2 in 2016, leveraging Pacemaker for automated VM and container migration in hyper-converged clusters.[66]