Veritas Cluster Server
Veritas Cluster Server is a high-availability clustering software solution that monitors systems and applications in real-time, automatically detecting faults and initiating failover to ensure continuous operation of critical business services with minimal downtime.[1] Originally developed by Veritas Software in 1998, later acquired by Symantec in 2005 and spun off as part of Veritas Technologies in 2016, it is now maintained under Arctera InfoScale by Cloud Software Group as of 2025. It organizes multiple servers into clusters to provide resiliency against hardware failures, software issues, and site disasters.[2][3][4] Key components include the High Availability Daemon (HAD) for resource management, the Low Latency Transport (LLT) for low-latency inter-node communication, and the Group Membership Services/Atomic Broadcast (GAB) for maintaining cluster membership and coordinating actions.[1] VCS supports up to 32 nodes per cluster and operates on platforms such as Linux, Windows, AIX, HP-UX, and Solaris, with extensions for virtualized environments like VMware and Microsoft Hyper-V.[5] It features service groups that bundle related resources (e.g., IP addresses, mount points, database processes), along with bundled and custom agents for application-specific monitoring and control.[6] As part of the broader Arctera InfoScale Availability product suite in versions up to 9.0 (released April 2025), VCS enables automated recovery across physical, virtual, hybrid, and multi-cloud deployments, including support for AWS, Azure, and Google Cloud.[7][8] This framework uses the Intelligent Monitoring Framework (IMF) for rapid fault detection and supports topologies like N-to-1 failover, parallel, and hybrid configurations to optimize resource utilization and meet recovery time objectives (RTOs).[1][2]Introduction
Overview
Veritas Cluster Server (VCS) is a high-availability clustering software solution that connects multiple independent systems, or nodes, into a unified management framework to enhance application uptime across Unix, Linux, and Windows operating systems.[9] By leveraging redundant hardware, VCS enables seamless application failover, eliminating single points of failure and ensuring continuous operation when a node or service encounters issues.[10] Its core purpose is to monitor and control critical business services—such as databases, file sharing systems, and e-commerce platforms—allowing them to switch or fail over to healthy nodes in the cluster with minimal disruption.[11] Originally developed by Veritas Software, the technology was acquired by Symantec in 2005 as part of a $13.5 billion deal that integrated it into broader storage and availability offerings.[12] In 2016, Symantec spun off its enterprise security and information management businesses, including Veritas assets, to The Carlyle Group and GIC, reestablishing Veritas Technologies as an independent entity.[13] From version 7.0 onward, VCS has been rebranded and integrated as a key component of Veritas InfoScale Availability, a software-defined solution for high availability and disaster recovery across physical, virtual, and cloud environments.[14] VCS delivers key benefits by significantly reducing unplanned and planned downtime through automated failover mechanisms that restore services in seconds.[15] It supports proactive service group management, where applications are grouped and monitored as cohesive units for efficient control and migration across nodes.[11] Additionally, its architecture-independent design ensures availability without reliance on specific hardware configurations, facilitating server consolidation and scalability for diverse enterprise workloads.[2]Development History
Veritas Software Corporation, established in 1989 as a spin-off from Tolerant Systems, initially focused on Unix storage management solutions before expanding into high availability clustering in the late 1990s.[16] The company developed Veritas Cluster Server (VCS) as a high availability solution for Unix systems, designed to integrate closely with its Storage Foundation platform to enable failover and resource management in shared storage environments.[16] Introduced in September 1998 under the code name Thor, version 1.0 supported clustering of up to 32 servers on Solaris and Windows NT, targeting storage area networks (SANs) for mission-critical applications.[16] In the early 2000s, VCS expanded to additional Unix platforms, including enhancements like the 1999 acquisition of NuView's ClusterX technology for improved Windows NT management and integration with Microsoft Cluster Server.[16] By 2000, Veritas released an upgraded Cluster Server supporting up to 32 servers on Windows NT and introduced the Global Cluster Manager for Solaris, capable of overseeing up to 256 clusters via a Java-based console.[16] Symantec's $13.5 billion acquisition of Veritas in mid-2005 marked a significant corporate shift, integrating VCS into a broader enterprise portfolio and accelerating Windows support to enhance cross-platform high availability.[17] During the Symantec era (2005–2016), development emphasized seamless enterprise integrations, such as advanced monitoring and failover for virtualized environments.[18] Key advancements included the introduction of global clustering in VCS 5.0, released in 2006, which enabled multi-site disaster recovery by coordinating failover across geographically dispersed clusters to mitigate large-scale outages.[19] Version 6.0, launched in late 2011, adopted enhanced open-source compatibility, particularly for Linux distributions, broadening VCS's applicability in heterogeneous environments.[20] In 2016, Symantec spun off its information management business, including VCS, to form independent Veritas Technologies.[13] With the release of version 7.0 in July 2015, VCS was rebranded as InfoScale Availability, aligning it within Veritas's expanded portfolio for storage, availability, and resilience solutions.[21] In December 2024, following the sale of Veritas Technologies' data protection business to Cohesity, the remaining assets—including InfoScale Availability—were separated into an independent company named Arctera. As of 2025, InfoScale Availability, encompassing the VCS technology, continues to be developed and supported by Arctera.[3][22]Architecture
Core Components
Veritas Cluster Server (VCS) relies on several fundamental software modules to enable reliable clustering and high availability. These components form the foundational architecture, handling communication, membership management, resource control, and data protection across cluster nodes.[23] The Low-Latency Transport (LLT) is a kernel-level module that provides high-speed, low-latency communication between cluster nodes over private networks. It replaces the standard IP stack for all inter-node traffic, enabling efficient heartbeat transmission and data exchange necessary for cluster coordination. LLT supports up to eight private network links for load balancing and redundancy, automatically redirecting traffic if a link fails, which ensures resilient connectivity and rapid failure detection.[24] Group Membership and Atomic Broadcast (GAB) is another kernel component that maintains a consistent view of cluster membership across all nodes. It uses an atomic broadcast protocol to deliver messages reliably and in the same order to every node, while monitoring heartbeats via LLT to detect node failures or network partitions. GAB tracks system states—such as regular (multiple links), jeopardy (single link), or visible (GAB running but unregistered)—and facilitates cluster seeding for initial formation, ensuring stable membership arbitration.[25] The High Availability Daemon (HAD) serves as the central user-space process, acting as the VCS engine to manage overall cluster operations. It builds and distributes the cluster configuration from main.cf files, monitors resources through agents, and coordinates actions like failover based on state changes reported via GAB. HAD operates as a replicated state machine, synchronizing resource status across nodes, and is automatically restarted by the hashadow process if it fails, maintaining continuous availability.[23] Cluster agents are multi-threaded processes that monitor and control specific resources, such as applications or hardware, within service groups. Each agent type handles start, stop, monitor, and other actions tailored to its resource—for example, the Oracle agent manages database instances by importing disk groups and starting services. Agents are categorized as bundled (standard VCS inclusions like IP or Mount), enterprise (for applications like SQL Server), or custom (user-developed scripts or binaries), and they support resource dependency trees to define ordered operations. Many incorporate the Intelligent Monitoring Framework (IMF) for asynchronous, efficient monitoring to reduce overhead.[26] Fencing mechanisms, primarily through the I/O fencing module, protect shared data integrity by preventing split-brain scenarios where multiple nodes attempt concurrent access. This module uses SCSI-3 Persistent Reservations on coordination points—such as shared disks or server-based nodes—to isolate failed or partitioned nodes, ensuring only the surviving cluster partition retains ownership. Server-based fencing employs external Coordination Point servers for arbitration in diskless environments, while majority-based fencing relies on node quorum; both integrate with GAB to enforce membership decisions and support preferred node selection for prioritized recovery.[27]Clustering and Failover Mechanisms
Veritas Cluster Server (VCS) assembles nodes into a cluster through a process that relies on the Group Membership Services and Atomic Broadcast (GAB) protocol to establish and maintain membership. During initialization, nodes join the cluster by seeding, either manually or automatically, where GAB port a facilitates heartbeat communication to monitor node liveness across the cluster, while port b handles I/O fencing operations to coordinate access to shared resources. This setup ensures all nodes develop a consistent, shared view of the cluster state, enabling synchronized operations and rapid detection of membership changes.[25] Service groups in VCS represent logical collections of resources, such as IP addresses, volumes, and applications, organized with defined dependencies to reflect real-world relationships, like an application depending on underlying storage. The High Availability Daemon (HAD) oversees the online and offline states of these groups, monitoring resources through specialized agents and enforcing policies for activation or deactivation based on cluster conditions. Dependencies within service groups dictate the order of resource startup or shutdown, ensuring orderly management across nodes.[23] The failover process begins with resource agents detecting faults, such as application crashes or hardware issues, which notify HAD to initiate corrective actions. Upon fault confirmation, HAD evaluates failover policies and transfers the affected service group to a healthy node in the system's list, stopping resources on the faulty node and starting them on the target while respecting dependencies. Pre-failover and post-failover scripts allow for custom actions, like data synchronization or notification, to facilitate clean transitions and minimize disruption.[28] Switchover provides a controlled mechanism for relocating service groups between nodes without simulating a failure, supporting manual intervention via commands or scheduled operations for maintenance and load balancing. HAD coordinates the switch by gracefully shutting down resources on the source node and bringing them online on the destination, leveraging the same dependency rules as failover to maintain service integrity. This approach enables proactive resource distribution, such as moving workloads to underutilized nodes to optimize performance. To prevent split-brain scenarios where partitioned nodes could concurrently access shared storage and cause data corruption, VCS employs I/O fencing mechanisms that isolate non-surviving partitions. Fencing uses coordination points, such as external CP servers or shared disks, to arbitrate membership during network partitions; only the partition holding a majority or quorum of these points retains access to storage via SCSI-3 Persistent Reservations, while others are evicted. This ensures data consistency by guaranteeing that only one cluster portion can modify shared resources at a time.[29]Features
High Availability Capabilities
Veritas InfoScale Availability, formerly known as Veritas Cluster Server, delivers high availability through a robust framework that monitors and manages applications across diverse environments, ensuring minimal downtime during failures.[2] Central to its capabilities is the ability to detect faults in real time using the Intelligent Monitoring Framework (IMF), which provides instant notifications and triggers automated responses, such as restarting services on the same node before escalating to failover.[2] This approach supports resilience by integrating with underlying clustering mechanisms to maintain service continuity, often achieving recovery times in seconds for critical workloads.[30] In version 9.0, InfoScale Availability introduces real-time cyber resiliency features, enabling application-aware recovery and protection against ransomware and other threats with near-zero downtime restoration.[31] A key feature is its application-agnostic clustering, which allows for the high availability of virtually any application through customizable agents and service groups.[6] Users can develop or leverage pre-built agents to cluster legacy systems, open-source software like Apache HTTP Server or MySQL databases, and multitier applications without requiring application-specific modifications.[6] This flexibility enables dynamic resource management, where service groups containing interdependent components—such as databases, web servers, and middleware—are monitored holistically to ensure coordinated failover or restart, promoting scalability in heterogeneous setups.[2] For broader resilience, global clustering supports multi-site configurations tailored for disaster recovery, facilitating wide-area failover across geographically dispersed data centers.[32] Integrated with replication technologies like Veritas Volume Replicator (VVR), it synchronizes data between sites while coordinating application failover with a single command, ensuring zero data loss in synchronous modes and non-disruptive testing via FireDrills to validate recovery plans without impacting production.[33] This capability extends high availability beyond local clusters, supporting metro and global distances to mitigate site-wide outages. InfoScale Availability accommodates scalable topologies such as N-to-1 and N-to-N designs, where multiple nodes (N) can fail over to one or more redundant systems, optimizing resource utilization.[34] In N-to-1 configurations, a cluster of active nodes shares a single failover target, while N-to-N symmetric setups allow balanced load sharing and mutual redundancy among all nodes; dynamic probing assesses node capacity—factoring in CPU, memory, and storage—before directing resources to the most suitable target.[35] These topologies enhance resilience by eliminating single points of failure and enabling efficient scaling for demanding environments, such as those handling high-volume transactions. Auto-restart policies further minimize downtime by attempting to recover resources locally before invoking full failover.[36] Configurable via attributes like AutoRestartLimit, the system automatically probes and restarts failed components on the originating node, escalating only if thresholds are exceeded, which typically limits outages to seconds rather than minutes.[36] Combined with quick recovery mechanisms, this ensures rapid restoration of service groups, supporting high resilience for mission-critical applications. Native integration with Veritas storage solutions, including Veritas File System (VxFS) and Veritas Volume Manager (VxVM), bolsters availability in both shared-nothing and shared-disk cluster models.[37] VxFS provides journaling and intent logging for fast crash recovery, while VxVM enables dynamic volume management and mirroring across nodes, ensuring data integrity during failovers without shared storage dependencies in replicated setups.[38] This synergy allows seamless operation in clustered file systems, where storage resources are treated as cluster-aware entities, enhancing overall system resilience against hardware faults.[2]Management and Monitoring Tools
Veritas Cluster Server (VCS) provides a suite of command-line and graphical tools for configuring, monitoring, and administering clusters, enabling administrators to manage resources, detect faults, and ensure high availability.[1] The VCS Command-Line Interface, part of the hacluster package, offers scripts for core operations such as starting and stopping clusters withhastart and hastop, managing individual resources via hares to modify attributes or probe status, and handling service groups through hagrp to enable, disable, or switch groups between nodes.[39] These commands allow querying cluster status in real-time, such as displaying resource dependencies or faulted components, facilitating quick troubleshooting without graphical interfaces.[39]
For visual management, the Cluster Manager Java Console serves as a desktop GUI application that connects to clusters over secure channels, providing topology views of nodes, resources, and dependencies.[40] Administrators can edit main.cf configuration files, simulate failover scenarios, and monitor events through its interface, which supports both local and remote cluster access.[40]
The VCS Notification Framework, powered by the notifier process, integrates with external systems for event alerting, including SNMP traps for network management stations and SMTP for email notifications on triggers like node failures or resource faults.[41][42] Custom scripts can also be configured to execute actions in response to these events, enhancing proactive administration.[41]
Performance monitoring in VCS relies on real-time metrics collected by the High Availability Daemon (HAD), which logs heartbeat latency, I/O operations, and system utilization in engine_a.log files for analysis of cluster health.[1] Agent probes periodically check resource states, reporting metrics like CPU usage or disk availability to detect performance degradation before faults occur.[1]
Multi-cluster management is supported through InfoScale Operations Manager, a centralized web-based tool that oversees multiple global clusters, displaying replication status, cross-site dependencies, and aggregated health metrics from a single dashboard.[43] This console enables coordinated actions across sites, such as synchronized failovers, without requiring individual cluster logins.[43]
Supported Platforms
Operating Systems
Arctera Cluster Server, now part of Arctera InfoScale Availability, supports a range of Unix, Linux, and Windows operating systems, with compatibility varying by architecture and clustering requirements. It also extends to virtualized environments including VMware and Microsoft Hyper-V.[44][45][46][5]Unix Support
On Unix platforms, Arctera InfoScale Availability supports Solaris 11 Update 4 on both SPARC and x86-64 architectures, enabling shared-disk clustering through the Arctera Cluster File System (CFS) for up to 64 nodes in certain configurations.[46] IBM AIX 7.2 Technology Level 5 and AIX 7.3 Technology Levels 2 and 3 are supported on Power7, Power8, Power9, and Power10 processors, with clustering limited to 8 nodes for standard setups and 2 nodes for some shared configurations using CFS.[45]Linux Support
Linux distributions supported include Red Hat Enterprise Linux 8.10 and 9.4/9.6, Oracle Linux 8.10 and 9.4/9.6, Rocky Linux 8.10 and 9.4/9.6, and SUSE Linux Enterprise Server 15 SP5/SP6, all on x86-64 architecture with kernel modules for the Low Latency Transport (LLT) protocol and I/O fencing to ensure cluster integrity.[44] These platforms emphasize shared-nothing clustering, though CFS enables shared-disk modes for select enterprise use cases.[44]Windows Support
Windows Server support encompasses 2019 through 2025 editions (Standard and Datacenter), integrating with Windows Failover Cluster Manager via platform-specific agents for high availability.[47] Clustering on Windows utilizes hybrid modes, often incorporating SMB shares for shared storage alongside native Windows features.[48] Platform-specific modes include shared-nothing architectures for most Linux and Unix environments to isolate node failures, shared-disk configurations via CFS on enterprise Unix systems like Solaris and AIX for concurrent data access, and hybrid approaches on Windows leveraging SMB for flexible storage sharing.[44][46][45] Deployment requires OS kernel patches to accommodate VCS drivers such as LLT and fencing modules, with end-of-support alignment ensuring compatibility until the respective OS reaches maturity phase.[49]Compatible Applications
Arctera Cluster Server (VCS), part of Arctera InfoScale Availability, provides dedicated agents to ensure high availability for a range of database applications through automated monitoring, failover, and resource management of database instances and associated storage. The Oracle agent supports Oracle Database versions including 12c, 19c, and 21c, handling tasks such as starting listeners, verifying database connectivity, and coordinating failover for single-instance and Real Application Clusters (RAC) setups. Similarly, the Microsoft SQL Server agent manages SQL Server Database Engine, Analysis Services, and Integration Services, enabling seamless failover for clustered instances on Windows platforms. For IBM DB2, VCS includes an enterprise agent that monitors database instances, handles failover of shared storage, and integrates with DB2's high availability features like HADR. Agents for open-source databases such as MySQL and PostgreSQL facilitate failover of database servers, replication setups, and monitoring of processes, supporting configurations like active-passive clusters with shared or replicated storage.[50][51][52][53] For file and storage services, VCS offers agents that manage shared or replicated data environments, ensuring continuous access during node failures. The NFS agent controls NFS exports and mounts, supporting high availability for Network File System shares across Linux and UNIX clusters. The SambaShare agent handles Samba-based file sharing, enabling failover of CIFS/SMB services for cross-platform access. Additionally, the Mount agent supports Arctera File System (VxFS) mounts, while the DiskGroup agent manages Arctera Volume Manager (VxVM) volume groups, allowing dynamic import/export and failover of storage resources for shared data volumes. These agents collectively ensure that file systems and volumes remain online, with options for replication via Arctera Volume Replicator.[54][55][56] Enterprise applications benefit from VCS agents tailored for business-critical workloads, particularly in e-commerce and middleware environments. The SAP agent, including support for SAP NetWeaver, monitors application servers, enqueues, and gateways, facilitating failover for SAP systems integrated with databases. For Oracle E-Business Suite, dedicated agents manage concurrent managers and forms servers, ensuring rapid recovery of ERP processes. Web server support includes the Apache agent for HTTP services on UNIX/Linux and the IIS agent for Microsoft Internet Information Services on Windows, both handling virtual IP failover and site monitoring to maintain web application availability. These configurations allow clustering of middleware components without disrupting user sessions.[57][58][59] VCS extends compatibility to open-source and custom workloads through specialized and flexible agents. The Kubernetes agent integrates with container orchestration, monitoring pods and enabling failover of stateful applications within Kubernetes clusters by leveraging InfoScale's storage and fencing capabilities. For big data environments, Hadoop support is available via custom configurations using the generic Application agent to manage NameNode, DataNode, and JobTracker processes, ensuring cluster-wide failover. Custom applications, including DevOps tools and messaging systems like RabbitMQ, can utilize user-defined scripts with the Process or Application agents for monitoring and failover, allowing tailored resource dependencies and actions.[60][61] Integration examples highlight VCS's role in Arctera ecosystems for comprehensive data protection. Bundled agents for NetBackup enable clustering of backup masters and media servers, supporting failover of backup operations and integration with application-specific backups like those for databases. Similarly, agents within InfoScale Storage, such as those for VxVM and VxFS, provide end-to-end storage management alongside VCS, ensuring replicated or shared storage availability for protected workloads.[62][63]Release History
Major Versions
Veritas Cluster Server (VCS), now known as InfoScale Availability, has evolved through several major versions since its early development, with each release introducing key enhancements to support broader environments, improved reliability, and integration with emerging technologies.[64] Version 4.0, released in 2005, marked a significant expansion by enhancing global clustering capabilities specifically for disaster recovery scenarios, allowing seamless replication and failover across geographically dispersed sites via the VCS Global Cluster Option. Additionally, it improved the agent framework, providing more robust and extensible agents for monitoring and managing a wider range of applications without requiring custom scripting. In 2007, Version 5.0 added native support for Microsoft Windows operating systems alongside Unix and Linux platforms, enabling high-availability clustering for Windows-based applications. It also added the multi-cluster management console, which allowed centralized administration and monitoring of multiple VCS clusters from a single interface, simplifying operations in large-scale deployments. It introduced support for virtualization environments, including VMware ESX servers, enabling high availability for virtual machines through features like automated failover and integration with VMware's VMotion. Furthermore, this release strengthened Linux integration with enhanced compatibility for major distributions, improving resource management and performance in Linux-based clusters.[65][66][65] Version 5.1, launched in 2009, advanced fencing mechanisms by incorporating coordination points, including server-based options like Coordination Point Servers (CP servers), to better prevent split-brain scenarios and ensure data integrity during node failures. It introduced auto-probe functionality for resources, automating the detection and validation of cluster resources during startup or reconfiguration to reduce manual intervention. The version also added IPv6 support, facilitating deployment in modern IPv6-enabled networks without compatibility issues.[67][67][67] Released in 2011, Version 6.0 integrated VCS more tightly with the emerging InfoScale suite, laying the groundwork for unified storage and availability management across Veritas products. It expanded platform support to include Red Hat Enterprise Linux (RHEL) 6 and SUSE Linux Enterprise Server (SLES) 11, broadening its applicability in enterprise Linux environments. A notable addition was dynamic reconfiguration, which allowed online adjustments to cluster configurations, such as adding or modifying resources, without requiring full cluster shutdowns.[68] Version 7.0 in 2015 represented a major rebranding to InfoScale Availability, aligning VCS with Veritas's broader InfoScale portfolio for simplified licensing and deployment. This release introduced cloud bursting capabilities, enabling seamless extension of on-premises clusters to public clouds like AWS and Azure for hybrid disaster recovery and workload mobility. It also incorporated kernel-independent drivers, such as for dynamic multi-pathing (VxDMP), to reduce dependencies on specific OS kernels and enhance portability across environments.[69][69][69] Subsequent versions from 8.0 (2021) to 9.0 (2025) have emphasized modern infrastructure trends, including native support for containers via Docker and Kubernetes integrations to provide high availability for containerized applications in orchestrated environments. Version 9.0 specifically introduced AI-powered anomaly detection for predictive analytics and real-time threat monitoring in cluster health, improving proactive issue resolution. Compatibility was extended to Windows Server 2025, ensuring continued support for the latest Microsoft ecosystems in mixed-platform clusters.[70][7][48][71]End-of-Life Timeline
Veritas Cluster Server (VCS), now integrated into Veritas InfoScale Availability, follows a structured support lifecycle that includes standard support, extended support for critical issues, and sustaining support for limited guidance, after which no further updates or patches are provided. This timeline details the end-of-support phases for major versions, along with migration recommendations to maintain high availability and security in clustered environments.[72] Version 4.0 reached its end of support on July 31, 2011, after which users were advised to migrate to version 5.x to access ongoing security updates and compatibility improvements.[72] Version 5.0 concluded standard support on August 31, 2014, with extended support provided until 2017 specifically for critical patches in existing deployments.[73] For version 5.1, support ended on December 31, 2016, with a focus on supporting legacy Unix environments during the transition period.[74] Version 6.0's standard support terminated on June 30, 2020, followed by sustaining support until 2023 to aid migrations to InfoScale Availability for enhanced scalability.[75] Version 7.0 saw premier support end in 2022, with extensions available through 2025; migrations to version 8.0 or later were recommended to leverage cloud integration features.[76] The current version, 9.0 released in 2025, maintains full standard support through 2028, with extended support until 2029, including quarterly updates to address emerging threats and ensure compatibility with modern platforms.[7][77]| Version | End of Standard Support | Extended/Sustaining Support | Migration Recommendations |
|---|---|---|---|
| 4.0 | July 31, 2011 | N/A | Upgrade to 5.x for security updates |
| 5.0 | August 31, 2014 | Until 2017 for critical patches | Transition to 5.1 or later |
| 5.1 | December 31, 2016 | N/A | Move to 6.0+ for broader OS support |
| 6.0 | June 30, 2020 | Until 2023 for InfoScale transitions | Adopt InfoScale Availability 7.0+ |
| 7.0 | 2022 | Extended to 2025 | Migrate to 8.0+ for cloud capabilities |
| 9.0 | 2028 | Until 2029 | N/A (current version) |