OCFS2
OCFS2, or the Oracle Cluster File System version 2, is a general-purpose, extent-based, shared-disk cluster file system for Linux that supports concurrent read and write access to the same files from multiple nodes in a cluster, utilizing a distributed lock manager to ensure data consistency and integrity.[1][2][3] It is designed for high-performance and high-availability environments, including both clustered and standalone systems, and features journaling for reliability, POSIX compliance, and support for features like quotas, access control lists (ACLs), and extended attributes.[4][2] Development of OCFS2 began in 2003 at Oracle Corporation as a successor to the original OCFS, which was specifically tailored for Oracle Real Application Clusters (RAC) database storage.[5] The project aimed to create a more versatile, POSIX-compliant file system with raw-like I/O throughput and efficient metadata operations, evolving from the limitations of OCFS by incorporating elements from ext3 and other Linux file systems.[5][2] The first stable release, version 1.0, arrived in August 2005, followed by its integration into the mainline Linux kernel with version 2.6.16 in early 2006, under the GNU General Public License (GPL).[5] Since then, OCFS2 has been maintained as part of the Linux kernel, as of November 2025, and is available in distributions such as Oracle Linux and SUSE Linux Enterprise, and on Red Hat Enterprise Linux with Oracle-provided support.[6][2][7] Key architectural components of OCFS2 include an in-kernel cluster stack (O2CB) for communication, a global heartbeat mechanism to detect node failures, and fencing policies to handle unresponsive nodes, enabling scalability across heterogeneous clusters (e.g., mixing 32-bit and 64-bit nodes or different endianness).[4][5] It supports block sizes from 512 bytes to 4 KB, cluster sizes up to 1 MB, and volume sizes ranging from 16 TB (with 4 KB clusters) to potentially 4 PB, with optimizations for large files such as sparse file handling, unwritten extents, and directory indexing for millions of entries.[4][2] Additional capabilities include reflink for copy-on-write clones, metadata checksums for integrity, and compatibility with SELinux security policies.[4] OCFS2 is primarily used in enterprise environments requiring shared storage, such as Oracle RAC for database clustering, Oracle VM for virtual machine images, and Oracle E-Business Suite for middleware load balancing.[1][4] It also serves general clustered applications, including Samba and NFS exports for file sharing, and is deployable on cloud platforms like Oracle Cloud Infrastructure with shareable block volumes.[1][5] The file system requires dedicated tools like ocfs2-tools for formatting, mounting, and management, and is configured via a cluster configuration file for node coordination.[2][6]Introduction
Overview
OCFS2 (Oracle Cluster File System version 2) is a shared-disk, journaling, extent-based cluster file system designed for the Linux kernel, allowing multiple nodes in a cluster to concurrently read from and write to the same shared block storage devices, such as storage area networks (SANs) accessed via iSCSI or Fibre Channel protocols.[4] It provides a general-purpose solution for clustered environments, supporting parallel I/O operations while maintaining data consistency across nodes, and can also be used on standalone systems for local file system needs.[4][8] The primary use cases for OCFS2 include high-availability setups, such as Oracle Real Application Clusters (RAC) for shared database storage, Oracle E-Business Suite in middleware clusters, and general clustered storage for applications like web servers, virtual machine images in Oracle VM, and other scenarios requiring simultaneous multi-node access to files.[9][4][8] Developed and maintained by Oracle Corporation, OCFS2 is released under the GNU General Public License (GPL) as an open-source project and has been integrated into the mainline Linux kernel since version 2.6.16.[8] Its key benefits include full POSIX compliance for standard file system semantics, high performance optimized for metadata operations through extent-based allocation, and scalability supporting clusters with up to 255 nodes via configurable slot mechanisms (1-255 slots), though practical limits depend on hardware and configuration.[4][10]History
The Oracle Cluster File System (OCFS) was initially developed by Oracle Corporation in 2002 as a proprietary clustered file system designed exclusively for Oracle Real Application Clusters (RAC), providing shared storage access for database operations as an alternative to raw devices.[5] This first-generation system focused on fast I/O performance for Oracle workloads but lacked broader POSIX compliance and general-purpose capabilities, limiting its use to Oracle-specific environments.[6] Development of OCFS2 began in 2003 as a complete redesign of OCFS, motivated by the need for a more versatile, POSIX-compliant clustered file system suitable for general-purpose Linux applications while retaining high performance in shared-disk cluster setups.[5] The initial version, OCFS2 v1.0, was released in August 2005, introducing features like extent-based allocation and improved scalability.[5] In January 2006, OCFS2 was merged into the mainline Linux kernel, with its full integration appearing in kernel version 2.6.16 released in March 2006, marking its availability as fully open-source under the GPL and enabling widespread adoption beyond Oracle ecosystems.[5][11] Subsequent releases enhanced OCFS2's functionality for diverse workloads. OCFS2 Release 1.4, launched in July 2008, added support for sparse files, unwritten extents, inline data, and shared writable mmap, improving storage efficiency and I/O handling for clustered environments.[12] Release 1.6 followed in November 2010, incorporating advancements such as user and group quotas for resource management and further optimizations to mmap operations for better memory-mapped file performance in clusters.[12] Later milestones included the addition of reflinks—a copy-on-write mechanism for efficient file cloning—in Linux kernel 2.6.32 in 2009, and online defragmentation capabilities via tools like defragfs.ocfs2, introduced in subsequent ocfs2-tools releases to address fragmentation without downtime.[13][14] OCFS2 has been maintained primarily by Oracle's Open Source Software team, with ongoing contributions from the Linux kernel community integrated through mainline development.[15] No major forks have emerged, though distributions such as Red Hat Enterprise Linux and SUSE Linux Enterprise have adapted and certified OCFS2 for their enterprise clustering stacks, ensuring compatibility and support in production environments.[6][16] As of 2025, OCFS2 continues to be maintained by Oracle and the Linux kernel community, with support in recent kernels and updates addressing security vulnerabilities.[17]Design and Architecture
Core Components
OCFS2's core architecture revolves around several key components that facilitate shared access to storage across multiple nodes in a cluster, ensuring data consistency, fault tolerance, and scalability. These elements work together to coordinate operations among nodes, preventing data corruption from concurrent modifications while supporting high availability. The system employs a distributed approach where each node maintains local state synchronized via network communication and shared disk mechanisms.[2] The Distributed Lock Manager (DLM), specifically O2DLM in OCFS2, is central to coordinating access to shared resources such as inodes, file data regions, and metadata. It distributes lock resources across nodes within a domain, allowing each node to hold only a subset of the overall lock state for improved scalability. Upon a node failure, the DLM enables rapid recovery by redistributing locks to surviving nodes, ensuring continued cluster operation without data loss. This domain-based locking model supports fine-grained concurrency, such as PR (protected read) and EX (exclusive) modes, to serialize writes while permitting parallel reads.[18][19][12] The heartbeat protocol, implemented as O2HB, provides liveness detection through both disk-based and network-based monitoring to identify node failures swiftly, typically within seconds. Disk heartbeats involve periodic writes (every 2 seconds by default) to reserved regions on shared storage, updating timestamps that other nodes poll to confirm activity. Network heartbeats complement this by exchanging keep-alive packets over the interconnect, helping prevent split-brain scenarios where disconnected nodes might independently claim resources. If a node stops heartbeating, the protocol triggers eviction via fencing, notifying the DLM and other services to initiate recovery. This dual mechanism enhances reliability in environments with potential network partitions.[8][20][21] The network interconnect, managed by O2NET, handles communication for cluster events, lock messaging, and heartbeats, using TCP/IP by default for general-purpose clusters. For high-performance setups, it supports RDMA via the o2ib module over InfiniBand or RoCE-enabled Ethernet, reducing latency for lock traffic and enabling faster coordination in large clusters. Connection parameters include configurable idle timeouts (default 30 seconds) and reconnect attempts (default 2 seconds delay), ensuring resilience to transient network issues while maintaining low overhead. This interconnect forms the backbone for all inter-node signaling in OCFS2.[12][22] OCFS2 offers flexible cluster stack options, with the in-kernel O2CB (OCFS2 Cluster Bare) as the primary choice for straightforward deployments, integrating node management, heartbeats, and DLM directly into the kernel for simplicity and performance. For advanced high-availability scenarios, it can integrate with external stacks like Corosync and Pacemaker, using resources such as ocf:heartbeat:o2cb to manage OCFS2 services alongside other cluster resources. Configuration occurs via the /etc/ocfs2/cluster.conf file, specifying nodes, timeouts, and domains to align with the chosen stack. This modularity allows OCFS2 to adapt to diverse clustering needs without requiring a full replacement of the file system.[23][24] Node slot management allocates dedicated resources, such as journals and system files, for each participating node via slots defined in the file system's superblock. During formatting with mkfs.ocfs2, administrators specify the maximum number of slots (default 8, tunable up to 255), with a slot map tracking active assignments to prevent conflicts. This per-node allocation ensures isolated journaling and metadata operations, supporting cluster expansion by adding slots post-formatting using tunefs.ocfs2, though reductions are not possible. The mechanism scales to hundreds of nodes while maintaining efficient resource isolation.[4][8][25]On-Disk Format
The on-disk format of OCFS2 is designed to support shared access across multiple nodes in a cluster while maintaining compatibility with local file system semantics. It organizes data into blocks for metadata and larger clusters for file data, enabling efficient allocation and extent-based storage. The format is extent-oriented, drawing inspiration from ext3 but extended for clustering, with all structures stored in little-endian byte order to ensure portability.[26] The superblock, located at block number 2 (offset 8192 bytes assuming a 4KB block size), serves as the primary metadata header for the file system. It includes a 16-byte UUID for unique identification, block size in bits (supporting 512 bytes to 4KB, with 4KB as the default), and cluster size in bits (ranging from 4KB to 1MB, default 4KB but often 128KB for database workloads). The superblock also specifies the maximum number of node slots (up to 255), feature flags for compatibility and incompatible features (such as support for extended attributes and unwritten extents), and the block offset of the root inode. Additionally, it contains pointers to the system directory and first cluster group, along with revision levels (major 0, minor typically 90 or higher) and mount counts for maintenance. This structure fits within 512 bytes to accommodate the smallest block size, with reserved padding for future use.[27][26][28] Inodes in OCFS2 use a 64-bit numbering scheme to support large file systems, with each inode stored in a fixed-size dinode structure (typically 512 bytes or more, depending on block size). The dinode includes up to 60 extent records in a leaf list, each describing a contiguous range of clusters with fields for logical offset (32-bit), cluster count (32-bit), and physical block number (64-bit); these extents enable efficient representation of large files without fragmentation. Support for unwritten extents allows allocation without immediate data writing, optimizing performance for sparse or growing files. Inline data up to 2KB can be stored directly within the inode for small files, reducing seek overhead, while extended attributes (xattrs) are accommodated via dedicated slots or external blocks. Inode allocation is dynamic, managed through a global inode allocation bitmap (system inode 1) that tracks free inodes across the disk, allowing nodes to allocate from shared pools without contention beyond locking.[27][26][28] Each node in the cluster has a dedicated journal, implemented as a system inode (one per slot, up to 255) using the kernel's JBD2 journaling layer for metadata logging. These journals record changes to ensure crash recovery, with replay occurring automatically on mount to restore consistency across nodes. The format supports both ordered and writeback data modes: ordered mode guarantees data is flushed before metadata commit for stricter consistency, while writeback mode allows metadata commits without immediate data sync for better performance. Journal size is configurable during formatting (default 32MB, scalable to 1GB or more), and each is accessed exclusively by its owning node during operations.[26][28][21] Directories use a B-tree structure (hashed with DX seeds in the superblock) for efficient lookups and insertions, storing entries with 64-bit inode numbers and names up to 255 bytes. File data allocation relies on extent trees, where leaf extents map to clusters, and internal tree nodes handle indirection for files exceeding the 60-extent limit (up to 4TB with 4KB blocks and 1MB clusters). The disk is divided into allocation groups—linked lists of bitmap-managed regions—for parallel allocation across nodes, with each group containing bitmaps for free clusters and blocks. Local allocation windows (default 8MB per node) cache bits from the main bitmap to reduce global contention, sliding dynamically as space exhausts.[26][28][21] OCFS2 maintains backward compatibility by preserving the core on-disk format across kernel versions, with changes gated by feature bits in the superblock (e.g., OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC for sparse files or OCFS2_FEATURE_RO_COMPAT_UNWRITTEN for unwritten extents). Incompatible features prevent mounting on older kernels, while compatible ones allow seamless upgrades; for instance, the format remains readable by tools like debugfs.ocfs2 even if advanced features are disabled via mkfs.ocfs2's --max-compat option. Backup superblocks can be enabled for recovery from corruption.[27][21][28]Features
Journaling and Consistency
OCFS2 employs journaling to maintain data integrity in a clustered environment, primarily through metadata journaling, which records all structural changes such as inode modifications and directory updates before they are committed to the file system.[2] This default mode ensures that the file system remains consistent even after a crash or power failure by allowing recovery through replay of the journal.[4] For data blocks, OCFS2 supports optional data journaling in two modes: ordered mode, which writes data to disk before committing the associated metadata for enhanced safety against inconsistencies, and writeback mode, which defers data writes for better performance but risks potential data loss if a failure occurs before data is flushed.[2] Each node maintains its own journal file, sized typically from 64 MB to 256 MB depending on the use case, to handle local operations efficiently.[4] In the event of a node failure, surviving nodes initiate recovery by replaying the failed node's journal to restore the file system's state, ensuring that pending transactions are either committed or aborted cluster-wide.[26] This process is coordinated through the Distributed Lock Manager (DLM), which detects the failure via heartbeat mechanisms and clears the dead node's locks before allowing journal replay under exclusive mode.[26] Barrier I/O operations, enabled by default in modern configurations, further guarantee write ordering on the shared storage device by forcing flushes to stable storage, preventing out-of-order commits that could lead to inconsistencies.[2] During recovery, resources like truncate logs and local allocation files are processed, and orphaned inodes are reclaimed to maintain overall cluster integrity.[26] Cross-node consistency is enforced by the DLM, which manages distributed locks across the cluster and invalidates caches on other nodes whenever a lock is granted in exclusive or shared modes, preventing stale data access.[26] This lock-based approach, combined with Lock Value Blocks (LVBs) that store recent inode metadata, ensures that all nodes see a coherent view of the file system, including support for coherent memory-mapped I/O (mmap) operations across the cluster.[26] For error handling, OCFS2 uses metadata checksums to detect corruption during operations, with the file system remounting read-only on errors by default; online repair is available via tools like tunefs.ocfs2 for certain metadata issues, while full offline checks use e2fsck-compatible utilities for comprehensive verification.[4] Quota enforcement for user and group limits is journaled as part of metadata operations, ensuring atomic updates without requiring separate recovery.[2] OCFS2 adheres to POSIX semantics for key file operations, providing atomic unlink, rename, and mkdir actions that are visible and consistent across all nodes in the cluster.[26] These operations are protected by DLM locks, such as cluster-wide rename locks to avoid deadlocks and delete votes for unlink-while-open scenarios, where files are moved to an orphan directory until all references close.[26] This design guarantees that directory modifications appear atomic from any node's perspective, maintaining the expected behavior of a local file system in a shared environment.[4]Advanced Capabilities
OCFS2 provides several advanced features that enhance its utility in clustered environments, extending beyond core file system operations to support resource management, security, efficiency, and flexibility. These capabilities include support for disk quotas, access control lists (ACLs), and extended attributes, which enable fine-grained control over storage usage and permissions in multi-node setups. OCFS2 supports per-user and per-group disk quotas, which enforce limits on storage allocation and are journaled to maintain consistency across cluster nodes even in the event of failures; this ensures that quota information remains synchronized without requiring offline recovery. Quotas can be enabled during file system creation with mkfs.ocfs2 or at mount time using options such as usrquota and grpquota. For security and metadata management, OCFS2 implements POSIX.1e-compliant ACLs and extended attributes (xattrs), stored directly within inodes to allow attachment of an unlimited number of name-value pairs to files, directories, and symbolic links. These features facilitate advanced access control and user-defined metadata, such as SELinux labels, while maintaining compatibility with POSIX standards. To optimize space efficiency, OCFS2 introduced reflinks with copy-on-write (COW) semantics in Linux kernel version 4.18, enabling efficient file cloning and deduplication through the reflink ioctl or related system calls. This allows multiple files to share the same data blocks initially, with writes triggering COW to create independent copies, reducing storage overhead for snapshots and duplicates in virtualized or database environments; enabling this requires updating the file system's UUID via tunefs.ocfs2 if necessary. Additionally, OCFS2 supports sparse files via unwritten extents, which allocate space only for actual data, minimizing waste for files with large gaps, and preallocation through the fallocate system call to reserve disk space in advance for performance-critical workloads. For maintenance without downtime, OCFS2 offers online defragmentation using the defragfs.ocfs2 tool, which reorganizes fragmented extents within files or the entire volume while the file system remains mounted and accessible across the cluster. Resize operations are also possible online, primarily for growth, using tunefs.ocfs2 to dynamically expand the volume size to utilize additional underlying storage, with the tool acquiring necessary cluster locks to ensure safety. Further enhancing adaptability, OCFS2 allows multiple cluster sizes ranging from 4 KB to 1 MB (in powers of 2) to tune for specific workloads, such as smaller sizes for metadata-heavy applications or larger for bulk data. The file system is fully endian-neutral, supporting heterogeneous clusters with mixed 32-bit/64-bit architectures and both little-endian (x86, x86_64, ia64) and big-endian (ppc64) nodes, promoting cross-platform compatibility.Implementation
Kernel Integration
OCFS2 has been integrated into the mainline Linux kernel since version 2.6.16, released in early 2006, marking its transition from an Oracle-specific development to a broadly available clustered file system. This inclusion provided the foundation for ongoing maintenance and enhancements, with the codebase residing in thefs/ocfs2 directory and receiving regular updates through kernel development cycles. The integration ensures that OCFS2 operates as a native file system within the Linux environment, supporting shared-disk clustering without requiring proprietary extensions.
The core kernel components of OCFS2 consist of several loadable modules that handle file system operations and clustering. The primary module, ocfs2, implements the file system logic, including extent-based allocation and journaling. Clustering is managed by ocfs2_dlm, the distributed lock manager, which coordinates access across nodes, and ocfs2_dlmfs, a specialized file system for exposing DLM resources via the VFS layer. Additional stacking modules include ocfs2_stack_o2cb for the in-kernel O2CB cluster stack and support for userdlm in userspace DLM configurations, enabling flexible deployment options.
OCFS2 depends on specific kernel configuration options for compilation and runtime support. The CONFIG_OCFS2_FS option must be enabled (as built-in or module) during kernel build to include the file system support, while clustering requires CONFIG_CONFIGFS_FS and related options for the O2CB stack. It integrates seamlessly with the Linux Virtual File System (VFS) layer, providing POSIX-compliant operations such as read, write, and directory traversal, while extending them with cluster-aware locking to maintain consistency across nodes.
To ensure interoperability, OCFS2 maintains backward compatibility in its on-disk format and cluster protocol, allowing newer kernels to mount and operate on volumes created by older versions without data loss. This is enforced through feature flags categorized as compatible (features ignorable by older kernels), incompatible (preventing mounts if unsupported), and read-only compatible (allowing read-only access). These flags, stored in the superblock, detect mismatches and avoid corruption during mixed-version cluster operations.
OCFS2 is enabled by default in kernel configurations for major Linux distributions, including Oracle Linux (via Unbreakable Enterprise Kernel), SUSE Linux Enterprise (with High Availability Extension), and available in Red Hat Enterprise Linux (though not officially supported by Red Hat), where it is compiled as a module or built-in depending on the distribution's packaging.[7] Recent enhancements in Linux kernel 6.10, released in July 2024, include optimizations for write I/O performance, reducing unnecessary extent searches in fragmented scenarios by orders of magnitude, and fixes for random read issues identified through file system testing suites.
User-Space Tools
The ocfs2-tools package contains a suite of command-line utilities for formatting, tuning, checking, and managing OCFS2 file systems in user space.[4] It is typically installed via package managers such asyum install ocfs2-tools on Oracle Linux distributions and requires version 1.8.0 or later for full feature support, including global heartbeat.[4] These tools operate externally to the kernel, enabling administrators to prepare and maintain shared cluster volumes without direct kernel intervention.
The mkfs.ocfs2 utility formats block devices into OCFS2 file systems, specifying parameters like block size (from 512 bytes to 4 KB), cluster size (from 4 KB to 1 MB), and the number of node slots (up to 255 for cluster mode).[4] For example, the command mkfs.ocfs2 -L label -N 4 -C 1M /dev/sdb1 creates a volume labeled "label" with 4 node slots and a 1 MB cluster size, supporting volumes up to 16 TB.[4] This tool initializes the on-disk layout essential for cluster-wide access.
tunefs.ocfs2 tunes existing OCFS2 file systems by modifying parameters such as node slots or UUID without reformatting.[4] It can, for instance, convert a local file system to cluster mode using tunefs.ocfs2 -M cluster -N 8 /dev/sdb1 or enable certain on-disk features post-formatting.[4] Queries for current settings are available via the -Q option.
For status monitoring, mounted.ocfs2 detects and lists all OCFS2 volumes on a system by scanning /proc/partitions and assuming shared cluster membership among detected nodes.[29] File system integrity is maintained with fsck.ocfs2, which performs consistency checks and repairs on unmounted volumes.[4]
Low-level inspection is provided by debugfs.ocfs2, which accesses OCFS2's in-kernel state through the mounted debugfs file system (typically at /sys/kernel/debug).[4] Commands like debugfs.ocfs2 -R 'fs_locks' /dev/sdb1 examine file locks, while trace bits can be set for event logging to aid debugging.[4]
Cluster management tools include o2cb, which handles the O2CB stack for initializing clusters, adding or removing nodes, and configuring heartbeat modes (local or global) in /etc/ocfs2/cluster.conf.[4] For example, o2cb add-cluster mycluster followed by o2cb add-node node1 192.168.1.1 sets up a basic cluster, with heartbeat regions defined for disk-based node monitoring.[4]
The ocfs2console graphical interface, once part of ocfs2-tools for visual cluster and file system management, has been deprecated and obsoleted in favor of command-line alternatives since version 1.8.[21] Quota verification uses quotacheck.ocfs2 to scan and ensure consistency of user and group quotas stored as internal file system metadata.[4] This integrates with standard quota tools like quotaon for enabling quotas on mounted volumes.[30]
Configuration and Usage
Installation
Installing OCFS2 requires specific prerequisites to ensure compatibility in a clustered environment. OCFS2 is fully supported on Oracle Linux via the Unbreakable Enterprise Kernel (UEK); for RHEL, the module is available in standard kernels but not supported by Red Hat for production clustering; for SUSE Linux Enterprise, support is limited to use with Pacemaker in SLE 15, and was removed in SLE HA 16 (2025)—consider alternatives for new deployments.[7][31][17] All nodes must have access to a shared block device, such as one configured via iSCSI initiator for concurrent read-write access across the cluster.[1] A compatible Linux kernel with OCFS2 support is essential; for Oracle Linux, this is the UEK, while other distributions like RHEL or SUSE include the OCFS2 module in their standard kernels.[32] Additionally, a reliable network connection between nodes is necessary for cluster communication, typically using TCP/UDP port 7777.[1] The installation process begins with installing the required packages on each node. On RHEL-based systems like Oracle Linux or RHEL, use the package manager to install the tools:sudo dnf install ocfs2-tools (or sudo yum install ocfs2-tools on older versions).[32] For SUSE Linux Enterprise 15, install via sudo zypper install ocfs2-tools, which also pulls in the matching kernel module packages (note: deprecated; use with Pacemaker).[33] It is critical to use the same OCFS2 and kernel versions across all nodes to avoid compatibility issues.[32] The OCFS2 kernel module is built into supported distributions like Oracle Linux; for RHEL, while included, it is unsupported—avoid third-party repositories like ELRepo for production.[7]
After package installation, prepare the shared storage for heartbeat functionality, which monitors node liveness in the cluster. For global disk heartbeat mode—recommended for environments with multiple volumes—dedicate small whole disk devices (e.g., 1 GB each, not partitions) on the shared storage and format them as OCFS2 volumes using tools like mkfs.ocfs2 (detailed in the User-Space Tools section).[34] At least three such devices provide redundancy, as fencing occurs if more than 50% fail to respond.[34]
Finally, verify the installation by loading the kernel module with sudo modprobe ocfs2 and confirming it with lsmod | grep ocfs2.[1] Ensure no conflicts with other filesystem drivers by checking that the shared devices are not claimed by local filesystems like ext4, which could prevent cluster access.[34]
Cluster Setup
Configuring an OCFS2 cluster begins with setting up the O2CB cluster stack (for supported distributions like Oracle Linux), which manages node communication and heartbeat mechanisms. On each node, run the command/sbin/o2cb [init](/page/Init) configure to initialize the stack, prompting for options such as loading the O2CB driver on boot (typically set to yes) and specifying the cluster name that matches the one defined in /etc/ocfs2/[cluster](/page/Cluster).conf.[23] The /etc/ocfs2/[cluster](/page/Cluster).conf file must be manually edited or generated to define the cluster section with parameters like name and node count, followed by node sections specifying each node's name, number (unique integer), IP address for the private heartbeat network, and stack type (o2cb).[35] This configuration file must be identical and present on all nodes in the cluster before starting the stack with /sbin/o2cb [init](/page/Init) [online](/page/.online).[23] Note: For SUSE, O2CB is not supported; use Pacemaker for cluster management.[33]
After configuring the cluster, format the shared block device using mkfs.ocfs2. For a four-node cluster, execute mkfs.ocfs2 -L "volumelabel" -C 4K -N 4 /dev/sdX, where -L sets the volume label, -C specifies the cluster size (e.g., 4K for typical workloads), and -N defines the maximum number of nodes.[36] Additional features like quotas can be enabled using the --fs-feature=usrquota option during formatting if required for the deployment.[10] This command creates the OCFS2 on-disk structure, including journals for each node, ensuring concurrent access across the cluster.
To mount the OCFS2 volume, add an entry to /etc/[fstab](/page/Fstab) on every node, such as UUID=xxxx-xxxx /mnt ocfs2 _netdev 0 0, where the _netdev option delays mounting until the network is available, preventing boot failures in networked environments.[37] Manually mount with [mount](/page/Mount).ocfs2 /dev/sdX /mnt or use [mount](/page/Mount) /mnt to leverage fstab; for automatic mounting on all nodes, enable the ocfs2 service with systemctl enable ocfs2.[38] The volume becomes accessible cluster-wide once mounted on all participating nodes.
Node management in OCFS2 involves adding or removing nodes dynamically using the o2cb utility while the cluster is online, provided global heartbeat mode is active (for supported setups). To add a node, run o2cb add node <nodename> <ip> <stack>, updating /etc/ocfs2/cluster.conf accordingly and propagating changes to all nodes; similarly, use o2cb del node <nodename> for removal.[39] For failure handling, OCFS2 employs integrated fencing where a node evicts itself upon detecting heartbeat loss, configurable via cluster parameters like dead threshold in /etc/ocfs2/cluster.conf.[40] Integration with high-availability stacks such as Pacemaker allows automated failover by combining O2CB (on Oracle Linux) with resource agents for mounting and fencing coordination, or using Pacemaker directly on SUSE.[33]
Best practices for OCFS2 cluster setup include using UUIDs in /etc/fstab entries instead of device names to ensure stability across reboots and device changes, and adding the nointr mount option to improve performance by disabling interruptible operations during I/O.[8] Additionally, configure consistent node numbering in /etc/ocfs2/cluster.conf for all nodes, enable global heartbeat mode for scalability, and thoroughly test failover scenarios by simulating node failures to verify fencing and remounting behaviors.[8]
Performance and Limitations
Optimization Techniques
Optimizing OCFS2 performance involves selecting appropriate file system parameters, tuning journaling and I/O behaviors, configuring the network interconnect, and employing monitoring tools to identify bottlenecks, all tailored to specific workloads such as general-purpose storage or database operations.[40][8] Block and cluster size selection is a foundational optimization step. For general-purpose file systems, a 4 KB block size and cluster size are recommended as defaults, providing a balance between metadata efficiency and compatibility with most workloads.[40] For workloads involving large files, such as database datafiles or virtual machine images, larger cluster sizes ranging from 64 KB to 1 MB reduce metadata overhead by allocating extents more efficiently; this can be set during file system creation with themkfs.ocfs2 -C option, ensuring the cluster size matches or exceeds the application's block size (e.g., 8 KB minimum for Oracle databases).[40][8]
Journaling modes and related parameters further enhance performance while balancing data integrity. The ordered mode, which is the default, ensures file data is written to disk before its associated metadata is committed to the journal, providing strong consistency guarantees suitable for database workloads. The writeback mode can be specified at mount time with -o data=writeback to prioritize performance by allowing data writes after metadata journaling, but with potential for data loss on sudden power failure.[40] To fine-tune journal commit frequency, the commit=N mount option sets the interval in seconds (e.g., mount -o commit=30 for less frequent syncing in low-contention environments), reducing I/O overhead while maintaining reasonable durability.[41]
I/O scheduling and barrier handling are critical for shared storage environments. The deadline scheduler is recommended for latency-sensitive workloads on shared disks, as it prioritizes read requests and enforces deadlines to prevent starvation; it can be set via echo deadline > /sys/block/<device>/queue/scheduler.[40] The noop scheduler suits simpler, sequential I/O patterns in clustered setups by minimizing overhead.[40] Barriers, enabled by default, enforce write ordering for consistency and should remain active unless the underlying storage (e.g., with battery-backed caches) guarantees it, in which case disabling via mount -o barrier=0 may yield minor gains.[40]
Network tuning optimizes the cluster interconnect for low latency and high bandwidth. Enabling jumbo frames (MTU up to 9000 bytes) on TCP-based networks reduces packet overhead for large transfers; this requires configuration on both hosts (e.g., [ifconfig](/page/Ifconfig) eth1 mtu 9000) and switches.[8] For low-latency cluster communication, high-speed Ethernet or InfiniBand via IPoIB can be used as the interconnect to support efficient node coordination.[8] Additionally, limiting the DLM domain size benefits small clusters by reducing lock management overhead; set the number of node slots during formatting with mkfs.ocfs2 -N to approximately twice the expected node count (e.g., 8 slots for a 4-node cluster).[40][8]
Effective monitoring helps pinpoint performance issues. The ocfs2 stat command, often used via debugfs.ocfs2, provides insights into lock resources and metadata usage (e.g., debugfs.ocfs2 -R "stat lockres-value" /dev/sdX), while iostat -x 1 tracks I/O metrics like throughput and wait times to identify bottlenecks.[40][8] Oracle best practices include avoiding concentrations of small files in high-contention directories to minimize lock contention and metadata I/O, as well as using mount options like noatime to reduce access time updates.[8]