Fact-checked by Grok 2 weeks ago

Disjoint-set data structure

The disjoint-set , also known as the union-find , is a that maintains a collection of disjoint (non-overlapping) subsets partitioning a of , enabling efficient operations to merge subsets and determine the subset containing a given . It supports three primary operations: MakeSet(x), which creates a new set containing x; Union(x, y), which merges the sets containing x and y into a single set; and Find(x), which returns a representative (typically the ) of the set containing x. The structure is commonly implemented as a forest of rooted trees, where each element points to its parent, and the root of each tree acts as the set's representative; initially, each element forms its own tree with itself as parent. To optimize performance and keep tree heights low, union by rank (or union by size) attaches the shorter (or smaller) tree to the root of the taller (or larger) one during merges, while path compression flattens the tree by making nodes point directly to the root during finds. With both optimizations, the amortized for a sequence of m operations on n elements is O(m α(n)), where α(n) is the extremely slow-growing inverse , which grows slower than any and is at most 5 for all practical values of n. Originally introduced in 1964 by Bernard A. Galler and Michael J. as an for handling relations in storage allocation, the structure has since become fundamental in algorithm design. Notable applications include detecting connected components in undirected graphs, implementing for minimum spanning trees, and managing classes in various computational problems.

History

Origins and early work

The disjoint-set data structure originated in the 1950s and 1960s as a mechanism to manage dynamic relations arising in computational problems, particularly those involving grouping elements based on evolving connections. These early efforts addressed the need to efficiently track and update partitions of sets where elements could be merged over time, reflecting relations like or that changed during computation. The structure's development was driven by the growing complexity of programming languages and algorithms requiring fast queries for set membership and merging operations. Early implementations focused on solving equivalence class problems in symbolic computation, such as handling declarations that specified shared storage for variables in compilers. A seminal contribution came from Bernard A. Galler and Michael J. Fischer, who in 1964 introduced an improved algorithm for processing statements in languages like and , alongside and declarations to allocate memory efficiently while respecting equivalence constraints. Their approach used a tree-based representation to represent sets, with find and union operations to resolve equivalences dynamically during compilation, reducing storage overhead and computation time for large programs. This work built on prior simple linking methods but introduced optimizations for path traversal, laying the groundwork for handling disjoint partitions in software systems. The structure also found early use in graph connectivity problems, where it modeled connected components by treating vertices as initial sets and edges as operations to merge components. In the , this enabled efficient incremental computation of in undirected graphs, avoiding full recomputation after each addition. The motivation extended to nascent database systems, where maintaining relations between records—such as linking entities through shared attributes—required similar merging of disjoint groups to support queries on and without redundant storage. These applications highlighted the structure's utility in problems demanding both of sets and rapid determination of representative elements for checks.

Key advancements and publications

The development of efficient heuristics for the disjoint-set data structure began in the early with foundational contributions that addressed the performance limitations of naive implementations. In the early , M. D. McIlroy proposed early union heuristics, including a weighting scheme akin to union by rank, in unpublished work that aimed to balance tree heights during set merging to prevent degenerate cases. Concurrently, John E. Hopcroft and Jeffrey D. Ullman introduced path compression in their seminal paper "Set Merging Algorithms," demonstrating how iteratively updating parent pointers along a find path could significantly reduce the depth of search trees over repeated operations. Building on these ideas, Robert E. Tarjan's 1975 paper "Efficiency of a Good but Not Linear Set Union Algorithm" provided the first rigorous of the structure combining union by rank and path compression, establishing an upper bound of O(m \alpha(n)) time for m operations on n elements, where \alpha is the Ackermann function—a bound so slow-growing that it is effectively constant for all practical values of n. This work marked a pivotal advancement, shifting focus from worst-case to amortized performance and highlighting the near-linear overall efficiency despite occasional costly operations. Tarjan further refined the theoretical foundations in his 1979 paper "A Class of Algorithms Which Require Nonlinear Time to Maintain ," proving a matching lower bound of \Omega(m \alpha(n)) for a broad class of algorithms on pointer machines, confirming the tightness of the upper bound and solidifying the structure's optimal complexity. In 1985, Tarjan's "Amortized Computational Complexity" offered a general framework for potential-based , applying it to to underscore the inverse Ackermann bound's implications for self-adjusting data structures. Refinements continued through the late 20th century, with surveys like Galil and Italiano's 1991 overview cataloging variants and applications while affirming the core heuristics' dominance. Into the 2000s, efforts focused on practical implementations and extensions, but the \alpha(n) amortized cost per operation, proven tight by Tarjan's earlier analyses, remained the unchallenged benchmark for the standard disjoint-set structure.

Representation

Core data structures

The disjoint-set data structure maintains a collection of disjoint sets through a simple array-based representation known as the parent array. Each element in the universe of discourse is assigned an index, and the parent array stores, for each element i, the index of its parent in the hierarchical structure of the set; if i is the representative (root) of its set, then parent[i] = i. This pointer-based approach forms the basis of the data structure, allowing sets to be implicitly organized as trees where links follow parent pointers upward to the root. Initially, when creating the structure for n elements labeled from 0 to n-1, each element forms its own singleton set, so the is initialized with parent[i] = i for all i. This setup ensures that every element is trivially its own representative at the start. For example, with n=5:
Initial [parent](/page/Parent) [array](/page/Array): [0, 1, 2, 3, 4]
After performing a operation that merges the sets containing elements 1 and 2 (making 1 the parent of 2, for instance), the might become:
Updated parent array: [0, 1, 1, 3, 4]
Here, elements 1 and 2 now belong to the same set represented by 1, while the others remain singletons. Overall, the parent array induces a —a collection of rooted —where each tree corresponds to one disjoint set, and the root of each tree serves as the canonical representative for all elements in that set. This forest representation naturally supports the partitioning of elements into disjoint subsets without storing explicit set boundaries. Auxiliary fields can be incorporated alongside the parent array to further optimize operations, as explored in subsequent sections on efficiency enhancements.

Auxiliary fields for efficiency

To enhance the efficiency of the disjoint-set data structure without modifying its core parent array representation, auxiliary fields such as rank and size arrays are introduced to guide merging decisions in union operations. These fields are stored alongside the parent array and are maintained exclusively at root nodes to approximate tree properties and prevent excessive height growth. The rank array supports the union by rank heuristic by storing, at each root, an integer value that upper-bounds the height of the corresponding tree. For an initial collection of n singleton sets, the rank of every element i (which is its own root) is initialized to 0, reflecting a tree height of 1. When merging two trees, the rank of the new root is set to the maximum of the two original ranks; if the ranks are equal, it is incremented by 1 to account for the increased height. This ensures ranks grow logarithmically with the number of unions, providing a simple proxy for height without requiring full tree traversals for updates. The array, used in the by heuristic, records the of each set at its to facilitate size-based merging. Upon initialization for n elements, each size[i] is set to 1, as each forms a set of one . During a , the of the chosen is updated to the sum of the sizes from both sets, while the other root's is not adjusted since it becomes a non-root. This field remains valid only at roots, as path compression may alter non-root pointers but does not affect values. Both auxiliary fields are updated solely during union operations at the designated root, ensuring O(1) maintenance cost per update while enabling heuristics that keep tree heights bounded. For instance, the following pseudocode illustrates their joint initialization alongside the parent array:
function Initialize(n):
    parent ← array of size n, where parent[i] = i for i = 0 to n-1
    rank ← array of size n, initialized to 0
    size ← array of size n, initialized to 1
    return parent, rank, size
These fields are referenced in the optimization heuristics to decide which root absorbs the other during unions.

Operations

Creating new sets

The creation of new sets in the disjoint-set data structure involves initializing individual elements as singleton sets, forming the foundational step before any merging occurs. The core operation for this purpose is MAKE-SET(x), which establishes a new set containing solely the element x by configuring x as its own parent, designating it as the set's representative. To support later efficiency heuristics, this operation also sets the rank of x to 0 and its size to 1, where rank approximates tree height for balanced unions and size tracks the number of elements in the set. The procedure assumes x is a distinct element not already part of any existing set, ensuring no overlaps or prior associations during initialization. This maintains the disjoint property from the outset, with each new set isolated until explicitly merged. Here is the standard pseudocode for MAKE-SET(x), as described in foundational algorithmic literature:
MAKE-SET(x)
    parent[x] ← x
    rank[x] ← 0
    size[x] ← 1
When initializing multiple elements—such as an array of n distinct items—a batch process applies MAKE-SET iteratively to each one, creating n independent singleton sets in linear time overall. This approach is efficient for preparing the structure in scenarios like graph connectivity problems, where all vertices start unconnected.

Finding set representatives

The find operation in a disjoint-set data structure identifies the representative () of the set containing a given element x by traversing the array from x until reaching a whose points to itself. This representative serves as a identifier for the entire set. The basic recursive implementation follows the parent pointers until the is found:
function find(x):
    if parent[x] == x:
        return x
    return find(parent[x])
This version is simple but can cause stack overflow in languages with limited recursion depth for deeply nested trees. An iterative alternative avoids recursion by looping through the parent pointers:
function find(x):
    while parent[x] != x:
        x = parent[x]
    return x
Both versions return the index of the root as the set representative. For example, consider a array where parent = [0, 1, 1, 2, 3] for elements 0 through 4 (with 0 as a root and a 4 → 3 → 2 → 1). Calling find(4) traverses 4 to 3, then 3 to 2, then 2 to 1 (where parent[1] == 1), returning 1 as the representative. Without optimizations, the naive find operation exhibits worst-case , as in a linear of n elements where traversal visits every . The basic find can be enhanced with path compression for better performance in subsequent operations.

Basic union operation

The basic union operation merges two disjoint sets in the disjoint-set data structure by linking the root of one set's tree to the root of the other's in the underlying forest representation. To perform the union of elements x and y, the algorithm first determines the roots of their respective sets using the find operation. If the roots differ, it arbitrarily sets the parent of one root (say, the root of x's set) to the other root, thereby combining the trees without altering the internal structure of either set. This approach, introduced as part of an improved algorithm for handling equivalence declarations in programming languages, relies on a simple tree-linking mechanism to maintain the partition of elements into disjoint sets. The for the basic union operation is straightforward and assumes the existence of a array to represent the , where each non-root node points to its , and roots point to themselves or a special value:
function find(x):
    if parent[x] != x:
        // In basic version, no path compression; traverse to root
        return find(parent[x])
    return x

function union(x, y):
    rx = find(x)
    ry = find(y)
    if rx != ry:
        parent[rx] = ry  // Arbitrary link: rx's root to ry's root
This implementation performs the merge in constant time after the finds, but the overall efficiency depends on the find operation's performance. A key limitation of this naive linking strategy is its potential to produce unbalanced trees, particularly long chains, which degrade the efficiency of future find operations. For instance, if unions are always performed by linking a new singleton set to the same existing root, the resulting structure becomes a linear chain where each find from the tail requires traversing all preceding nodes, leading to quadratic time in the worst case for a sequence of operations. To illustrate with elements 1 through 4 as initial singletons (each with parent = i), union(1, 2) sets parent = 2; union(2, 3) sets parent = 3, yielding 1 → 2 → 3 with 4 separate; then union(3, 4) sets parent = 4, resulting in 1 → 2 → 3 → 4. A subsequent find(1) now traverses the full chain of length 4. Such degenerate cases highlight why the basic union, while simple, requires enhancements for practical use, though those heuristics are addressed elsewhere.

Optimization Heuristics

Path compression

Path compression is a key optimization technique integrated into the find operation of the disjoint-set data structure, designed to reduce the height of trees by shortening the paths from nodes to their roots during searches. By updating parent pointers along the traversal path, it ensures that subsequent find operations on those nodes traverse fewer edges, thereby amortizing the cost over multiple operations. This method preserves the correctness of the structure, as it only modifies pointers to point toward the actual root without altering set memberships. The technique was originally proposed by McIlroy and in their for finding spanning trees, and later analyzed in detail for union-find applications. There are two primary variants of path compression: full compression, which sets every node on the path directly to the root, and path halving, which updates every other node to point to its grandparent for a one-pass approximation. Full compression achieves greater flattening but requires a two-pass traversal, while path halving is simpler and performs updates in a single pass, though it results in less aggressive shortening. These variants were systematically compared and analyzed for their worst-case performance in the context of set union algorithms. The recursive implementation of find with full path is elegant and widely used, as it naturally compresses the path during the recursive calls. In , it appears as follows:
function find(x):
    if parent[x] ≠ x:
        parent[x] = find(parent[x])
    return parent[x]
This recursion first finds the root of the parent and then updates the current 's parent to that root, effectively compressing the entire path on the unwind. For iterative full compression, a two-pass approach is employed: first, traverse from the node to the root while recording the path (or using a temporary root holder), then traverse back to set each 's parent to the root. This avoids and is suitable for deep trees or environments with limitations. To illustrate, consider a of four nodes A → B → C → D, where D is the (i.e., parent[A] = B, parent[B] = C, parent[C] = D, parent[D] = D), resulting in a of 3. After performing find(A) with full , the structure flattens to A → D, B → D, C → D, reducing the height to 1 and making future finds on A, B, or C constant-time in terms of path length. With path halving, the traversal would set parent[A] = C (grandparent of A) and parent[B] = D (grandparent of B), while C remains pointing to D, yielding A → C → D and B → D, for a partially reduced of 2. The primary benefit of path compression is its ability to drastically reduce the amortized of find operations, approaching constant time in practice when used alone or in combination with union heuristics like by , without introducing any overhead to the union step itself.

Union by rank

The by heuristic enhances the disjoint-set data structure by linking trees during unions in a way that maintains balance, approximating tree heights to prevent excessive growth. Each root maintains a value, which is a non-decreasing starting at 0 for new sets and serving as an upper bound on the logarithm of the tree's height. To perform a union, the roots r_x and r_y of the two sets are first identified using the find operation. The root with the smaller rank becomes a child of the root with the larger rank. If the ranks are equal, one root (arbitrarily chosen, say r_y) is made a child of the other (r_x), and the rank of r_x is incremented by 1. This procedure is illustrated in the following pseudocode:
function union_by_rank(x, y):
    r_x = find(x)
    r_y = find(y)
    if r_x != r_y:
        if rank[r_x] > rank[r_y]:
            parent[r_y] = r_x
        elif rank[r_x] < rank[r_y]:
            parent[r_x] = r_y
        else:
            parent[r_y] = r_x
            rank[r_x] = rank[r_x] + 1
Consider an example involving two trees: one with A of 1 (containing elements forming a chain of 2) and another with B of 2 ( approximately 3). Upon , A is linked directly to B, preserving the of B at 2 and ensuring the new tree's remains bounded by the higher original without increase. This approach guarantees that tree increases only when merging equal- trees, limiting the maximum to O(\log n) where n is the number of elements, as each increment at least doubles the minimum number of nodes in the tree.

Union by size

The union by size heuristic enhances the efficiency of the disjoint-set data structure by maintaining a size field for each node, representing the number of nodes in its subtree, including itself. This field is initialized to 1 for each singleton set containing a single element. By always merging the smaller tree into the larger one, the heuristic promotes balanced tree growth, thereby limiting the increase in tree height during unions. In the union procedure, the roots of the two sets to be merged are first identified using the find operation. Let r_x and r_y denote these roots. The root of the smaller tree (based on the size field) is then set as a child of the root of the larger tree, and the size of the new parent root is updated by adding the size of the attached subtree. If the sizes are equal, an arbitrary choice can be made, with the size of the chosen parent increased accordingly. This approach ensures that the depth of nodes in the smaller tree increases by at most 1, while larger trees absorb smaller ones without significant height penalty. The following pseudocode illustrates the union by size operation, assuming parent[] and size[] arrays:
function [union](/page/Union)(x, y):
    rx = find(x)
    ry = find(y)
    if rx != ry:
        if [size](/page/Size)[rx] < [size](/page/Size)[ry]:
            [parent](/page/Parent)[rx] = ry
            [size](/page/Size)[ry] += [size](/page/Size)[rx]
        else:
            [parent](/page/Parent)[ry] = rx
            [size](/page/Size)[rx] += [size](/page/Size)[ry]
This implementation requires constant-time access to the size fields, which are stored alongside the parent pointers in the auxiliary structure. Consider an example where two disjoint sets are merged: one set with root r_x and size 3 (containing elements a, b, c), and another with root r_y and size 5 (containing elements d, e, f, g, h). Upon union, since size[r_x] < size[r_y], r_x becomes a child of r_y, and size[r_y] is updated to 8. The resulting tree has r_y as the new root, with the subtree rooted at r_x attached, maintaining the overall structure's balance. The primary advantage of union by size over naive linking is its direct control over tree balance through cardinality, which keeps the maximum tree height logarithmic in the number of elements, providing a practical alternative to other heuristics for maintaining efficiency in dynamic set operations. It is often combined with path compression for near-optimal performance in practice.

Time Complexity

Unoptimized performance

The make-set operation, which initializes a new set for an , requires time, O(1), as it simply sets the pointer of the to itself. In the unoptimized implementation, the find operation naively traverses the pointers from a given until reaching the root representative of its set. This can degenerate into linear chains, leading to a worst-case time of per find, where n is the number of in the set. The basic union operation, which links the roots of two sets, takes O(1) time once the finds have identified the roots; however, the overall cost of a union can reach O(n) when accounting for the in chain-like structures. For a sequence of m intermixed make-set, find, and operations on n elements, the unoptimized disjoint-set exhibits a worst-case of O(m n). This bound arises because repeated unions can form long chains without balancing, forcing subsequent finds to traverse up to n nodes each time. For instance, consider an adversarial starting with make-set(1), followed by make-set(i) and union(1, i) for i from 2 to n (assuming unions link the root of the first set to the root of the second); this constructs a linear chain where each subsequent find on element 1 traverses all n nodes. Such performance motivates the use of heuristics to achieve better bounds in practice.

Amortized bounds with heuristics

When path compression is applied alone, typically with arbitrary linking for unions, the amortized time for a sequence of m find and union operations on n elements is O(m \log n). Similarly, employing by rank or by size without path compression yields an amortized of O(m \log n), as these methods bound the tree height to O(\log n). Combining path compression with union by rank or size achieves the near-optimal amortized bound of O(m \alpha(n)), where \alpha(n) denotes the inverse ; this function increases so gradually that \alpha(n) \leq 5 for all n relevant to modern computing. In this fully optimized disjoint-set structure, the total time required to perform m operations starting from n singleton sets is thus O(m \alpha(n)). These heuristics maintain exceptionally low tree heights overall—union by rank or size limits growth logarithmically, while path compression dynamically flattens search paths to the root, ensuring each operation costs nearly constant time in aggregate.

Analysis of the inverse Ackermann function

The Ackermann function, denoted A(m, n), is a recursively defined fast-growing function that serves as a benchmark for extremely rapid growth rates in computable functions. It is defined for nonnegative integers m and n as follows: \begin{align*} A(0, n) &= n + 1, \\ A(m + 1, 0) &= A(m, 1), \\ A(m + 1, n + 1) &= A(m, A(m + 1, n)). \end{align*} This yields values such as A(1, n) = n + 2, A(2, n) = 2n + 3, A(3, n) = 2^{n+3} - 3, and A(4, n) = 2 \uparrow\uparrow (n + 3) - 3 (using Knuth's up-arrow notation for tetration), where A(4, n) exceeds any primitive recursive function for sufficiently large n. The inverse Ackermann function, \alpha(n), is defined as the smallest nonnegative k such that A(k, k) \geq \log n (with adjustments in some variants to align with the analysis), capturing the slow-growing inverse needed for the union-find bound. For example, \alpha(2^{65536}) \leq 5, and \alpha(n) \leq 4 for all n up to vastly larger than the number of atoms in the . Tarjan's proof of the amortized employs a \Phi that measures the "disorder" in the structure, defined as \Phi = \sum_v c(v), where c(v) counts the number of nodes whose path to the root passes through v at certain iterative levels of . Ranks bound tree heights logarithmically, ensuring unions increase ranks sparingly (at most \log n times per tree). During a find operation with path , the potential decrease bounds the path length traversed, as each step flattens the and reduces the number of nodes at higher levels. Levels are iterated up to \alpha(n) times, where each level corresponds to a recursive application akin to the Ackermann growth, limiting the total changes. The amortized cost per find operation is thus at most \alpha(n) + O(1), derived from the fact that the actual cost is charged to the initial potential plus a constant, while compressions yield a potential drop of at least the path length minus \alpha(n). Unions contribute O(1) amortized cost via rank maintenance. This bound arises from analyzing the forest over m operations on n elements, where the total time is O(m \alpha(n)). In practice, \alpha(n) < 5 for all feasible n, rendering the time complexity effectively constant and superior to polylogarithmic bounds for large-scale applications.

Applications

Graph connectivity problems

The disjoint-set data structure is widely used to solve connectivity problems by efficiently tracking whether vertices belong to the same through and find operations. In undirected s, it enables the detection of connectivity, identification of cycles, and construction of spanning structures like minimum spanning trees, all while maintaining near-linear performance across multiple operations. One prominent application is for finding the (MST) of a connected, undirected with weighted edges. The algorithm sorts all edges in non-decreasing order of weight and processes them sequentially: for each edge, it performs a find operation on its endpoints to check if they are in different sets; if so, it unions the sets and adds the edge to the MST. This greedy approach ensures the MST is constructed without cycles, as unions only occur between distinct components. The following illustrates the core logic using a disjoint-set structure:
function KruskalMST(G, w):
    T = empty set  // MST edges
    for each vertex v in G:
        make_set(v)  // Initialize each vertex as singleton
    sort edges of G by weight w in increasing order
    for each edge (u, v) in sorted edges:
        if find(u) != find(v):
            union(u, v)
            add (u, v) to T
    return T
The disjoint-set operations in contribute O(m α(n)) time for m edges and n vertices, where α is the inverse , dominating only after the initial O(m log m) step. To determine the connected components of an undirected , initialize each as its own singleton set and the sets for every pair of adjacent vertices. After processing all edges, the number of distinct roots (via find operations) equals the number of components, and vertices sharing the same root belong to the same component. This approach runs in O(m α(n)) time overall. Cycle detection in an undirected leverages the same structure: for each (u, v), if find(u) and find(v) return the same root before , a cycle exists, as the edge connects vertices already in the same component. Otherwise, perform the . This detects cycles in O(m α(n)) time without explicitly traversing paths.

Other algorithmic uses

The disjoint-set data structure finds application in offline queries, where a sequence of operations merges elements into classes, followed by find operations to determine class membership. This is particularly useful in symbolic computation and optimizations, such as alias analysis, where variables or pointers are grouped based on relations derived from program statements. For instance, in points-to analysis, the structure efficiently tracks sets of locations that a pointer may reference by ing nodes in a constraint graph representing equality relations, achieving near-linear time performance through path compression and . In image processing, the disjoint-set structure supports segmentation by labeling connected components, where adjacent pixels with similar properties (e.g., color or ) are unioned into the same set, enabling efficient partitioning of large-scale images without repeated traversals. An optimized variant processes unions during a of the image, resolving equivalences in a second pass to assign unique labels to each component, reducing memory usage and achieving practical runtimes on high-resolution data. For network connectivity in social networks, the structure facilitates online friend groupings or clustering by dynamically merging user profiles into partitions based on shared connections or attributes, such as mutual interests, allowing incremental updates to equivalence classes without recomputing the entire . This approach underpins clustering algorithms that anonymize or analyze communities by iteratively unioning nodes with high similarity, preserving privacy while identifying dense subgroups. A representative example is statements in a , where statements like a EQUIVALENCE b or pointer assignments imply . The following illustrates offline handling:
initialize [disjoint sets](/page/Disjoint_sets) for all variables
for each equivalence statement in [program](/page/Program):
    [union](/page/Union)(find(a), find(b))  // merge equivalence classes
for each query (e.g., alias check between x and y):
    if find(x) == find(y):
        report equivalent (potential alias)
This method efficiently resolves all equivalences in amortized nearly constant time per operation, avoiding full or rebuilds of dependency graphs. The advantage lies in its ability to handle dynamic merges scalably, supporting thousands of updates in practice while maintaining integrity for subsequent analyses like optimization passes.

Extensions and Variants

Support for deletions

The standard disjoint-set data structure does not natively support deletions, as removing an element from a set can require splitting the set or restructuring the underlying representation, while path compression flattens trees in a way that obscures subtree relationships and makes removal challenging without additional . One foundational approach to incorporating deletions is the scheme proposed by Kaplan, Shafrir, and Tarjan, which extends the classical link-by-size heuristic to handle element removals while preserving near-optimal performance. In their method, deletions are processed by marking the element and adjusting set sizes, with unions and finds modified to account for marked (deleted) elements during path traversal. To prevent degradation from accumulated marks, sets are periodically rebuilt by creating fresh trees containing only active elements, linking subtrees as needed. This lazy deletion strategy incurs amortized O(α(n)) time per union and find operation, where α(n) is the inverse , and O(log n / log log n) time per deletion. Ben-Amram and Yoffe (2011) presented a simple and efficient Union-Find-Delete algorithm achieving amortized O(1) time for unions and deletions, and O(α(n)) for finds, preserving the efficiency of the standard structure. Alstrup, Gørtz, Rauhe, Thorup, and Zwick developed a deterministic supporting O(1) worst-case time for deletions, makesets, and unions, with amortized O(α(n)) time for finds. Their construction modifies the in the to accommodate deletions, using a binomial-tree-like representation for sets that facilitates quick of deleted nodes without frequent rebuilds. This balances the need for efficient merging with the overhead of removals, though it introduces slightly higher constants and usage compared to non-deletion variants. For an example of a basic deletion operation in a tree-based representation supporting splits, consider the following for deleting x, assuming a array and tracking (with lazy marking omitted for simplicity; in practice, full implementations handle path compression adjustments):
function delete(x):
    if x is [root](/page/Root):  # [Singleton](/page/Singleton) set
        mark x as deleted
        return
    parent_x = [parent](/page/Parent)[x]
    # Isolate x by detaching its children
    for each child y of x:  # Traverse children if stored, or recompute via reverse links
        [parent](/page/Parent)[y] = parent_x
    # Attach x's former siblings or adjust as needed; if x had no children, simply mark
    mark x as deleted
    # Update [size](/page/Size)s along path to [root](/page/Root)
    current = parent_x
    while current is not None:
        [size](/page/Size)[current] -= 1
        if [size](/page/Size)[current] < threshold:  # Trigger rebuild if needed
            rebuild_set(current)
        current = [parent](/page/Parent)[current]
This isolates x and potentially splits its subtree by reattaching children to the , but requires reverse or (increasing ) to avoid O(n) scans; without them, splits may merge unintended subtrees. These deletion-supporting extensions generally trade off the near-constant amortized time of the standard structure for logarithmic factors, with complexities rising to O(log n) or higher per operation in some cases, and increased implementation complexity due to marking, rebuilding, or level-based decompositions. Periodic full rebuilds in lazy approaches can also introduce bursts of higher cost, though amortized bounds remain efficient for dynamic applications like incremental connectivity.

Backtracking capabilities

Backtracking capabilities in the disjoint-set data structure enable the reversal of union operations, which is crucial for algorithms that involve exploratory trial-and-error processes, such as constraint satisfaction problems where temporary merges of equivalence classes must be undone upon encountering inconsistencies. A straightforward method to support backtracking is the stack-based undo technique, which records the modifications made during each union—such as changes to the parent pointers, ranks, or sizes of roots—on a stack before applying them. When backtracking is required, the stack is popped in reverse order to restore the prior state, effectively undoing the most recent unions. This approach preserves the efficiency of the standard disjoint-set operations, adding only constant-time overhead per union and backtrack in the amortized sense. The following pseudocode illustrates a union operation with stack-based undo using union by rank (without path compression to simplify reversibility):
function union(x, y, stack):
    rx = find(x)
    ry = find(y)
    if rx == ry:
        return
    # Assume union by rank
    if rank[rx] < rank[ry]:
        swap(rx, ry)
    # Record changes for undo
    old_parent = parent[ry]
    old_rank = rank[rx]
    stack.push((ry, old_parent, rx, old_rank))  # Tuple: (attached_root, old_parent, link_root, old_link_rank)
    parent[ry] = rx
    if rank[rx] == rank[ry]:
        rank[rx] += 1

function backtrack(stack, num_undos):
    for i in 1 to num_undos:
        if stack.empty():
            break
        (ry, old_parent, rx, old_rank) = stack.pop()
        parent[ry] = old_parent
        rank[rx] = old_rank
In this implementation, each union pushes a constant number of values onto the stack, and each backtrack restores them in O(1) time per undone union, yielding O(1) amortized overhead when integrated with the inverse Ackermann function complexity of finds. For scenarios requiring multiple versions or immutable histories, a persistent variant of the disjoint-set structure uses mechanisms to create new copies of only the modified nodes (e.g., roots and their paths) upon each union, allowing access to previous snapshots without altering the current state. This facilitates efficient by switching to an earlier version pointer. Seminal work on persistent data structures demonstrates that such a union-find achieves update and find times of O(log n / log log n) amortized, with space proportional to the number of operations. A more recent implementation matches the non-persistent amortized inverse Ackermann complexity while ensuring full persistence.

Concurrent and parallel versions

In concurrent settings, the disjoint-set data structure encounters significant challenges due to race conditions arising from multiple s accessing and modifying the shared parent array simultaneously during find and union operations, potentially leading to incorrect path compression or linking that violates the disjoint-set . To address these issues, lock-free implementations leverage atomic operations, such as (), to ensure thread-safe updates without traditional locks, enabling non-blocking progress for at least one thread. These approaches often employ optimistic techniques where operations proceed assuming low contention, with validation steps to detect and retry on conflicts. A seminal randomized concurrent algorithm uses CAS to implement wait-free find and union operations, achieving expected O(1) time per operation with high probability in the worst case. For example, an optimistic concurrent find might first traverse the path to the root while recording nodes, then validate the path integrity atomically before applying compression via sequential operations on each node to point directly to the root, retrying if any fails due to concurrent modification:
function concurrent_find(x):
    if atomic_load(parent[x]) == x:
        return x
    path = []
    curr = x
    while atomic_load(parent[curr]) != curr:
        path.append(curr)
        curr = atomic_load(parent[curr])
    root = curr
    # Validate and compress
    for node in reversed(path):
        expected = atomic_load(parent[node])
        if expected != the recorded parent for node:  # Simplified validation
            retry entire find  # Or handle locally
        else:
            CAS(&parent[node], expected, root)  # Atomic set to root
    return root
This illustrates a basic optimistic strategy, where traversal uses loads and compression relies on for safe updates; actual implementations may incorporate randomization or versioning for stronger guarantees. Recent advancements include a 2024 machine-verified multicore disjoint-set union structure that maintains the near-constant O(α(n)) amortized per operation, even under high contention, using fine-grained updates verified for . This design supports across p processors, delivering near-linear for batches of m operations in parallel algorithms like connected components, as evidenced by its deployment in production multicore systems.

References

  1. [1]
    Disjoint Set Data Structure
    Let's see algorithms to implement this linked-list Union/Find. We'll assume that p[1..n] is initialized to p[i] = i so each item is in its own singleton set ...
  2. [2]
    Data structures and algorithms for disjoint set union problems
    A linear-time algorithm for a special case of disjoint set union. STOC '83: Proceedings of the fifteenth annual ACM symposium on Theory of computing.Missing: sources | Show results with:sources
  3. [3]
    Set Merging Algorithms | SIAM Journal on Computing
    View PDF. References. 1. B. A. Galler, M. J. Fischer, An improved equivalence algorithm, Comm. ACM, 7 (1964), 301–303. Crossref · Web of Science · Google ...
  4. [4]
    Efficiency of a Good But Not Linear Set Union Algorithm
    Efficiency of a Good But Not Linear Set Union Algorithm. Author: Robert Endre Tarjan ... The algorithm executes an intermixed sequence of m union and find ...
  5. [5]
    A class of algorithms which require nonlinear time to maintain ...
    The paper defines a class of algorithms which compute unions of disjoint sets on-line, and proves that any such algorithm requires nonlinear time in the worst ...
  6. [6]
    Amortized Computational Complexity | SIAM Journal on Matrix ...
    A powerful technique in the complexity analysis of data structures is amortization, or averaging over time. Amortized running time is a realistic but robust ...
  7. [7]
    [PDF] An Improved Equivalence Algorithm - Moscova
    GALLER AND MICHAEL J. FISHER. University of Michigan, Ann Arbor, Michigan. An algorithm for assigning storage on the basis of EQUIV-. ALENCE, DIMENSION and ...Missing: paper | Show results with:paper
  8. [8]
    1.5 Case Study: Union-Find - Algorithms, 4th Edition
    Sep 14, 2019 · The quick-find algorithm uses one array access for each call to find() and between n + 3 and 2n + 1 array accesses for each call to union() that combines two ...
  9. [9]
    [PDF] ‣ naïve linking ‣ link-by-size ‣ link-by-rank ‣ path compression ...
    Jan 15, 2020 · This paper analyzes the asymptotic worst-case running time of a number of variants of the well-known method of path compression for maintaining ...
  10. [10]
    [PDF] Disjoint Set Union - cs.Princeton
    Union by size: maintain at each root the tree size (# nodes). unite (x,y): if size (x) ≥ size (y) make x the parent of y.
  11. [11]
    [PDF] 10 Data Structures for Disjoint Sets
    It follows that if we use union by rank, FIND with path compression runs in O(α(n)) amortized time. Even this upper bound is somewhat conservative if m is ...
  12. [12]
    Points-to analysis in almost linear time - ACM Digital Library
    Points-to analysis in almost linear time. Author: Bjarne Steensgaard. Bjarne ... Union-FindProceedings of the ACM on Programming Languages10.1145/3729298 ...
  13. [13]
    The union-split algorithm and cluster-based anonymization of social ...
    We present two new and efficient clustering methods for undirected graphs: bounded t-means clustering and union-split clustering algorithms that group similar ...Missing: find | Show results with:find
  14. [14]
    Union-Find with Constant Time Deletions - ACM Digital Library
    Union-Find with Constant Time Deletions. Authors: Stephen Alstrup. Stephen ... Data structures and algorithms for disjoint set union problems. Computing ...
  15. [15]
    A simple and efficient Union–Find–Delete algorithm - ScienceDirect
    Feb 4, 2011 · The Union–Find data structure for maintaining disjoint sets is one of the best known and widespread data structures, in particular the ...Missing: summary | Show results with:summary
  16. [16]
    Amortized Analysis of Algorithms for Set Union with Backtracking
    They proposed a way to modify standard set union algorithms to handle de-union operations. In this paper several algorithms are analyzed based on their approach ...
  17. [17]
    A persistent union-find data structure - ACM Digital Library
    This paper details the implementation of a persistent union-find data structure as efficient as its imperative counterpart. ... Robert Endre Tarjan.
  18. [18]
    [2003.01203] Concurrent Disjoint Set Union - arXiv
    Mar 2, 2020 · We develop and analyze concurrent algorithms for the disjoint set union (union-find) problem in the shared memory, asynchronous multiprocessor ...Missing: lock- | Show results with:lock-
  19. [19]
    A Randomized Concurrent Algorithm for Disjoint Set Union
    Disjoint set union is a basic problem in data structures with a wide variety of applications. We extend a known efficient sequential algorithm for this ...