Disjoint-set data structure
The disjoint-set data structure, also known as the union-find data structure, is a data structure that maintains a collection of disjoint (non-overlapping) subsets partitioning a finite set of elements, enabling efficient operations to merge subsets and determine the subset containing a given element.[1]
It supports three primary operations: MakeSet(x), which creates a new singleton set containing element x; Union(x, y), which merges the sets containing elements x and y into a single set; and Find(x), which returns a representative element (typically the root) of the set containing x.[1]
The structure is commonly implemented as a forest of rooted trees, where each element points to its parent, and the root of each tree acts as the set's representative; initially, each element forms its own singleton tree with itself as parent.[1]
To optimize performance and keep tree heights low, union by rank (or union by size) attaches the shorter (or smaller) tree to the root of the taller (or larger) one during merges, while path compression flattens the tree by making nodes point directly to the root during finds.[1]
With both optimizations, the amortized time complexity for a sequence of m operations on n elements is O(m α(n)), where α(n) is the extremely slow-growing inverse Ackermann function, which grows slower than any iterated logarithm and is at most 5 for all practical values of n.
Originally introduced in 1964 by Bernard A. Galler and Michael J. Fischer as an algorithm for handling equivalence relations in storage allocation, the structure has since become fundamental in algorithm design.
Notable applications include detecting connected components in undirected graphs, implementing Kruskal's algorithm for minimum spanning trees, and managing equivalence classes in various computational problems.[1]
History
Origins and early work
The disjoint-set data structure originated in the 1950s and 1960s as a mechanism to manage dynamic equivalence relations arising in computational problems, particularly those involving grouping elements based on evolving connections. These early efforts addressed the need to efficiently track and update partitions of sets where elements could be merged over time, reflecting relations like equality or connectivity that changed during computation. The structure's development was driven by the growing complexity of programming languages and algorithms requiring fast queries for set membership and merging operations.
Early implementations focused on solving equivalence class problems in symbolic computation, such as handling declarations that specified shared storage for variables in compilers. A seminal contribution came from Bernard A. Galler and Michael J. Fischer, who in 1964 introduced an improved algorithm for processing EQUIVALENCE statements in languages like FORTRAN and MAD, alongside DIMENSION and COMMON declarations to allocate memory efficiently while respecting equivalence constraints.[2] Their approach used a tree-based representation to represent sets, with find and union operations to resolve equivalences dynamically during compilation, reducing storage overhead and computation time for large programs. This work built on prior simple linking methods but introduced optimizations for path traversal, laying the groundwork for handling disjoint partitions in software systems.
The structure also found early use in graph connectivity problems, where it modeled connected components by treating vertices as initial singleton sets and edges as union operations to merge components. In the 1960s, this enabled efficient incremental computation of reachability in undirected graphs, avoiding full recomputation after each addition. The motivation extended to nascent database systems, where maintaining relations between records—such as linking entities through shared attributes—required similar merging of disjoint groups to support queries on equivalence and hierarchy without redundant storage. These applications highlighted the structure's utility in problems demanding both union of sets and rapid determination of representative elements for equivalence checks.
Key advancements and publications
The development of efficient heuristics for the disjoint-set data structure began in the early 1970s with foundational contributions that addressed the performance limitations of naive implementations. In the early 1970s, M. D. McIlroy proposed early union heuristics, including a weighting scheme akin to union by rank, in unpublished work that aimed to balance tree heights during set merging to prevent degenerate cases.[3] Concurrently, John E. Hopcroft and Jeffrey D. Ullman introduced path compression in their seminal paper "Set Merging Algorithms," demonstrating how iteratively updating parent pointers along a find path could significantly reduce the depth of search trees over repeated operations.[3]
Building on these ideas, Robert E. Tarjan's 1975 paper "Efficiency of a Good but Not Linear Set Union Algorithm" provided the first rigorous amortized analysis of the structure combining union by rank and path compression, establishing an upper bound of O(m \alpha(n)) time for m operations on n elements, where \alpha is the inverse Ackermann function—a bound so slow-growing that it is effectively constant for all practical values of n.[4] This work marked a pivotal advancement, shifting focus from worst-case to amortized performance and highlighting the near-linear overall efficiency despite occasional costly operations.
Tarjan further refined the theoretical foundations in his 1979 paper "A Class of Algorithms Which Require Nonlinear Time to Maintain Disjoint Sets," proving a matching lower bound of \Omega(m \alpha(n)) for a broad class of algorithms on pointer machines, confirming the tightness of the upper bound and solidifying the structure's optimal complexity.[5] In 1985, Tarjan's "Amortized Computational Complexity" offered a general framework for potential-based amortized analysis, applying it to disjoint sets to underscore the inverse Ackermann bound's implications for self-adjusting data structures.[6]
Refinements continued through the late 20th century, with surveys like Galil and Italiano's 1991 overview cataloging variants and applications while affirming the core heuristics' dominance.[7] Into the 2000s, efforts focused on practical implementations and extensions, but the \alpha(n) amortized cost per operation, proven tight by Tarjan's earlier analyses, remained the unchallenged benchmark for the standard disjoint-set structure.
Representation
Core data structures
The disjoint-set data structure maintains a collection of disjoint sets through a simple array-based representation known as the parent array. Each element in the universe of discourse is assigned an index, and the parent array stores, for each element i, the index of its parent in the hierarchical structure of the set; if i is the representative (root) of its set, then parent[i] = i. This pointer-based approach forms the basis of the data structure, allowing sets to be implicitly organized as trees where links follow parent pointers upward to the root.
Initially, when creating the structure for n elements labeled from 0 to n-1, each element forms its own singleton set, so the parent array is initialized with parent[i] = i for all i. This setup ensures that every element is trivially its own representative at the start. For example, with n=5:
Initial [parent](/page/Parent) [array](/page/Array): [0, 1, 2, 3, 4]
Initial [parent](/page/Parent) [array](/page/Array): [0, 1, 2, 3, 4]
After performing a union operation that merges the sets containing elements 1 and 2 (making 1 the parent of 2, for instance), the array might become:
Updated parent array: [0, 1, 1, 3, 4]
Updated parent array: [0, 1, 1, 3, 4]
Here, elements 1 and 2 now belong to the same set represented by 1, while the others remain singletons.
Overall, the parent array induces a forest—a collection of rooted trees—where each tree corresponds to one disjoint set, and the root of each tree serves as the canonical representative for all elements in that set. This forest representation naturally supports the partitioning of elements into disjoint subsets without storing explicit set boundaries. Auxiliary fields can be incorporated alongside the parent array to further optimize operations, as explored in subsequent sections on efficiency enhancements.
Auxiliary fields for efficiency
To enhance the efficiency of the disjoint-set data structure without modifying its core parent array representation, auxiliary fields such as rank and size arrays are introduced to guide merging decisions in union operations. These fields are stored alongside the parent array and are maintained exclusively at root nodes to approximate tree properties and prevent excessive height growth.
The rank array supports the union by rank heuristic by storing, at each root, an integer value that upper-bounds the height of the corresponding tree. For an initial collection of n singleton sets, the rank of every element i (which is its own root) is initialized to 0, reflecting a tree height of 1. When merging two trees, the rank of the new root is set to the maximum of the two original ranks; if the ranks are equal, it is incremented by 1 to account for the increased height. This ensures ranks grow logarithmically with the number of unions, providing a simple proxy for height without requiring full tree traversals for updates.
The size array, used in the union by size heuristic, records the cardinality of each set at its root to facilitate size-based merging. Upon initialization for n elements, each size[i] is set to 1, as each singleton forms a set of one element. During a union, the size of the chosen root is updated to the sum of the sizes from both sets, while the other root's size is not adjusted since it becomes a non-root. This field remains valid only at roots, as path compression may alter non-root pointers but does not affect size values.
Both auxiliary fields are updated solely during union operations at the designated root, ensuring O(1) maintenance cost per update while enabling heuristics that keep tree heights bounded. For instance, the following pseudocode illustrates their joint initialization alongside the parent array:
function Initialize(n):
parent ← array of size n, where parent[i] = i for i = 0 to n-1
rank ← array of size n, initialized to 0
size ← array of size n, initialized to 1
return parent, rank, size
function Initialize(n):
parent ← array of size n, where parent[i] = i for i = 0 to n-1
rank ← array of size n, initialized to 0
size ← array of size n, initialized to 1
return parent, rank, size
These fields are referenced in the optimization heuristics to decide which root absorbs the other during unions.
Operations
Creating new sets
The creation of new sets in the disjoint-set data structure involves initializing individual elements as singleton sets, forming the foundational step before any merging occurs. The core operation for this purpose is MAKE-SET(x), which establishes a new set containing solely the element x by configuring x as its own parent, designating it as the set's representative. To support later efficiency heuristics, this operation also sets the rank of x to 0 and its size to 1, where rank approximates tree height for balanced unions and size tracks the number of elements in the set.
The procedure assumes x is a distinct element not already part of any existing set, ensuring no overlaps or prior associations during initialization. This maintains the disjoint property from the outset, with each new set isolated until explicitly merged.
Here is the standard pseudocode for MAKE-SET(x), as described in foundational algorithmic literature:
MAKE-SET(x)
parent[x] ← x
rank[x] ← 0
size[x] ← 1
MAKE-SET(x)
parent[x] ← x
rank[x] ← 0
size[x] ← 1
When initializing multiple elements—such as an array of n distinct items—a batch process applies MAKE-SET iteratively to each one, creating n independent singleton sets in linear time overall. This approach is efficient for preparing the structure in scenarios like graph connectivity problems, where all vertices start unconnected.
Finding set representatives
The find operation in a disjoint-set data structure identifies the representative (root) of the set containing a given element x by traversing the parent array from x until reaching a node whose parent points to itself.[8] This representative serves as a canonical identifier for the entire set.[8]
The basic recursive implementation follows the parent pointers until the root is found:
function find(x):
if parent[x] == x:
return x
return find(parent[x])
function find(x):
if parent[x] == x:
return x
return find(parent[x])
This version is simple but can cause stack overflow in languages with limited recursion depth for deeply nested trees.[7]
An iterative alternative avoids recursion by looping through the parent pointers:
function find(x):
while parent[x] != x:
x = parent[x]
return x
function find(x):
while parent[x] != x:
x = parent[x]
return x
Both versions return the index of the root as the set representative.[8]
For example, consider a parent array where parent = [0, 1, 1, 2, 3] for elements 0 through 4 (with 0 as a singleton root and a chain 4 → 3 → 2 → 1). Calling find(4) traverses 4 to 3, then 3 to 2, then 2 to 1 (where parent[1] == 1), returning 1 as the representative.
Without optimizations, the naive find operation exhibits O(n worst-case time complexity, as in a linear chain of n elements where traversal visits every node.[4] The basic find can be enhanced with path compression for better performance in subsequent operations.
Basic union operation
The basic union operation merges two disjoint sets in the disjoint-set data structure by linking the root of one set's tree to the root of the other's in the underlying forest representation. To perform the union of elements x and y, the algorithm first determines the roots of their respective sets using the find operation. If the roots differ, it arbitrarily sets the parent of one root (say, the root of x's set) to the other root, thereby combining the trees without altering the internal structure of either set. This approach, introduced as part of an improved algorithm for handling equivalence declarations in programming languages, relies on a simple tree-linking mechanism to maintain the partition of elements into disjoint sets.
The pseudocode for the basic union operation is straightforward and assumes the existence of a parent array to represent the tree structure, where each non-root node points to its parent, and roots point to themselves or a special value:
function find(x):
if parent[x] != x:
// In basic version, no path compression; traverse to root
return find(parent[x])
return x
function union(x, y):
rx = find(x)
ry = find(y)
if rx != ry:
parent[rx] = ry // Arbitrary link: rx's root to ry's root
function find(x):
if parent[x] != x:
// In basic version, no path compression; traverse to root
return find(parent[x])
return x
function union(x, y):
rx = find(x)
ry = find(y)
if rx != ry:
parent[rx] = ry // Arbitrary link: rx's root to ry's root
This implementation performs the merge in constant time after the finds, but the overall efficiency depends on the find operation's performance.[9]
A key limitation of this naive linking strategy is its potential to produce unbalanced trees, particularly long chains, which degrade the efficiency of future find operations. For instance, if unions are always performed by linking a new singleton set to the same existing root, the resulting structure becomes a linear chain where each find from the tail requires traversing all preceding nodes, leading to quadratic time in the worst case for a sequence of operations. To illustrate with elements 1 through 4 as initial singletons (each with parent = i), union(1, 2) sets parent[10] = 2; union(2, 3) sets parent[11] = 3, yielding 1 → 2 → 3 with 4 separate; then union(3, 4) sets parent[12] = 4, resulting in 1 → 2 → 3 → 4. A subsequent find(1) now traverses the full chain of length 4. Such degenerate cases highlight why the basic union, while simple, requires enhancements for practical use, though those heuristics are addressed elsewhere.[9]
Optimization Heuristics
Path compression
Path compression is a key optimization technique integrated into the find operation of the disjoint-set data structure, designed to reduce the height of trees by shortening the paths from nodes to their roots during searches. By updating parent pointers along the traversal path, it ensures that subsequent find operations on those nodes traverse fewer edges, thereby amortizing the cost over multiple operations. This method preserves the correctness of the structure, as it only modifies pointers to point toward the actual root without altering set memberships. The technique was originally proposed by McIlroy and Morris in their algorithm for finding spanning trees, and later analyzed in detail for union-find applications.[13]
There are two primary variants of path compression: full compression, which sets every node on the path directly to the root, and path halving, which updates every other node to point to its grandparent for a one-pass approximation. Full compression achieves greater flattening but requires a two-pass traversal, while path halving is simpler and performs updates in a single pass, though it results in less aggressive shortening. These variants were systematically compared and analyzed for their worst-case performance in the context of set union algorithms.
The recursive implementation of find with full path compression is elegant and widely used, as it naturally compresses the path during the recursive calls. In pseudocode, it appears as follows:
function find(x):
if parent[x] ≠ x:
parent[x] = find(parent[x])
return parent[x]
function find(x):
if parent[x] ≠ x:
parent[x] = find(parent[x])
return parent[x]
This recursion first finds the root of the parent and then updates the current node's parent to that root, effectively compressing the entire path on the unwind. For iterative full compression, a two-pass approach is employed: first, traverse from the node to the root while recording the path (or using a temporary root holder), then traverse back to set each node's parent to the root. This avoids recursion and is suitable for deep trees or environments with stack limitations.
To illustrate, consider a chain of four nodes A → B → C → D, where D is the root (i.e., parent[A] = B, parent[B] = C, parent[C] = D, parent[D] = D), resulting in a height of 3. After performing find(A) with full compression, the structure flattens to A → D, B → D, C → D, reducing the height to 1 and making future finds on A, B, or C constant-time in terms of path length. With path halving, the traversal would set parent[A] = C (grandparent of A) and parent[B] = D (grandparent of B), while C remains pointing to D, yielding A → C → D and B → D, for a partially reduced height of 2.
The primary benefit of path compression is its ability to drastically reduce the amortized time complexity of find operations, approaching constant time in practice when used alone or in combination with union heuristics like union by rank, without introducing any overhead to the union step itself.
Union by rank
The union by rank heuristic enhances the disjoint-set data structure by linking trees during unions in a way that maintains balance, approximating tree heights to prevent excessive growth. Each root maintains a rank value, which is a non-decreasing integer starting at 0 for new singleton sets and serving as an upper bound on the logarithm of the tree's height.[4]
To perform a union, the roots r_x and r_y of the two sets are first identified using the find operation. The root with the smaller rank becomes a child of the root with the larger rank. If the ranks are equal, one root (arbitrarily chosen, say r_y) is made a child of the other (r_x), and the rank of r_x is incremented by 1. This procedure is illustrated in the following pseudocode:
function union_by_rank(x, y):
r_x = find(x)
r_y = find(y)
if r_x != r_y:
if rank[r_x] > rank[r_y]:
parent[r_y] = r_x
elif rank[r_x] < rank[r_y]:
parent[r_x] = r_y
else:
parent[r_y] = r_x
rank[r_x] = rank[r_x] + 1
function union_by_rank(x, y):
r_x = find(x)
r_y = find(y)
if r_x != r_y:
if rank[r_x] > rank[r_y]:
parent[r_y] = r_x
elif rank[r_x] < rank[r_y]:
parent[r_x] = r_y
else:
parent[r_y] = r_x
rank[r_x] = rank[r_x] + 1
[4]
Consider an example involving two trees: one with root A of rank 1 (containing elements forming a chain of height 2) and another with root B of rank 2 (height approximately 3). Upon union, root A is linked directly to root B, preserving the rank of B at 2 and ensuring the new tree's height remains bounded by the higher original rank without increase.[4]
This approach guarantees that tree height increases only when merging equal-rank trees, limiting the maximum height to O(\log n) where n is the number of elements, as each rank increment at least doubles the minimum number of nodes in the tree.[4]
Union by size
The union by size heuristic enhances the efficiency of the disjoint-set data structure by maintaining a size field for each node, representing the number of nodes in its subtree, including itself. This field is initialized to 1 for each singleton set containing a single element. By always merging the smaller tree into the larger one, the heuristic promotes balanced tree growth, thereby limiting the increase in tree height during unions.
In the union procedure, the roots of the two sets to be merged are first identified using the find operation. Let r_x and r_y denote these roots. The root of the smaller tree (based on the size field) is then set as a child of the root of the larger tree, and the size of the new parent root is updated by adding the size of the attached subtree. If the sizes are equal, an arbitrary choice can be made, with the size of the chosen parent increased accordingly. This approach ensures that the depth of nodes in the smaller tree increases by at most 1, while larger trees absorb smaller ones without significant height penalty.
The following pseudocode illustrates the union by size operation, assuming parent[] and size[] arrays:
function [union](/page/Union)(x, y):
rx = find(x)
ry = find(y)
if rx != ry:
if [size](/page/Size)[rx] < [size](/page/Size)[ry]:
[parent](/page/Parent)[rx] = ry
[size](/page/Size)[ry] += [size](/page/Size)[rx]
else:
[parent](/page/Parent)[ry] = rx
[size](/page/Size)[rx] += [size](/page/Size)[ry]
function [union](/page/Union)(x, y):
rx = find(x)
ry = find(y)
if rx != ry:
if [size](/page/Size)[rx] < [size](/page/Size)[ry]:
[parent](/page/Parent)[rx] = ry
[size](/page/Size)[ry] += [size](/page/Size)[rx]
else:
[parent](/page/Parent)[ry] = rx
[size](/page/Size)[rx] += [size](/page/Size)[ry]
This implementation requires constant-time access to the size fields, which are stored alongside the parent pointers in the auxiliary structure.[14]
Consider an example where two disjoint sets are merged: one set with root r_x and size 3 (containing elements a, b, c), and another with root r_y and size 5 (containing elements d, e, f, g, h). Upon union, since size[r_x] < size[r_y], r_x becomes a child of r_y, and size[r_y] is updated to 8. The resulting tree has r_y as the new root, with the subtree rooted at r_x attached, maintaining the overall structure's balance.
The primary advantage of union by size over naive linking is its direct control over tree balance through cardinality, which keeps the maximum tree height logarithmic in the number of elements, providing a practical alternative to other heuristics for maintaining efficiency in dynamic set operations. It is often combined with path compression for near-optimal performance in practice.[14]
Time Complexity
The make-set operation, which initializes a new singleton set for an element, requires constant time, O(1), as it simply sets the parent pointer of the element to itself.[15]
In the unoptimized implementation, the find operation naively traverses the parent pointers from a given element until reaching the root representative of its set. This can degenerate into linear chains, leading to a worst-case time of O(n per find, where n is the number of elements in the set.[15] The basic union operation, which links the roots of two sets, takes O(1) time once the finds have identified the roots; however, the overall cost of a union can reach O(n) when accounting for the preceding finds in chain-like structures.[15]
For a sequence of m intermixed make-set, find, and union operations on n elements, the unoptimized disjoint-set structure exhibits a worst-case time complexity of O(m n). This bound arises because repeated unions can form long chains without balancing, forcing subsequent finds to traverse up to n nodes each time. For instance, consider an adversarial sequence starting with make-set(1), followed by make-set(i) and union(1, i) for i from 2 to n (assuming unions link the root of the first set to the root of the second); this constructs a linear chain where each subsequent find on element 1 traverses all n nodes.[15] Such performance motivates the use of heuristics to achieve better bounds in practice.[4]
Amortized bounds with heuristics
When path compression is applied alone, typically with arbitrary linking for unions, the amortized time for a sequence of m find and union operations on n elements is O(m \log n).[16]
Similarly, employing union by rank or union by size without path compression yields an amortized time complexity of O(m \log n), as these methods bound the tree height to O(\log n).[16]
Combining path compression with union by rank or size achieves the near-optimal amortized bound of O(m \alpha(n)), where \alpha(n) denotes the inverse Ackermann function; this function increases so gradually that \alpha(n) \leq 5 for all n relevant to modern computing.[4]
In this fully optimized disjoint-set structure, the total time required to perform m operations starting from n singleton sets is thus O(m \alpha(n)).[4]
These heuristics maintain exceptionally low tree heights overall—union by rank or size limits growth logarithmically, while path compression dynamically flattens search paths to the root, ensuring each operation costs nearly constant time in aggregate.[4]
Analysis of the inverse Ackermann function
The Ackermann function, denoted A(m, n), is a recursively defined fast-growing function that serves as a benchmark for extremely rapid growth rates in computable functions. It is defined for nonnegative integers m and n as follows:
\begin{align*}
A(0, n) &= n + 1, \\
A(m + 1, 0) &= A(m, 1), \\
A(m + 1, n + 1) &= A(m, A(m + 1, n)).
\end{align*}
This yields values such as A(1, n) = n + 2, A(2, n) = 2n + 3, A(3, n) = 2^{n+3} - 3, and A(4, n) = 2 \uparrow\uparrow (n + 3) - 3 (using Knuth's up-arrow notation for tetration), where A(4, n) exceeds any primitive recursive function for sufficiently large n.
The inverse Ackermann function, \alpha(n), is defined as the smallest nonnegative integer k such that A(k, k) \geq \log n (with adjustments in some variants to align with the analysis), capturing the slow-growing inverse needed for the union-find bound. For example, \alpha(2^{65536}) \leq 5, and \alpha(n) \leq 4 for all n up to vastly larger than the number of atoms in the observable universe.
Tarjan's proof of the amortized time complexity employs a potential function \Phi that measures the "disorder" in the forest structure, defined as \Phi = \sum_v c(v), where c(v) counts the number of nodes whose path to the root passes through v at certain iterative levels of compression. Ranks bound tree heights logarithmically, ensuring unions increase ranks sparingly (at most \log n times per tree). During a find operation with path compression, the potential decrease bounds the path length traversed, as each compression step flattens the tree and reduces the number of nodes at higher levels. Levels are iterated up to \alpha(n) times, where each level corresponds to a recursive application akin to the Ackermann growth, limiting the total changes.
The amortized cost per find operation is thus at most \alpha(n) + O(1), derived from the fact that the actual cost is charged to the initial potential plus a constant, while compressions yield a potential drop of at least the path length minus \alpha(n). Unions contribute O(1) amortized cost via rank maintenance. This bound arises from analyzing the forest over m operations on n elements, where the total time is O(m \alpha(n)).
In practice, \alpha(n) < 5 for all feasible n, rendering the time complexity effectively constant and superior to polylogarithmic bounds for large-scale applications.
Applications
Graph connectivity problems
The disjoint-set data structure is widely used to solve graph connectivity problems by efficiently tracking whether vertices belong to the same connected component through union and find operations. In undirected graphs, it enables the detection of connectivity, identification of cycles, and construction of spanning structures like minimum spanning trees, all while maintaining near-linear performance across multiple operations.[4]
One prominent application is Kruskal's algorithm for finding the minimum spanning tree (MST) of a connected, undirected graph with weighted edges. The algorithm sorts all edges in non-decreasing order of weight and processes them sequentially: for each edge, it performs a find operation on its endpoints to check if they are in different sets; if so, it unions the sets and adds the edge to the MST. This greedy approach ensures the MST is constructed without cycles, as unions only occur between distinct components. The following pseudocode illustrates the core logic using a disjoint-set structure:
function KruskalMST(G, w):
T = empty set // MST edges
for each vertex v in G:
make_set(v) // Initialize each vertex as singleton
sort edges of G by weight w in increasing order
for each edge (u, v) in sorted edges:
if find(u) != find(v):
union(u, v)
add (u, v) to T
return T
function KruskalMST(G, w):
T = empty set // MST edges
for each vertex v in G:
make_set(v) // Initialize each vertex as singleton
sort edges of G by weight w in increasing order
for each edge (u, v) in sorted edges:
if find(u) != find(v):
union(u, v)
add (u, v) to T
return T
The disjoint-set operations in Kruskal's algorithm contribute O(m α(n)) time for m edges and n vertices, where α is the inverse Ackermann function, dominating only after the initial O(m log m) sorting step.[4]
To determine the connected components of an undirected graph, initialize each vertex as its own singleton set and union the sets for every pair of adjacent vertices. After processing all edges, the number of distinct roots (via find operations) equals the number of components, and vertices sharing the same root belong to the same component. This approach runs in O(m α(n)) time overall.[4]
Cycle detection in an undirected graph leverages the same structure: for each edge (u, v), if find(u) and find(v) return the same root before union, a cycle exists, as the edge connects vertices already in the same component. Otherwise, perform the union. This detects cycles in O(m α(n)) time without explicitly traversing paths.[4]
Other algorithmic uses
The disjoint-set data structure finds application in processing offline equivalence queries, where a sequence of union operations merges elements into equivalence classes, followed by find operations to determine class membership. This is particularly useful in symbolic computation and compiler optimizations, such as alias analysis, where variables or pointers are grouped based on equivalence relations derived from program statements. For instance, in points-to analysis, the structure efficiently tracks sets of locations that a pointer may reference by unioning nodes in a constraint graph representing equality relations, achieving near-linear time performance through path compression and union by rank.[17]
In image processing, the disjoint-set structure supports segmentation by labeling connected components, where adjacent pixels with similar properties (e.g., color or intensity) are unioned into the same set, enabling efficient partitioning of large-scale images without repeated traversals. An optimized variant processes unions during a raster scan of the image, resolving equivalences in a second pass to assign unique labels to each component, reducing memory usage and achieving practical runtimes on high-resolution data.
For network connectivity in social networks, the structure facilitates online friend groupings or clustering by dynamically merging user profiles into partitions based on shared connections or attributes, such as mutual interests, allowing incremental updates to equivalence classes without recomputing the entire network. This approach underpins clustering algorithms that anonymize or analyze user communities by iteratively unioning nodes with high similarity, preserving privacy while identifying dense subgroups.[18]
A representative example is processing equivalence statements in a compiler, where statements like a EQUIVALENCE b or pointer assignments imply unions. The following pseudocode illustrates offline handling:
initialize [disjoint sets](/page/Disjoint_sets) for all variables
for each equivalence statement in [program](/page/Program):
[union](/page/Union)(find(a), find(b)) // merge equivalence classes
for each query (e.g., alias check between x and y):
if find(x) == find(y):
report equivalent (potential alias)
initialize [disjoint sets](/page/Disjoint_sets) for all variables
for each equivalence statement in [program](/page/Program):
[union](/page/Union)(find(a), find(b)) // merge equivalence classes
for each query (e.g., alias check between x and y):
if find(x) == find(y):
report equivalent (potential alias)
This method efficiently resolves all equivalences in amortized nearly constant time per operation, avoiding full symbolic execution or rebuilds of dependency graphs. The advantage lies in its ability to handle dynamic merges scalably, supporting thousands of updates in practice while maintaining partition integrity for subsequent analyses like optimization passes.[17]
Extensions and Variants
Support for deletions
The standard disjoint-set data structure does not natively support deletions, as removing an element from a set can require splitting the set or restructuring the underlying forest representation, while path compression flattens trees in a way that obscures subtree relationships and makes removal challenging without additional bookkeeping.
One foundational approach to incorporating deletions is the scheme proposed by Kaplan, Shafrir, and Tarjan, which extends the classical link-by-size heuristic to handle element removals while preserving near-optimal performance. In their method, deletions are processed by marking the element and adjusting set sizes, with unions and finds modified to account for marked (deleted) elements during path traversal. To prevent degradation from accumulated marks, sets are periodically rebuilt by creating fresh trees containing only active elements, linking subtrees as needed. This lazy deletion strategy incurs amortized O(α(n)) time per union and find operation, where α(n) is the inverse Ackermann function, and O(log n / log log n) time per deletion. Ben-Amram and Yoffe (2011) presented a simple and efficient Union-Find-Delete algorithm achieving amortized O(1) time for unions and deletions, and O(α(n)) for finds, preserving the efficiency of the standard structure.[19]
Alstrup, Gørtz, Rauhe, Thorup, and Zwick developed a deterministic data structure supporting O(1) worst-case time for deletions, makesets, and unions, with amortized O(α(n)) time for finds. Their construction modifies the potential function in the amortized analysis to accommodate deletions, using a binomial-tree-like representation for sets that facilitates quick isolation of deleted nodes without frequent rebuilds. This balances the need for efficient merging with the overhead of removals, though it introduces slightly higher constants and space usage compared to non-deletion variants.[20]
For an example of a basic deletion operation in a tree-based representation supporting splits, consider the following pseudocode for deleting element x, assuming a parent array and size tracking (with lazy marking omitted for simplicity; in practice, full implementations handle path compression adjustments):
function delete(x):
if x is [root](/page/Root): # [Singleton](/page/Singleton) set
mark x as deleted
return
parent_x = [parent](/page/Parent)[x]
# Isolate x by detaching its children
for each child y of x: # Traverse children if stored, or recompute via reverse links
[parent](/page/Parent)[y] = parent_x
# Attach x's former siblings or adjust as needed; if x had no children, simply mark
mark x as deleted
# Update [size](/page/Size)s along path to [root](/page/Root)
current = parent_x
while current is not None:
[size](/page/Size)[current] -= 1
if [size](/page/Size)[current] < threshold: # Trigger rebuild if needed
rebuild_set(current)
current = [parent](/page/Parent)[current]
function delete(x):
if x is [root](/page/Root): # [Singleton](/page/Singleton) set
mark x as deleted
return
parent_x = [parent](/page/Parent)[x]
# Isolate x by detaching its children
for each child y of x: # Traverse children if stored, or recompute via reverse links
[parent](/page/Parent)[y] = parent_x
# Attach x's former siblings or adjust as needed; if x had no children, simply mark
mark x as deleted
# Update [size](/page/Size)s along path to [root](/page/Root)
current = parent_x
while current is not None:
[size](/page/Size)[current] -= 1
if [size](/page/Size)[current] < threshold: # Trigger rebuild if needed
rebuild_set(current)
current = [parent](/page/Parent)[current]
This isolates x and potentially splits its subtree by reattaching children to the grandparent, but requires reverse links or child lists (increasing space) to avoid O(n) scans; without them, splits may merge unintended subtrees.
These deletion-supporting extensions generally trade off the near-constant amortized time of the standard structure for logarithmic factors, with complexities rising to O(log n) or higher per operation in some cases, and increased implementation complexity due to marking, rebuilding, or level-based decompositions. Periodic full rebuilds in lazy approaches can also introduce bursts of higher cost, though amortized bounds remain efficient for dynamic applications like incremental graph connectivity.[19]
Backtracking capabilities
Backtracking capabilities in the disjoint-set data structure enable the reversal of union operations, which is crucial for algorithms that involve exploratory trial-and-error processes, such as constraint satisfaction problems where temporary merges of equivalence classes must be undone upon encountering inconsistencies.[21]
A straightforward method to support backtracking is the stack-based undo technique, which records the modifications made during each union—such as changes to the parent pointers, ranks, or sizes of roots—on a stack before applying them. When backtracking is required, the stack is popped in reverse order to restore the prior state, effectively undoing the most recent unions. This approach preserves the efficiency of the standard disjoint-set operations, adding only constant-time overhead per union and backtrack in the amortized sense.[21]
The following pseudocode illustrates a union operation with stack-based undo using union by rank (without path compression to simplify reversibility):
function union(x, y, stack):
rx = find(x)
ry = find(y)
if rx == ry:
return
# Assume union by rank
if rank[rx] < rank[ry]:
swap(rx, ry)
# Record changes for undo
old_parent = parent[ry]
old_rank = rank[rx]
stack.push((ry, old_parent, rx, old_rank)) # Tuple: (attached_root, old_parent, link_root, old_link_rank)
parent[ry] = rx
if rank[rx] == rank[ry]:
rank[rx] += 1
function backtrack(stack, num_undos):
for i in 1 to num_undos:
if stack.empty():
break
(ry, old_parent, rx, old_rank) = stack.pop()
parent[ry] = old_parent
rank[rx] = old_rank
function union(x, y, stack):
rx = find(x)
ry = find(y)
if rx == ry:
return
# Assume union by rank
if rank[rx] < rank[ry]:
swap(rx, ry)
# Record changes for undo
old_parent = parent[ry]
old_rank = rank[rx]
stack.push((ry, old_parent, rx, old_rank)) # Tuple: (attached_root, old_parent, link_root, old_link_rank)
parent[ry] = rx
if rank[rx] == rank[ry]:
rank[rx] += 1
function backtrack(stack, num_undos):
for i in 1 to num_undos:
if stack.empty():
break
(ry, old_parent, rx, old_rank) = stack.pop()
parent[ry] = old_parent
rank[rx] = old_rank
In this implementation, each union pushes a constant number of values onto the stack, and each backtrack restores them in O(1) time per undone union, yielding O(1) amortized overhead when integrated with the inverse Ackermann function complexity of finds.[21]
For scenarios requiring multiple versions or immutable histories, a persistent variant of the disjoint-set structure uses copy-on-write mechanisms to create new copies of only the modified nodes (e.g., roots and their paths) upon each union, allowing access to previous snapshots without altering the current state. This facilitates efficient backtracking by switching to an earlier version pointer. Seminal work on persistent data structures demonstrates that such a union-find achieves update and find times of O(log n / log log n) amortized, with space proportional to the number of operations. A more recent implementation matches the non-persistent amortized inverse Ackermann complexity while ensuring full persistence.[22]
Concurrent and parallel versions
In concurrent settings, the disjoint-set data structure encounters significant challenges due to race conditions arising from multiple threads accessing and modifying the shared parent array simultaneously during find and union operations, potentially leading to incorrect path compression or linking that violates the disjoint-set invariant.[23]
To address these issues, lock-free implementations leverage atomic operations, such as compare-and-swap (CAS), to ensure thread-safe updates without traditional locks, enabling non-blocking progress for at least one thread.[24] These approaches often employ optimistic techniques where operations proceed assuming low contention, with validation steps to detect and retry on conflicts.[24]
A seminal randomized concurrent algorithm uses CAS to implement wait-free find and union operations, achieving expected O(1) time per operation with high probability in the worst case.[24] For example, an optimistic concurrent find might first traverse the path to the root while recording nodes, then validate the path integrity atomically before applying compression via sequential CAS operations on each node to point directly to the root, retrying if any CAS fails due to concurrent modification:
function concurrent_find(x):
if atomic_load(parent[x]) == x:
return x
path = []
curr = x
while atomic_load(parent[curr]) != curr:
path.append(curr)
curr = atomic_load(parent[curr])
root = curr
# Validate and compress
for node in reversed(path):
expected = atomic_load(parent[node])
if expected != the recorded parent for node: # Simplified validation
retry entire find # Or handle locally
else:
CAS(&parent[node], expected, root) # Atomic set to root
return root
function concurrent_find(x):
if atomic_load(parent[x]) == x:
return x
path = []
curr = x
while atomic_load(parent[curr]) != curr:
path.append(curr)
curr = atomic_load(parent[curr])
root = curr
# Validate and compress
for node in reversed(path):
expected = atomic_load(parent[node])
if expected != the recorded parent for node: # Simplified validation
retry entire find # Or handle locally
else:
CAS(&parent[node], expected, root) # Atomic set to root
return root
This pseudocode illustrates a basic optimistic strategy, where traversal uses atomic loads and compression relies on CAS for safe updates; actual implementations may incorporate randomization or versioning for stronger guarantees.[24]
Recent advancements include a 2024 machine-verified multicore disjoint-set union structure that maintains the near-constant O(α(n)) amortized time complexity per operation, even under high contention, using fine-grained atomic updates verified for linearizability. This design supports scalability across p processors, delivering near-linear speedup for batches of m operations in parallel algorithms like connected components, as evidenced by its deployment in production multicore systems.[25]