Search tree
A search tree is a hierarchical data structure in computer science designed to store and organize a collection of keys in sorted order, enabling efficient operations such as searching, insertion, and deletion, typically with an average time complexity of O(log n) for n elements.[1] These structures leverage the tree's branching to mimic binary or multi-way search processes, reducing the number of comparisons needed compared to linear search methods.[2]
The foundational example of a search tree is the binary search tree (BST), a binary tree where each node contains a key, and for any node, all keys in the left subtree are strictly less than the node's key, while all keys in the right subtree are strictly greater.[1] This ordering property allows searches to proceed by recursively navigating left or right from the root, potentially halving the search space at each step in a balanced tree.[3] However, unbalanced BSTs can degrade to O(n) performance in the worst case, resembling a linked list, which motivates balanced variants.[1]
To ensure consistent logarithmic performance, several balanced search tree implementations exist, including AVL trees, which maintain balance by enforcing a height difference of at most one between subtrees through rotations after insertions or deletions.[1] Red-black trees achieve balance using node colors (red or black) to guarantee that no path from root to leaf is more than twice as long as another, with self-balancing adjustments via rotations and recoloring.[1] These structures are widely used in programming language libraries, such as Java's TreeMap and C++'s std::map.[4]
For applications involving large datasets or disk-based storage, multi-way search trees like B-trees extend the concept by allowing nodes to have multiple children (typically between m/2 and m for order m), minimizing disk I/O by keeping the tree height low and nodes bushy.[5] B-trees, invented by Rudolf Bayer and Edward M. McCreight in 1970,[6] are fundamental in database indexing systems like those in MySQL and PostgreSQL,[7][8] where they support range queries and handle variable-sized blocks efficiently. Variants such as B+-trees further optimize for sequential access by storing data only in leaves and using internal nodes solely for indexing.[9]
Search trees are essential in numerous domains, including file systems, compilers for symbol tables, and real-time systems requiring fast lookups, underscoring their role as a cornerstone of efficient algorithm design.[1]
Definition and Properties
Core Concept
A search tree is a tree data structure used to store a collection of keys in a manner that facilitates efficient searching, insertion, and deletion operations on dynamic sets. It typically organizes keys hierarchically while maintaining an ordering invariant: for any node, all keys in its left subtree are less than the node's key, and all keys in its right subtree are greater than the node's key.[10] This structure extends general tree concepts—such as nodes connected by edges, with a root and leaves—to support ordered data management, assuming familiarity with basic tree terminology like parent-child relationships and subtrees.[10]
The concept of search trees emerged in the early 1960s as part of efforts to optimize data organization in computing systems. One of the earliest descriptions appeared in P. F. Windley's 1960 paper "Trees, Forests and Rearranging" on tree-based rearranging methods for efficient storage and retrieval.[11] The term and its formal properties were further systematized by Donald Knuth in his seminal work The Art of Computer Programming, Volume 3: Sorting and Searching, published in 1973, where search trees are analyzed as a foundation for balanced searching algorithms.[12]
For illustration, consider a simple search tree storing integer keys: the root node holds 5, its left child is 3, and its right child is 7. This satisfies the ordering invariant, as 3 < 5 < 7, enabling quick location of any key by traversing left or right based on comparisons. The binary search tree represents the simplest form of this structure.[12]
Fundamental Properties
Search trees are characterized by a fundamental ordering invariant that facilitates efficient key-based searching. In binary variants, such as the binary search tree, this invariant stipulates that for any node containing key k, all keys in the left subtree are strictly less than k, while all keys in the right subtree are strictly greater than k. This property generalizes to multi-way search trees, where keys within each node are maintained in sorted order, and the subtrees between consecutive keys in a node contain exclusively smaller or larger keys relative to those boundaries.[13]
A key structural assumption in search trees is the absence of duplicate keys, ensuring uniqueness within the structure; duplicates, if permitted, are typically managed through multiplicity counts or auxiliary storage rather than violating the ordering. The height h of the tree plays a central role in performance, with balanced trees achieving an average height of O(\log n) for n nodes, in contrast to the worst-case O(n) height in skewed configurations, which underscores the distinction between amortized and worst-case analyses. For binary search trees, the maximum number of nodes is bounded by the full binary tree formula:
n \leq 2^{h+1} - 1
This bound highlights how balance minimizes height to logarithmic scale.[14][15]
Core operations—search, insertion, and deletion—each require time proportional to the tree's height, yielding O(h) complexity; in balanced search trees, this simplifies to O(\log n), enabling their utility in dynamic data management. As exemplified in the binary search tree, adherence to the ordering invariant directly supports these bounds by allowing traversal decisions based on key comparisons alone.[16]
Types of Search Trees
Binary Search Tree
A binary search tree (BST) is a type of binary tree where each node contains a key and has at most two children, designated as the left child and the right child. The structure adheres to the binary-search-tree property: for any node, all keys in the left subtree are strictly less than the node's key, and all keys in the right subtree are strictly greater than the node's key. This invariant ensures that the tree maintains a sorted order among its elements.[17]
BSTs are constructed through sequential insertion of keys, starting from an empty tree. During insertion, the new key is compared to the keys of existing nodes, traversing left if smaller or right if larger until an appropriate leaf position is found, where the new node is attached while preserving the BST property.[17][1]
One key advantage of BSTs is their simplicity in implementation, requiring only basic pointer operations for node management. Additionally, they facilitate ordered traversals; an in-order traversal (visiting left subtree, node, then right subtree) produces the keys in non-decreasing sorted order without additional sorting effort.[17] However, a significant limitation arises from potential imbalance: if keys are inserted in monotonically increasing or decreasing order, the tree degenerates into a linear chain resembling a linked list, where search, insertion, and deletion operations degrade to O(n) time complexity in the worst case.[17] In a randomly constructed BST, the expected height is O(log n), supporting average-case O(log n) performance for these operations.[17]
To illustrate construction, consider inserting the sequence of keys {3, 1, 4, 2, 5} into an empty BST:
- Insert 3: The tree consists of a single root node with key 3.
- Insert 1: Compare with 3 (1 < 3), attach as left child of 3.
- Insert 4: Compare with 3 (4 > 3), attach as right child of 3.
- Insert 2: Compare with 3 (2 < 3), then with 1 (2 > 1), attach as right child of 1.
- Insert 5: Compare with 3 (5 > 3), then with 4 (5 > 4), attach as right child of 4.
The resulting tree structure is:
3
/ \
1 4
\ \
2 5
3
/ \
1 4
\ \
2 5
This example demonstrates how comparisons guide placement while enforcing the BST property.[17]
The basic node structure for a BST in pseudocode is typically represented as follows:
[class](/page/Class) Node:
key
left = [null](/page/Null)
right = [null](/page/Null)
[class](/page/Class) Node:
key
left = [null](/page/Null)
right = [null](/page/Null)
This minimal design supports the core tree operations through recursive or iterative traversal using the left and right pointers.[17][18]
B-Tree
A B-tree is a self-balancing multiway search tree that generalizes the binary search tree, allowing each internal node to have multiple children—specifically, between \lceil m/2 \rceil and m children for a tree of order m—to efficiently handle large datasets stored on secondary storage like disks.[19] The keys within each node divide the key space for its child subtrees, preserving the search tree invariant: all keys in a left subtree are less than the node's smallest key, and all keys in a right subtree are greater than its largest key.[19] Invented by Rudolf Bayer and Edward M. McCreight in 1970 for organizing large ordered indexes, B-trees minimize disk I/O by keeping the tree height low and nodes bushy, as each node access typically requires a full block read.[19]
B-trees maintain balance such that all leaves reside at the same level, achieved through dynamic node splitting and redistribution: during insertion, if a node fills to capacity, it splits into two nodes with the median key promoted to the parent; during deletion, underfull nodes borrow keys from siblings or merge with them to avoid violating minimum occupancy.[20] This parameterization often uses a minimum degree t \geq 2, where non-root nodes hold at least t-1 keys (and t children) and at most $2t-1 keys (and $2t children), while the root has at least 1 key unless it is a leaf.[20] The resulting height is h = O(\log_t n), where n is the number of keys, ensuring logarithmic-time operations even for millions of entries, as the fanout t (often 100–1000 for disk blocks) reduces levels to 3–4 for terabyte-scale indexes.[19]
B-trees are foundational in database indexing for file systems and relational databases, where they support efficient range queries and ordered access; for instance, MySQL's InnoDB engine uses a B+ tree variant (a B-tree where leaves link sequentially and internal nodes store only keys) for clustered primary indexes and secondary indexes.[21]
Example: Insertion in a B-Tree of Minimum Degree t=2
Consider building a B-tree with minimum degree t=2 (non-root nodes hold 1–3 keys, 2–4 children for internal nodes) by inserting keys 1, 2, 3, 4, 5, 6 in sequence, highlighting a split.[20]
- Insert 1: Root becomes [22] (a leaf).
- Insert 2: Root becomes [1, 2].
- Insert 3: Root becomes [1, 2, 3] (full).
- Insert 4: Root is full; split into [1, 2] and [23], promote median 2 to new root [24], left child [22], right child [23]; then insert 4 into right child, becoming [3, 4].
- Insert 5: Traverse to right leaf [3, 4] (not full), insert to become [3, 4, 5] (now full, but no split as it does not exceed maximum).
- Insert 6: Traverse to right leaf [3, 4, 5] (full); temporarily insert to [3, 4, 5, 6], then split: promote median 4 to parent (root becomes [2, 4]), left of 4 is [23], right of 4 is [5, 6]; overall children: [22], [23], [5, 6].
The tree after inserting 6 has root [2, 4], with children [22], [23], [5, 6].
This demonstrates the split: when inserting into a full leaf causes overflow (more than 3 keys temporarily), split it, promote the median, and adjust the parent. Such operations ensure the tree remains balanced, with search/insert/delete costing O(h) node accesses, or O(\log_t n) time.[19]
Other Variants
(a,b)-trees generalize B-trees by allowing internal nodes to have between a and b children, where 1 ≤ a ≤ b/2, providing flexibility in fanout without the strict splitting rules of standard B-trees. This structure enables tunable balance and is particularly useful in scenarios requiring variable node capacities to optimize storage and access patterns. Introduced by Huddleston and Mehlhorn in 1982 as a new data structure for representing sorted lists, (a,b)-trees support the same logarithmic-time operations as B-trees while offering greater adaptability.[25] The B-tree is a special case of an (a,b)-tree where a = ⌈b/2⌉.
A 2-3 tree is a specific instance of a B-tree of order 3, where each internal node has either 2 or 3 children, ensuring height balance and efficient search, insertion, and deletion in O(log n) time. Invented by John Hopcroft in 1970, 2-3 trees serve as a foundational balanced multiway search tree, predating broader B-tree variants.
Ternary search trees extend binary search trees to three children per node—typically low, equal, and high—optimized for prefix-based string searches by combining trie-like compression with binary search ordering. This hybrid structure excels in lexicographic operations, such as autocomplete or dictionary lookups, achieving space efficiency comparable to tries while maintaining fast query times. Popularized by Bentley and Sedgewick in 1998, ternary search trees are particularly effective for datasets involving variable-length strings.[26]
Scapegoat trees are binary search trees that maintain approximate balance through lazy rebuilding of unbalanced subtrees, identifying a "scapegoat" node with excessive height to trigger reconstruction into a perfectly balanced tree. This approach amortizes costs over insertions, ensuring O(log n) time per operation with high probability, without the overhead of constant rotations. Developed by Galperin and Rivest in 1993, scapegoat trees provide a simple alternative to strictly balanced structures like AVL trees.[27]
Among these variants, ternary search trees are suited for lexicographic string ordering due to their prefix-handling capabilities, while (a,b)-trees offer tunable balance for applications needing adjustable node sizes, such as database indexing.[25][26]
Operations on Search Trees
Search Procedures
Search procedures in search trees rely on the ordered structure to efficiently locate keys through directional traversal from the root node. To find a specific key, the algorithm begins at the root and compares the target key with the current node's key: if the target is less than the current key, it traverses to the left subtree; if greater, to the right subtree. This process continues until the key is found or a null pointer (indicating absence) is reached.[28]
The search can be implemented recursively or iteratively. In the recursive version, a function is called on the appropriate subtree based on the comparison, with base cases for null nodes (returning failure) or exact matches (returning the node). The following pseudocode illustrates this for a binary search tree:
TREE-SEARCH(x, k)
if x = NIL or k = x.key
return x
if k < x.key
return TREE-SEARCH(x.left, k)
else
return TREE-SEARCH(x.right, k)
TREE-SEARCH(x, k)
if x = NIL or k = x.key
return x
if k < x.key
return TREE-SEARCH(x.left, k)
else
return TREE-SEARCH(x.right, k)
This approach mirrors the iterative method, which uses a loop to update the current node pointer without recursion stack overhead. Iterative pseudocode is:
procedure find(key)
p ← root
while p ≠ NIL and key ≠ p.key
if key < p.key
p ← p.left
else
p ← p.right
return p
procedure find(key)
p ← root
while p ≠ NIL and key ≠ p.key
if key < p.key
p ← p.left
else
p ← p.right
return p
Both variants take time proportional to the tree's height h, yielding O(h) complexity; in balanced trees where h = O(\log n), search is O(\log n).[29][28][30]
Finding the minimum or maximum key is a specialized traversal that follows the ordering invariant. The minimum is the leftmost node, obtained by repeatedly moving left from the root until a null left child is encountered. Pseudocode for minimum is:
TREE-MINIMUM(x)
while x.left ≠ NIL
x ← x.left
return x
TREE-MINIMUM(x)
while x.left ≠ NIL
x ← x.left
return x
The maximum is symmetrically found by traversing rightmost. These operations also run in O(h) time, or O(\log n) in balanced trees.[30]
Search trees may handle duplicates differently depending on the implementation: some policies disallow them entirely to maintain unique keys, while others permit multiples by placing equals in the left or right subtree during insertion, with search returning any matching node (or all via extended traversal if required). In standard binary search trees without duplicates, search terminates on the first match.[31][32]
Insertion and Deletion
Insertion in search trees follows a procedure analogous to searching for the key, traversing from the root to a null position where the new key would be located if it existed, and then adding a new leaf node at that spot to preserve the ordering invariant.[33] In binary search trees (BSTs), this process is straightforward: the new node becomes the child of the last non-null node encountered during the traversal, with its left and right pointers initialized to null.[33] The time complexity is O(h), where h is the tree height, which remains O(\log n) in balanced trees.[33]
The following pseudocode illustrates recursive insertion in a BST, assuming a tree T with root r and a new node z containing the key to insert:
TREE-INSERT(T, z)
1 y ← NIL
2 x ← T.root
3 while x ≠ NIL
4 y ← x
5 if z.key < x.key
6 x ← x.left
7 else
8 x ← x.right
9 z.p ← y
10 if y = NIL
11 T.root ← z
12 elseif z.key < y.key
13 y.left ← z
14 else
15 y.right ← z
TREE-INSERT(T, z)
1 y ← NIL
2 x ← T.root
3 while x ≠ NIL
4 y ← x
5 if z.key < x.key
6 x ← x.left
7 else
8 x ← x.right
9 z.p ← y
10 if y = NIL
11 T.root ← z
12 elseif z.key < y.key
13 y.left ← z
14 else
15 y.right ← z
This algorithm ensures the BST property by placing z as a leaf after following the search path.[33] In B-trees, insertion begins similarly by locating the appropriate leaf via search, but if the leaf overflows its maximum capacity (typically $2t-1 keys for minimum degree t), it is split into two nodes, with the median key promoted to the parent to maintain balance.[33] This splitting may propagate upward if the parent also overflows, potentially creating a new root and increasing the height by one, ensuring all leaves remain at the same level.[34]
Deletion in search trees removes a key while adjusting the structure to uphold the ordering invariant, handling three primary cases based on the number of children of the node containing the key.[33] For a leaf node (zero children), it is simply removed by setting its parent's pointer to null.[33] In the case of one child, the node is bypassed by linking the parent directly to that child.[33] For two children, the node's key is replaced with its inorder successor (the minimum key in the right subtree), and then the successor—now a node with at most one child—is deleted as above.[33] This maintains O(h) time complexity, or O(\log n) in balanced trees.[33]
Consider a BST example for deleting a node with two children: suppose the tree has node z with key 13, left child 8, and right child 18 (which has left child 15). The successor is 15 (minimum in 18's subtree). Replace 13 with 15, making 18 the right child of the new 15 position, then delete the original 15 node (a leaf) from under 18.[33] The following pseudocode for BST deletion uses a helper to find the successor:
TREE-DELETE(T, z)
1 if z.left = NIL or z.right = NIL
2 y ← z
3 else
4 y ← TREE-MINIMUM(z.right)
5 if y.left ≠ NIL
6 x ← y.left
7 else
8 x ← y.right
9 if x ≠ NIL
10 x.p ← y.p
11 if y.p = NIL
12 T.[root](/page/Root) ← x
13 elseif y = y.p.left
14 y.p.left ← x
15 else
16 y.p.right ← x
17 if y ≠ z
18 z.[key](/page/Key) ← y.[key](/page/Key)
19 return y
TREE-DELETE(T, z)
1 if z.left = NIL or z.right = NIL
2 y ← z
3 else
4 y ← TREE-MINIMUM(z.right)
5 if y.left ≠ NIL
6 x ← y.left
7 else
8 x ← y.right
9 if x ≠ NIL
10 x.p ← y.p
11 if y.p = NIL
12 T.[root](/page/Root) ← x
13 elseif y = y.p.left
14 y.p.left ← x
15 else
16 y.p.right ← x
17 if y ≠ z
18 z.[key](/page/Key) ← y.[key](/page/Key)
19 return y
In B-trees, deletion starts by removing the key from a leaf; if underflow occurs (fewer than t-1 keys), the node borrows a key from a sibling via the parent if the sibling has at least t keys, or merges with the sibling otherwise, pulling a separator key from the parent into the merged node.[33] For internal nodes, the key is first replaced by its successor or predecessor from a leaf, then deleted from that leaf with underflow handling.[33] Merges may propagate upward, and if the root underflows to empty, the height decreases.[33] These operations preserve the balanced structure, yielding O(\log_t n) time, or amortized O(\log n).[34]
Advanced Topics
Balancing Mechanisms
Balancing mechanisms in search trees are techniques designed to maintain a logarithmic height, preventing degeneration into linear structures during insertions and deletions. These methods ensure efficient search, insertion, and deletion operations with O(log n) time complexity by enforcing structural invariants through local adjustments like rotations or node recoloring. Common approaches include height-based balancing in AVL trees, color-based balancing in red-black trees, and self-adjusting heuristics in splay trees.[35][36][37]
AVL trees are height-balanced binary search trees where the height difference between the left and right subtrees of any node, known as the balance factor, is at most 1 in absolute value. This invariant is maintained after insertions and deletions through a series of single or double rotations to restore balance. Single rotations handle cases where the imbalance occurs in the child subtree aligned with the insertion site, while double rotations address zig-zag imbalances involving both child and grandchild subtrees. The original AVL tree algorithm was introduced by Adelson-Velsky and Landis in 1962.[35]
Rotations in AVL trees preserve the in-order traversal and binary search tree properties while adjusting heights. For a right rotation on a node y with left child x, the transformation sets x as the new root, attaches y as the right child of x, and updates y's left child to x's original right child:
Before: y After: x
/ \ / \
x C A y
/ \ / \
A B B C
Before: y After: x
/ \ / \
x C A y
/ \ / \
A B B C
This operation reduces the height difference at y by rebalancing the subtrees. Similar left rotations and double rotations (left-right or right-left) are used for other cases, ensuring the balance factor remains within [-1, 1] along the affected path.[35]
A concrete example illustrates AVL balancing during insertion. Starting with an empty tree, insert 3 to create a root node. Inserting 1 makes it the left child, yielding a balanced tree with balance factor 1 at the root. Inserting 2 then attaches as the right child of 1, violating the balance at the root (left height 2, right height 0). This requires a left-right double rotation: first perform a left rotation on the left child of 3 (node 1, promoting 2 as the new left child of 3), then perform a right rotation on 3 (promoting 2 as root with 1 left and 3 right), restoring balance with all factors at 0.[35]
Red-black trees achieve balance using node colors (red or black) rather than explicit heights, with five key properties: every node is colored, the root is black, all leaves (NIL nodes) are black, no two red nodes are adjacent (red nodes have black children), and every path from a node to its descendant leaves contains the same number of black nodes (equal black-height). Violations after insertion or deletion are fixed through recoloring and at most three rotations, prioritizing local fixes from the bottom up. This framework was developed by Guibas and Sedgewick in 1978 as a binary representation of 2-3-4 trees.[36]
The color properties ensure that the height h of a red-black tree with n internal nodes satisfies h ≤ 2 log₂(n+1), which is asymptotically worse than AVL trees' stricter 1.44 log n bound but simpler to implement due to fewer rotations on average. The upper bound follows from the minimum number of nodes in a full binary tree of black height bh being at least 2^{bh} - 1, with bh ≤ log₂(n+1), and h ≤ 2 bh due to alternating colors.[36]
Splay trees are self-adjusting binary search trees that do not maintain a strict balance invariant but instead perform amortized O(log n) operations by "splaying" recently accessed nodes to the root through a sequence of zig, zig-zig, or zig-zag rotations. These rotations recenter the accessed node without relying on colors or heights, promoting frequently accessed elements for faster future operations under the static optimality property. Splay trees were introduced by Sleator and Tarjan in 1985, offering adaptive performance superior to fixed-balance trees for non-uniform access patterns.[37]
Applications and Comparisons
Search trees find widespread application in software systems requiring efficient ordered data management. Binary search trees (BSTs) and their self-balancing variants, such as red-black trees, are commonly used for in-memory data structures like sorted sets and maps. For instance, Java's TreeMap class implements a red-black tree to provide logarithmic-time operations for insertion, deletion, and lookup of key-value pairs, enabling efficient handling of ordered collections in applications ranging from caching to configuration management.[38] B-trees, on the other hand, are integral to disk-based storage systems due to their ability to minimize input/output operations. In file systems like NTFS, B-trees index directory entries and file attributes, allowing for scalable organization of large volumes of data on secondary storage.[39] Similarly, databases leverage B-trees for indexing to support fast queries over persistent data. Ternary search trees excel in string-heavy scenarios, such as autocomplete features in search engines and text editors, where they efficiently store and retrieve words with common prefixes by branching on character comparisons.[40]
When comparing search tree variants, their design choices reflect trade-offs in balance, complexity, and performance suited to specific environments. Plain BSTs offer simplicity in implementation but risk degeneration into linear chains under adversarial insertions, leading to O(n) worst-case time for operations. AVL trees enforce strict balance by limiting height differences to one, ensuring O(log n) guarantees but requiring more frequent rotations during insertions and deletions, which can increase overhead in dynamic workloads. Red-black trees, by contrast, allow slightly looser balance through color rules, resulting in fewer rotations and better practical performance for insertions, making them a staple in systems like the Linux kernel for scheduling and memory management tasks. B-trees prioritize high fanout to reduce tree height in external memory contexts, where each node access incurs costly disk I/O, though they demand more space per node compared to binary variants.[41][42]
Key trade-offs across these structures involve space efficiency, time complexity, and key type suitability. B-trees achieve denser packing with variable-order nodes, using less overall space for large datasets than binary trees, but balanced binary trees like AVL and red-black provide tighter O(log n) bounds in main memory without the fanout overhead. Unbalanced BSTs can degrade to O(n) time, while balanced variants maintain consistency at the cost of rotational maintenance. For numeric keys, binary search trees suffice, but string keys benefit from ternary structures, which compress paths and support prefix-based retrieval more effectively than binary alternatives.[43][44]
As of 2025, B+ tree variants—where leaf nodes link sequentially for range scans—remain dominant in database indexing, including NoSQL systems like MongoDB, which employs B-trees for efficient query optimization over distributed data. Emerging adaptations focus on resilience in constrained environments, though quantum-resistant modifications to core search tree structures are still in early research phases, primarily influencing related data integrity mechanisms rather than operational search algorithms.[45]
| Tree Type | Height Guarantee | Fanout | Primary Use Case |
|---|
| BST | O(log n) average; O(n) worst | 2 (binary) | Simple in-memory ordered storage |
| AVL | O(log n) strict | 2 (binary) | Search-intensive applications |
| Red-Black | O(log n) approximate | 2 (binary) | Dynamic in-memory operations (e.g., kernels) |
| B-Tree | O(log n) with high order | Variable (high) | External storage, databases/filesystems |