Fact-checked by Grok 2 weeks ago

Open addressing

Open addressing, also known as closed hashing, is a method for resolving collisions in hash tables by storing all elements directly within the hash table array itself, rather than using external structures like linked lists. When a collision occurs—meaning the computed hash index is already occupied—the algorithm probes sequentially or according to a predefined sequence to locate the next available slot in the array. This approach contrasts with separate chaining, where collided elements are linked together in buckets, and requires the load factor (the ratio of elements to table size) to remain below 1 to ensure operations terminate. The core of open addressing lies in the probing strategy used to handle collisions, with three primary techniques: , , and . advances to the next slot in a straightforward sequential manner, using the formula (h(key) + i) m, where h(key) is the initial , i is the number starting from 0, and m is the table size; while simple and cache-efficient due to contiguous , it suffers from primary clustering, where consecutive occupied slots form long chains that degrade . mitigates this by using a quadratic offset, such as (h(key) + i²) m, which spreads probes more evenly and reduces clustering, though it can lead to secondary clustering (similar probe sequences for keys with the same initial ) and, in its standard form using i², only probes roughly half the table's slots even when the table size is prime. employs a second to determine the step size, probing as (h₁(key) + i * h₂(key)) m, which minimizes both primary and secondary clustering by producing unique sequences for each key, but requires careful selection of functions to avoid cycles and offers poorer cache . Open addressing offers several advantages, including simplicity in , excellent spatial locality for utilization in main environments, and no need for dynamic memory allocation or pointers, making it suitable for systems with limited resources. However, it is sensitive to high load factors, where search and insertion times can approach O(n) in the worst case due to clustering, and deletions require "lazy deletion" (marking slots as deleted rather than freeing them) to preserve probe sequences for future searches. Performance analysis under uniform hashing assumptions shows average O(1) operations for low load factors, with unsuccessful searches costing approximately 1/(1 - λ) probes for , where λ is the load factor.

Fundamentals

Definition and basic principles

Open addressing is a method for resolving collisions in hash tables by storing all elements directly within the table's , rather than using external data structures like linked lists. In this approach, when a collision occurs at the initial hash-computed index, the algorithm probes subsequent slots in a deterministic sequence until an empty slot is found or the key is located during search. This keeps the entire structure contained in a fixed-size of m slots, promoting efficiency due to spatial locality but requiring careful management to avoid clustering. The core principle involves applying a h(k) to a k, yielding an initial position h(k) mod m. If occupied, a probe sequence—such as linear increments or more complex permutations—searches for the next viable slot, ensuring all insertions, searches, and deletions operate solely on the array. Unlike separate chaining, which appends colliding keys to lists at each slot, open addressing relies on empty slots for resolution, making it suitable for scenarios with predictable key distributions and limited memory overhead. A fundamental requirement is that the table cannot be completely full, as this would lead to infinite probing loops during insertion; thus, the load factor α = n/m, where n is the number of elements, must satisfy α < 1. Practical implementations often target α ≤ 0.7 to balance space utilization and performance, as analyzed in seminal work on . To illustrate, consider a array of size 7 (indices 0–6) with a simple h(k) = k mod 7. Inserting key 10 yields index 3 (empty, placed there). Inserting key 17 also yields 3 (occupied), so probing finds index 4 (empty, placed there). Inserting key 24 yields 3 (occupied), probes past 4 (occupied), and places at 5 (empty). The array now appears as:
Index0123456
Key101724
This demonstrates collision resolution via probing without external storage.

Comparison to separate chaining

Open addressing and separate chaining represent two fundamental approaches to collision resolution in hash tables, differing primarily in their structural design. In open addressing, all elements are stored directly in a single contiguous , with collisions resolved by applying a probing sequence to locate an unoccupied slot within the same . By contrast, separate chaining employs an of s, where the determines the initial index, and any colliding keys are simply appended to the at that index, allowing multiple elements per slot without probing. Space efficiency varies between the methods due to their handling of . Open addressing avoids the overhead of pointers required for linked lists, enabling potentially denser packing of elements in the and reducing usage per entry; however, it often leads to wasted slots from clustering during insertions, necessitating a load factor typically below 0.7 to maintain . Separate , while consuming additional space for pointers (approximately 8 bytes per on 64-bit systems), fills slots completely and supports load factors greater than 1, making it less prone to internal fragmentation but overall more memory-intensive for sparse tables. Cache performance also highlights key trade-offs, influenced by memory access patterns. Open addressing generally offers superior locality in implementations like , where sequential probes access contiguous elements, resulting in fewer cache misses than the pointer traversals in separate chaining's linked lists. However, probing strategies that scatter searches (such as ) can degrade this benefit by jumping across non-adjacent memory locations. Separate chaining clusters collided elements logically within lists for potentially faster sequential scans but suffers from poor spatial locality, as list nodes are often allocated non-contiguously, exacerbating cache misses during traversals. The choice between open addressing and separate chaining depends on specific use cases and constraints. Open addressing is advantageous in memory-constrained settings, such as systems, where eliminating pointer overhead is critical, and when deletions are rare, as they complicate probing sequences. Separate chaining excels in environments with high or unpredictable load factors, like with variable insertion rates, as it gracefully accommodates overflows without resizing the table frequently and simplifies operations like deletion.

Probing strategies

Linear probing

is the simplest collision resolution strategy within open addressing hash tables, introduced as a for addressing records in random-access storage systems. Upon computing the initial index h(k) for a k, if the slot is occupied, the algorithm probes sequentially forward in the table, wrapping around if necessary, until an empty slot is found for insertion or the key is located during search. The probe sequence is formally defined as (h(k) + i) \mod m for i = 0, 1, 2, \dots, where m is the table size and probing continues up to m steps in the worst case. To illustrate, consider a hash table of size m = 10 with h(k) = 5. The probe sequence begins at slot 5 and proceeds to slots 6, 7, 8, 9, 0, 1, 2, 3, and 4 if needed, ensuring exhaustive coverage of the table before failure. This linear increment of 1 per probe distinguishes it from more complex strategies, maintaining a fixed step size throughout. One key advantage of linear probing is its simplicity, as it requires only basic for implementation without additional data structures or secondary hash functions. Furthermore, the sequential nature of probes enhances performance by exploiting spatial locality: consecutive memory accesses align well with cache line fetches, reducing compared to non-sequential patterns in other schemes. Despite these benefits, linear probing is prone to primary clustering, where occupied slots form long contiguous runs, particularly when multiple keys to nearby positions. This clustering extends probe lengths for new insertions that collide with the cluster's edge, degrading as the load factor increases and exacerbating the tendency for further clustering.

Quadratic probing

is a method for resolving collisions in open addressing tables by generating probe sequences using offsets from the initial position. Formulated as h(k, i) = (h(k) + c_1 i + c_2 i^2) \mod m for probe index i = 0, 1, 2, \dots, where h(k) is the primary , m is the table size, and c_1, c_2 are constants, this approach was introduced by W. D. Maurer to improve upon linear probing's tendency toward primary clustering. A common simplification sets c_1 = 0 and c_2 = 1, yielding h(k, i) = (h(k) + i^2) \mod m, which produces successively larger jumps in the probe path. The quadratic progression of offsets in the probe sequence avoids the contiguous linear scans that exacerbate primary clustering in open addressing schemes like , as the steps increase nonlinearly to distribute probes more evenly across the table. For the sequence to potentially visit all table slots without premature repetition, m must be a , ideally of the form $4k + 3, ensuring that the first (m+1)/2 probes are distinct and cover half the table before any overlap occurs. This property, analyzed in detail by Knuth, guarantees better coverage than non-prime sizes, where the sequence might cycle through fewer than m positions. Despite mitigating primary clustering, quadratic probing introduces secondary clustering, where keys sharing the same initial hash value h(k) generate identical probe sequences, leading to overlapping paths and potential inefficiencies for those keys. The selection of c_1 and c_2 influences the uniformity of the probe distribution; while c_1 = 0, c_2 = 1 is widely used for its simplicity, other values can adjust the spread, though inappropriate choices combined with a non-prime m may limit the probed slots to a of the table.

Double hashing

Double hashing is a collision resolution technique in open addressing hash tables that employs two independent s to generate probe sequences, aiming to distribute keys more uniformly across the . The primary h_1(k) determines the initial probe position for a key k, while the secondary h_2(k) provides the step size for subsequent probes. The probe sequence is defined as h(k, i) = (h_1(k) + i \cdot h_2(k)) \mod m for i = 0, 1, 2, \dots, where m is the size. This method ensures that the probes form a of the indices if h_2(k) and m are coprime, guaranteeing that all slots will be examined before repeating. To ensure full traversal of the table and avoid cycles shorter than m, the secondary hash function h_2(k) must produce values relatively prime to m. Common choices include making m a prime number and designing h_2(k) to yield values in the range $1 to m-1, or setting m as a power of 2 and ensuring h_2(k) returns an odd integer. A frequently used form is h_2(k) = 1 + (h'(k) \mod (m-1)), where h' is another auxiliary , which helps maintain coprimality while simplifying computation. If h_1 and h_2 are chosen independently and uniformly, the resulting probe sequences approximate random permutations, enhancing distribution quality. One key advantage of is its ability to mitigate both primary clustering—where keys hashing to the same form long chains due to fixed step sizes—and secondary clustering—where keys with the same initial follow identical probe paths. By varying the step size based on the key itself, double hashing produces unique probe sequences for each key, even those sharing the same h_1(k), leading to more uniform probing across the table. This independence reduces the tendency for clusters to form and improves overall access efficiency compared to simpler methods like . Although double hashing requires computing two hash values per operation, introducing a modest increase in computational overhead, this cost is offset by reduced probe lengths and fewer collisions in practice, resulting in better average-case performance for insertions, searches, and deletions. The method's effectiveness relies heavily on the quality of the hash functions, as poor choices can degrade uniformity and reintroduce clustering effects.

Operations

Insertion process

In open addressing hash tables, the insertion process begins by computing the initial value for the k using a h(k), which maps k to an index in the table array of size m. To correctly handle duplicates and tombstones from prior deletions (see Deletion handling), first perform a search (as described in the Search and retrieval subsection) along the probing sequence to verify if the already exists; if found, skip insertion to avoid duplicates. If the search confirms the is absent, follow the probing sequence using the generic probe function h(k, i) (with i \geq 0) to locate the first available slot—either empty (NIL) or marked as deleted (tombstone)—and insert the there. Specific probing strategies, such as where h(k, i) = (h(k) + i) \mod m, are used within this framework. Prior to insertion, verify the load factor \alpha = n/m (where n is the number of elements) and resize if it exceeds a like 0.7 to maintain . If no available slot is found, the table is full, raising an or triggering resize. Here is the standard pseudocode for insertion in an open addressing (assuming prior search confirmed the key is not present):
HASH-INSERT(T, k)
i ← 0
repeat
    j ← h(k, i)
    if T[j] == NIL or T[j] == DELETED
        then T[j] ← k
             return j
    i ← i + 1
until i = m
error "table overflow"  // or trigger resize
This procedure places the key in the first available slot along the probe sequence.

Search and retrieval

In open addressing hash tables, the search operation for a k follows the identical used during insertion to ensure consistency in locating elements. The algorithm begins by computing the initial index h(k, 0) and examines the slot at that position. If the key in the slot matches k, the associated value is returned. Otherwise, it proceeds to the next position h(k, i) for i = 1, 2, \dots, continuing this process until a match is found, an empty slot (NIL) is encountered, or the entire table has been probed (i.e., i = m, where m is the table size), in which case the key is not present. For an unsuccessful search, the process terminates upon reaching the first empty slot in the probe sequence. This early termination is valid because, during insertion, any key following the same probe path would have occupied that empty slot if it existed in the table, preventing further probes beyond it. Probes continue past deleted (tombstone) slots, treating them as occupied for search purposes. The following pseudocode illustrates the search algorithm, which mirrors the insertion loop but returns immediately on a match or empty slot without performing any modification:
SEARCH(T, k)
    i ← 0
    repeat
        j ← h(k, i)
        if T[j] == k
            return T[j]  // Found: return the associated value
        if T[j] == NIL
            return NIL  // Not found: empty slot encountered
        i ← i + 1
    until i == m
    return NIL  // Table full and key not found
This shared probe sequence with insertion enables efficient lookups for existing keys, as the search retraces the exact path taken during their placement, resulting in constant-time average performance under low load factors and minimal clustering.

Deletion handling

Deletion in open-address hashing presents a significant challenge because simply removing an entry and marking its slot as empty can disrupt the probe sequences used for subsequent searches and insertions. For instance, if a key is deleted from a slot and that slot is part of the probe path for another key inserted later, a search for the remaining key would terminate prematurely upon encountering the now-empty slot, leading to incorrect non-retrieval. This issue arises due to the linear or patterned probing strategies that rely on consecutive slots to resolve collisions, making direct deletion incompatible with the table's integrity. To address this, the standard approach employs tombstone markers, where a deleted slot is marked with a special indicator—often denoted as "DELETED" or a tombstone value—rather than being cleared to empty. This marker signals that the slot is no longer occupied by the original key but must still be treated as part of existing probe chains to avoid breaking searches. During search operations, a tombstone is considered occupied, allowing the probe to continue until an empty slot or a matching key is found. In contrast, for insertions, tombstones are treated as available slots, enabling reuse without altering the probing behavior for other entries. This dual treatment preserves the correctness of all operations while facilitating space reclamation. The deletion algorithm proceeds by first locating the target key via its standard probe sequence (using the SEARCH procedure, which will find it if present). Upon finding the matching entry, the slot's content is replaced with the DELETED marker instead of NIL. For the modified search process, probing advances past tombstones as if they were occupied until reaching an empty slot (indicating the key is absent) or the exact match. Insertion follows the probe sequence (after confirming absence via search), placing the new key into the first tombstone or empty slot encountered, thereby integrating it seamlessly into the chain. These steps ensure that post-deletion searches for unaffected keys remain successful by simulating the original occupancy. Here is the standard pseudocode for deletion:
HASH-DELETE(T, k)
i ← 0
repeat
    j ← h(k, i)
    if T[j] == k
        then T[j] ← DELETED
             return j
    if T[j] == NIL
        then return  // key not found
    i ← i + 1
until i == m
return  // key not found
However, tombstones introduce drawbacks, as they accumulate over multiple deletions without removal, effectively increasing the load factor and exacerbating clustering by forcing longer probes around these markers. This degradation in performance can be mitigated through periodic rehashing, where the entire table is rebuilt into a new , eliminating all tombstones and restoring efficiency. Such maintenance is recommended when the density of tombstones reaches a , typically tied to the overall load factor.

Performance characteristics

Load factor and efficiency

In open addressing, the load factor is defined as \alpha = \frac{n}{m}, where n is the number of elements stored in the hash table and m is the total number of slots available. Since all elements must fit within the table's slots through probing, \alpha must remain strictly less than 1; exceeding this renders insertions impossible due to a full table. To ensure efficient performance, practitioners typically maintain \alpha \leq 0.7, as higher values lead to significant increases in probe lengths and operational slowdowns. Under the simple uniform hashing assumption, the average number of probes for insertions and unsuccessful searches is at most \frac{1}{1 - \alpha}, demonstrating how efficiency degrades nonlinearly as \alpha rises—doubling the probes when \alpha = 0.5 and quadrupling them at \alpha = 0.75. Successful searches follow a similar but slightly lower bound, at most \frac{1}{\alpha} \ln \frac{1}{1 - \alpha}. These metrics highlight the importance of load factor management for keeping average-case operations close to constant time. To prevent performance degradation, hash tables using open addressing implement resizing: when \alpha surpasses a like 0.7, the size m is typically doubled, and all n elements are rehashed into the enlarged array. This amortized process ensures the load factor drops below the threshold post-resizing. Implementations often initialize with a large m to minimize early resizes, and under uniform hashing, the expected cost remains efficient. Deletions in open addressing complicate load factor control, as removed slots are marked with tombstones rather than cleared, preserving probe paths but not reducing effective occupancy. When deletions accumulate and tombstones inflate the perceived load, periodic rehashing into a fresh table without markers is performed to restore efficiency. Relative to separate , open addressing degrades more sharply as \alpha nears 1, with probe counts rising hyperbolically per \frac{1}{1 - \alpha}, whereas chaining's chain traversals scale more linearly at roughly $1 + \alpha.

Clustering effects

In open addressing hash tables, clustering refers to the tendency of keys to form groups of occupied slots that degrade search efficiency by increasing probe lengths. Primary clustering occurs specifically in , where keys hashing to nearby positions create contiguous runs of occupied slots. This phenomenon arises because the linear probe sequence increments sequentially from the initial position, causing subsequent insertions or searches for keys in the vicinity to traverse the entire cluster before finding an empty slot or the target key. As a result, probe sequences for nearby hashes become correlated, leading to longer average and worst-case probe counts compared to expectations. Secondary clustering affects and, to a lesser extent, . It manifests when multiple keys share the same initial value, forcing them to follow identical sequences regardless of their distinct secondary computations. In , the offsets are determined by squares of integers, which, while avoiding the contiguous buildup of primary clustering, still results in overlapping paths for keys with congruent initial hashes. , using a second to compute both the starting position and step size, reduces secondary clustering more effectively by generating unique sequences for each key, though some correlation persists if the secondary is not well-chosen. The overall impact of clustering is a distortion in probe length distribution, elevating both average and worst-case operations across insertion, search, and deletion. Primary clustering in amplifies this distortion most severely, as clusters grow contiguously and propagate to adjacent values, while secondary clustering confines the effect to subsets of keys but still inflates probes within those groups. These effects intensify with higher load factors α, where the table's approaches capacity, reducing available slots and forcing probes to navigate denser clusters. mitigates clustering impacts better than the alternatives, approaching ideal probing behavior under suitable functions. To counteract clustering, practitioners select probing strategies that promote even distribution, such as with independent, high-quality functions to minimize sequence overlaps. Performance can be monitored through simulations that model insertions under varying load factors and distributions, revealing cluster formation patterns and guiding sizing or rehashing thresholds.

Theoretical bounds

Under the assumption of simple uniform hashing, where values are uniformly distributed and independent, the theoretical analysis of open addressing focuses on the expected number of probes required for search and insertion operations. For specifically, the expected number of probes in an unsuccessful search is \frac{1}{2} \left(1 + \frac{1}{(1 - \alpha)^2}\right), while for a successful search it is \frac{1}{2} \left(1 + \frac{1}{1 - \alpha}\right). These formulas, derived under the condition that the load factor \alpha < 1, highlight how clustering in increases the probe count for unsuccessful searches compared to more uniform probing schemes. In contrast, quadratic probing and exhibit expected probe lengths closer to the general bounds for open addressing, benefiting from reduced clustering effects that lower the constants in the formulas. For these methods, the expected number of probes in an unsuccessful search is at most \frac{1}{1 - \alpha}, and for a successful search, at most \frac{1}{\alpha} \ln \frac{1}{1 - \alpha}. , in particular, approximates uniform probing, achieving these bounds with high probability when the secondary is chosen appropriately. In the worst case, the number of probes can reach m, the size of the , particularly if the table becomes full. However, by resizing the table to maintain \alpha < 1, all operations achieve amortized O(1) . Overall, under uniform hashing, insertion, search, and deletion in open addressing have expected O(1) , though the worst case without resizing is O(n), where n is the number of elements. Clustering can increase the variance in these bounds, leading to occasionally longer probe sequences even in average-case scenarios.

Advantages and limitations

Benefits over other methods

Open addressing offers significant space efficiency compared to separate chaining, as it eliminates the overhead of pointers required for linked lists in chaining implementations. In open addressing, the consists of a contiguous where each entry stores only the and directly, typically requiring around 8 bytes per entry for keys and values on 32-bit systems, whereas separate chaining incurs additional 8-16 bytes or more per entry for pointers to next nodes, increasing overall usage by up to 50% or higher depending on the load factor. Another key benefit is improved performance, stemming from the sequential nature of probe sequences in methods like linear or , which access locations that are spatially close and thus more likely to be cached. This contrasts with separate chaining, where traversing linked lists can lead to scattered accesses and more cache misses, particularly as chain lengths grow.

Drawbacks and mitigation strategies

Open addressing exhibits significant sensitivity to the quality of the underlying , as suboptimal distribution of keys can intensify clustering, resulting in longer probe sequences and reduced efficiency during insertions and searches. Poor lead to uneven key placement, exacerbating primary clustering in schemes like , where contiguous blocks of occupied slots form, increasing the average number of probes and causing thread divergence in parallel environments such as GPUs. This drawback is mitigated by selecting high-quality, non-cryptographic hash functions that ensure and strong avalanche effects, such as , which minimizes initial collisions and helps maintain predictable performance even under moderate load factors. These functions are computationally efficient while providing near-ideal uniformity, thereby reducing the onset and severity of clustering without introducing excessive overhead. Deletions in open addressing pose another challenge, as they require marking slots with tombstones to preserve probe sequences for subsequent searches; however, accumulated tombstones bloat the table, artificially inflating the effective load factor and further aggravating clustering by blocking potential insertion paths. This leads to performance degradation over time, with probe lengths increasing as tombstones fill available space, potentially causing the table to approach saturation even below the nominal load threshold. Common mitigations include lazy deletion, where tombstones remain during operations to avoid immediate rearrangement costs, paired with periodic compaction or rehashing to eliminate them—such as rebuilding the entire table after a fixed number of operations (e.g., every \Omega(n/x) insertions, where n is the table size and x controls the load). More advanced strategies, like hashing, address these issues by incrementally redistributing tombstones within small probe windows during ongoing operations, deamortizing costs and eliminating downtime while achieving high throughput (e.g., up to 6.91 million operations per second at 95% load) and space efficiency (around 70-92%). Open addressing is notably intolerant to high load factors (\alpha), where performance deteriorates rapidly as \alpha nears 1, due to heightened collision rates that extend probe chains and elevate operation times—often reaching O(1/\epsilon) for queries and O(1/\epsilon^2) for updates at load factor $1 - \epsilon. Theoretical analyses confirm lower bounds of \Omega(\log \log \epsilon^{-1}) for amortized times in classical schemes under such conditions, making full utilization impractical without specialized variants. Proactive resizing serves as the primary mitigation, involving table expansion (typically doubling the size) and rehashing all elements when \alpha exceeds a conservative threshold, such as 0.7 for linear probing, to restore constant expected-time operations and prevent exponential slowdowns. Dynamic resizing strategies ensure the load remains below $1 - \epsilon, balancing space usage with time guarantees while amortizing rehashing costs across insertions. To overcome inherent limitations, hybrid variants integrate open addressing with techniques; for instance, computed chaining links overflow items using probe numbers derived from predecessors rather than full pointers, combining the cache locality of with chaining's robustness to deletions and high loads, achieving mean probe counts near 1.45 at 99% utilization while saving space (e.g., 8 bits per link versus 10). Cuckoo hashing represents an advanced open addressing alternative, employing multiple hash functions to enable key relocation during insertions, which disperses clusters and supports load factors up to 0.95 with constant (typically 2-3) probes per lookup; however, it risks insertion loops or failures from cycles. These are mitigated by adding a small stash (3-4 slots) for unresolved keys, reducing failure probability to negligible levels (e.g., $10^{-9}) without impacting average performance, making it suitable for high-throughput applications.