Quadratic probing
Quadratic probing is a collision resolution technique used in open-addressing hash tables to handle situations where multiple keys hash to the same index. When a collision occurs at the initial hash position h(k) for a key k in a table of size m, subsequent probes follow the sequence (h(k) + i^2) \mod m for i = 0, 1, 2, \dots, where the quadratic offset i^2 determines the displacement from the original slot. This method was introduced by W. D. Maurer in 1968 as an improvement over linear probing to reduce clustering effects in scatter storage.[1]
Unlike linear probing, which increments by a constant step and suffers from primary clustering—where consecutive occupied slots form long chains—quadratic probing disperses probes more widely, mitigating this issue by using nonlinear offsets that spread insertions across the table. However, it introduces secondary clustering, where keys sharing the same initial hash position follow identical probe sequences, potentially leading to inefficiencies in search and insertion times. For optimal performance, the table size m should be a prime number or a power of 2 to ensure that the probe sequence can access up to roughly half the table slots without repetition, though full coverage of all slots is not guaranteed.[2]
The expected number of probes for unsuccessful searches and insertions in quadratic probing is O\left(\frac{1}{1 - \alpha}\right), and for successful searches is approximately \frac{1}{\alpha} \ln \frac{1}{1 - \alpha}, where \alpha is the load factor (ratio of elements to table size), assuming uniform hashing and \alpha < 0.5. Analysis by Donald Knuth in The Art of Computer Programming further established that quadratic probing achieves these bounds under certain conditions on m, making it suitable for applications requiring fast average-case access with moderate load factors.[3]
Variations of quadratic probing include using alternating signs in the quadratic term (e.g., (h(k) + i^2) \mod m and (h(k) - i^2) \mod m) to improve distribution and reduce secondary clustering, as proposed in subsequent works during the 1970s. Despite these enhancements, modern hash table implementations often favor alternatives like double hashing or chaining for higher load factors or to avoid the permutation limitations of probing methods.[4][5]
Fundamentals
Definition and Purpose
Quadratic probing is a collision resolution technique employed in open-addressing hash tables, where all elements are stored directly in an array of slots without using auxiliary data structures like linked lists.[6] In open addressing, a hash function computes an initial index for a key, but collisions arise when multiple keys map to the same slot, necessitating a systematic search for the next available empty slot within the array.[7] Quadratic probing specifically addresses this by advancing from the initial hash position using offsets derived from a quadratic function, thereby exploring alternative slots in a non-linear manner.[6]
The primary purpose of quadratic probing is to achieve a more uniform distribution of probe sequences compared to simpler methods like linear probing, thereby mitigating the formation of clusters where consecutive slots become occupied and degrade performance.[7] This approach helps reduce the likelihood of long probe chains during insertions and searches, while maintaining computational simplicity since the offsets remain integers and the process cycles within the fixed table size when properly configured.[6] By diverging probe paths for keys with nearby initial hashes, it promotes better space utilization in the hash table without requiring complex rehashing operations.[7]
For illustration, consider a small open-addressing hash table of size 7, initially empty except for a key occupying slot 2. If a new key hashes to slot 2, causing a collision, quadratic probing would first attempt the adjacent slot 3 as the next candidate. If slot 3 is also full, it would then probe slot 6, skipping over intermediate positions to avoid immediate clustering around the original slot.[6] This sequence continues until an empty slot is found or the table is determined to be full.
Historical Development
Quadratic probing emerged during the early development of hash tables in the mid-20th century, as researchers sought efficient methods for collision resolution in open-addressing schemes amid the rise of random-access storage systems. The foundational concepts of open addressing, including probing techniques for handling collisions, were explored in the context of magnetic tape and drum storage, where direct access to records required minimizing search times. W. W. Peterson's 1957 paper provided one of the earliest systematic analyses of such methods, estimating search costs for record location in large files and introducing key ideas that influenced later hashing strategies, though focused primarily on linear variants.[8]
The specific technique of quadratic probing was first introduced by Ward Douglas Maurer in 1968, who proposed it as an improvement over linear probing to reduce clustering effects in scatter storage, using a quadratic offset in the probe sequence to better distribute collisions across the table. This innovation addressed limitations in earlier open-addressing approaches by leveraging polynomial increments, allowing for more flexible table sizes and potentially fuller utilization without excessive secondary clustering. Maurer's work built on the growing body of hashing literature from the 1950s and 1960s, adapting ideas from storage systems to general-purpose computing.
Quadratic probing gained formal analysis and wider recognition through Donald E. Knuth's seminal 1973 volume of The Art of Computer Programming, where it was integrated into the theoretical framework of hashing algorithms, including probabilistic performance bounds under uniform assumptions. In the 1970s, as dynamic memory allocation and hash-based structures became prevalent in programming languages and systems, several variations were proposed to optimize quadratic sequences for specific table sizes, such as powers of two, enhancing its practicality for implementation. By the 1980s, quadratic probing had become a standard topic in algorithms textbooks, reflecting its adoption in database indexing and file organization for efficient key-based retrieval.
Implementation
The probe sequence in quadratic probing is defined mathematically to resolve hash collisions by generating a series of offsets from the initial hash value in a nonlinear manner. For a key k and table size m, the position of the i-th probe (i = 0, 1, 2, \dots) is given by the general formula
p(i) = \left( h(k) + c_1 i + c_2 i^2 \right) \mod m,
where h(k) is the initial hash function mapping k to \{0, 1, \dots, m-1\}, and c_1, c_2 are constants chosen to optimize distribution and ensure distinct probes.[9] This quadratic form introduces a parabolic progression in probe offsets, contrasting with linear probing's arithmetic sequence.[10]
The simplest and most common variant sets c_1 = 0 and c_2 = 1, yielding
p(i) = \left( h(k) + i^2 \right) \mod m.
This choice promotes even scattering by leveraging the quadratic term to accelerate probe steps, thereby reducing the tendency for probes to form contiguous clusters near the initial hash position—a phenomenon known as primary clustering in linear probing.[9] For tables where m is a power of 2, c_1 = c_2 = \frac{1}{2} is often used to approximate integer arithmetic while ensuring the first m probes visit all slots exactly once in the worst case, though practical implementations scale these constants appropriately to avoid fractions (e.g., multiplying by 2 and adjusting modulo). This corresponds to probing with offsets given by triangular numbers: (i^2 + i)/2 \mod m.[7]
Key properties of the quadratic probe sequence depend on the table size m. If m is prime, the sequence with c_1=0, c_2=1 guarantees that the first (m+1)/2 probes are distinct and that exactly (m+1)/2 positions are visited in total, ensuring roughly half the table is searched before repetition, which bounds the search cost under low load factors.[10] These properties stem from number-theoretic results on quadratic residues, ensuring probes remain within table bounds and distribute uniformly without systematic bias.
To illustrate, consider m = 11 (a prime) and h(k) = 0. The probe positions for i = 0 to $5 are:
\begin{align*}
i=0: & \quad 0^2 \mod 11 = 0, \\
i=1: & \quad 1^2 \mod 11 = 1, \\
i=2: & \quad 4 \mod 11 = 4, \\
i=3: & \quad 9 \mod 11 = 9, \\
i=4: & \quad 16 \mod 11 = 5, \\
i=5: & \quad 25 \mod 11 = 3.
\end{align*}
These yield the distinct positions {0, 1, 4, 9, 5, 3}, demonstrating the sequence's non-clustered spread within the table.[10]
Insertion Algorithm
The insertion process in quadratic probing starts with computing the initial hash index h(k) for the key k in a hash table of size m. If the slot at position h(k) is empty, the key is placed there. Otherwise, the algorithm generates subsequent probe positions using a quadratic offset added to the initial hash, continuing until an empty slot is located or all positions have been probed, indicating a full table. This approach ensures that insertions resolve collisions by exploring positions in a non-linear sequence, as originally proposed for random-access storage systems.
To implement insertion, the algorithm iteratively computes probe indices until success or failure. A common form uses the probe sequence (h(k) + i^2) \mod m for i = 0, 1, 2, \dots, where the quadratic term i^2 determines the offset (referring to the standard quadratic probing formula). The loop terminates after at most m probes to avoid infinite execution on a full table.
Here is pseudocode for the insertion algorithm:
function INSERT(k, T) // T is hash table of size m
h ← HASH(k)
for i ← 0 to m-1 do
probe ← (h + i²) mod m
if T[probe] is empty then
T[probe] ← k
return
if T[probe] = k then
return // Key already present; insertion skipped for idempotency
error "Hash table overflow"
function INSERT(k, T) // T is hash table of size m
h ← HASH(k)
for i ← 0 to m-1 do
probe ← (h + i²) mod m
if T[probe] is empty then
T[probe] ← k
return
if T[probe] = k then
return // Key already present; insertion skipped for idempotency
error "Hash table overflow"
This pseudocode assumes a simple quadratic form with coefficients c_1 = 0 and c_2 = 1, ensuring up to m distinct probes under suitable table sizes like primes. The check for an existing key prevents duplicate insertions, maintaining set-like behavior, though some implementations omit this for multiset support.
Consider an example with a hash table of size m = 7 and hash function h(k) = k \mod 7. Inserting key 14: initial probe i=0, position (14 \mod 7 + 0^2) \mod 7 = 0, which is empty, so insert at index 0 (probe path: 0). Next, key 23: i=0, position (23 \mod 7 + 0^2) \mod 7 = 2, empty, insert at index 2 (probe path: 2). Finally, key 3: i=0, position (3 \mod 7 + 0^2) \mod 7 = 3, empty, insert at index 3 (probe path: 3). No collisions occurred, resulting in direct insertions at initial positions.
Search and Deletion Procedures
In quadratic probing hash tables, the search algorithm starts at the initial hash position h(k) for a given key k, then applies the quadratic probe sequence to examine subsequent positions until the key is located, an empty slot is reached (signaling the key's absence), or a full cycle through the table is completed to handle potential full tables.[11] This ensures that searches follow the same path used during insertion, preserving correctness in open addressing schemes.[12]
Pseudocode for the search procedure, adapted for quadratic probing, illustrates this process with an early stop on empty slots:
SEARCH(table, key, m):
i ← 0
repeat
pos ← (h(key) + i²) mod m
if table[pos] is empty:
return not found
if table[pos] = key:
return pos
i ← i + 1
until i = m
return not found
SEARCH(table, key, m):
i ← 0
repeat
pos ← (h(key) + i²) mod m
if table[pos] is empty:
return not found
if table[pos] = key:
return pos
i ← i + 1
until i = m
return not found
During probing, slots marked as "deleted" (tombstones) are treated as occupied but not matching the key, allowing the search to continue without interruption.[11]
Deletion in quadratic probing cannot simply nullify a slot, as this would break probe chains for subsequently inserted keys whose searches would prematurely terminate at the now-empty slot.[11] Instead, deletions use a tombstone marker to indicate the slot's availability for new insertions while permitting searches and insertions to probe beyond it.[12] The deletion process first performs a search to find the key; if located, the slot is replaced with the tombstone value (e.g., a special sentinel like "DELETED").[11]
Pseudocode for deletion integrates the search:
DELETE(table, key, m):
pos ← SEARCH(table, key, m)
if pos ≠ not found:
table[pos] ← DELETED
return pos
DELETE(table, key, m):
pos ← SEARCH(table, key, m)
if pos ≠ not found:
table[pos] ← DELETED
return pos
For insertion after deletion, the algorithm treats both empty slots and tombstones as candidates, preferring tombstones to reuse space efficiently.[11]
Now, delete 92 by marking index 1 as DELETED. A subsequent search for a key like 23 (h(23)=1): Probe starts at 1 (DELETED, continue), i=1: (1 + 1) mod 11 = 2 (101 ≠ 23), i=2: (1 + 4) mod 11 = 5 (50 ≠ 23), and so on until empty or cycle, without stopping at the tombstone. This preserves the chain for any key that might probe through position 1.[11]
Load Factor Considerations
The load factor in quadratic probing is defined as \alpha = \frac{n}{m}, where n is the number of keys stored and m is the number of slots in the hash table.[13] This metric directly influences the efficiency of insertions, searches, and deletions, as higher values increase the likelihood of collisions and longer probe sequences. For quadratic probing to function effectively, the load factor must remain below 1, since open-addressing schemes like this one cannot accommodate more keys than slots without overflow. However, the optimal range is \alpha < 0.5, as this ensures that probe sequences can access sufficient distinct slots without premature cycling, thereby guaranteeing that an empty slot is found during insertion if space is available.[14]
Theoretical analysis of quadratic probing's performance under varying load factors reveals that the average number of probes for an unsuccessful search is at most \frac{1}{1 - \alpha}, assuming uniform hashing.[13] For successful searches, the expected probe length is bounded by \frac{1}{\alpha} \ln \frac{1}{1 - \alpha} + 1, which remains constant on average for \alpha well below 1 but degrades quadratically as \alpha approaches 0.5 and beyond.[13] Donald Knuth provides detailed bounds in his analysis, showing that probe lengths increase more sharply than in uniform open addressing when \alpha > 0.5, due to the limited distinct probes in quadratic sequences.[15] Specifically, when the table size m is prime, the probe sequence visits exactly \frac{m+1}{2} distinct slots before repeating, which limits reliable operation to load factors of approximately 50% to avoid exhaustive cycling without full coverage.[16]
In practice, implementers are advised to select a prime table size m and maintain \alpha \leq 0.5 through dynamic resizing, typically doubling the table size and rehashing all elements when the load factor exceeds this threshold.[14] Resizing prevents performance degradation, as exceeding \alpha = 0.5 can lead to probe lengths that grow exponentially, significantly slowing operations even if the table is not full.[15] This guideline ensures that quadratic probing achieves near-constant time complexity for core operations while minimizing the risk of failed insertions due to incomplete probe coverage.[16]
Clustering Effects
Quadratic probing avoids primary clustering, a phenomenon prevalent in linear probing where collisions tend to form contiguous blocks of occupied slots due to sequential probing increments. In quadratic probing, the nonlinear quadratic offsets cause probe sequences to jump to non-adjacent positions, preventing the formation of linear clusters and distributing collisions more evenly across the table.[17]
Instead, quadratic probing exhibits secondary clustering, which arises when multiple distinct keys produce the same initial hash value h(k). All such keys then traverse identical probe sequences—determined solely by the quadratic function—leading to overlapping collision resolution paths where they compete for the same alternative slots. This shared trajectory exacerbates contention among the affected keys, unlike in methods where probe sequences vary based on the key itself.[18][3]
The impact of secondary clustering is a degradation in performance for keys involved in such groups, as each additional key in the cluster must advance further along the common probe path to locate an empty slot, thereby increasing the average number of probes required for insertion, search, or deletion of those keys. For instance, in a simulation with a table size of 13 (a prime to ensure good coverage) and quadratic probing via (h(k) + i^2) \mod 13, inserting four keys all hashing to position 0 results in them occupying positions 0, 1, 4, 9 along the sequence, forcing the fourth key to perform four probes while earlier keys averaged fewer; in a uniform scenario without shared h(k), probes would distribute more broadly, averaging closer to 1-2 steps even at similar load factors. This clustering effect amplifies probe lengths nonlinearly as the number of colliding keys grows, contrasting with ideal uniform hashing where probes remain short and independent.[19][17]
Secondary cluster sizes are measured in terms of the number of keys sharing an initial hash value, with the expected cluster size increasing linearly with the load factor \alpha (the ratio of inserted keys to table size), as higher \alpha raises the likelihood of multiple keys mapping to the same slot. The probability of a secondary cluster forming (i.e., at least two keys sharing a probe sequence) approximates \alpha^2 for low \alpha, capturing the quadratic scaling of pairwise collision risks along constrained paths; this underscores the need to maintain \alpha < 0.5 for efficient operation, as cluster growth degrades performance more sharply than in primary clustering scenarios.[3][18]
Comparisons and Alternatives
Versus Linear Probing
Linear probing is a collision resolution technique in open-addressing hash tables where, upon a collision at initial hash position h(k), the algorithm probes subsequent positions in an arithmetic sequence: h(k, i) = (h(k) + i) \mod m for i = 0, 1, 2, \dots, until an empty slot is found or the key is located.[9]
Quadratic probing differs by using a quadratic offset: h(k, i) = (h(k) + c_1 i + c_2 i^2) \mod m, often simplified to h(k) + i^2 \mod m, which generates probe positions that jump farther from the initial hash, reducing the tendency for primary clustering seen in linear probing where occupied slots form long contiguous runs that subsequent insertions must traverse.[20] However, quadratic probing introduces secondary clustering, where keys sharing the same initial hash h(k) follow identical probe sequences, potentially leading to denser local groups for those keys, though overall distribution is more even than linear probing's sequential scans in dense regions.[9]
In terms of performance, quadratic probing generally yields shorter average probe sequences at low load factors \alpha (e.g., \alpha < 0.5) due to mitigated primary clustering, but the two methods converge in probe length at higher loads where secondary clustering impacts quadratic probing. For linear probing, the average number of probes for an unsuccessful search is \frac{1 + \frac{1}{(1 - \alpha)^2}}{2}, while for a successful search it is \frac{1 + \frac{1}{1 - \alpha}}{2}.[9] Quadratic probing lacks a simple closed-form analysis but empirically shows similar asymptotic behavior with improved constants at moderate loads; recent analysis confirms expected insertion time remains O(1) up to \alpha \approx 0.089, outperforming linear probing's \Theta\left(\left(\frac{1}{1-\alpha}\right)^2\right) growth in clustered scenarios.[21]
| Load Factor \alpha | Linear Unsuccessful Probes (Approx.) | Quadratic Unsuccessful Probes (Approx.) |
|---|
| 0.5 | 2.5 | 2.2 |
| 0.8 | 13 | 5 |
| 0.9 | 51 | 12 |
These values illustrate quadratic probing's advantage at lower loads, derived from standard analyses and empirical simulations.[9][5]
Quadratic probing is preferable when even probe distribution is prioritized over simplicity and cache locality, such as in applications with expected low-to-moderate load factors where reducing primary clustering improves insertion efficiency. For example, consider a hash table of size 11 with keys hashing to position 2: in linear probing, inserting keys A (at 2), B (probes 3, at 3), C (probes 3–4, at 4), and D (hashes to 2, probes 3–5, at 5) builds a contiguous cluster, forcing a fifth key E (hashes to 2) to probe up to 6 positions. In quadratic probing (i^2 \mod 11), A at 2, B at 3 (2+1), C at 6 (2+4), D at 11≡0 (2+9), and E at 7 (2+16≡7), avoiding long linear scans and inserting E at 7 with only 5 probes, demonstrating less bunching.[20]
Versus Double Hashing
Double hashing is a collision resolution method in open-addressing hash tables that utilizes two independent hash functions to generate probe sequences. The position for the ith probe is computed as h(k, i) = (h_1(k) + i \cdot h_2(k)) \mod m, where h_1(k) provides the initial hash position, h_2(k) determines the step size, i is the probe number starting from 0, and m is the table size.
In comparison to quadratic probing, double hashing eliminates secondary clustering entirely because the step size h_2(k) is key-dependent, ensuring that distinct keys with the same initial hash h_1(k) follow unique probe paths rather than identical sequences. Quadratic probing, by contrast, suffers from secondary clustering as keys sharing the same initial hash position traverse the same quadratic offsets, leading to correlated collisions. Computationally, quadratic probing is more efficient, involving a single hash evaluation plus inexpensive quadratic calculations (e.g., h(k, i) = (h(k) + c_1 i + c_2 i^2) \mod m), whereas double hashing requires evaluating two full hash functions, increasing overhead especially for complex hash computations.[3][22]
Performance-wise, double hashing provides probe sequences with near-random uniformity, closely mimicking ideal uniform hashing and supporting load factors up to approximately 0.8 with average probe counts remaining low (around 2-3 for searches). This allows for more compact tables compared to quadratic probing, which is generally restricted to load factors below 0.5 to ensure that insertions and searches succeed within m/2 probes when m is prime and the quadratic coefficients are appropriately chosen (e.g., c_1 = 0, c_2 = 1). Quadratic probing's simplicity makes it preferable in resource-constrained environments despite the load factor limitation.[22]
A key trade-off in double hashing is the need for a carefully designed secondary hash function h_2(k) that yields step sizes coprime to m, preventing premature cycle repetition in the probe sequence; common choices include h_2(k) = 1 + (h'(k) \mod (m-1)) where h' is another hash. For illustration, consider a table of size m = 11 with initial hash h_1(k) = 0. In quadratic probing using (0 + i^2) \mod 11, the sequence is 0, 1, 4, 9, 5, 3, 3 (repeating at i=6, having visited only 6 distinct positions). In double hashing with h_2(k) = 5, the sequence is 0, 5, 10, 4, 9, 3, 8, 2, 7, 1, 6 (visiting all 11 positions before repeating, demonstrating fuller coverage). Such non-repeating paths in double hashing enhance distribution but demand robust hash functions to avoid degenerate cases.[23][3]
Limitations and Mitigations
Secondary Clustering Issue
Secondary clustering represents a key limitation of quadratic probing, where keys that collide at the same initial hash position h(k) subsequently follow identical probe sequences determined solely by the quadratic offsets from that position. This results in the formation of secondary chains or clusters that depend only on the shared initial hash value, rather than the distinct characteristics of the keys themselves, leading to inefficient probing within those groups. Unlike primary clustering in linear probing, which propagates across adjacent slots based on proximity, secondary clustering isolates the effect to keys sharing the exact initial collision point but can still amplify search times within those localized chains.
The underlying cause is inherent to the probing formula, typically h(k, i) = (h(k) + i^2) \mod m for probe step i, where the offset i^2 is independent of the key k beyond its initial hash. Thus, any set of keys with the same h(k) will probe the exact same sequence of slots, creating deterministic chains that grow with the number of collisions at that hash value. This independence from key-specific data exacerbates the issue in hash functions with poor uniformity, as popular initial positions spawn longer shared paths.
A mathematical proof of the convergence in probe sequences follows directly from the formula: if two keys k_1 and k_2 satisfy h(k_1) = h(k_2) = s, then for all i \geq 0, h(k_1, i) = (s + i^2) \mod m = h(k_2, i), ensuring identical paths and thus secondary clustering. Knuth's rigorous analysis confirms this effect, showing that while quadratic probing mitigates primary clustering, the secondary variant introduces a dependency that slightly increases the variance in probe lengths, though the overall average remains asymptotically similar to linear probing under uniform assumptions.[14]
In real-world applications with uneven hash distributions—such as when input data exhibits patterns causing multiple keys to map to few initial slots—this degradation manifests as prolonged search times for elements in dense secondary clusters. For a concrete illustration, consider a hash table of size 13 where 10 keys initially hash to just three slots (say, slots 0, 1, and 2, with four keys at 0, three at 1, and three at 2); the four keys at slot 0 will all probe the shared offsets 0, 1, 4, 9, 3, 12, 10, 10 mod 13 onward, forming a chain of at least four probes for later insertions in that group, while unaffected slots remain underutilized. Unlike linear probing's broad propagation, secondary clustering confines the inefficiency to collided keys only, yet it can still lead to worst-case O(m) probe counts if a single initial hash value attracts enough collisions to exhaust its sequence. This targeted but potent effect underscores quadratic probing's vulnerability in non-uniform environments.
To mitigate secondary clustering, one common approach is to incorporate a key-dependent component into the probe offset, such as in double hashing, where the increment is h_2(k) * i^2 mod m or similar, ensuring distinct sequences for keys with the same initial hash. This avoids shared paths while retaining the benefits of quadratic dispersion.[3]
Alternating Signs Problem
A significant limitation in standard quadratic probing arises when the hash table size m is even or a power of 2, as the probe sequence h(k, i) = (h(k) + i^2) \mod m only explores approximately half of the table's slots due to properties of quadratic residues. Specifically, i^2 \mod 4 is either 0 (for even i) or 1 (for odd i), which restricts the parity and confines probes to a subset of positions, potentially failing to find an empty slot even if available. For example, with m = 8 starting from position 0, the sequence yields offsets 0, 1, 4, 1, 0, 1, 4,... cycling through only slots 0, 1, and 4.[24]
This issue stems from the quadratic residue property and was analyzed in early works, including Hopgood and Davenport (1972), who examined quadratic hashing for power-of-2 table sizes and demonstrated incomplete traversal without modifications.[25]
One mitigation is the alternating signs variant of quadratic probing, which alternates the sign of the quadratic offset: h(k, i) = (h(k) + (-1)^i i^2) \mod m. This approach improves distribution by effectively doubling the explorable offsets and can ensure full table permutation under specific conditions, such as when m is a prime of the form $4k + 3. For such sizes, the sequence visits every slot before repeating, allowing load factors up to 0.5 with guaranteed insertion success if space exists. Even for powers of 2, adjusted forms like (i^2 + i)/2 can achieve full coverage. Selecting m as a prime (ideally $4k + 3) remains recommended for optimal performance in standard quadratic probing.[14]