Rolling hash
A rolling hash is a hash function designed to compute hash values for consecutive substrings or sliding windows of a string or data stream in constant time per update, by reusing previous computations as the window shifts, typically through polynomial or checksum-based methods.[1]
In the polynomial rolling hash approach, commonly used in string processing, a string s of length n is mapped to an integer via the formula H(s) = \sum_{i=0}^{n-1} s \cdot p^i \mod m, where s represents the numeric value of the i-th character, p is a base (often a prime like 31), and m is a large prime modulus (e.g., $10^9 + 9) to reduce collisions.[1] Substring hashes are efficiently derived from precomputed prefix hashes using modular arithmetic and powers of p, enabling O(1) queries after O(n) preprocessing.[1] To minimize collision probability (approximately $1/m), multiple independent hash functions are often combined, such as using two different bases and moduli.[1]
Rolling hashes power key algorithms in pattern matching and data synchronization; for instance, the Rabin-Karp algorithm employs them to search for a pattern of length m in a text of length n in expected O(n + m) time by comparing rolling hashes of text windows against the pattern's hash, verifying exact matches only on hash collisions.[2] Beyond string search, they facilitate applications like plagiarism detection through substring similarity checks and counting distinct substrings in large texts.[1] In file synchronization, tools like rsync use a weak 32-bit rolling checksum alongside a strong hash (e.g., MD4) to identify matching blocks across files efficiently over networks, transmitting only differing portions to minimize bandwidth usage.[3] This technique, introduced in the context of efficient string matching by Michael O. Rabin and Richard M. Karp in 1987, remains foundational in data processing due to its balance of simplicity and performance.[2]
Fundamentals
Definition and Motivation
A rolling hash function is a specialized hashing technique designed to compute the hash value of a fixed-length substring or sliding window within a larger string or data stream by incrementally updating the previous hash value as the window advances by one position, thereby avoiding the computational overhead of recalculating the entire hash from scratch for each new window.[4] This approach leverages modular arithmetic to ensure efficient updates, typically in constant time per slide, making it suitable for processing sequential data where overlapping substrings need repeated hashing.[5]
The primary motivation for rolling hashes stems from the inefficiency of naive substring hashing methods in applications requiring multiple overlapping window comparisons, such as string pattern matching or data deduplication. In a naive implementation, hashing each of the (n - m + 1) possible windows of size m in an n-length string would incur O((n - m + 1) * m) time complexity due to the per-window recomputation cost.[4] By contrast, rolling hashes reduce this to O(n + m) overall time through incremental updates, enabling scalable processing of large datasets where frequent substring evaluations are essential.[4]
Rolling hashes originated in the 1980s amid developments in efficient string processing algorithms, with a seminal application in the Rabin-Karp string-searching algorithm introduced by Richard M. Karp and Michael O. Rabin in 1987. This work built on earlier linear-time matching techniques like Knuth-Morris-Pratt (1977) but innovated by using randomized fingerprinting—essentially an early form of rolling hashing—to achieve constant-space efficiency and near-real-time performance for pattern detection in texts and multidimensional arrays.
As a basic illustration, consider hashing consecutive k-mers in a string over an alphabet of size b by treating each k-mer as the evaluation of a polynomial of degree (k-1) with coefficients from the character values, computed modulo a large prime p; the hash for the next k-mer is then obtained by subtracting the outgoing character's contribution, multiplying by b, and adding the incoming character's value, all modulo p.[5]
Advantages over Naive Methods
Rolling hash provides significant efficiency gains over naive methods for computing hashes of overlapping substrings, primarily through its incremental update mechanism. In naive approaches, hashing each possible substring of length k in a string of length n requires recomputing the hash from scratch for every position, resulting in a time complexity of O(nk). By contrast, rolling hash computes an initial hash in O(k) time and then updates the hash value in O(1) time per window slide by subtracting the contribution of the outgoing character and adding the incoming one, multiplied by appropriate powers of the base, yielding a total time complexity of O(n + k) or simply O(n) for large n. This linear-time performance makes rolling hash particularly suitable for large-scale string processing tasks where k is substantial relative to n.[6][7][8]
In terms of space efficiency, rolling hash maintains only a single running hash value at any time, requiring constant additional memory beyond the input string itself. Naive methods, while not necessarily storing all hashes, incur repeated full computations that indirectly burden memory through redundant operations on the entire window data for each position. This minimal memory footprint of rolling hash facilitates its use in resource-constrained environments, such as embedded systems or when processing massive datasets where storing intermediate results for all windows would be prohibitive.[9][10]
Practically, rolling hash enables real-time processing of streaming data by allowing continuous updates without restarting computations, which is infeasible with naive full rehashes at each step. For instance, in plagiarism detection, rolling hash can scan long documents to identify similar passages by fingerprinting overlapping windows efficiently, avoiding the exhaustive hashing of every possible substring that would render naive methods impractical for texts exceeding thousands of characters. Additionally, when implemented with modular arithmetic over large primes, rolling hash supports probabilistic matching with low collision rates, further enhancing its reliability in applications requiring quick filtering before exact verification.[11][12]
Core Techniques
Polynomial Rolling Hash
The polynomial rolling hash function computes a numeric value for a fixed-length substring by interpreting the characters as coefficients of a polynomial evaluated at a chosen base b, taken modulo a large prime p. For a string s = s_1 s_2 \dots s_k where each s_i is mapped to an integer between 0 and the alphabet size minus 1, the initial hash h for the window of length k is
h = s_1 \cdot b^{k-1} + s_2 \cdot b^{k-2} + \dots + s_k \cdot b^0 \pmod{p}.
This formulation allows the hash to represent the substring uniquely with high probability when p is sufficiently large, facilitating quick comparisons in applications like pattern matching. The initial computation can be performed efficiently in O(k) time using Horner's rule, which rewrites the expression as h = ((s_1 \cdot b + s_2) \cdot b + \dots + s_k) \pmod{p}.[6]
To slide the window by one position to the substring s_2 s_3 \dots s_{k+1}, the hash is updated in constant time without recomputing from scratch. The new hash h' is calculated as
h' = (h - s_1 \cdot b^{k-1}) \cdot b + s_{k+1} \pmod{p}.
If the subtraction yields a negative value, add p to ensure non-negativity before multiplication and addition. This update leverages the polynomial structure, effectively "shifting" the coefficients by removing the leading term, multiplying by b, and appending the new character. Precomputing b^{k-1} \pmod{p} once allows all updates to run in O(1) time after the initial hash.[6]
The parameters b and p are selected to balance computational efficiency, hash range, and collision resistance. The base b is typically a small prime larger than the alphabet size, such as 131 for general text processing or 256 for direct byte values in ASCII strings, ensuring the polynomial evaluation spreads values evenly. The modulus p is chosen as a large prime, often $10^9 + 7, to confine the hash to 30 bits while approximating uniform distribution over [0, p-1]. For random strings, the probability of two distinct substrings having the same hash (a collision) is roughly $1/p; to improve safety in sensitive applications, double-hashing with two independent moduli p_1 and p_2 can reduce this to approximately $1/(p_1 p_2).[6][13]
The following pseudocode illustrates initialization and sliding for a string s (0-indexed) of length n with window size k:
function initialize_hash(s, k, b, p):
h = 0
for i = 0 to k-1:
h = (h * b + ord(s[i])) % p
return h
function slide_hash(h, s_old, s_new, b, bk_minus_1, p):
h = (h - ord(s_old) * bk_minus_1 + p) % p // +p ensures non-negative
h = (h * b + ord(s_new)) % p
return h
// Usage example:
bk_minus_1 = pow(b, k-1, p) // Precompute b^{k-1} mod p
h = initialize_hash(s, k, b, p)
for i = k to n-1:
h = slide_hash(h, s[i-k], s[i], b, bk_minus_1, p)
// Use h for window s[i-k+1 .. i]
function initialize_hash(s, k, b, p):
h = 0
for i = 0 to k-1:
h = (h * b + ord(s[i])) % p
return h
function slide_hash(h, s_old, s_new, b, bk_minus_1, p):
h = (h - ord(s_old) * bk_minus_1 + p) % p // +p ensures non-negative
h = (h * b + ord(s_new)) % p
return h
// Usage example:
bk_minus_1 = pow(b, k-1, p) // Precompute b^{k-1} mod p
h = initialize_hash(s, k, b, p)
for i = k to n-1:
h = slide_hash(h, s[i-k], s[i], b, bk_minus_1, p)
// Use h for window s[i-k+1 .. i]
This approach enables linear-time processing of all n - k + 1 windows after O(k) setup. The method forms the basis for more advanced techniques like the Rabin fingerprint, which enhances error bounds for pattern matching.[6]
Rabin Fingerprint
The Rabin fingerprint extends the polynomial rolling hash by computing the hash value as a pair (h mod p₁, h mod p₂), where p₁ and p₂ are two large distinct primes, providing a compact representation for probabilistic string comparison.[14]
The update mechanism follows the same sliding window principle as the base polynomial hash but is performed independently modulo each prime: subtract the outgoing character's contribution scaled by b^{k-1} modulo p_i, multiply the result by b modulo p_i, add the incoming character modulo p_i, for i=1,2, enabling constant-time shifts across overlapping substrings.
In the Rabin-Karp algorithm for string searching, this fingerprint pair for the pattern is compared against pairs for each m-length substring of the text, where m is the pattern length, achieving O(n + m) average-case time complexity for text length n, since equal fingerprints suggest potential matches (verified exactly) while unequal pairs confirm non-matches with overwhelming probability.
The collision probability, where distinct strings produce the same fingerprint pair, is roughly 1/(p₁ p₂); for instance, with p₁ ≈ p₂ ≈ 10⁹, this yields an error rate of approximately 10⁻¹⁸, sufficiently low for most applications without increasing computational overhead significantly.[14]
This method originated in Michael O. Rabin's 1981 work on fingerprinting via random polynomials and was formalized for pattern matching by Richard M. Karp and Michael O. Rabin in 1987.[14]
Variants and Extensions
Cyclic Polynomial Hash
The cyclic polynomial hash, also known as Buzhash, modifies the standard polynomial rolling hash by treating the sliding window as a circular buffer, employing bit rotations and XOR operations instead of multiplications to achieve efficient updates and improved handling of wrap-around effects. This approach draws from polynomial representations over finite fields, where cyclic shifts correspond to multiplication by primitive elements, ensuring the hash value remains consistent under rotations of the input window. Unlike linear polynomial methods, it avoids edge biases in periodic or repeating data structures by modeling the window as a cycle, which enhances uniformity in hash distributions for applications involving circular or modular data.[15][16]
The formulation computes the hash of a k-length window s_0, s_1, \dots, s_{k-1} as
H = \bigoplus_{i=0}^{k-1} \text{rol}_i \left( h(s_i) \right),
where h(\cdot) maps each symbol s_i to a random fixed-width integer (e.g., 64 bits), \text{rol}_i denotes a left rotation by i bits, and \bigoplus is bitwise XOR. To update the hash for a sliding window—removing s_0 and adding s_k—the new value is obtained via
H' = \text{rol}_1(H) \bigoplus h(s_k) \bigoplus h(s_0),
which requires only one rotation and two XORs, maintaining constant-time complexity per update. This recursive structure leverages the cyclic nature to simulate polynomial evaluation under rotation, where the rotation operator acts analogously to multiplication by a primitive root in the underlying field.[16][15]
A key advantage of the cyclic polynomial hash lies in its avoidance of costly multiplications, relying instead on fast bitwise operations, which can yield up to twice the speed of irreducible polynomial alternatives on hardware with limited multiplication throughput, while achieving pairwise independence for low collision probabilities in n-gram hashing—which represents the best achievable level for recursive n-gram hashing functions like cyclic polynomial hashes.[17] It excels with periodic or rotational data, such as in cyclic redundancy checks adapted for probabilistic hashing, by reducing sensitivity to window boundaries and providing more uniform distributions compared to linear shifts in standard polynomial hashes. This makes it particularly suitable for scenarios where data exhibits repetitive patterns, minimizing biases that could cluster similar hashes.[15]
In practice, cyclic polynomial hashing finds application in content deduplication systems with variable chunk boundaries, such as backup tools like Borg, where it identifies duplicate blocks by computing rolling hashes over streams to trigger chunks at natural data seams, thereby reducing storage overhead without fixed-size partitioning. It also supports bioinformatics tasks, including k-mer counting and sequence alignment, by enabling efficient rolling computations over genomic data to filter duplicates and build probabilistic structures like Bloom filters with minimal collisions. These uses highlight its role in reducing hash distribution biases for irregular or cyclic inputs, distinct from standard polynomial rolling hashes that assume linear progression and may exhibit edge effects in sliding windows.[16][15]
Gear Fingerprint
The Gear fingerprint is a specialized rolling hash variant tailored for efficient variable-length chunking in data processing tasks, particularly in backup and cloud storage systems. It leverages a lightweight recursive update mechanism to compute hash values over sliding windows, facilitating the identification of content-defined boundaries without relying on fixed block sizes. First introduced in the Ddelta delta compression framework in 2014, the Gear fingerprint was refined in subsequent work on content-defined chunking algorithms during the mid-2010s to address performance bottlenecks in deduplication pipelines.[18]
At its core, the Gear fingerprint operates by maintaining a rolling hash computed over the data stream using the update formula fp = (fp \ll 1) + G(b), where fp is the current fingerprint, \ll 1 denotes a left bit-shift by one position, and G(b) is a precomputed 256-entry table assigning random 64-bit integers to each possible byte value b. A chunk boundary is selected when the Gear hash matches a predefined mask or threshold, such as verifying if the most significant bits equal a target value (e.g., fp \& (2^{13} - 1) = 0 for an average 8 KB chunk size). This approach allows efficient boundary detection as the window slides byte-by-byte.[18]
The primary advantages of the Gear fingerprint lie in its computational efficiency and adaptability for content-aware partitioning, which enhances deduplication ratios by producing variable-length chunks aligned with semantic shifts in the data stream, such as file format transitions or repetitive patterns. Unlike naive fixed-size chunking, it avoids boundary misalignment issues that reduce storage savings, achieving up to 10× speedup over Rabin-based methods and 3× over prior Gear/AE-based methods while preserving hash uniformity and low collision rates. In practical deployments, such as cloud backup systems, this results in significant overall chunking throughput improvements with negligible impact on deduplication effectiveness (e.g., less than 1% ratio loss compared to Rabin hashing).[18]
In a typical application to a data file, the algorithm slides a window byte-by-byte to generate successive Gear hash values, accumulating content influence through the shifting mechanism. At each position after a minimum chunk length, the Gear hash is evaluated against the threshold; a match signals a boundary, segmenting the file into content-coherent chunks suitable for storage optimization. This process ensures balanced chunk sizes (e.g., 2–64 KB) while minimizing overhead from excessive small or oversized fragments.[18]
Applications
Content-Based Chunking
Content-based chunking, also known as content-defined chunking (CDC), leverages rolling hashes to partition data streams into variable-sized chunks based on inherent content patterns rather than fixed byte offsets. This approach scans the input data sequentially using a sliding window of fixed length, typically 32 to 64 bytes, over which a rolling hash function—often a Rabin fingerprint—is computed incrementally as each new byte is incorporated and the oldest byte is removed. Chunk boundaries are dynamically selected when the rolling hash satisfies a predefined condition, such as the low-order bits of the hash value being zero, ensuring that similar content produces identical chunk divisions regardless of positional shifts. This method was pioneered in systems like the Low-Bandwidth File System (LBFS), where it enables efficient deduplication by aligning chunks to semantic boundaries in the data.[19]
The specific technique involves applying a mask to the rolling hash output to control chunk size distribution. For instance, if the hash value bitwise ANDed with a mask (e.g., $2^k - 1 for checking the lowest k bits) equals zero, the current window position marks the end of a chunk, balancing average chunk sizes around 4-64 KB while allowing variation based on content. Minimum and maximum chunk size constraints are enforced to prevent overly small or large fragments, with the mask parameter tuned to achieve the desired average (e.g., k=13 for approximately 8 KB chunks). Variants like the gear fingerprint can further optimize this process by using multiple hash gears for faster computation without sacrificing boundary detection quality. In practice, the full chunk is then fingerprinted using a cryptographic hash like SHA-1 for deduplication lookup, but the rolling hash serves solely for boundary detection.
This content-based approach yields higher deduplication rates compared to fixed-size chunking.[20] By aligning boundaries to content features, it achieves better similarity detection in diverse workloads like email archives or file backups, where duplicates often span shifted regions.
In contrast to fixed chunking, which suffers from boundary-shift errors—where even a single-byte insertion causes all subsequent chunks to differ, triggering unnecessary rehashing and storage—content-based chunking localizes changes to affected regions only. A small edit early in a file might shift only one or two chunks, preserving the rest for reuse in deduplication. This resilience is particularly valuable in incremental backups or versioned storage, minimizing data transfer and recomputation.
Real-world implementations include backup tools like Restic, which employs 64-bit Rabin fingerprints over 64-byte windows with a mask checking the lowest 21 bits for boundaries, enabling efficient inter-file and inter-snapshot deduplication. Distributed storage systems such as GlusterFS also integrate Rabin-Karp rolling hashes for variable chunking in their deduplication translator, optimizing space in scale-out environments.[21] These applications underscore the technique's role in modern storage efficiency, from personal backups to enterprise file systems.[22]
FastCDC Algorithm
FastCDC is an efficient content-defined chunking (CDC) algorithm designed for high-performance data deduplication, leveraging a rolling hash mechanism to identify chunk boundaries without requiring full file hashing. Introduced by Xia et al. in 2016, it builds on prior Gear-based approaches by incorporating three key optimizations: simplified hash judgment, skipping of sub-minimum cut-points, and chunk-size normalization, all integrated with a fast rolling hash to achieve near-linear time complexity.[18] This enables FastCDC to process large datasets rapidly while maintaining deduplication ratios comparable to more computationally intensive methods.[18]
The algorithm proceeds in structured steps to detect boundaries efficiently. First, it normalizes chunk sizes by dynamically adjusting the hash mask based on an expected average chunk size (e.g., 8KB), using a stricter mask (with more '1' bits, corresponding to a smaller sliding window) for the initial phase up to this size, and a looser mask (fewer '1' bits, larger window) thereafter; this ensures chunks fall within a target range (e.g., 8KB–16KB) without excessive small or large fragments.[18] Second, it employs a two-level rolling hash: a small-window hash to identify candidate cut-points starting from a minimum chunk size (e.g., 2KB), followed by confirmation with a larger-window hash to validate boundaries and skip invalid small candidates.[18] Third, boundaries are selected where the rolling hash value matches a predefined mask pattern (e.g., specific low-order bits are zero), using the hash to slide incrementally over the data stream.[18]
A core innovation in FastCDC is its find_boundary function, which integrates the rolling hash to skip potential chunks efficiently by advancing the hash computation from the minimum size threshold, avoiding redundant evaluations of sub-minimum cut-points and enabling linear-time boundary detection across the data.[18] The rolling hash itself, known as the Gear hash, uses an array of 256 precomputed 64-bit random integers (Gear for byte b) and updates via a left-shift and addition: fp = (fp << 1) + Gear[src[i]], allowing constant-time incremental computation as the window slides byte-by-byte.[18] This combination yields up to 10× speedup over traditional Rabin-based chunkers and 3× over other Gear- or AE-based methods, with deduplication ratios remaining similar (e.g., 47.64% for FastCDC vs. 47.58% for Rabin on TAR datasets at 8KB average size).[18]
The following pseudocode illustrates the rolling hash integration in the boundary detection process (adapted from Algorithm 1 in Xia et al.):[18]
Input: [data buffer](/page/Data_buffer) src, length n
MaskS ← 0x0003590703530000LL // 15 '1' bits for small [window](/page/Window)
MaskL ← 0x0000d90003530000LL // 11 '1' bits for large [window](/page/Window)
MinSize ← 2KB; NormalSize ← 8KB
fp ← 0; i ← 0
// Initialize hash up to MinSize
for j from 0 to MinSize-1:
fp = (fp << 1) + Gear[src[j]]
i ← MinSize
while i < n:
// Advance hash one byte
fp = ((fp << 1) + Gear[src[i]]) & 0xFFFFFFFFFFFFFFFFLL // 64-bit mask
if i < NormalSize:
if !(fp & MaskS):
return i // Candidate boundary in small phase
else:
if !(fp & MaskL):
return i // Confirmed boundary in large phase
i ← i + 1
return n // End of buffer
Input: [data buffer](/page/Data_buffer) src, length n
MaskS ← 0x0003590703530000LL // 15 '1' bits for small [window](/page/Window)
MaskL ← 0x0000d90003530000LL // 11 '1' bits for large [window](/page/Window)
MinSize ← 2KB; NormalSize ← 8KB
fp ← 0; i ← 0
// Initialize hash up to MinSize
for j from 0 to MinSize-1:
fp = (fp << 1) + Gear[src[j]]
i ← MinSize
while i < n:
// Advance hash one byte
fp = ((fp << 1) + Gear[src[i]]) & 0xFFFFFFFFFFFFFFFFLL // 64-bit mask
if i < NormalSize:
if !(fp & MaskS):
return i // Candidate boundary in small phase
else:
if !(fp & MaskL):
return i // Confirmed boundary in large phase
i ← i + 1
return n // End of buffer
Analysis
Computational Complexity
The computational complexity of rolling hash functions centers on their ability to update hash values efficiently as the window slides over the input stream. For core techniques like polynomial rolling hash and Rabin fingerprinting, preprocessing to compute the initial hash for a window of size k requires O(k) time, while each subsequent update for sliding the window by one position takes O(1) time, resulting in O(n) total time to process a stream of length n.[23][24]
Space requirements are minimal, using O(1) additional space beyond the input stream, as the algorithm maintains only the current hash value, the window size k, the base, and the modulus.[23] The core operations rely on modular arithmetic, where multiplication of two numbers modulo a prime p has bit complexity O((\log p)^2), though in practical implementations using 64-bit integers, these operations execute in constant time via optimized hardware instructions.[25]
Probabilistically, rolling hashes exhibit strong average-case performance due to low collision probabilities when using large primes for the modulus. In the worst case, a collision triggers verification via direct comparison, costing O(k) time, but with a suitably large prime p (e.g., around $2^{64}), the collision probability is approximately $1/p, ensuring expected O(1) time per update.[26]
Comparisons across variants reveal consistent efficiency: both polynomial rolling hash and Rabin fingerprinting achieve O(1) time per update. The gear fingerprint, employed in content-defined chunking, also supports O(1) updates through simple shift and addition operations.[18]
Limitations and Trade-offs
Despite their efficiency, rolling hashes face significant collision risks, even with large prime moduli. Adversarial inputs or low-entropy data can exploit uneven feature coverage in Rabin's randomized fingerprinting model, leading to high false positive rates that exceed 10% for entropy scores up to 180 in typical text corpora.[27] Mitigation strategies include employing multiple independent hash functions—such as double hashing with distinct bases and moduli—or performing exact verification on potential matches, which can reduce collision probabilities to approximately $10^{-18} using double hashing with two large prime moduli around $10^9.[1]
Parameter selection presents key trade-offs in rolling hash design. A larger modulus, such as $10^9 + 9, minimizes collision likelihood by expanding the hash space but necessitates careful handling of arithmetic operations to prevent performance degradation from big-integer simulations. Conversely, a small base (e.g., 31 for lowercase letters) risks patterned distributions and elevated collisions in repetitive or structured inputs, while larger bases (e.g., 53 for full ASCII) promote uniformity at the expense of higher multiplication costs during rolling updates.[1]
Rolling hashes lack cryptographic security and are unsuitable for adversarial environments like authentication or data integrity against tampering; for such purposes, secure alternatives like SHA-256 must be used instead. In certain applications, such as incremental message authentication, they remain vulnerable to length-extension attacks, where an attacker can predictably append data to a known hash value without access to the original input.[28]
Practical implementation introduces challenges like integer overflow in polynomial computations, which can be mitigated by using unsigned 64-bit integers and explicit modular reduction to maintain non-negative values. Additionally, processing non-ASCII data requires robust character mapping—such as full Unicode ordinal values—to avoid equivalence collisions, with the base selected to exceed the effective alphabet size for uniform distribution.[1]
Emerging applications in machine learning for preprocessing high-volume datasets highlight opportunities for improvement, where hybrid designs integrating rolling hashes with cryptographically secure primitives can balance speed and robustness.[29]