Fact-checked by Grok 2 weeks ago

Universal hashing

Universal hashing is a technique in computer science for selecting a hash function randomly from a family of such functions, where the family is designed to ensure that for any two distinct keys, the probability of them mapping to the same hash value is at most $1/m, with m denoting the number of possible hash outputs.^[1] This property, known as 2-universality, guarantees that collisions occur with probability no higher than that of a truly random function, making it robust against worst-case inputs.^[1] The concept was introduced by J. Lawrence Carter and Mark N. Wegman in their seminal 1977 paper, "Universal Classes of Hash Functions," presented at the 9th Annual ACM Symposium on Theory of Computing (STOC).^[1] In this work, they defined universal classes (such as H_1, H_2, and H_3) that are both theoretically sound and practically evaluable, enabling input-independent average-case linear time for storage and retrieval operations in hash tables.^[1] Their approach built on earlier ideas in randomized algorithms, including probabilistic selection methods by researchers like Michael Rabin, to provide performance bounds that hold regardless of data distribution.^[1] Universal hashing underpins efficient implementations of data structures like chained hash tables, where it ensures expected O(1) lookup and insertion times even under adversarial conditions.^[2] It has also influenced streaming algorithms for tasks such as frequency estimation and heavy hitters detection, leveraging fast 4-universal variants for high-speed processing.^[3] In cryptography, universal hashing enables information-theoretically secure message authentication codes (e.g., the Wegman–Carter MAC) when combined with a secret key, while extensions like universal one-way hash functions provide computational security for applications such as digital signatures and compression.^[4] Over time, refinements such as strongly universal hashing have emerged, offering tighter collision bounds (e.g., pairwise independence) while maintaining computational efficiency, with modern implementations achieving rates of 0.2 CPU cycles per byte for string hashing.^[5]

Fundamentals

Definition and motivation

Universal hashing refers to a method of selecting a hash function randomly from a carefully designed family of functions to achieve desirable probabilistic properties in hash tables and related data structures. A universal hash family \mathcal{H} is defined as a set of functions from a key universe U to a slot universe M such that, for any two distinct keys x, y \in U, the probability that a randomly chosen function h \in \mathcal{H} maps them to the same slot satisfies \Pr_{h \in \mathcal{H}}[h(x) = h(y)] \leq 1/|M|.^[1] This bound ensures that collisions occur no more frequently than if the keys were mapped independently and uniformly at random to the slots. The primary motivation for universal hashing arises from the vulnerabilities of fixed hash functions, which can suffer from poor performance when the input distribution is unknown or adversarial. With a fixed hash function, an adversary aware of the function can deliberately choose keys that concentrate in few slots, leading to many collisions and degraded efficiency, such as \Theta(n) time per operation in the worst case for n keys.^[6] Universal hashing addresses this by randomizing the choice of function from the family at runtime (or preprocessing), providing worst-case expected performance guarantees that hold independently of the input, thus mitigating adversarial exploitation and ensuring average-case linear-time operations without assuming benign key distributions.^[1] In hashing, collisions inevitably occur when multiple keys map to the same slot, but their probability determines the overall efficiency of structures like hash tables using chaining or open addressing. Classical analyses often assume uniform key distribution, but universal families bound the pairwise collision probability uniformly at $1/|M| for any distinct pair, yielding an expected number of collisions per slot proportional to the load factor, regardless of key correlations.^[1] This uniform bound simplifies performance proofs and enhances robustness in practice. To illustrate, consider a simple scenario with a hash table of 2 slots (|M|=2) and keys from a small universe. A non-universal family might consist of two functions where a specific pair of keys always collides (e.g., both map the pair to slot 0 in both functions), yielding \Pr[h(x)=h(y)]=1 > 1/2 for that pair and allowing an adversary to exploit it. In contrast, a universal family ensures that for every distinct pair, at most half the functions cause a collision, bounding the probability at $1/2 and distributing keys more evenly in expectation.^[1]

Historical development

Universal hashing was introduced by J. Lawrence Carter and Mark N. Wegman in the late 1970s as a response to the limitations of traditional hashing methods, which relied on heuristic choices and assumptions about input distributions that often failed in practice. Their work aimed to provide rigorous performance guarantees for hashing in randomized algorithms, ensuring average-case efficiency independent of the input data by selecting a hash function randomly from a carefully designed class. An extended abstract appeared in 1977 at the Symposium on the Theory of Computing, followed by the seminal full paper "Universal Classes of Hash Functions" published in 1979, where they defined universal hash families and presented efficient constructions for hashing integers.^[7] In a follow-up publication in 1981, Wegman and Carter extended the framework by introducing strongly universal hash functions, which offer even tighter collision probability bounds and enable applications beyond basic storage, such as authentication codes and set equality testing. This paper, "New Hash Functions and Their Use in Authentication and Set Equality," demonstrated how these functions could achieve information-theoretically secure message authentication without relying on computational assumptions, marking a key milestone in connecting universal hashing to cryptography.^[4] The concept gained practical traction in the 1990s through refinements focused on efficiency and implementation, influencing the design of hash tables and parallel algorithms. For instance, Dietzfelbinger et al. proposed high-performance universal hashing schemes in 1992 that supported constant-time evaluation and fast construction, facilitating their use in shared memory emulation and other data structures. Post-2000 developments emphasized hardware and software optimizations, such as Rogaway's PolyR construction in 2001, which enabled fast hashing of short messages with minimal key sizes, paving the way for efficient deployments in modern systems.^[8]^[9]

Properties

Universal hash families

A universal hash family is a collection \mathcal{H} of hash functions from a universe U to a range M with |M| = m, such that for any distinct x, y \in U, the fraction of functions in \mathcal{H} that map x and y to the same value is at most $1/m. Formally, \left| \{ h \in \mathcal{H} : h(x) = h(y) \} \right| / |\mathcal{H}| \leq 1/m. Equivalently, if h is selected uniformly at random from \mathcal{H}, then \Pr[ h(x) = h(y) ] \leq 1/m. This property ensures that collisions are no more likely than under a perfectly random mapping, providing a probabilistic guarantee independent of the specific keys. The definition implies a bound on the expected collision probability when using a random hash from the family. For any distinct pair x, y, the probability of collision is directly upper-bounded by $1/m, matching the collision rate of an ideal random function. To see the broader implication, consider a fixed key x and a set S \subseteq U \setminus \{x\} with |S| = k. By linearity of expectation, the expected number of collisions between x and elements of S is \sum_{y \in S} \Pr[ h(x) = h(y) ] \leq k / m. A proof sketch proceeds as follows: the indicator random variable I_y = 1 if h(x) = h(y) and 0 otherwise has expectation \mathbb{E}[I_y] = \Pr[ h(x) = h(y) ] \leq 1/m, so the total expectation \mathbb{E}[ \sum I_y ] \leq k/m. This average-case bound holds regardless of the distribution of keys in S. While stronger notions like pairwise independence require h(x) and h(y) to be uniformly distributed and independent (yielding exactly $1/m collision probability), universality only demands the upper bound and thus allows for more efficient constructions with smaller families. Universality suffices for many applications, as the average collision rate of $1/m ensures good performance in expectation without the overhead of full independence. A simple example of a universal hash family for small universes is the shift-based family over U = M = \{0, 1, \dots, m-1\}, defined as \mathcal{H} = \{ h_a \mid a = 0, 1, \dots, m-1 \} where h_a(x) = (x + a) \mod m. For any distinct x, y \in U, h_a(x) = h_a(y) implies x + a \equiv y + a \pmod{m}, or x \equiv y \pmod{m}, which contradicts x \neq y since both are in \{0, \dots, m-1\}. Thus, no function in \mathcal{H} causes a collision for this pair, so |\{ h \in \mathcal{H} : h(x) = h(y) \}| = 0 \leq m / m = 1, and the collision probability is $0 \leq 1/m. This family achieves the universality bound (trivially, with zero collisions) and can be evaluated efficiently via modular arithmetic.

Strong and k-wise universality

A strongly universal hash family, also known as a 2-universal family with exact pairwise independence, requires that for any two distinct keys x \neq y in the universe U and any two values a, b in the range = \{0, 1, \dots, m-1\}, the probability that a randomly selected hash function h from the family satisfies h(x) = a and h(y) = b is exactly $1/m^2. ^[1] This ensures that the hash values for distinct keys are perfectly independent and uniformly distributed, implying that the collision probability \Pr[h(x) = h(y)] = 1/m holds exactly. ^[10] This concept generalizes to k-wise universal hash families, where for any k distinct keys x_1, \dots, x_k \in U and any k values v_1, \dots, v_k \in , the probability \Pr[h(x_1) = v_1, \dots, h(x_k) = v_k] = 1/m^k exactly, mimicking the behavior of fully independent uniform random variables restricted to k samples. ^[11] Such families provide progressively stronger guarantees of independence as k increases, enabling analysis of higher-order interactions in hashing schemes. The primary benefits of strong and k-wise universality lie in their ability to yield tighter probabilistic bounds on multi-key collisions and dependent events compared to basic universality; for instance, strong universality directly implies the standard universal property, as \Pr[h(x) = h(y)] = \sum_{a=0}^{m-1} \Pr[h(x) = a \land h(y) = a] = \sum_{a=0}^{m-1} (1/m)^2 = 1/m. ^[1] These enhanced properties support more precise performance guarantees in structures sensitive to variance or tail probabilities, such as those involving multiple insertions or queries. However, achieving k-wise universality for k > 2 typically requires larger family sizes than 2-universal families, often scaling polynomially with k and the universe size to maintain the exact independence, which can increase storage and selection overhead. ^[11] A high-level outline for constructing a 3-universal family involves parameterizing hash functions as low-degree polynomials over a suitable finite field, with random coefficients chosen to enforce the triple-wise independence condition. ^[12]

Constructions

Hashing integers

One of the foundational constructions for universal hash families targeting integer keys is the Carter-Wegman method, which maps elements from a universe U of integers to a range M = \{0, 1, \dots, m-1\}. The hash function is defined as h_{a,b}(x) = ((a x + b) \mod p) \mod m, where p is a prime number greater than |U|, a \in \{1, 2, \dots, p-1\} (ensuring a \not\equiv 0 \mod p), and b \in \{0, 1, \dots, p-1\}, with a and b chosen uniformly at random.^[1] This family H = \{ h_{a,b} \} achieves 2-universality by bounding collisions to at most $1/m.^[1] The 2-universality of this construction is established in the original paper, showing that for any distinct integers x, y \in U with x \neq y, the collision probability \Pr[h_{a,b}(x) = h_{a,b}(y)] \leq \frac{1}{m}.^[1] For practical implementation with w-bit integers (where |U| \leq 2^w), a suitable prime p is chosen greater than $2^w, often a large prime to minimize computational overhead while ensuring p > |U|. This choice allows efficient modular arithmetic using integer operations, achieving O(1) time complexity per hash evaluation on standard hardware, as multiplications and reductions modulo p (such as a Mersenne prime like $2^{61} - 1 for larger keys) can leverage bit shifts and additions. As an illustrative example, consider hashing 32-bit integers (w = 32, |U| = 2^{32}) into m = [1024](/page/1024) buckets using p = 4294967311 (a prime). For distinct x and y, the collision probability is at most \frac{1}{[1024](/page/1024)} \approx 0.000976, meaning that in a set of n keys, the expected number of colliding pairs is at most \binom{n}{2} / [1024](/page/1024), which remains manageable for moderate n (e.g., for 10,000 keys, at most around 48,800).

Hashing strings and sequences

A prominent construction for universal hashing of strings and sequences interprets the input as coefficients of a polynomial over a finite field, extending the linear hashing methods for integers discussed previously. Consider a string s = s_{l-1} \dots s_0 of length l over an alphabet \Sigma embedded into the finite field \mathbb{F}_p, where p is a large prime. The hash function evaluates the polynomial P_s(r) = \sum_{i=0}^{l-1} s_i r^i \mod p, with r chosen uniformly at random from \{1, 2, \dots, p-1\}, and then applies a final modulo m operation to map into the range \{0, 1, \dots, m-1\}, where m is the desired output size. This approach, originally proposed in the context of universal hash families, achieves strong (2-)universality because for any two distinct strings, the polynomials differ, ensuring the probability that P_s(r) \equiv P_t(r) \pmod{p} is at most $1/(p-1), and thus the overall collision probability is approximately $1/m.^[13] To enhance practicality, r is often selected as a primitive root modulo p to promote uniform distribution across the field, particularly when \Sigma is the ASCII alphabet with 128 symbols. The method draws inspiration from the Rabin-Karp algorithm for string matching, which employs a similar polynomial rolling hash but fixes parameters for deterministic use; randomizing r elevates it to a universal family suitable for hashing applications. For fixed-length strings, this directly yields the hash value; however, the construction preserves universality even without padding, as distinct sequences produce distinct polynomials. Strings of variable lengths are accommodated by padding shorter ones with zero symbols to a maximum length or, more efficiently, via the iterative form of the polynomial evaluation: start with h_0 = 0 and compute h_i = (r \cdot h_{i-1} + s_i) \mod p for i = 1 to l, followed by modulo m. This rolling hash variant not only handles variable lengths but also enables O(1) updates for sliding windows over sequences, maintaining a collision probability of at most $1/m for distinct keys under random r. Such flexibility is crucial for sequence data where lengths vary, ensuring the hash family remains strongly universal across all possible inputs.^[13] In practice, this polynomial hashing serves as a fingerprinting mechanism for applications like plagiarism detection, where ASCII text documents are processed to generate compact hashes of passages for similarity comparison; typical parameters include a 64-bit prime p \approx 2^{64} - 2^{32} + 1 and r = 31 or another small primitive root to balance speed and collision resistance.

Hashing vectors and tuples

One approach to hashing vectors involves applying universal hash functions component-wise. Consider a vector \mathbf{v} = (v_1, v_2, \dots, v_d) where each v_i belongs to a universe U of integers. Select d independent universal hash functions h_i: U \to \{0, 1, \dots, m-1\} from a universal family, and define the hash as

h(\mathbf{v}) = \left( \sum_{i=1}^d h_i(v_i) \right) \mod m.

This construction, known as vector multiply-shift when the h_i are linear, ensures efficient computation for moderate d, and can approximate low collision probabilities when vectors differ in few components. The universality may be preserved under stronger conditions on the h_i, such as pairwise independence. For guaranteed 2-universality, more robust methods are preferred. For fixed-size vectors over finite fields or rings, the Toeplitz matrix method provides a compact representation. A Toeplitz matrix T of size k \times d (with m = p^k for prime p) has constant values along each diagonal, allowing specification with O(k + d) random elements from \mathbb{Z}_m. The hash is computed as h(\mathbf{v}) = T \mathbf{v} \mod m, where the multiplication yields a vector result projected modulo m. When T is chosen uniformly at random from Toeplitz matrices, this family achieves strong universality, with collision probability exactly $1/m for distinct \mathbf{v}, \mathbf{w}, due to the linear independence properties over the ring.^[14] These methods are particularly useful for hashing multi-attribute data, such as IP addresses represented as 4-tuples of 32-bit integers (e.g., \mathbf{v} = (a.b.c.d) with each component hashed via multiply-shift modulo m) or database records as fixed-length tuples of fields. In networking applications, this enables uniform distribution of 128-bit IPv6 addresses into smaller tables without pathological collisions.^[15]^[16]

Applications and extensions

Use in data structures

Universal hashing plays a crucial role in enhancing the performance and robustness of hash tables by mitigating the risks of poor hash function choices that could lead to excessive collisions. In standard hash table implementations, a random hash function is selected from a universal family at initialization. This approach ensures that for any two distinct keys, the probability of collision is at most 1/m, where m is the table size, leading to an expected constant-time O(1) performance for insertions, deletions, and lookups when the load factor is kept below 1.^[1] To maintain this efficiency, tables often rehash all elements using a new random function from the family upon reaching a high load factor, with the overall amortized cost remaining O(1) per operation due to the bounded expected chain lengths.^[1] In open-addressing hash tables, universal hashing addresses the issue of clustering, where collisions cause long probe sequences. Double hashing serves as a practical variant, employing two independent universal hash functions h_1 and h_2 to generate probe sequences defined as h(k, i) = (h_1(k) + i \cdot h_2(k)) \mod m for the i-th probe. This method approximates the behavior of ideal random probing, distributing probes more uniformly and avoiding the primary clustering seen in linear probing with fixed hashes, thereby achieving expected O(1/\alpha) probe lengths, where \alpha is the load factor. The universality of h_1 and h_2 ensures that the probe sequences for different keys are sufficiently independent, preventing worst-case degenerations. Cuckoo hashing further leverages universal hashing by using multiple (typically two) hash functions from a universal family to assign each key to one of several possible slots across two or more tables. During insertion, if a slot is occupied, the existing key is evicted and "kicked" to an alternative slot determined by its own hash functions, potentially triggering a chain of reassignments. This process succeeds with high probability, as the universal property guarantees that the graph of possible placements remains sparse, resulting in a failure probability that is exponentially small in the table size for load factors below 0.5.^[17] Empirically, universal hashing in open-addressing tables exhibits reduced clustering and probe sequence lengths compared to fixed hash functions, particularly under adversarial key distributions, leading to more predictable and efficient performance in real-world applications like databases and caches.

Role in cryptography and beyond

Universal hashing plays a pivotal role in cryptography, particularly in constructing message authentication codes (MACs) that provide information-theoretic security. The Carter-Wegman MAC, introduced in the late 1970s, leverages strongly universal hash functions combined with a pseudorandom function or block cipher to authenticate messages, ensuring that an adversary cannot forge a valid tag with probability better than 1/2^k for a k-bit tag, even after seeing polynomially many message-tag pairs. This construction achieves information-theoretic security against existential forgery under chosen-message attacks, providing unconditional security even against computationally unbounded adversaries, making it a foundational primitive for unconditionally secure authentication.^[18]^[19] Universal hash functions also underpin practical keyed hashing schemes for data integrity in protocols like HMAC and various stream ciphers. While standard HMAC relies on collision-resistant cryptographic hashes, variants and related constructions such as UMAC employ universal hash families to enable fast, parallelizable authentication with provable security bounds against forgery, processing messages at rates exceeding gigabits per second on modern hardware. In stream cipher-based authenticated encryption, such as ChaCha20-Poly1305, a universal hash like Poly1305 computes tags over ciphertexts, providing integrity while resisting quantum attacks when paired with appropriate keys, as the hash's security holds information-theoretically.^[20]^[21]^[22] Beyond cryptography, universal hashing enhances probabilistic data structures for approximate membership queries, such as Bloom filters, where multiple universal hash functions map elements to bit positions, minimizing false positives through pairwise independence guarantees that bound collision probabilities to 1/m for an m-bit array. In streaming algorithms, universal hashes facilitate efficient estimation of distinct elements (F_0 moment) in massive data streams; for instance, the optimal space O(1/ε^2 log(1/δ) log n) algorithm uses a pairwise independent hash to sample and count unique items with (1±ε) approximation and failure probability δ, enabling real-time analysis without storing the entire stream.^[23]^[24] Despite these strengths, universal hashing faces limitations in quantum settings, where Grover's algorithm can accelerate collision searches, necessitating quantum-resistant variants. Recent work in the 2020s has developed post-quantum universal hash families, such as those based on lattice problems or ε-almost universal constructions over finite fields, ensuring collision resistance against quantum adversaries while maintaining efficiency for MACs and sketches in hybrid classical-quantum protocols.^[25]

References

[1]
[PDF] Universal Classes of Hash Functions - cs.Princeton
CARTER, AND M. N. WEGMAN,. Analysis of a universal class of hash functions, in “Proceedings of the Seventh. Mathematical. Foundations of Computer. Science.
[2]
[PDF] High Speed Hashing for Integers and Strings - arXiv
May 9, 2020 · One of the most classic applications of universal hashing is hash tables with chaining. We have a set S ⊆ U of keys that we wish to store so ...
[3]
Tabulation based 4-universal hashing with applications to second ...
The hashing was based on small pre-computed 4-universal tables. This led to a five-fold improvement in speed over direct methods based on degree 3 polynomials.
[4]
[PDF] Strongly universal string hashing is fast - arXiv
We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that.
[5]
[PDF] Universal and Perfect Hashing
We need a method for resolving collisions. A collision is when h(x) = h(y) for two different keys x and y. For this lecture, we will handle ...
[6]
Universal classes of hash functions - ScienceDirect.com
April 1979, Pages 143-154. Journal of Computer and System Sciences. Universal classes of hash functions. Author links open overlay panel. J.Lawrence Carter ,
[7]
New hash functions and their use in authentication and set equality
June 1981, Pages 265-279. Journal of Computer and System Sciences. New hash functions and their use in authentication and set equality ... J.L. Carter, J. Gill ...
[8]
High performance universal hashing, with applications to shared ...
May 29, 2005 · We describe and analyze a new high performance universal class of hash functions which can be constructed fast, evaluated in constant time, and ...
[9]
[PDF] Fast Universal Hashing with Small Keys and no Preprocessing
Jan 4, 2001 · Ever since its introduction by Carter and Wegman in 1979 [6], universal hashing has been an im- portant tool in computer science. Recent ...
[10]
[PDF] Universal and Perfect Hashing - Lecture 10 - CS 473: Algorithms
Sep 26, 2019 · A family of hash functions H is (2-)strongly universal if for all distinct x, y ∈ U, h(x) and h(y) are independent for h chosen uniformly at ...
[11]
[PDF] Universal and Perfect Hashing
Sep 5, 2023 · Definition: k -wise independence. A hash family H is k -wise independent if for all k distinct keys x1,x2,...,xk and every. set of k values v1,v2 ...
[12]
Universal Hashing and k-Wise Independent Random Variables via ...
Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes. Author: Martin Dietzfelbinger.Missing: original paper
[13]
[PDF] 6.006 Recitation Notes 9/29/10 - csail
Example: The family of hash functions ha,b(x) = ((ax + b) mod p) mod m (1) where 0 <a<p, b<p, m<p, and |U| < p for prime p is universal. cannot be zero since 0 ...
[14]
https://eprint.iacr.org/2008/216.pdf
[15]
[PDF] A New Multi-Linear Universal Hash Family - Cryptology ePrint Archive
By setting n = 1, the description of the Toeplitz method turns out to be a Toeplitz version of the ... Universal hashing and authentication codes. Des ...
[16]
[PDF] Lecture 2: Hashing - cs.Princeton
In particular, if we try a random hash function it will succeed with probability. 1/2. If we see a collision when inserting elements of S into the table, we ...
[17]
[PDF] Lecture 8: Hashing - MIT OpenCourseWare
Theorem: Dot-product hash family H is universal. Another universal hash family from CLRS: Choose prime p ≥ u (once). Define hab(k) = [(ak + b) mod p)] mod m. ...
[18]
[PDF] Cuckoo Hashing - BRICS
The contribution of this paper is a new, simple hashing scheme called cuckoo hashing. A description and analysis of the scheme is given in Section 3 ...
[19]
Universal classes of hash functions (Extended Abstract)
This paper gives an input independent average linear time algorithm for storage and retrieval on keys. The algorithm makes a random choice of hash function.
[20]
Universal Hashing and Authentication Codes - ACM Digital Library
In this paper, we study the application of universal hashing to the construction of unconditionally secure authentication codes without secrecy.
[21]
New Proofs for NMAC and HMAC: Security without Collision ...
Jun 10, 2014 · HMAC [4] is a popular cryptographic-hash-function-based MAC. The basic construct is actually NMAC, of which HMAC can be viewed as a derivative.
[22]
UMAC: Fast and Secure Message Authentication - ACM Digital Library
To achieve such speeds, UMAC uses a new universal hash-function family, NH, and a design which allows effective exploitation of SIMD parallelism. The " ...
[23]
Integrity Analysis of Authenticated Encryption Based on Stream ...
We study the security of authenticated encryption based on a stream cipher and a universal hash function. We consider ChaCha20-Poly1305 and generic ...
[24]
https://dl.acm.org/doi/10.1145/1807085.1807094
[25]
An optimal algorithm for the distinct elements problem
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this ...
[26]
[PDF] Quantum Hashing via Classical ǫ-universal Hashing Constructions
Jan 21, 2015 · Their results are based on quantum a fingerprinting technique and add “quantum direction” for post-quantum cryptography. ... ǫ universal hash ...
[27]
[PDF] Post-Quantum Security of Block Cipher Constructions - arXiv
Oct 9, 2025 · universal hash function family h. Then the post-quantum security of LRW is given by: AdvSPRP-PQ. LRW. (q, t) ≤ AdvSPRP-PQ. E. (q, t) + q2ε. We ...