Fact-checked by Grok 2 weeks ago

Hash collision

A hash collision is a situation in which two distinct inputs to a hash function produce identical output values, also known as a hash clash.^[1] In computer science, hash functions map data of arbitrary size to fixed-size values, typically for efficient storage and retrieval in data structures like hash tables, where collisions arise inevitably from the pigeonhole principle when the input space exceeds the output space.^[2] To manage collisions in hash tables, common resolution strategies include separate chaining, which links colliding elements in lists at the same index, and open addressing, which probes for the next available slot using methods like linear probing or double hashing.^[3]^[4] In cryptography, hash collisions pose significant security risks, as they can undermine the integrity of digital signatures, certificates, and blockchain applications by allowing attackers to forge data with the same hash.^[5] Cryptographic hash functions are thus designed for collision resistance, making it computationally infeasible for adversaries to find such pairs using probabilistic polynomial-time algorithms, often relying on properties like preimage and second-preimage resistance as well.^[6]^[7] Notable real-world vulnerabilities include the 2004 practical collision attack on MD5, which demonstrated forging certificates, and the 2017 SHAttered attack on SHA-1, which produced colliding PDFs and accelerated its deprecation in favor of stronger standards like SHA-256 from the NIST SHA-2 family and the SHA-3 family.^[8]^[9]^[10] These developments highlight the ongoing evolution of hash functions to counter advancing computational power and cryptanalytic techniques.^[11]

Fundamentals

Hash functions

A hash function is a deterministic procedure that takes an input of arbitrary size and produces a fixed-size output value, often referred to as a hash code, hash value, or digest.^[12] This mapping enables efficient data organization and retrieval in structures like hash tables, where the output serves as an index into a fixed array.^[13] Key properties of effective hash functions include determinism, which ensures that identical inputs always yield the same output; uniformity, which promotes an even distribution of hash values across the output space to minimize clustering; and computational efficiency, allowing rapid calculation even for large inputs.^[14] These attributes are crucial for practical applications, as poor uniformity can lead to uneven load distribution, while inefficiency hampers performance in time-sensitive operations.^[13] Hash functions are broadly categorized into non-cryptographic and cryptographic types. Non-cryptographic hash functions, such as those used in hash tables, prioritize speed and uniformity over security, with examples including multiplicative hashing where the input key k is multiplied by a constant fraction and scaled by the table size m, yielding h(k) = \lfloor m \cdot \{k a\} \rfloor for some constant a between 0 and 1.^[15] In contrast, cryptographic hash functions, like SHA-256, emphasize resistance to attacks such as finding collisions or preimages, making them suitable for security protocols but often slower for general-purpose use.^[16] Mathematically, a simple hash function can be represented using the division method: h(k) = k \mod m, where k is the input key treated as an integer and m is the size of the hash table, producing an index in the range [0, m-1].^[17] This approach is straightforward but requires careful selection of m (often a prime number) to achieve good distribution.^[18] Representative examples include the polynomial rolling hash, commonly used for string processing, defined as h(s) = \sum_{i=0}^{n-1} s \cdot b^{n-1-i} \mod p, where s is the string, b is a base (e.g., 31), and p is a large prime modulus; this allows efficient incremental updates for substrings.^[19] Another example is Java's hashCode() method in the Object class, which returns a consistent integer for the object to support hashing in collections like HashMap, with the contract requiring equal objects to produce equal hash codes for consistent bucketing.^[20]

Hash tables

A hash table is an array-based data structure that implements an associative array, mapping keys to values by using a hash function to compute an index into an array of slots or buckets from which the corresponding value can be retrieved.^[21] This structure enables efficient storage and access, supporting operations such as lookups, insertions, and deletions with average constant time complexity under suitable conditions.^[22] The hash function serves as the primary indexing mechanism, transforming keys into array indices to facilitate rapid access.^[23] The core operations of a hash table are straightforward in their basic form. For insertion, the hash function is applied to the key to determine the target slot in the array, where the key-value pair is then stored.^[23] Search involves computing the hash of the key to locate the slot and retrieve the associated value if present.^[23] Deletion follows a similar process: the hash identifies the slot, and the key-value pair is removed from it.^[23] In the ideal scenario with no collisions—where each key maps uniquely to a distinct slot—these operations achieve O(1) time complexity, akin to direct array access.^[23] Selecting an appropriate table size is crucial for effective performance. The array size is often chosen as a prime number to promote even distribution of keys when using modular hashing, or as a power of 2 to enable efficient bitwise operations in certain implementations.^[22]^[24] The load factor, defined as the ratio of the number of stored elements to the total number of slots, is typically maintained below 0.7; exceeding this threshold increases the potential for performance degradation, prompting resizing such as doubling the table size and rehashing all elements.^[25]

Definition of collision

In hashing, a collision occurs when two distinct keys, denoted as k_1 and k_2 where k_1 \neq k_2, are mapped to the same hash value by the hash function, such that h(k_1) = h(k_2). This phenomenon arises in hash tables, where the hash function h transforms keys from a potentially large universe into a fixed number of slots in an array. The term "collision" specifically refers to this mapping overlap, distinguishing it from other hashing issues like poor distribution.^[26] Collisions are inevitable in practical hash tables due to the pigeonhole principle: if the number of keys n exceeds the number of available slots m (i.e., n > m), at least one slot must contain more than one key, guaranteeing a collision. Even when n \leq m, collisions can occur probabilistically because the hash function compresses a vast key space into a finite range, making perfect injectivity impossible for most inputs. In hashing terminology, colliding keys are sometimes called "synonyms," a term used to describe elements that share the same hash address, though "collision" is the more standard modern designation.^[27] To illustrate, consider a simple hash table with 5 slots (indices 0 to 4) and a hash function h(k) = k \mod 5. Inserting keys 3, 8, and 13 yields h(3) = 3, h(8) = 3, and h(13) = 3, causing all three to collide at slot 3:

Slot:  0  1  2  3  4
Keys:     |     3,8,13
Slot:  0  1  2  3  4
Keys:     |     3,8,13

This clustering at a single slot exemplifies how multiple distinct keys can target the same position. Without proper resolution, collisions degrade hash table performance significantly: ideal operations like search, insert, and delete achieve O(1) average time by direct slot access, but in the worst case—such as when all keys collide in one slot—they revert to O(n) time due to linear scanning of the chain or probes.^[26] This underscores collisions as a fundamental failure mode in the otherwise efficient direct-addressing paradigm of hash tables.

Causes and probability

Deterministic factors

Deterministic factors contributing to hash collisions arise primarily from flaws in hash function design and predictable patterns in input data, leading to non-uniform distribution of keys across hash table buckets. A common issue is the selection of a poor hash function, such as one that uses division by a non-prime modulus m, which can result in systematic clustering where multiple keys map to the same subset of buckets. For instance, if the table size is a power of 2 or 10, keys sharing similar low-order bits—such as strings or numbers with repeating patterns—will frequently collide because the modulo operation preserves these correlations, exacerbating uneven load distribution.^[28] Input clustering further amplifies deterministic collisions when similar keys, like consecutive integers or strings with shared prefixes (e.g., "apple" and "apricot"), are processed. In such cases, a simplistic hash function that emphasizes only certain bits or characters may direct these keys to identical or adjacent buckets, creating localized overloads independent of table load factor. An illustrative example involves treating strings as base-256 integers and hashing modulo 255, where permutations like "mac" and "cam" yield the same value since $256 \equiv 1 \pmod{255}, causing unavoidable collisions for patterned textual data.^[28] Fixed table sizes also introduce deterministic effects during resizing operations, where rehashing all entries into a larger array can temporarily spike collisions if the new size poorly interacts with the existing key distribution. This process, typically triggered when the load factor exceeds a threshold like 0.75, requires recomputing hashes for every key, potentially leading to short-term performance degradation before the reduced load factor stabilizes distribution.^[29] Certain hash functions, such as linear congruential generators of the form h(k) = (a k + b) \mod [m](/page/M), are particularly prone to failures on patterned inputs; for example, when applied to strings with common prefixes or sequential data, poor choices of parameters a, b, or [m](/page/M) (e.g., m = 41 with a radix of 42) result in systematic mappings to few buckets, undermining uniformity. To mitigate these issues, selecting a prime modulus [m](/page/M) promotes better spreading by minimizing arithmetic correlations, while employing families of universal hash functions—where the probability of collision between any two distinct keys is at most $1/[m](/page/M)—provides robustness against adversarial or patterned inputs without relying on fixed designs.^[28]^[30]

Probabilistic analysis

In probabilistic analysis of hash collisions, a key assumption is that of uniform random hashing, where each key is independently and uniformly mapped to any of the m slots in the hash table with equal probability $1/m. This model, introduced in the context of universal hash functions, ensures that for any two distinct keys x and y, the probability that they hash to the same slot is at most $1/m.^[31] Under this assumption, the probability that no collisions occur when inserting n keys into a table of size m can be bounded and approximated. Specifically, the probability p(n, m) that all keys hash to distinct slots satisfies p(n, m) \leq \exp\left(-\frac{n(n-1)}{2m}\right), derived from the union bound over all pairwise collisions, where each pair collides with probability at most $1/m. For large m and uniform hashing, this bound is asymptotically tight, yielding the approximation

p(n, m) \approx \exp\left(-\frac{n(n-1)}{2m}\right).

^[32] The expected number of collisions can be derived using linearity of expectation on indicator random variables for each pair of keys. Let X_{ij} be the indicator that keys i and j (for $1 \leq i < j \leq n) collide, so \Pr(X_{ij} = 1) \leq 1/m. The total number of collisions X = \sum_{1 \leq i < j \leq n} X_{ij} has expectation

References

[1]
collision - Glossary | CSRC
A collision is when two different messages have the same message digest, or when two distinct inputs produce the same output.
[2]
Hash collisions – Clayton Cafiero - University of Vermont
Jan 5, 2025 · A hash collision occurs when two keys yield the same hash value for a given table size. This is due to the table size and the hash function.Missing: definition | Show results with:definition
[3]
[PDF] CMSC 420: Lecture 11 Hashing - Handling Collisions - UMD CS
Separate Chaining: If we have additional memory at our disposal, a simple approach to collision resolution, called separate chaining, is to store the colliding ...<|control11|><|separator|>
[4]
[PDF] Lecture 10: Hash Collision Resolutions - Washington
-Various schemes: -Linear Probing – easiest, but lots of clusters. -Quadratic Probing – middle ground, but need to be more careful about 𝜆. -Double Hashing – ...Missing: techniques | Show results with:techniques
[5]
[PDF] Cryptographic Hash Functions - Purdue Computer Science
Collision Resistance (Strong Col. Res.): It is computationally infeasible to find any two distinct inputs which hash to the same output.
[6]
Cryptographic hash function - Glossary | CSRC
2. (Collision-resistant) It is computationally infeasible to find any two distinct inputs that map to the same output.
[7]
[PDF] Avoiding collisions Cryptographic hash functions - Computer Science
A collision is when two distinct inputs have the same hash output. A collision-resistant hash function makes it infeasible to find such collisions.
[8]
[PDF] MD5 and SHA-1 Collision Attacks: A Tutorial - Koc Lab
MD5 and SHA-1 are two of the most popular hash func- tions and are in widespread use. However, MD5 and SHA-. 1 are vulnerable to collision attacks based on ...Missing: famous | Show results with:famous
[9]
[PDF] The first collision for full SHA-1 - SHAttered.io
Feb 23, 2017 · This family originally started with MD4 [30] in 1990, which was quickly replaced by MD5 [31] in 1992 due to serious security weaknesses [7, 9].Missing: famous | Show results with:famous
[10]
[PDF] NIST Hash Competition
hash algorithms, particularly MD5 & SHA-1. – No actual collisions yet announced on SHA-1. • We think SHA-1 collision work factor ≈ 260 operations. • Held 2 ...Missing: famous | Show results with:famous
[11]
Hash Function - an overview | ScienceDirect Topics
A hash function is a deterministic function that maps a set of strings or keys to a set of bounded integers. It can also include objects, data structures, ...
[12]
CS 312 Lecture 21 Hash functions - Cornell: Computer Science
A uniform hash function produces clustering near 1.0 with high probability. A clustering measure of c > 1 greater than one means that the performance of the ...
[13]
[PDF] 1 Choosing a Hash Function when Using Chaining
Apr 26, 2016 · Criteria for choosing a good hash function: • it should distribute keys roughly uniformly into slots,. • regularity in key distribution should ...
[14]
CS 3110 Lecture 21 Hash functions - Cornell: Computer Science
A faster but often misused alternative is multiplicative hashing, in which the hash index is computed as ⌊m * frac(ka)⌋. Here k is again an integer hash code, a ...
[15]
Hash function - Glossary - MDN Web Docs - Mozilla
Jul 11, 2025 · Outside cryptography, for example, hash functions can be used to generate the keys for an associative array such as a map or a dictionary. The ...<|separator|>
[16]
Hashing Functions - UMBC
Hashing Functions. The Division Method ( for integers ). h ( k ) = k mod M, where k is the integer key and M is the size of the hash table
[17]
[PDF] Hashing
Division method (continued) h(k) = k mod m. Pick m to be a prime not too close to a power of 2 or 10 and not otherwise used prominently in the computing ...
[18]
String Hashing - Algorithms for Competitive Programming
Jul 4, 2024 · Calculation of the hash of a string where and are some chosen, positive numbers. It is called a polynomial rolling hash function. It is ...Calculation of the hash of a... · Example tasks · Fast hash calculation of...
[19]
https://cp-algorithms.com/string/string-hashing.html
[20]
[PDF] Lecture Notes - 06 Hash Tables - CMU 15-445/645
A hash table implements an associative array abstract data type that maps keys to values. It provides on average O (1) operation complexity (O (n) in the worst- ...
[21]
3.4 Hash Tables - Algorithms, 4th Edition
The first step is to compute a hash function that transforms the search key into an array index. Ideally, different keys would map to different indices. This ...
[22]
[PDF] Overview of Hash Tables Hash Functions - csail
Feb 16, 2011 · A hash table is a data structure that supports the following operations: • insert(k) - puts key k into the hash table.Missing: definition | Show results with:definition
[23]
[PDF] CS 3137 Class Notes 1 What is Hashing?
• A Hash table is sometimes referred to as Scatter Table. This is because we are trying to scatter the data throughout the table.<|control11|><|separator|>
[24]
[PDF] Hashes - CS 15: Data Structures
What is a good load factor? • Answer: systems typically keep load factor under around 0.7 to 0.75. ‣ Determined empirically: experimentation shows performance.Missing: below | Show results with:below
[25]
Art of Computer Programming, The: Volume 3: Sorting and ... - InformIT
30-day returnsApr 24, 1998 · The book contains a selection of carefully checked computer methods, with a quantitative analysis of their efficiency. Outstanding features of ...<|separator|>
[26]
EECS 311: Hashtables
Terms related to hashtables you should be familiar with: hash functions; keys; collisions and collision resolution; synonyms; linear probing; quadratic probing ...
[27]
CIS Department > Tutorials > Software Design Using C++ > Hash ...
Jul 21, 2021 · Above it was stated that the size of the hash table should be prime. ... This helps to avoid some clustering, but does not eliminate all of it.
[28]
Load Factor and Rehashing - GeeksforGeeks
Jul 11, 2025 · Rehashing is the process of increasing the size of a hashmap and redistributing the elements to new buckets based on their new hash values.Missing: spikes | Show results with:spikes
[29]
Universal classes of hash functions (Extended Abstract)
This paper gives an input independent average linear time algorithm for storage and retrieval on keys. The algorithm makes a random choice of hash function.
[30]
[PDF] Universal Classes of Hash Functions - cs.Princeton
CARTER, AND M. N. WEGMAN,. Analysis of a universal class of hash functions, in “Proceedings of the Seventh. Mathematical. Foundations of Computer. Science.
[31]
[PDF] Cormen Leiserson Rivest Stein ALG_3rd - DidaWiki
Suppose that we insert n keys into a hash table of size m using open addressing and uniform hashing. Let p.n; m/ be the probability that no collisions occur.
[32]
[PDF] Lecture 2: Hashing - cs.Princeton
Assume that H is a pairwise-independent hash family. Now, we want to count the expected number of collisions. To do this, let the random variable. Ixy = (. 1.
[33]
[PDF] Hash Tables
Linearity of expectation implies that the expected ... In particular, if we set m = cn2. , the expected number of collisions is less than 1/c, which implies.
[34]
[PDF] ⋆ 5.4 Probabilistic analysis and further uses of indicator random ...
Our first example is the birthday paradox. How many people must there be in a room before there is a 50% chance that two of them were born on the same day of.
[35]
Birthday Problem -- from Wolfram MathWorld
The birthday problem asks how many people are needed for a 50% chance of at least two sharing a birthday. With 365 days, 23 people are needed.
[36]
[PDF] 6.006 Lecture 08: Hashing with chaining - MIT OpenCourseWare
Simple Uniform Hashing: An assumption (cheating): Each key is equally likely to be hashed to any slot of table, independent of where other keys ...
[37]
[PDF] 4.2 Hashing - cs.Princeton
Separate Chaining. Separate chaining: array of M linked lists. □. Hash: map key to integer i between 0 and M-1. □. Insert: put at front of ith chain (if not ...
[38]
[PDF] Hash Tables
Theorem 1 In a hash table in which collisions are resolved by chaining, an unsuccessful search takes Θ(1 + α) time on average, assuming simple uniform hashing.<|separator|>
[39]
[PDF] Hash Table Analysis - Rose-Hulman
For a proof, see Knuth, The Art of Computer Programming, Vol 3: Searching ... Why use 31 and not 256 as a base in the. String hash function? ▫ Consider ...
[40]
Implementations for coalesced hashing | Communications of the ACM
This paper is a practical study of coalesced hashing for use by those who intend to implement or further study the algorithm.Missing: definition mechanism
[41]
[PDF] INDIVIDUAL DISPLACEMENTS IN HASHING WITH COALESCED ...
Introduction. The standard version of hashing with coalesced chains, due to Williams. [10] can be described as follows, where n and m are integers with 0 ...
[42]
Analysis of Early-Insertion Standard Coalesced Hashing - SIAM.org
This paper analyzes the early-insertion standard coalesced hashing method (EISCH), which is a variant of the standard coalesced hashing algorithm (SCH) ...Missing: mechanism | Show results with:mechanism
[43]
Analysis of the Search Performance of Coalesced Hashing
Analysis of the Search Performance of Coalesced Hashing. Author: Jeffrey Scott Vitter.
[44]
[PDF] Hash Tables - Algorithms
・No time limitation: trivial collision resolution with sequential search. ・Space and time limitations: hashing (the real world). hash("times") = 3 ?? 0. 1.
[45]
[PDF] Lecture 4: Hashing, Chaining, and Probing Analysis - Rice University
Sep 10, 2024 · If a collision occurs, check the next slot in the array (index + 1), and continue linearly until an empty slot is found. Example: If the hash ...
[46]
Implementations for coalesced hashing
Williams, F.A. Handling identifiers as internal symbols in language processors. Comm ACM 2, 6 (June 1959), 21-24. 926. Communications. December 1982 of. Volume ...
[47]
[PDF] 1 Hash tables
Feb 6, 2017 · A hash table is a commonly used data structure to store an unordered set of items, allowing constant time inserts, lookups and deletes (in ...
[48]
[PDF] Module 7: Hashing - Jackson State University
Hashing is another example for space-time tradeoff. • A hash table takes space corresponding to the size of the array, but the average time- complexity for a ...
[49]
Analysis of Early-Insertion Standard Coalesced Hashing
Aug 5, 2025 · This paper analyzes the early-insertion standard coalesced hashing method (EISCH), which is a variant of the standard coalesced hashing ...
[50]
CS 3110 Lecture 22 Hash tables and amortized analysis
Since the bucket array is being doubled at each rehashing, the rehashes must all occur at powers of two. The final rehash rehashes all n elements, the previous ...
[51]
[PDF] Data-Parallel Hashing Techniques for GPU Architectures - arXiv
Jul 11, 2018 · evicted, during a hash collision. This helps reduce the aver- age ... The Art of Computer Programming, Volume 3: (2nd. Ed.) Sorting and ...<|separator|>
[52]
[PDF] How Caching Affects Hashing - CS@Cornell
Given this double hash function, the authors' experimental results indicated that linear probing tends to outperform double hashing, particularly when the load ...
[53]
[PDF] Fast Hash Tables for In-Memory Data-Intensive Computing - USENIX
Jun 22, 2016 · These traits allow efficient SIMD im- plementations of BCHTs that achieve lookup rates supe- rior to linear probing and double-hashing-based ...
[54]
[PDF] Implementation and Performance Analysis of Hash Functions and ...
In our experiments, the hopscotch hashing was about as good as the linear probing. Because all the probes are localized, cache is heavily utilized. Timewise, ...<|separator|>
[55]
[PDF] Cryptographic Hash-Function Basics: Definitions, Implications, and ...
Abstract. We consider basic notions of security for cryptographic hash functions: collision resistance, preimage resistance, and second-preimage resistance.
[56]
[PDF] Recommendation for Applications Using Approved Hash Algorithms
Collision resistance. An expected property of a hash function whereby it is computationally infeasible to find a collision, See “Collision”. Page 7. 4. Digital ...
[57]
[PDF] Blockchains from Non-Idealized Hash Functions
Nov 13, 2020 · This property is useful in the blockchain context, since intuitively collision resistance ensures that the hash-chain maintained by the parties ...
[58]
Birthday Attacks, Collisions, And Password Strength - Auth0
Mar 23, 2021 · This problem is sometimes called the Birthday Paradox because it may seem counter-intuitive that it only takes 23 people for the chance to be ...The Problem With Hashing... · The Birthday Problem · What If 1234 Is Mapped To...
[59]
[PDF] Chosen-prefix Collisions for MD5 and Colliding X.509 Certificates ...
The main contribution of this paper is a method to construct MD5 collisions starting from two arbitrary IHVs. Given this method one can take any two chosen ...
[60]
MD5 considered harmful today - Marc Stevens
Dec 30, 2008 · Creating a rogue CA certificate. Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, Benne de Weger ...
[61]
Hash Functions | CSRC - NIST Computer Security Resource Center
Jan 4, 2017 · Collision resistance: It is computationally infeasible to find two different inputs to the hash function that have the same hash value.NIST Policy · News & Updates · Events · SHA-3 Standardization
[62]
Announcing the first SHA1 collision - Google Online Security Blog
We are announcing the first practical technique for generating a collision. This represents the culmination of two years of research that sprung from a ...Missing: Shockwave | Show results with:Shockwave
[63]
Flame malware collision attack explained - Microsoft
Jun 6, 2012 · They had to perform a collision attack to forge a certificate that would be valid for code signing on Windows Vista or more recent versions of ...
[64]
Moving Git past SHA-1 - LWN.net
Feb 27, 2017 · The use of SHA-1 in this way makes it difficult to tamper with the files in a repository; a change to a file will change the resulting hash, so ...
[65]
SHA-1 - Wikipedia
In February 2017, CWI Amsterdam and Google announced they had performed a collision attack against SHA-1, publishing two dissimilar PDF files which produced the ...Missing: Shockwave | Show results with:Shockwave
[66]
SHA-3 Standard: Permutation-Based Hash and Extendable-Output ...
This Standard specifies the Secure Hash Algorithm-3 (SHA-3) family of functions on binary data. Each of the SHA-3 functions is based on an instance of the ...
[67]
NIST Transitioning Away from SHA-1 for All Applications | CSRC
As a result, NIST will transition away from the use of SHA-1 for applying cryptographic protection to all applications by December 31, 2030.