Fact-checked by Grok 2 weeks ago

Checksum

A checksum is a value computed on a block of digital data for the purpose of detecting errors that may have been introduced during its storage or transmission.^[1] It is typically generated using an algorithm that produces a fixed-size datum from a variable-size input, often a 16-, 32-, or 64-bit value, which is appended to or stored alongside the original data.^[1] Upon receipt or retrieval, the receiving system recomputes the checksum from the data and compares it to the provided value; a match indicates the data is likely intact, while a mismatch signals potential corruption or manipulation.^[1] Checksums serve as a fundamental error-detection mechanism in computing, widely employed in data transmission protocols, file integrity verification, and storage systems to ensure reliability without the overhead of more robust cryptographic methods.^[1] In networking, they are integral to protocols such as the Internet Protocol (IP), User Datagram Protocol (UDP), and Transmission Control Protocol (TCP), where the Internet checksum—a 16-bit one's complement sum—helps identify transmission errors in packet headers and payloads.^[2] Beyond networking, checksums support digital preservation by acting as digital fingerprints to monitor file fixity, detecting even minor alterations that could compromise data integrity.^[3] Their efficiency makes them suitable for real-time applications, though they offer no protection against intentional tampering by adversaries who could recompute matching values.^[1] Various checksum algorithms exist, ranging from simple summations to more sophisticated techniques, each balancing computational cost and error-detection strength.^[4] Basic types include the longitudinal redundancy check (LRC), which XORs all bytes of the data to produce a parity-like value, and the Internet checksum, which folds 16-bit words using one's complement arithmetic to yield a 16-bit result.^[4] Advanced variants, such as Fletcher's checksum, use two running totals modulo 255 to improve burst-error detection over simple parity, while Adler-32 employs a similar modular approach for compression libraries like zlib.^[5] Cyclic redundancy checks (CRCs), often classified alongside checksums, apply polynomial division for higher reliability in detecting multi-bit errors, commonly used in storage devices and Ethernet frames.^[6] Selection of a checksum type depends on the error patterns expected, with simpler methods sufficing for random single-bit errors and stronger ones needed for burst errors in noisy channels.^[4]

Fundamentals

Definition

A checksum is a fixed-size datum computed from a larger block of digital data, typically to detect errors that may occur during storage or transmission.^[1] This value acts as a compact representation of the original data, enabling quick verification of integrity without retransmitting or re-storing the entire block.^[1] The general computation process involves applying a checksum function to the data block, where the function processes the contents—often by aggregating or transforming the data elements according to a predefined rule—to generate the fixed-size value.^[1] This resulting checksum is appended to or associated with the data, and upon receipt or retrieval, the same function is reapplied to check for discrepancies indicative of alterations.^[7] In contrast to cryptographic hash functions, which emphasize security properties like resistance to finding collisions for adversarial purposes, checksums prioritize computational efficiency and simplicity to facilitate rapid error detection in non-adversarial contexts.^[8] They are not designed to withstand deliberate manipulation, as their algorithms are often linear or easily invertible, making them unsuitable for authentication or data protection against intentional attacks.^[8] A basic illustrative example of checksum computation is to sum the integer values of all bytes in the data block and then take the total modulo a fixed value, such as 256, yielding a single-byte result that summarizes the data.^[9] This approach highlights the foundational mechanics without requiring complex operations, though practical implementations vary in sophistication.^[9]

Purpose

Checksums serve primarily as a mechanism for detecting accidental errors in data, such as bit flips caused by electrical noise, hardware malfunctions, or issues during transmission and storage.^[10]^[11] These errors are typically random and unintentional, distinguishing checksums from cryptographic methods designed to counter deliberate tampering.^[12] By appending or embedding a computed value derived from the original data, checksums enable the receiver to verify integrity without assuming malicious intent.^[13] A key advantage of checksums is their low computational overhead, which allows for efficient error checking even in resource-constrained environments like embedded systems or high-speed networks.^[14] This efficiency supports quick verification processes, where only the checksum needs to be recalculated and compared, avoiding the need for full data retransmission and thereby enhancing overall system reliability.^[15] In protocols such as IP, UDP, and TCP, this enables robust data handling with minimal impact on performance.^[11] Unlike error-correcting codes, checksums are intended solely for detection, flagging discrepancies to prompt actions like retransmission but not locating or repairing the errors themselves.^[16] This contrasts with certain parity bit schemes, which can sometimes support single-error correction when combined with additional redundancy, though basic parity is limited to detection similar to checksums.^[17] The effectiveness of a checksum against random errors is statistical, with the probability of an undetected error approximately 1 in 2^k for a k-bit checksum, assuming errors are independently distributed.^[18] This provides a high likelihood of catching typical transmission faults, making checksums a practical choice for maintaining data reliability in diverse computing applications.^[19]

History

Early Developments

The origins of checksum concepts trace back to the challenges of reliable data transmission in 19th-century telegraphy, where ensuring message accuracy over long distances was critical.^[20] The transition to digital computing in the mid-20th century introduced automated checksum techniques, particularly for data storage media. In 1951, the UNIVAC I system's Uniservo tape drive marked the first commercial use of magnetic tape for computer data storage, incorporating eight tracks consisting of six data tracks, one parity track, and one timing track to detect single-bit errors by ensuring an even or odd number of 1s across each character.^[21] Similarly, IBM's 726 magnetic tape unit, announced in 1952 for the IBM 701 computer, employed parity checks on its seven-track format (six data tracks plus one parity track) to validate data read from or written to tape during early batch processing tasks. Punch card systems, prevalent since the 1890s but digitized in the 1950s, also adopted parity bits; for instance, paper tape readers in computing setups added an even-parity bit in an extra track to confirm data integrity before processing.^[22] Key innovations during this period included contributions from IBM engineer Hans Peter Luhn, whose 1953 internal memorandum outlined hashing as a method to distribute data into "buckets" for faster retrieval, effectively serving as an early checksum for detecting alterations in records stored on punched cards and tapes.^[23] This work culminated in the Luhn algorithm, patented in 1954, which used a modulus-10 calculation to generate check digits for validating numerical identifiers, widely applied in early IBM data processing for error-prone input like account numbers.^[24] Although Claude Shannon's 1948 formulation of information theory provided a theoretical basis for quantifying noise and error rates in channels, influencing subsequent error-correcting codes, practical checksums like parity had already emerged as simple, hardware-efficient solutions predating this formal framework. By the 1960s, checksums became standardized in IBM's computing ecosystems for ensuring batch processing integrity. The IBM System/360, launched in 1964 as the first family of compatible computers, integrated parity and summation-based checks in its input/output channels and storage devices to verify data blocks during sequential batch jobs, reducing errors in high-volume commercial applications like payroll and inventory management. These early implementations prioritized detection over correction, setting the stage for more sophisticated algorithms in networking and storage.

Evolution in Computing

In the 1970s, checksums were integrated into ARPANET protocols to detect errors in packet transmission, with a checksum appended to every packet of information for verification upon receipt.^[25] This approach evolved from early ARPA Internet Protocol editions and laid the groundwork for broader network standardization, culminating in the definition of the 16-bit IP header checksum in RFC 791 in 1981, which ensured header integrity during processing and fragmentation across interconnected systems including ARPANET.^[11] During the 1980s and 1990s, the proliferation of personal computing drove increased adoption of checksums in file systems and storage devices. For instance, the FAT file system, originating in the early 1980s for MS-DOS, incorporated checksums in its long filename extensions by the mid-1990s to verify the association between long and short file name entries, preventing mismatches due to data corruption or reorganization.^[26] Concurrently, cyclic redundancy checks (CRC) became standard in storage devices like hard disk drives, enabling robust error detection as capacities grew and personal computers became ubiquitous, with implementations in interfaces such as SCSI and ATA protocols supporting reliable data retrieval. From the 2000s onward, the demand for high-speed networks and cloud storage prompted a shift toward lightweight checksum algorithms to minimize computational overhead while maintaining error detection efficacy. Algorithms like one's complement sums and Fletcher checksums were favored for their efficiency in embedded and high-throughput environments, as demonstrated in evaluations showing their balance of speed and undetected error rates below 10^{-5} for typical control network payloads.^[14] In cloud storage systems, checksums evolved to support scalable integrity verification, with providers integrating methods like CRC32C for end-to-end validation during data transfers and replication, addressing the challenges of distributed architectures that emerged with services like Amazon S3 in the early 2000s.^[27] The specification of IPv6 in RFC 2460 (1998) marked a notable evolution by omitting a header checksum—unlike IPv4—to reduce processing latency in high-speed environments, relying instead on transport-layer checksums for protection.^[28] Subsequent updates, such as RFC 6935 (2013), refined UDP checksum handling for tunneled packets over IPv6, allowing zero-checksum modes in low-error links to further optimize performance without compromising integrity.^[29] By the 2020s, non-cryptographic checksums continued to play a role in emerging domains, including blockchain transaction verification where they validate addresses to detect typographical errors, as in Bitcoin's Base58Check encoding, and in AI data pipelines to monitor integrity during model training and inference, ensuring uncorrupted inputs via periodic hashing.^[30]^[31]

Algorithms

Parity Methods

Parity methods represent one of the simplest forms of checksums used for error detection, primarily relying on the addition of a single parity bit to data to ensure a consistent count of 1s across the transmitted or stored unit.^[16] In even parity, the parity bit is chosen such that the total number of 1s in the data plus the parity bit is even; conversely, in odd parity, the total is made odd.^[32] This approach allows the receiver to verify integrity by recounting the 1s and checking against the expected parity, flagging any discrepancy as a potential error.^[33] These methods extend beyond single bits to larger units like bytes or words by applying parity across multiple positions, forming structures such as longitudinal redundancy checks (LRC). In LRC, parity is computed column-wise across a block of bytes, where each parity bit represents the even or odd count for that bit position across all bytes.^[34] Similarly, in storage systems like RAID levels 3, 4, and 5, parity is applied longitudinally across disk blocks using bitwise operations to enable reconstruction of lost data from a single failure.^[35] This extension improves detection for burst errors within the block while maintaining computational simplicity.^[36] The parity bit is mathematically derived as the exclusive OR (XOR, denoted \oplus) of all data bits for even parity, ensuring the overall sum modulo 2 equals zero:

p = \bigoplus_{i=1}^{n} d_i

where d_i are the data bits and p is the parity bit; for odd parity, the bit is inverted.^[37] This XOR operation effectively counts the parity of 1s modulo 2, making it efficient for hardware implementation.^[38] Consider an 8-bit data word 10110001, which has four 1s (even count). For even parity, the parity bit p = 0 (XOR of bits yields 0), resulting in transmission as 101100010. If a single bit flips during transmission to 101100000 (now three 1s plus p=0, odd total), the receiver's recount detects the mismatch.^[16] This example illustrates single-error detection, as the method alters the parity only for an odd number of bit flips. The primary strengths of parity methods lie in their extreme simplicity and low overhead, requiring just one additional bit per unit and minimal computation via XOR, which suits resource-constrained environments.^[39] They reliably detect all single-bit errors and any odd-numbered bit errors, providing a foundational layer of integrity checking without the complexity of advanced algorithms.^[34]

Summation-Based Methods

Summation-based checksums compute an error-detecting value by arithmetically summing the data in fixed-size words, typically 16 bits, and applying operations like carry folding and one's complement to produce the final checksum.^[40] This approach treats all positions equally, relying on modular arithmetic to detect multi-bit errors more effectively than simple parity bits, which only catch odd-numbered bit flips.^[40] A 16-bit sum-of-words checksum detects all single-bit errors and all bursts up to 16 bits long, along with approximately 99.998% of longer bursts.^[40] The Internet checksum, standardized in RFC 1071, exemplifies this method and is widely used in IP, TCP, and UDP protocols.^[41] It processes the data as a sequence of 16-bit words, summing them using one's complement arithmetic: any carry from the most significant bit is folded back by adding it to the least significant bit position.^[41] The checksum is then the one's complement of this sum, targeted to produce 0xFFFF when the entire message (including the checksum field) is summed correctly at the receiver.^[41] Formally, for data words w_1, w_2, \dots, w_n, the checksum C is given by

C = \overline{\left( \sum_{i=1}^{n} w_i \mod 2^{16} \right) \mod 2^{16}},

where \overline{x} denotes the one's complement (bitwise NOT) and the sum includes folding any carries exceeding 16 bits back into the total.^[41] For example, consider two 16-bit words 0x1234 and 0x5678. Their sum is 0x68AC (no carry to fold). The one's complement is 0x9753, which serves as the checksum.^[41] At verification, summing the original words with 0x9753 yields 0xFFFF, confirming integrity.^[41] A notable variant is Fletcher's checksum, introduced to improve error detection distribution over simple sums by using two running accumulators.^[42] Proposed by J. G. Fletcher in 1982, it processes data bytes sequentially: initialize two 8-bit accumulators A and B to zero; for each byte, update A as A = (A + byte) mod 255, then B = (B + A) mod 255.^[42] The 16-bit checksum concatenates the final A and B values (often with one's complement for compatibility).^[42] This dual-accumulator design enhances avalanche properties, reducing the likelihood of undetected errors compared to single-sum methods, while remaining computationally efficient for serial transmissions.^[42]

Position-Dependent Methods

Position-dependent checksum methods, also referred to as weighted checksums, enhance error detection by assigning unique weights to data elements based on their positions before performing the summation. This weighting scheme ensures that errors involving positional changes, such as swaps or shifts, produce a distinct alteration in the checksum value compared to uniform summation approaches. By incorporating factors like powers of a base number, consecutive integers, or primes, these methods achieve higher sensitivity to certain error patterns without significantly increasing computational complexity.^[43] The core computation follows the formula:

\text{Checksum} = \left( \sum_{i=1}^{n} d_i \cdot w_i \right) \mod m

where d_i represents the data element at position i, w_i is the position-dependent weight (for example, w_i = 10^{i-1} or decreasing integers like 10 to 1), and m is the modulus, typically 10 or 11. This structure allows the checksum to validate data integrity by verifying if the computed value matches an appended check digit. Weights are chosen to maximize discrepancy for common errors, such as adjacent transpositions, where swapping elements d_i and d_{i+1} results in a non-zero difference unless w_i = w_{i+1}.^[43] A widely adopted example is the Luhn algorithm, employed for credit card and identification number validation. Developed by IBM researcher Hans Peter Luhn and patented in 1960, it processes digits from right to left, doubling every second digit (effectively weighting alternate positions by 2 while others remain 1), summing the results (splitting doubled values exceeding 9 into their digits), and checking if the total modulo 10 equals zero. This method reliably detects all single-digit errors and approximately 98% of adjacent digit transpositions, making it suitable for manual entry scenarios prone to such mistakes.^[44]^[45] In the ISBN-10 standard, position-dependent weighting is used to compute the final check digit for book identification numbers. The first nine digits are multiplied by weights decreasing from 10 to 2, the products are summed, and the check digit is the value that makes the total divisible by 11 (using 'X' for 10). Established under ISO 2108, this approach leverages consecutive integer weights and a prime modulus to detect nearly all single errors, transpositions, and even some multiple errors, outperforming unweighted sums in bibliographic data handling.^[43] The primary advantages of position-dependent methods lie in their improved resilience to systematic errors like transpositions or shifts, as the varying weights amplify positional discrepancies in the sum. For instance, in transposition cases, the error term (d_i - d_{i+1})(w_i - w_{i+1}) is unlikely to be zero modulo m when weights differ, enabling detection rates far superior to non-weighted alternatives for human-entered data. These techniques balance simplicity with effectiveness, finding application in scenarios where error patterns are predictable but computation must remain lightweight.^[45]^[43]

Fuzzy Methods

Fuzzy methods in checksums enable the generation of similar hash values for data that are nearly identical, facilitating approximate comparisons tolerant to minor alterations such as small edits or noise. Unlike exact checksums, these approaches leverage locality-sensitive hashing principles to produce outputs that remain close when inputs differ slightly, making them suitable for tasks like plagiarism detection and content deduplication without requiring cryptographic strength. This non-cryptographic focus prioritizes efficiency and probabilistic similarity estimation over collision resistance.^[46] A foundational technique in fuzzy checksums is Rabin fingerprinting, introduced by Michael O. Rabin in 1981, which models data as a polynomial over the finite field GF(2) and computes a compact representation modulo a randomly selected irreducible polynomial. The data string M = m_0 m_1 \dots m_{n-1}, where each m_i \in \{0,1\}, is interpreted as the polynomial M(x) = m_0 + m_1 x + m_2 x^2 + \dots + m_{n-1} x^{n-1}. An irreducible polynomial P(x) of degree k is chosen randomly from GF(2), and the fingerprint is the remainder of the division:

f(M) = M(x) \mod P(x)

This yields a value of degree less than k, typically represented as a k-bit integer. The method ensures low collision probability for distinct data while allowing Hamming distance checks: for two fingerprints f(M) and f(N), their bitwise Hamming distance reflects the bit-level differences in the original data with high probability, enabling detection of small variations.^[46] In practice, Rabin fingerprinting supports fuzzy matching by applying rolling hashes over overlapping windows (shingles) of the data, generating sets of fingerprints that can be compared for similarity. For instance, in duplicate file detection, systems tolerate 1-2 byte differences by chunking files based on rolling Rabin hashes and verifying similarity through low Hamming distances or Jaccard indices on fingerprint sets, avoiding full recomputation for near-matches. This approach efficiently identifies modified versions without exact equality. Applications extend to search engines, where Rabin-based fingerprints detect near-duplicate web pages by extracting shingle fingerprints from documents and clustering those with small Hamming distances, reducing indexing redundancy while preserving diverse content. For example, Google's crawling systems use variants to filter boilerplate or slightly altered pages, improving crawl efficiency and result quality.^[47]

Polynomial-Based Methods

Polynomial-based methods employ mathematical operations over polynomials in the Galois field GF(2) to compute checksums, with the Cyclic Redundancy Check (CRC) serving as the most prominent example for its robustness in detecting burst errors in data transmission and storage.^[48] In the CRC algorithm, the sender represents the data message as a polynomial M(x) of degree n-1, where n is the message length in bits. To generate the checksum, the message is augmented by appending k zero bits, equivalent to computing M(x) \cdot x^k, with k being the degree of the chosen generator polynomial G(x). This augmented polynomial is divided by G(x) using polynomial long division in GF(2), where addition and subtraction are performed via XOR operations. The remainder R(x), a polynomial of degree less than k, becomes the CRC checksum. The transmitted codeword is then M(x) \cdot x^k + R(x), which is divisible by G(x) (yielding zero remainder upon division). At the receiver, the received codeword is divided by G(x); a non-zero remainder indicates an error.^[48]^[49] Widely adopted generator polynomials include the CRC-16 variant with hexadecimal value 0x8005 (binary x^{16} + x^{15} + x^2 + 1), commonly used in storage systems like Modbus, and CRC-32 with 0x04C11DB7 (binary x^{32} + x^{26} + x^{23} + \dots + 1), standard in Ethernet frames per IEEE 802.3 and in ZIP archives for file integrity verification.^[50] To illustrate the computation, consider an 8-bit data message 11010000 (polynomial x^7 + x^6 + x^4) and generator G(x) = x^3 + x + 1 (binary 1011). Augment the data by appending three zeros: 11010000000. Perform modulo-2 division step by step:

Align 1011 under the first four bits 1101: $1101 \oplus 1011 = 0110. Updated dividend: 0110000000 (with remaining bits).
Next four bits (shifted): 0110 (leading 0 implies quotient 0, no XOR), bring down 0: 01100.
Continue: 01100 (0, no XOR), bring down 0: 011000.
011000 (0), bring down 0: 0110000.
0110000 (0), bring down 0: 01100000.
Now process further alignments, yielding subsequent XORs that result in the final remainder 110 after all 11 bits are processed.

The checksum is 110, so the transmitted codeword is 11010000110 (original data + remainder). Dividing this by 1011 yields zero remainder, confirming integrity.^[51] A key strength of CRC is its guaranteed detection of all burst errors—consecutive bit flips—of length up to the degree k of the generator polynomial, making it superior for channels prone to such errors compared to simpler linear methods.^[52]^[53]

Applications

Data Transmission

In data transmission, checksums play a critical role in communication protocols to detect errors introduced during packet transit, ensuring reliable delivery over potentially noisy channels. In the TCP/IP suite, the Internet Protocol (IP) header includes a mandatory 16-bit checksum computed as the one's complement of the one's complement sum of all 16-bit words in the header, which verifies the integrity of the header fields against transmission errors.^[41] For the User Datagram Protocol (UDP), the 16-bit checksum is optional in IPv4 but recommended to protect the header and data payload, covering a pseudo-header derived from the IP header along with the UDP header and data.^[54] At the data link layer, Ethernet frames employ a 32-bit cyclic redundancy check (CRC-32) as the frame check sequence (FCS) to provide physical layer integrity, detecting bit errors in the frame payload and headers as defined in the IEEE 802.3 standard. In wireless protocols such as Wi-Fi, the IEEE 802.11 standard similarly uses a 32-bit CRC for frame checks, enabling detection of errors caused by interference, fading, or collisions in the radio environment. Upon receipt, the receiver recomputes the checksum over the relevant packet or frame fields; a mismatch indicates corruption, prompting the protocol to trigger retransmission through mechanisms like negative acknowledgments (NAK) in automatic repeat request (ARQ) schemes or the absence of positive acknowledgments (ACK) in protocols such as TCP.^[55] This process ensures erroneous packets are discarded and resent, maintaining end-to-end reliability without assuming error-free channels. The computational and bandwidth overhead of checksums remains minimal, with the CRC-32 adding only 4 bytes to Ethernet and Wi-Fi frames, making it essential for high-latency or bandwidth-constrained networks where undetected errors could propagate significantly.

Data Storage and Integrity

In data storage systems, checksums play a crucial role in maintaining integrity by verifying that stored data remains uncorrupted over time. File systems such as ZFS employ checksums at the block level to detect and mitigate errors. Specifically, ZFS uses the Fletcher-4 algorithm by default to compute 256-bit checksums for all data and metadata blocks, storing these alongside the data in the storage pool. This enables self-healing in redundant configurations like mirrors or RAID-Z, where if a read operation detects a checksum mismatch indicating corruption, ZFS automatically retrieves and repairs the affected block from a healthy copy.^[56]^[57] Backup tools leverage checksums to ensure accurate incremental transfers and verify data fidelity during synchronization. For instance, rsync utilizes an Adler-32-based rolling checksum as a weak check in its delta-transfer algorithm, dividing files into blocks and computing checksums to identify unchanged portions, thereby minimizing data movement while confirming integrity before applying updates. This approach allows rsync to efficiently handle large backups by only transferring differing blocks, with a subsequent strong checksum (such as MD4 or MD5) validating the reconstructed file.^[58] In array-based storage like RAID, parity checksums provide error detection and correction for disk failures. RAID-5 distributes parity information across all drives in a striped configuration, using exclusive-OR (XOR) operations to compute parity blocks that enable reconstruction of data from a single failed drive. When reading data, the system verifies parity to detect inconsistencies, such as bit errors, and can correct them by recalculating from the remaining drives and parity. This method enhances reliability in multi-disk setups without dedicating an entire drive to redundancy.^[59] Modern cloud storage services integrate checksums for upload validation to prevent corruption during ingestion. Amazon S3, for example, supports multiple checksum algorithms (e.g., CRC32, SHA-256) that users can specify when uploading objects; S3 computes and stores the provided checksum in object metadata, then verifies it upon receipt to ensure the data matches exactly. This process catches transmission or storage errors immediately, with S3 rejecting invalid uploads and allowing retries.^[60] To proactively detect silent data corruption—errors that go unnoticed until data access—storage systems perform periodic scrubbing. During idle periods, these systems systematically read all blocks, recompute checksums, and compare them against stored values; mismatches trigger repairs using redundancy if available. In ZFS, the zpool scrub command automates this traversal, scanning terabytes of data to identify and heal bit rot or media faults without user intervention. Similar mechanisms in other systems ensure long-term data durability by addressing degradation that might otherwise propagate undetected.

Limitations

Detection Capabilities

Checksums are designed to detect certain types of errors based on their mathematical structure, though they cannot guarantee detection of all possible error patterns. All non-trivial checksums reliably detect single-bit errors, as a single bit flip alters the computed value, resulting in a checksum mismatch.^[61] For burst errors, detection capabilities vary by method. Simple parity checksums detect bursts of odd length, since an even number of bit flips preserves parity, potentially allowing even-length bursts to go undetected.^[34] Polynomial-based checksums like CRC can detect all burst errors up to a length equal to the degree of the generator polynomial plus one; for example, a 16-bit CRC detects bursts up to 17 bits long.^[62] The probability of an undetected error depends on the error model and checksum strength. For random independent bit errors over uniformly distributed data, the undetected error probability is approximately $1/2^k for a k-bit checksum, meaning a 16-bit checksum fails to detect about 1 in 65,536 random errors.^[63] The IP checksum, using one's complement summation, exhibits weaknesses against specific patterns, such as errors where the net change sums to a multiple of $2^{16}, increasing the likelihood of undetected errors beyond the random case in adversarial scenarios.^[41] Certain error patterns evade detection in specific checksum types. For instance, in summation-based checksums, two-bit errors within the same 16-bit word can sometimes cancel out if their positional values result in no net change modulo $2^{16}, though such cases are rare and depend on the bit states.^[64] From a coding theory perspective, checksums typically achieve a minimum Hamming distance of 2, enabling reliable detection of all single-bit errors but offering no correction capability. Achieving single-error correction requires a minimum distance of 3, which exceeds the scope of standard checksums focused solely on detection.^[65]

Comparisons to Other Techniques

Checksums differ fundamentally from error-correcting codes (ECC) in their capabilities and design. While checksums, such as CRC or Internet checksum, are primarily used for error detection by identifying inconsistencies in data without providing mechanisms for repair, ECC methods like Reed-Solomon codes incorporate redundancy to both detect and correct errors automatically.^[66]^[67] For instance, Reed-Solomon codes add parity symbols that enable the recovery of up to t erroneous symbols in a block of n symbols, making them suitable for applications like data storage in RAID systems where correction is essential without retransmission.^[68] In contrast, checksums require additional protocols, such as automatic repeat request (ARQ), to handle detected errors through feedback and retransmission, as they lack the inherent correction capability of ECC.^[66] Compared to cryptographic hash functions like SHA-256, checksums prioritize computational efficiency over collision resistance and security against intentional tampering. Checksums are lightweight algorithms designed to catch accidental errors, such as those from transmission noise, but they are vulnerable to deliberate modifications because collisions can be found relatively easily without significant effort.^[69]^[12] Cryptographic hashes, however, produce fixed-size digests that are computationally infeasible to reverse or collide under adversarial conditions, making them ideal for verifying data authenticity and integrity in untrusted scenarios, though at the cost of higher processing overhead.^[69]^[70] Checksums also contrast with forward error correction (FEC) techniques, which embed corrective information directly into the transmitted data to enable error repair at the receiver without needing acknowledgment or retransmission. In FEC, codes like Hamming or BCH add systematic redundancy upfront, allowing recovery of errors in real-time, which is advantageous for bandwidth-constrained or high-latency channels.^[67] Checksums, by relying on detection followed by ARQ, introduce latency due to the feedback loop but consume less bandwidth per transmission since no extra redundancy is added for correction.^[66]^[67] Checksums are preferred in trusted environments where efficiency is key and threats are limited to random errors, such as internal network transmissions or storage verification without adversarial risks.^[12]^[69] In contrast, cryptographic hashes should be used for security-critical integrity checks, like software distribution or digital signatures, to guard against malicious alterations.^[12]^[70] Hybrid approaches often combine checksums and cryptographic hashes to balance performance and security; for example, in file signing protocols like OpenPGP, a fast checksum may perform initial validation, while a secure hash of the file is signed with a private key to ensure authenticity and non-repudiation.^[69] This layering allows quick error detection in routine operations while providing robust protection against tampering in distributed systems.

References

[1]
checksum - Glossary - NIST Computer Security Resource Center
A checksum is a value computed on data to detect error or manipulation, stored with the data to detect changes.
[2]
RFC 1071 - Computing the Internet checksum - IETF Datatracker
This memo discusses methods for efficiently computing the Internet checksum that is used by the standard Internet protocols IP, UDP, and TCP.
[3]
Fixity and checksums - Digital Preservation Handbook
Fixity ensures a digital file remains unchanged. Checksums are digital fingerprints that change with any file change, used to monitor fixity.
[4]
[PDF] Selection of Cyclic Redundancy Code and Checksum Algorithms to ...
A longitudinal redundancy check (LRC), also known as an XOR checksum, involves XORing all the chunks of a dataword together to create a check sequence. It is ...
[5]
Error Checking - Fletcher's Checksum :: Intro CS Textbook
Let's look at an even better way to do a checksum to check and see if our data is correct. This is called Fletcher's checksum. And here we're giving a quick ...
[6]
cyclic redundancy check (CRC) - Glossary | CSRC
A type of checksum algorithm that is not a cryptographic hash but is used to implement data integrity service where accidental changes to data are expected.
[7]
https://homepage.cs.uiowa.edu/~dwjones/security/notes/23.shtml
[8]
[PDF] Detecting Bit Errors - MIT
Oct 7, 2010 · After the modulo operation the A and B values can be represented as 16-bit quantities. The Adler-32 checksum is the 32-bit quantity (B 16) + A ...<|control11|><|separator|>
[9]
CSC161 2011S : Laboratory: Input/Output in C - Samuel A. Rebelsky
An easy way to compute a checksum is to sum all of the characters read (after all, in C every character is an integer) and then take that sum modulo some number ...
[10]
[PDF] TaintScope: A Checksum-Aware Directed Fuzzing Tool for ...
Aug 2, 2010 · In this paper, we focus on checksums that are designed to protect against mainly accidental errors that may have been introduced during ...
[11]
RFC 791 - Internet Protocol - IETF Datatracker
The internet protocol is designed for use in interconnected systems of packet-switched computer communication networks. Such a system has been called a catenet.
[12]
What Are Checksums, CRCs, & Hashes in Embedded Software?
Oct 29, 2024 · A checksum is an algorithm designed to detect random errors. A cryptographic hash maps a dataset to another dataset to help prevent intentional data changes.
[13]
7.10. Checksums - Wireshark
Checksums are calculated summaries of data used to ensure data integrity and detect transmission errors. If the calculated checksums don't match, an error has ...
[14]
[PDF] The Effectiveness of Checksums for Embedded Control Networks
Mar 24, 2009 · By using one's complement addition, the Fletcher checksum is performing integer addition modulo 255 for 8-bit blocks and modulo 65,535 for 16- ...
[15]
CRC32 Checksum - Importance and Limitations - Embedded Office
Low Computational Overhead: CRC32 checksum is relatively efficient in terms of computational complexity. It can be performed quickly and with minimal ...
[16]
Error Detection - UMSL
A single bit error can be detected but it cannot be corrected. Parity Checks do not work if there are are an even number of errors. CheckSum Error Detection: A ...
[17]
Error Detection in Computer Networks - GeeksforGeeks
Sep 25, 2025 · Two-Dimensional Parity Check can not correct two or three bit error. It can only detect two or three bit error. If we have a error in the ...Checksum · Cyclic Redundancy Check(CRC) · One's Complement
[18]
[PDF] Direct Link Networks – Error Detection and Correction
output that's essentially random. ○. So any error would be detected with probability 1 – 2-k. ▫ Checksum: not close to ideal. ▫ CRC: better. Spring 2019. © CS ...
[19]
CRC Series, Part 2: CRC Mathematics and Theory - Barr Group
Dec 1, 1999 · To repeat, the probability of detecting any random error increases as the width of the checksum increases. Specifically, a 16-bit checksum will ...
[20]
Invention of the Telegraph | Articles and Essays | Samuel F. B. ...
The idea of using electricity to communicate over distance is said to have occurred to Morse during a conversation aboard ship when he was returning from ...
[21]
1951: Tape unit developed for data storage
In 1951, Univac introduced the Uniservo 1, a tape drive using 0.5 inch tape, and in 1952, IBM announced its first magnetic tape storage unit.
[22]
ACL::Punched Tape Codes - Chilton Computing
This is the basic 6-bit code laid out on 8-track tape (1 inch wide) with the addition of an even-parity bit in track 5 and a hole in track 8 to mean lower case.
[23]
Hans Peter Luhn and the Birth of the Hashing Algorithm
Jan 30, 2018 · In early 1953, Luhn had written an internal IBM memo in which he suggested putting information into “buckets” in order to speed up a search.
[24]
Mr. Hans Peter Luhn | IT History Society
The Luhn algorithm or Luhn formula, also known as the "modulus 10" or "mod 10" algorithm, is a simple checksum formula used to validate a variety of ...
[25]
[PDF] A History of the ARPANET: The First Decade - DTIC
Apr 1, 1981 · checksum is appended to every packet of information to ... Outside the U.S. military, the commercial world has begun to use the ARPANET technology ...
[26]
FAT Filesystem - Elm-chan.org
The FAT file system originated around 1980 and is the filesystem that was first supported by MS-DOS. ... Checksum of the SFN entry associated with this entry.
[27]
Cyclic redundancy check - Wikipedia
A cyclic redundancy check (CRC) is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to digital data.Computation of cyclic... · Mathematics of cyclic... · Polynomial long division
[28]
Checksums in Storage Systems and Why You Should Care - Quobyte
May 10, 2018 · The checksum is validated along the path throughout the life of the data – even at rest when the data isn't accessed (via periodic disk ...
[29]
RFC 2460 - Internet Protocol, Version 6 (IPv6) Specification
This document specifies version 6 of the Internet Protocol (IPv6), also sometimes referred to as IP Next Generation or IPng.Missing: evolutions | Show results with:evolutions
[30]
RFC 6935: IPv6 and UDP Checksums for Tunneled Packets
This document updates the IPv6 specification (RFC 2460) to improve performance when a tunnel protocol uses UDP with IPv6 to tunnel packets.Missing: evolution post
[31]
Checksum | A Simple Error Detection Tool - Learn Me A Bitcoin
Aug 5, 2025 · A quick explanation of what a checksum is in Bitcoin, how they're used to detect errors in things like addresses, and how they're created.Creating · Verifying · TerminologyMissing: non- | Show results with:non-
[32]
Why is data integrity key to ML Monitoring? - Deepchecks
The act of validating data, or to check data integrity, encompasses several methods. These could range from simple checksums or hash functions for detecting ...
[33]
[PDF] Detecting and Correcting Bit Errors - cs.Princeton
How many bit errors can we detect? • Suppose the minimum Hamming distance between any pair of codewords is dmin. • Then, we can detect at most dmin.
[34]
[PDF] Communication and Networking Error Detection Basics - spinlab
Checksum Example: 00 01 F2 03 F4 F5 F6 F7 00 00. Result of all ones (FFFF) ... and modulo 2 addition of any number with itself is zero, we get. T. P. = Q + ...
[35]
[PDF] Error Detection
Apr 29, 2013 · • Fraction of errors undetected at a given random probability of bit flips. – ... • For a given BER, can determine Probability of UnDetected error ...
[36]
parity - UMD Computer Science
The parity bit contains the XOR of all the other bits in the same position on each disk. i.e. bit 0 on each disk, bit 1 on each disk, etc. In the case of a disk ...
[37]
Basic RAID Organizations
The parity bit is the XOR of these four data bits, which can be calculated by adding them up and writing a 0 if the sum is even and a 1 if it is odd. Here the ...
[38]
https://www.cs.utexas.edu/~venkatar/f24/assets/slides/lec9-error-detect.pdf
[39]
[PDF] Lecture 9: Physical layer - Error detection and reliable transmission
Simplest error detection mechanism is the parity check bit. Rule: the final bit is the parity bit ensures that there are always an even number of 1s.
[40]
Checking Up
Checksums, like parity bits, are used to detect errors in data transmission. Checksums are usually a summation of data, and can be more complex than parity ...
[41]
CRC Series, Part 1: Additive Checksums - Barr Group
Nov 1, 1999 · A 16-bit sum-of-words checksum will detect all single bit errors and all error bursts of length 16 bits or fewer. It will also detect 99.998% of ...
[42]
RFC 1071 - Computing the Internet checksum - IETF Datatracker
This memo discusses methods for efficiently computing the Internet checksum that is used by the standard Internet protocols IP, UDP, and TCP.
[43]
An Arithmetic Checksum for Serial Transmissions
**Summary of Content from https://ieeexplore.ieee.org/document/1095369:**
[44]
Secrets of the ISBN - An Error Detection Method
If it happens that the value of the check digit must be 10, the letter 'X' is used instead to complete the ISBN.
[45]
Computer for verifying numbers - US2950048A - Google Patents
This invention relates to a hand computer for computing a check digit for numbers or for verifying numbers which already have a check digit appended.Missing: algorithm | Show results with:algorithm
[46]
Secrets of the LUHN-10 Algorithm - An Error Detection Method
One of the most widely used algorithms for character-based error control is known as the LUHN-10 Algorithm. This simple but effective error detection method is ...
[47]
[PDF] Fingerprinting By Random Polynomials by Michael O. Rabin - XMail
Let GF(2) = E be the Galois field with. 2k elements. irreducible polynomial p ... It follows that the number of these polynomials is (2-2)/k. •. Consider ...
[48]
[PDF] Detecting Near-Duplicates for Web Crawling - Google Research
Each task solves the Hamming Distance Problem over some 64-MB chunk of F and the entire file Q as inputs. A list of near-duplicate fingerprints discovered by a ...
[49]
[PDF] Cyclic Codes for Error Detection - Semantic Scholar
Cyclic Codes for Error Detection. W. W. Peterson and D. T. Brown by. Fuad Mohamed. Page 2. Outline. ▫ Concept of Cyclic Redundancy Checks. ▫ Error detections.Missing: W. Wesley
[50]
[PDF] Cyclic Redundancy Code (CRC) Polynomial Selection For ...
Cyclic Redundancy Codes (CRCs) are used for error detection in networks. This paper proposes a selection process for good general-purpose CRC polynomials.
[51]
A Painless Guide to CRC Error Detection Algorithms - Zlib
To see this, note that 1) CRC multiplication is simply XORing a constant value into a register at various offsets, 2) XORing is simply a bit-flip operation, and ...<|separator|>
[52]
Cyclic Redundancy Check and Modulo-2 Division - GeeksforGeeks
May 24, 2025 · Cyclic Redundancy Check or CRC is a method of detecting accidental changes/errors in the communication channel. CRC uses Generator Polynomial ...
[53]
Error Detection with the CRC
The above polynomial of degree 16 generates a 16-bit CRC. Typically ... 100% detection of all burst errors spanning up to 16-bits;; 100% detection of ...
[54]
[PDF] Cyclic Redundancy Checks and Error Detection - arXiv
May 23, 2022 · A generator polynomial with degree 𝑝 will detect 1 − 1/2𝑝 of all burst errors of length greater than 𝑝 + 1. [5] [14]. However these properties ...
[55]
RFC 768 - User Datagram Protocol (UDP) - IETF
Checksum is the 16-bit one's complement of the one's complement sum of a pseudo header of information from the IP header, the UDP header, and the data, padded ...<|control11|><|separator|>
[56]
RFC 9293: Transmission Control Protocol (TCP)
A TCP header follows the IP headers, supplying information specific to TCP. This division allows for the existence of host-level protocols other than TCP.Table of Contents · Purpose and Scope · Introduction · Functional Specification
[57]
Chapter 1 Oracle Solaris ZFS File System (Introduction)
Checksums and Self-Healing Data. With ZFS, all data and metadata is ... In addition, ZFS provides for self-healing data. ZFS supports storage pools ...
[58]
ZFS Terminology - Managing ZFS File Systems in Oracle® Solaris ...
A 256-bit hash of the data in a file system block. The checksum capability can range from the simple and fast fletcher4 (the default) to cryptographically ...
[59]
[PDF] The rsync algorithm - andrew.cmu.ed
Jun 18, 1996 · The rsync algorithm consists of the following steps: 1. splits the file B into a series of non-overlapping fixed-sized blocks of size. S ...
[60]
[PDF] A Case for Redundant Arrays of Inexpensive Disks (RAID)
A Case for Redundant Arrays of Inexpensive Disks (RAID). Davtd A Patterson, Garth Gibson, and Randy H Katz. Computer Saence D~v~smn. Department of Elecmcal ...
[61]
Checking object integrity in Amazon S3 - AWS Documentation
With Amazon S3, you can use checksum values to verify the integrity of the data that you upload or download. In addition, you can request that another checksum ...
[62]
Performance of Checksums and CRC's over Real Data
When performed in twos- complement, this 16-bit checksum detects all single bit errors, a single error of less than 16 bits in length, and all double bit ...
[63]
[PDF] Tutorial: Checksum and CRC Data Integrity Techniques for Aviation
May 9, 2012 · – Many good polynomials are not primitive nor divisible by (x+1). – Divisibility by (x+1) doubles undetected error rate for even # of bit errors.
[64]
Performance of checksums and CRCs over real data
Checksum and CRC algorithms have historically been studied under the assumption that the data fed to the algorithms was entirely random.Missing: survey | Show results with:survey
[65]
18-348 Lab #4
But, some two-bit errors are undetected. A one's complement checksum is ... cancel each other out in terms of effect on the checksum. This is a ...
[66]
[PDF] The Effectiveness of Checksums for Embedded Networks
Embedded control networks commonly use checksums to detect data transmission errors. However, design decisions about which checksum to use are difficult ...
[67]
2.4 Error Detection - Computer Networks: A Systems Approach
In general, the goal of error detecting codes is to provide a high probability of detecting errors combined with a relatively low number of redundant bits.
[68]
[PDF] CSE 561 – Error detection & correction - Washington
Error Detection/Correction Codes The sender computes the ECC bits based on the data. The receiver also computes ECC bits for the data it receives and compares ...
[69]
[PDF] A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID ...
Jul 19, 1996 · It is well-known that Reed-Solomon codes may be used to provide error correction for multiple failures in. RAID-like systems.
[70]
An introduction to hashing and checksums in Linux - Red Hat
Jan 18, 2021 · SHA256 generates a bigger hash, and may take more time and computing power to complete. It is considered to be a more secure approach. MD5 is ...
[71]
[PDF] NIST.SP.800-175Br1.pdf
Mar 1, 2020 · The algorithm and key are used together to apply cryptographic protection to data (e.g., to encrypt the data or to generate a digital signature) ...