Fact-checked by Grok 2 weeks ago

Erasure code

An erasure code is a (FEC) scheme that encodes a of k symbols into a longer codeword of n symbols (n > k), such that the original can be recovered from any k of the n symbols even if the remaining nk symbols are lost or erased. These codes provide in and by distributing redundant information across multiple components, enabling reconstruction without needing the exact locations of erasures. Unlike general error-correcting codes that handle arbitrary errors, erasure codes specifically address known erasures, making them efficient for scenarios like disk failures or in networks. The foundational erasure code, the Reed–Solomon code, was invented in 1960 by Irving S. Reed and Gustave Solomon as a subclass of Bose–Chaudhuri–Hocquenghem (BCH) codes, using polynomial evaluations over finite fields to achieve maximum distance separable (MDS) properties that optimize redundancy. Early applications emerged in the 1980s for consumer technologies, such as compact discs (CDs), where interleaved Reed–Solomon codes corrected burst errors in audio data. Over time, erasure codes evolved beyond Reed–Solomon variants to include regenerating codes, which minimize repair bandwidth during node recovery, and locally recoverable codes (LRCs), which reduce the number of helper nodes needed for repairs. In modern distributed storage systems, erasure codes offer significant advantages over traditional replication by achieving higher storage efficiency—for instance, a (14, 10) Reed–Solomon code requires only about 1.4× the original data size compared to 3× for triple replication—while maintaining comparable reliability against failures. They are deployed in large-scale environments like , where LRCs such as the (18, 14, 7) code provide approximately 1.29× overhead and support rapid local repairs from fewer nodes, and in open-source systems like Ceph and Hadoop for handling petabyte-scale data with daily node failure rates exceeding 50 in data centers. Key performance metrics include encoding/decoding complexity, repair bandwidth (data downloaded for recovery), and repair degree (number of contacted nodes), which continue to drive research into hybrid and optimized constructions.

Fundamentals

Definition and Principles

Erasure codes are a subclass of error-correcting codes designed specifically to protect against data erasures, where the positions of lost data are known, enabling the recovery of the original data from a sufficient subset of the remaining encoded fragments. In this framework, k original data symbols are encoded into n total symbols, with n > k, such that any k out of the n encoded symbols can reconstruct the original data exactly. The redundancy is provided by m = n - k additional symbols, which allow the system to tolerate up to m erasures without data loss. Unlike general error-correcting codes, which must detect and correct arbitrary errors in unknown positions within noisy channels, erasure codes exploit the knowledge of erasure locations to simplify decoding and achieve higher efficiency in recovery. This distinction makes erasure decoding computationally easier, as it involves solving a from the known surviving symbols rather than searching for error patterns. A fundamental limit on erasure code performance is given by the , which states that the minimum distance d of an (n, k) code satisfies d \leq n - k + 1, with codes achieving equality known as maximum distance separable (MDS) codes. Erasure codes offer significant compared to replication methods, requiring less overhead while providing comparable ; for instance, a (6,5) incurs only a 1.2× storage multiplier versus 2× for simple . This stems from distributing across multiple fragments, allowing from any k survivors even after m failures. A simple illustrative example is a (3,2) over , where two data blocks D_1 and D_2 are encoded into three blocks with the third being the P = D_1 \oplus D_2; the original data can then be recovered from any two of the three blocks, such as D_1 = P \oplus D_2 or D_2 = P \oplus D_1.

Mathematical Foundations

Erasure codes are constructed as linear codes over finite fields, also known as Galois fields GF(q), where q is typically a power of 2 for computational efficiency in applications. In such fields, addition is performed using bitwise XOR operations, which correspond to vector addition modulo 2. Multiplication is more involved and is defined relative to an irreducible primitive polynomial of degree b for GF(2^b); for byte-level coding, GF(256) or GF(2^8) is commonly used, with multiplication implemented via precomputed lookup tables for logarithms and antilogarithms to enable fast exponentiation-based computation. Linear erasure codes are defined by a G, which is a k × n matrix over GF(q), where k is the (number of symbols) and n is the block length (total symbols including redundancy). A codeword is formed by multiplying a k-dimensional m by G, yielding the n-dimensional codeword c = m G. In systematic form, G is structured as [I_k | P], where I_k is the k × k and P is a k × (n - k) submatrix; this allows the encoded block to be E = [D | P_D], with D the k symbols and P_D = D P the (n - k) symbols. The H is an (n - k) × n matrix satisfying G H^T = 0, used in decoding to compute the syndrome s = r H^T for a received r, where non-zero syndromes indicate erasures or errors. A fundamental limit on the performance of block codes is the , which states that for a code of block length n, dimension k, and minimum d over an of size q, the satisfies d ≤ n - k + 1. Codes achieving equality in this bound are known as maximum distance separable (MDS) codes, which provide the optimal between and / correction capability for a given rate. The of an erasure code is characterized by its R = k/n, which represents the of symbols in the encoded , and the corresponding overhead factor of 1/R = n/k, quantifying the introduced. Decoding erasure codes involves solving a derived from the known symbol positions using the parity-check matrix or . For up to (n - k) erasures in a systematic code, this reduces to inverting a k × k submatrix via over the , with O(k^3) in the number of arithmetic operations.

Historical Development

Early Innovations

The concept of parity checks emerged in the 1950s as a fundamental mechanism for achieving simple redundancy in early computing systems, enabling basic error detection and correction. These checks involved adding redundant bits to data to verify integrity, laying groundwork for erasure coding principles by allowing recovery from single failures through parity computations. In 1950, Richard Hamming formalized error-correcting codes in his seminal work on parity-check-based schemes, initially focused on detecting and correcting transmission errors in computing machines, which influenced later erasure-tolerant designs. A significant practical application arrived in 1988 with the introduction of by David A. Patterson, Garth Gibson, and Randy H. Katz, who proposed using XOR-based schemes for disk arrays to tolerate disk failures. levels 5 and 6 specifically employed these erasure codes, distributing information across drives to recover from one or two simultaneous failures, respectively, thereby enhancing reliability in storage systems without excessive redundancy. In 1960, Irving S. Reed and Gustave Solomon developed Reed-Solomon codes, a class of non-binary cyclic error-correcting codes based on evaluations over finite fields, capable of correcting multiple symbol erasures or errors. These codes found early real-world use in deep-space communication, notably in NASA's Voyager missions launched in 1977, where a concatenated (255,223) Reed-Solomon code over GF(256) was implemented to protect telemetry data against channel noise and erasures during transmission from billions of kilometers away. During the 1990s, erasure codes began appearing in academic prototypes for distributed systems, exploring beyond traditional replication. The OceanStore project, initiated in 2000 by John Kubiatowicz and colleagues, integrated Byzantine fault-tolerant erasure coding into its global-scale persistent storage architecture, using techniques like Reed-Solomon variants to ensure data durability across untrusted, wide-area networks while minimizing storage overhead.

Modern Advancements

In the , academic research increasingly applied erasure codes to distributed storage systems, emphasizing scalability for wide-area networks. The system, introduced in 2004, used erasure codes to store large files across nodes, automating availability management by detecting and repairing failures through redundant fragments. Similarly, DHash++, developed in 2006, integrated erasure coding into a architecture to enhance lookup efficiency and in wide-area environments, reducing the overhead of traditional replication. Production adoption accelerated in the 2010s as large-scale systems sought cost savings over replication. Google's transition from the to Colossus around 2010 incorporated Reed-Solomon codes, such as the RS(6,3) configuration, to achieve efficient redundancy with a storage overhead of 1.5x while supporting across clusters. In parallel, Apache Hadoop's HDFS , released in 2012, applied erasure codes to convert replicated cold data into parity-protected stripes, reducing requirements by up to 50% without compromising durability. Efficiency breakthroughs addressed repair bottlenecks in erasure-coded systems. Minimum Storage Regenerating (MSR) codes, proposed by Dimakis et al. in 2007, minimized both per-node storage and repair bandwidth by leveraging network coding principles, enabling optimal trade-offs for distributed repair scenarios. Building on this, Locally Repairable Codes (LRCs), introduced in 2012 for Storage, localized single-failure repairs to a small subset of nodes, reducing repair traffic by up to 75% compared to standard Reed-Solomon codes while maintaining low storage overhead. By the 2020s, erasure codes had become integral to cloud infrastructure. continued deploying LRCs for blob , achieving multi-rack with repair times under 10% of global reconstruction costs. Recent advancements include maximally recoverable LRCs, which enhance resilience in multi-failure scenarios and have been further deployed in as of 2022. A survey of 285 papers from 2002 to 2023 highlights the field's maturity, with 76.1% addressing code constructions, algorithmic optimizations, and adaptations for emerging architectures like . Standardization efforts facilitated broader deployment. The IETF's Framework (RFC 6363, 2011) and Reed-Solomon specifications (RFC 5510, 2009) provided protocols for erasure codes in IP networks, influencing storage implementations. Swift integrated erasure code support from the 2014 release onward, enabling policy-based use of codes like Reed-Solomon for object storage durability.

Optimal Erasure Codes

Parity Check Codes

Parity check codes represent a foundational class of optimal codes constructed using linear operations over the binary field GF(2), where addition corresponds to the exclusive-OR (XOR) function. These codes generate m parity symbols from k symbols to form an n = k + m symbol codeword, achieving maximum distance separable (MDS) properties for small m by ensuring that any k symbols suffice for . The encoding process relies on a -check matrix H of dimensions m × n, where the code satisfies H \mathbf{c} = \mathbf{0} over GF(2), with \mathbf{c} the systematic codeword containing followed by parities. In simple constructions, each row of H places a 1 in the corresponding parity column and 1's across specific columns to define the linear dependencies, enabling efficient XOR-based while maintaining invertibility of relevant submatrices for MDS guarantees. For the basic case of m=1, the construction yields a single even symbol P computed as the XOR of all k data symbols d_1, \dots, d_k: P = \bigoplus_{i=1}^{k} d_i This corresponds to the parity-check matrix row [1 , 1 , \dots , 1 , | , 1], where the final 1 enforces the even parity condition. Such codes, exemplified by RAID-5 arrays, tolerate one erasure by recomputing the lost symbol via XOR of the remaining n-1 symbols, making them computationally lightweight with O(n) operations for encoding and decoding. A prominent extension to m=2 appears in the EVENODD code, which provides an MDS construction tolerating two erasures with minimal redundancy. Designed for k data symbols where k is an odd prime, EVENODD arranges the information symbols into a (k-1) × k array and computes two parity columns: the first (horizontal parity column) as the row-wise XOR of all k data symbols in each row, and the second (diagonal parity column) along wrapped diagonals of the array, with an adjustment for even/odd parity on a special diagonal to ensure systematic form and MDS distance 3. Encoding requires O(k^2) XORs, but leverages RAID-5 hardware for efficiency. Decoding for single or double erasures uses syndrome equations solved via recursive XORs on available symbols, recovering lost data in O(k) to O(k^2) time depending on failure pattern. This achieves optimality per the Singleton bound, using only two parities for distance 3, outperforming general Reed-Solomon codes in XOR count for small fields (e.g., 50% fewer operations for n=15). These codes exhibit MDS properties for small m, such as m=1 in RAID-5 or m=2 in EVENODD, allowing recovery from any m erasures through simple XOR combinations of surviving symbols without . However, a key limitation is high repair : repairing one erased symbol requires downloading approximately k symbols from other nodes, as the linear dependencies necessitate accessing nearly the full . Additionally, the quadratic in k for encoding, decoding, and updates renders them non-scalable for large n in distributed systems, where prime constraints on k further restrict flexibility. In practice, parity check codes underpin RAID-6 deployments, which employ two for double-failure tolerance and optimize computations using Cauchy matrices to minimize XOR operations (e.g., reducing encoding to O(k) via sparse, invertible structures over GF(2^b)).

Reed-Solomon Codes

Reed-Solomon codes are a class of optimal maximum distance separable (MDS) erasure codes that achieve the , allowing recovery of the original data from any k out of n encoded symbols, where n - k is the number of parity symbols. These codes operate over finite fields, typically Galois fields GF(q), and are constructed using to ensure the MDS property. In the construction of a Reed-Solomon code RS(n, k), the k data symbols are treated as the coefficients of a polynomial f(x) of degree less than k over GF(q), where q ≥ n to provide n distinct evaluation points α_i. The encoded symbols consist of the n evaluations of f(x) at these distinct points α_1, α_2, ..., α_n in the field, with the first k evaluations representing the systematic data and the remaining n - k serving as parities. This polynomial-based approach ensures that any n - k erasures can be tolerated, as the minimum distance d = n - k + 1 matches the MDS requirement. The encoding formula for each symbol is given by c_i = f(α_i) for i = 1 to n, where f(x) = d_0 + d_1 x + ... + d_{k-1} x^{k-1} and the d_j are the symbols. This evaluation can be computed directly using for efficiency, resulting in an overall encoding complexity of O(n k) field operations. Decoding in the erasure-only case involves recovering f(x) from any k of the received symbols using . Lagrange interpolation provides a direct method to reconstruct the from the k points, while the Berlekamp-Massey algorithm offers an efficient alternative by solving for the shortest that generates the received sequence, enabling recovery in O(k^2) time. These methods handle up to n - k erasures by interpolating the unique of less than k that passes through the available points. A simple example is the (3,2) Reed-Solomon code, often illustrated for basic error-resilient messaging like "Err-mail." Consider data symbols d_0 and d_1 in GF(3), forming f(x) = d_0 + d_1 x; evaluate at α_1 = 0, α_2 = 1, α_3 = 2 to get c_1 = d_0, c_2 = d_0 + d_1, c_3 = d_0 + 2 d_1. If c_3 is erased, interpolation from c_1 and c_2 recovers f(x) and thus the data. Reed-Solomon codes exhibit O(n k) encoding and O(k^2) decoding complexity in their standard forms, making them suitable for practical implementations. They have been widely adopted, such as in compact discs (CDs) for error correction since the 1980s using cross-interleaved (32,28) and (28,24) codes over GF(2^8), and in QR codes for data recovery with varying error correction levels based on Reed-Solomon parities. A key limitation is the field size constraint, requiring q ≥ n, which bounds the code to the field order and necessitates sub-packetization—dividing into smaller —for large-scale applications where n exceeds practical q values like 2^8 or 2^16.

Near-Optimal Erasure Codes

Locally Repairable Codes

Locally repairable codes (LRCs) are a of codes designed to minimize the amount of read during the repair of a single failed in distributed systems, achieving near-optimal tradeoffs between and repair . Unlike maximum separable (MDS) codes, LRCs relax the strict MDS to enable local repairability, where each can be recovered using a small, fixed number r of other symbols within a local group. The typically divides the n coded symbols into g local groups, each consisting of r symbols plus one local , resulting in g local parities. Additional h global parities are then computed across all and local parities, with the total redundancy m = g + h - 1, as the sum of local parities serves as an implicit global to avoid . This structure ensures locality r for repairing any single using only r other symbols from its local group. A representative example is a (10, 6, 3) LRC, featuring two groups each of size 4 (three symbols and one local parity computed via XOR or Reed-Solomon within the group) for a total of six symbols and two local parities, augmented by two parities computed over the entire set. In this setup, repairing a single failed symbol requires accessing only three other symbols from its group, significantly reducing I/O compared to reading from k systematic symbols in an MDS code. The minimum distance d satisfies the near-MDS bound d = n - k + 1 - (g - 1) = 10 - 6 + 1 - (2 - 1) = 4, allowing correction of up to three erasures globally while maintaining repair efficiency; repair operations scale as O(r) in , versus O(k) for traditional Reed-Solomon codes. Encoding in LRCs involves first computing local parities independently for each group, often using simple XOR for binary fields or higher-order Reed-Solomon codes for enhanced local distance, followed by global parities derived from a systematic overall parity-check that incorporates both and local parities. This hierarchical approach ensures the remains linear and efficient for encoding/decoding. Tradeoffs include slightly higher storage overhead, such as 1.5× replication factor versus 1.33× for an equivalent MDS code tolerating three failures, in exchange for 2-3× faster single-node repairs by limiting reads to the local group. Storage deploys an LRC with parameters (16, 12, 2), using 12 symbols, two local parities (g=2 groups of six data each plus locals), and two globals (h=2), achieving a 1.33× overhead while halving repair over Reed-Solomon baselines.

Regenerating Codes

In distributed storage systems, repairing a failed using traditional erasure codes like Reed–Solomon requires downloading a full 's worth of data from k surviving s, amounting to k symbols in total, which incurs significant network bandwidth costs. Regenerating codes address this by enabling a new to download a reduced amount \gamma \leq k symbols from d helper s (k \leq d < n), where n is the total number of s, while ensuring any k s can reconstruct the original file of size B. This trade-off between storage per \alpha and repair bandwidth \gamma = d \beta (with \beta symbols from each helper) minimizes communication overhead in failure-prone environments like cloud storage. The cut-set bound, derived from network coding principles, establishes the theoretical minimum for repair bandwidth: \gamma \geq \frac{k d}{d - k + 1} \times symbol size, assuming the file is divided into symbols and helpers contribute equally. This bound arises from information flow graph analysis, where the minimum cut across k nodes must support file reconstruction, and repair flows from d helpers must suffice for recovery without exceeding storage constraints. Codes achieving this bound are optimal, balancing \alpha and \gamma along the storage-bandwidth trade-off curve. Minimum Storage Regenerating (MSR) codes achieve the cut-set bound at the minimum storage point, where \alpha = B/k (matching MDS codes like for space efficiency) and \gamma = \frac{d - k + 1}{d} k. Exact-regenerating MSR constructions recover the exact lost data using interference alignment techniques, where helper contributions are linearly combined to cancel interference and isolate desired symbols. These codes maintain MDS properties for reconstruction while reducing repair traffic, with explicit constructions available for parameters where d \geq 2k - 2. Minimum Bandwidth Regenerating (MBR) codes prioritize bandwidth minimization at \gamma = d \left(1 - \frac{k}{2d} + \frac{k(k-1)}{2 d^2}\right), at the cost of higher storage \alpha > B/k. This point on the curve allows simpler encoding but increases per-node capacity, making MBR suitable for bandwidth-constrained networks. Like MSR, MBR codes can be exactly repaired using interference alignment for feasible (n, k, d). A representative example is the (4,2) MSR code with d=3 helpers and sub-packetization, where the file B is split into smaller sub-packets to enable exact repair; each helper contributes $3/2 symbols, yielding total \gamma = 9/2 symbols—less than the 4 symbols required by traditional MDS repair—while \alpha = B/2. This construction uses product-matrix methods to align interferences across sub-packets, demonstrating bandwidth savings in small-scale systems. As of 2025, extensions like piggybacking codes, introduced in 2013, enable lazy repair by precomputing functions of data during initial writes, further reducing effective in MSR setups without altering core parameters. These codes remain primarily in research prototypes, such as experimental distributed file systems, due to sub-packetization complexities limiting practical deployment. Regenerating codes differ from locally repairable codes by focusing on global bandwidth reduction across d helpers rather than locality for fast single-node fixes.

Applications

Distributed Storage Systems

Erasure codes have been integrated into the Hadoop Distributed File System (HDFS) since the release of 3.0 in 2017, with initial proposals and development efforts beginning around 2015 through collaborations between and . Note: Erasure code parameters are denoted as RS(n,k), where n is the total number of symbols and k is the number of data symbols. HDFS supports configurable erasure coding policies, such as Reed-Solomon (RS) codes with parameters like RS(9,6) or RS(14,10), which replace triple replication for certain data sets, particularly cold or archival data. For example, an RS(14,10) configuration encodes 10 data blocks into 14 total blocks (10 data + 4 parity), enabling automatic tiering where frequently accessed hot data remains replicated while less-accessed data uses erasure coding to optimize space. This integration allows administrators to apply policies at the directory level, balancing durability with storage efficiency without altering core HDFS workflows. In cloud storage systems, erasure codes are prominently deployed in Google's Colossus file system and Blob Storage. Google's Colossus, the successor to the (GFS), employs (9,6) codes, which provide a redundancy factor of 1.5 while tolerating up to 3 simultaneous failures across distributed clusters. Similarly, utilizes locally repairable codes (LRCs), such as LRC(12,2,2) with 12 data fragments, 2 local parities, and 2 global parities (total 16 fragments), to enhance repair efficiency in large-scale . These LRCs, which build on principles from regenerating codes, reduce the data read during single-fragment repairs from 12 blocks (as in standard codes) to as few as 3, achieving approximately 50% lower repair times compared to equivalent configurations. The primary benefits of erasure codes in distributed storage include significant cost savings through improved storage efficiency—up to 3 times better than traditional 3-way replication for low-overhead schemes like RS(13,10)—and robust against 10-20% node failures in large clusters, depending on . For instance, in hyperscale environments, this translates to handling petabyte-scale data with minimal overhead, reducing requirements and operational expenses while maintaining . However, challenges persist, including encoding and decoding , which can increase read/write times over replication due to computational overhead; these are often mitigated using accelerators like Intel's ISA-L . Additionally, straggler effects during distributed repairs—where slow nodes delay overall recovery—can extend (MTTR) in heterogeneous clusters, though techniques like partial parallel decoding help reduce this by up to 59%. As of 2025, erasure codes are widely adopted in hyperscale distributed storage systems, powering platforms like HDFS, Colossus, and for zettabyte-scale deployments. Open-source libraries such as OpenEC facilitate this adoption by providing a unified framework for implementing and configuring various erasure codes, including and LRC, directly into systems like Ceph and , enabling customizable integration without proprietary dependencies.

Networking and Communication

Erasure codes play a crucial role in networking and communication by enabling reliable data transmission over unreliable channels, where packet losses occur due to , , or other impairments. In such scenarios, packets are encoded prior to transmission to include redundancy, allowing receivers to recover lost data without retransmission requests, which is particularly beneficial in or broadcast settings. For instance, Tornado codes, introduced in 1998, were specifically developed for efficient reliable of large files, achieving near-linear time encoding and decoding complexity while outperforming traditional Reed-Solomon codes in bandwidth efficiency for erasure channels. A key advancement in this domain is the development of fountain codes, which are rateless erasure codes capable of generating an unlimited number of parity packets from a fixed set of source symbols. codes, proposed in 2002, form the foundational class of practical fountain codes, employing a robust degree distribution to ensure efficient performance; the encoding process involves randomly selecting source symbols based on this distribution to create output symbols. Building on codes, codes, introduced in 2006, incorporate a low-density parity-check precode followed by an LT-like fountain stage, enabling linear-time encoding and decoding complexity, which significantly reduces computational overhead for large-scale transmissions. As an illustrative example, for an LT code with 1000 source symbols, a receiver typically collects around 1050 encoded packets—incurring a modest 5% overhead—to achieve successful decoding with high probability using the algorithm, which iteratively resolves symbols starting from degree-one outputs. These codes find widespread application in video streaming and communications, where is common and retransmission-based (ARQ) protocols introduce unacceptable delays. In video streaming, fountain codes protect against erasures in real-time delivery, maintaining playback quality without buffering interruptions; for example, they have been integrated into protocols for scenarios to minimize retransmissions. In links, characterized by high and variable loss rates, erasure codes enhance throughput by correcting erasures proactively, with empirical studies demonstrating reductions of 20-30% compared to ARQ approaches in lossy environments. As of 2025, erasure codes are increasingly integrated into and networks to support ultra-reliable low- communication (URLLC) services, such as industrial automation and autonomous vehicles, by providing robust FEC without feedback overhead. The IETF's FECFRAME framework further standardizes these mechanisms, defining architectures for applying FEC, including fountain-based schemes, across and flows to ensure and efficiency.

Emerging Uses

Advanced Storage Architectures

In and (SSD) storage, codes are adapted to address the unique constraints of multilevel cells, such as limited write cycles and . Write-Once Memory (WOM) codes enable multiple writes to cells without by leveraging the asymmetric programming property, where cells can only increase in charge level. These codes are particularly suited for multilevel cells () in , allowing data to be rewritten up to t times per while minimizing . For instance, rank modulation schemes represent data as permutations of cell charge levels rather than absolute values, facilitating rewriting without full erasures and integrating with erasure-correcting mechanisms to tolerate erasures and deletions caused by or errors. This approach extends lifetime by reducing frequency, with rank-modulation rewriting codes achieving close to 2 bits per per write while preserving reliability. In in-memory storage systems using volatile clusters, erasure codes provide with lower overhead than traditional replication. EC-Cache, an online erasure coding scheme for cluster caching, splits objects into k data and r parity units during writes, enabling reconstruction from any k units. It employs (10,6) codes, which store data across 16 units with 1.6× storage overhead, reducing replication overhead by over 3× compared to selective 3-replication while using similar . By reading k+Δ units (e.g., Δ=1) to handle stragglers, EC-Cache improves load balancing by more than 3× and reduces read latency by up to 2.64× and (99.9th ) by 1.79× for large objects like 40 , scaling effectively in -based clusters. Disaggregated storage architectures separate compute and resources into pooled systems, where erasure codes enhance efficiency in memory pooling. MicroEC optimizes erasure coding for disaggregated memory by reusing auxiliary coding data and pipelining encoding with RDMA transmissions via an exponential , supporting operations like writes, reads, degraded reads, and recoveries. This reduces write by up to 44.35% compared to baseline erasure coding and 42.14% versus 3-way replication, while improving write throughput by 1.80× over erasure coding and 1.73× over replication for objects of 1 or larger. Complementing this, applies online erasure coding to individual remote memory pages in disaggregated setups, achieving single-digit access latencies for fault-tolerant remote memory without full replication. It uses configurable coding groups placed via the CodingSets for balanced load and , tolerating node failures in pooled environments. Optimizations in these architectures further mitigate traditional erasure code limitations. Stripeless placement schemes, such as Nos, eliminate stripe alignment by having independently replicate and XOR-encode data into parities based on symmetric balanced incomplete block designs (SBIBD), avoiding centralized coordination and stripe-induced overheads during writes. This enables for up to p node failures while simplifying placement in in-memory systems. Evaluations show Nos-based systems achieving 1.61× to 2.60× higher throughput than stripe-based baselines, with comparable or lower latencies. Additionally, accelerations like GPUs speed up encoding; for example, G-CRS uses GPU parallelism for Cauchy Reed-Solomon codes, delivering up to 10× higher throughput than CPU-based libraries for large-scale operations. These advancements enable erasure codes to support scalability to petabyte-scale nodes in advanced architectures, such as those using NVMe-over-Fabrics (NVMe-oF) for high-speed interconnects. In NVMe-oF fabrics, erasure codes provide robust against multiple node or drive failures, as seen in systems like Spectrum Scale Erasure Code Edition, which uses Reed-Solomon codes for declustered protection across petabyte clusters with 1-3 node tolerance depending on configuration (e.g., 8+3 parity). This ensures data availability in disaggregated, high-density environments without excessive replication overhead.

Integration with AI and DNA Storage

Erasure codes have been integrated into distributed frameworks to enhance resilience during training es, particularly by mitigating the impact of stragglers—slow or failed compute nodes—in computations. In coded computing approaches, data and computations are encoded such that the results from a of nodes suffice for full , tolerating up to a specified number of stragglers without halting the entire ; for instance, nested gradient codes allow flexible straggler mitigation by concatenating schemes for varying failure counts, reducing training time in large-scale setups. Recent advancements, such as ECRec for recommendation models, couple erasure coding with model-specific traits to achieve while minimizing overhead in distributed training environments. Additionally, models predict file access patterns or failure probabilities to enable adaptive erasure coding, as demonstrated in Storage systems where optimizes code parameters for dynamic workloads, improving efficiency in AI-driven data management. In DNA storage, erasure codes address inherent biological errors during synthesis and sequencing, including 6.2% deletion rates and 5.7% insertion rates observed in photolithographic methods, by encoding data into redundant oligonucleotides that allow reconstruction from partial reads. codes, adapted for DNA's constraints like variable oligo recovery and biochemical limits, enable rateless encoding where sufficient fragments—regardless of dropouts—decode the original information, as pioneered in the DNA architecture that stores gigabytes scalably. This integration supports ultra-high densities, with theoretical capacities reaching up to 33 zettabytes in volumes comparable to a ping-pong ball, far surpassing traditional media and positioning DNA as a viable archival solution for exascale data. An example is the system, which applies erasure coding tiering in edge-cloud LSM-tree stores to balance redundancy and performance, extendable to bio-hybrid scenarios for reliable DNA data handling. As of 2025, trends emphasize hybrid erasure codes combining replication and coding for mixed failure modes in AI and DNA contexts, as surveyed in comprehensive reviews of storage systems, with applications in adaptive AI training and error-prone bio-storage. The market for erasure coding technologies is projected to experience significant growth by 2031, fueled by these interdisciplinary integrations in resilient computing and archival technologies. Challenges persist, including high decoding error rates in DNA due to indel dominance, necessitating advanced codes like HEDGES for direct correction, and computational overhead in AI distributed training, where encoding/decoding adds latency that coded schemes aim to offset through optimized designs.

References

  1. [1]
    [PDF] Erasure Coding for Distributed Storage: An Overview - arXiv
    Jun 12, 2018 · Efficient repair calls for erasure codes that in the face of node failure, are efficient in terms of minimizing the amount of repair data ...
  2. [2]
    [PDF] Tutorial on Reed-Solomon Error Correction Coding
    This tutorial covers Reed-Solomon error correction coding, including Reed-Solomon encoding, block codes, and error correction systems.
  3. [3]
    [PDF] Effective Erasure Codes for Reliable Computer Communication ...
    Lin, D.J.Costello, "Error Control Coding: Fundamentals and Applications", Prentice. Hall, 1983. [12] S.Lin, D.J.Costello, M.Miller, "Automatic-repeat-request ...
  4. [4]
    [PDF] Optimal Codes for the Burst Erasure Channel - IPN Progress Report
    Aug 15, 2008 · The Singleton bound states that the minimum distance of any linear (n, k) code satisfies d ≤ n − k + 1 [3], which is a consequence of the facts ...
  5. [5]
    [PDF] Using Erasure Codes Efficiently for Storage in a Distributed System∗
    Abstract. Erasure codes provide space-optimal data redundancy to protect against data loss. A common use is to reliably store data in a distributed system, ...
  6. [6]
  7. [7]
    [PDF] Error Detecting and Error Correcting Codes
    Table II we find that the first parity check involves positions 1, 3, 5, 7 and is used to determine the value in the first position; the second parity check,.
  8. [8]
  9. [9]
    [PDF] A Case for Redundant Arrays of Inexpensive Disks (RAID)
    A Case for Redundant Arrays of Inexpensive Disks (RAID). Davtd A Patterson, Garth Gibson, and Randy H Katz. Computer Saence D~v~smn. Department of Elecmcal ...
  10. [10]
    [PDF] “Polynomial Codes over Certain Finite Fields”
    A paper by: Irving Reed and Gustave Solomon presented by Kim Hamilton. March 31, 2000. Page 2. Significance of this paper: • Introduced ideas that form the ...
  11. [11]
    [PDF] N91-24068 - NASA Technical Reports Server (NTRS)
    The coding system on Voyager Was further enhanced--b_¢ an 8-bit (255, 22_3) Reed-Solomon code, which was used in concatenation with the conv0iutional code to ...
  12. [12]
    OceanStore: an architecture for global-scale persistent storage
    OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information.
  13. [13]
    A Survey of the Past, Present, and Future of Erasure Coding for ...
    Jan 8, 2025 · Erasure coding is a known redundancy technique that has been popularly deployed in modern storage systems to protect against failures.
  14. [14]
    [PDF] Storage Architecture and Challenges | Google Cloud
    Jul 29, 2010 · File system (GFS/Colossus), structured storage. (Bigtable). 2-10%: disk drive annualized failure rate. Planet. Ensures availability across ...
  15. [15]
    [PDF] XORing Elephants: Novel Erasure Codes for Big Data - arXiv
    Jan 16, 2013 · The paper introduces novel erasure codes (LRCs) that are efficiently repairable, offer higher reliability, and are optimal in locality, ...
  16. [16]
    [PDF] Erasure Coding in Windows Azure Storage - USENIX
    In this paper, we introduce Local Reconstruction Codes. (LRC) that provide the above properties. In addition, we describe our erasure coding implementation and ...
  17. [17]
    Erasure Coding in Windows azure storage - Microsoft Research
    Jun 1, 2012 · In this paper we introduce a new set of codes for erasure coding called Local Reconstruction Codes (LRC). LRC reduces the number of erasure coding fragments ...
  18. [18]
    RFC 6363 - Forward Error Correction (FEC) Framework
    This document describes a framework for using Forward Error Correction (FEC) codes with applications in public and private IP networks to provide protection ...
  19. [19]
  20. [20]
    [PDF] An Optimal Scheme for Tolerating Double Disk Failures in RAID ...
    We present a novel method, that we call EVEN-. ODD, for tolerating up to two disk failures in RAID architectures. EVENODD is the first known scheme.
  21. [21]
    [PDF] Tutorial on Erasure Coding for Storage Applications, Part 1
    Feb 12, 2013 · – Encoding is done by dot products of rows of the generator with the data. • XORs only when w = 1. • Otherwise use Galois Field arithmetic GF(2w).
  22. [22]
    [PDF] The RAID-6 Liberation Codes - USENIX
    Bit matrix coding is a parity array coding technique first employed in Cauchy Reed-Solomon coding [6]. In gen- eral, there are k data devices and m coding ...
  23. [23]
    [PDF] Shift-Register Synthesis and BCH Decoding l
    The shift-register synthesis algorithm of Section III is then seen to coincide with the iterative algorithm introduced recently by. Berlekamp [l] for decoding ...
  24. [24]
    [PDF] Reed-Solomon Codes and the Compact Disc - ResearchGate
    The error control code used in the CD system employs not one but two Reed-. Solomon codes (Ct, C2), which are interleaved cross-wise. In particular, the.
  25. [25]
    Error correction feature | QRcode.com | DENSO WAVE
    QR codes use error correction to restore data, implemented by adding Reed-Solomon Code. Different levels are available based on environment and data size.
  26. [26]
    [1206.3804] Locally Repairable Codes - arXiv
    Jun 17, 2012 · Recently, erasure codes were used to reduce the large storage overhead, while increasing data reliability. A main limitation of off-the-shelf ...
  27. [27]
    [cs/0702015] Network Coding for Distributed Storage Systems - arXiv
    Feb 2, 2007 · Second, we introduce a new scheme called Regenerating Codes which use slightly larger fragments than MDS but have lower overall bandwidth use.Missing: original | Show results with:original
  28. [28]
    Codes for Distributed Storage - now publishers
    May 30, 2022 · This was the subject of the seminal paper by Dimakis et al. [50] in which an entirely new class of codes called regenerating codes was ...
  29. [29]
    Optimal Exact-Regenerating Codes for Distributed Storage at ... - arXiv
    May 23, 2010 · In this paper, we present optimal, explicit constructions of MBR codes for all feasible values of [n, k, d] and MSR codes for all [n, k, d >= 2k-2], using a ...
  30. [30]
    Interference Alignment in Regenerating Codes for Distributed Storage
    May 10, 2010 · To the best of our knowledge, the constructions presented in this paper are the first, explicit constructions of regenerating codes that achieve ...
  31. [31]
    A Piggybacking Design Framework for Read-and Download-efficient ...
    Feb 24, 2013 · We present a new 'piggybacking' framework for designing distributed storage codes that are efficient in data-read and download required during node-repair.
  32. [32]
    Introduction to HDFS Erasure Coding in Apache Hadoop - Cloudera
    Sep 23, 2015 · Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compared to replication while maintaining the same ...Missing: 2012 | Show results with:2012
  33. [33]
    HDFS Erasure Coding - Apache Hadoop 3.4.2
    Erasure coding (EC) in HDFS replaces replication, using striping and parity to provide fault tolerance with less storage space, similar to RAID.Missing: 2012 | Show results with:2012
  34. [34]
    [PDF] A Tale of Two Erasure Codes in HDFS - USENIX
    Feb 16, 2015 · We discuss how the use of erasure codes within HDFS re- duces storage overhead, however it increases the recov- ery cost. This motivates the ...Missing: definition tutorial
  35. [35]
    What is erasure coding and how is it different from RAID? - TechTarget
    Jun 12, 2024 · Erasure coding (EC) is a method of data protection in which data is broken into fragments, expanded and encoded with redundant data pieces.Missing: check | Show results with:check
  36. [36]
    [PDF] A Distributed Technique for Repairing Erasure Coded Storage
    Total recall: System support for automated availability man- agement. In ... Beehive: erasure codes for fixing multiple failures in distributed storage systems.
  37. [37]
    OpenEC: Toward Unified and Configurable Erasure Coding ...
    We present OpenEC, a unified and configurable framework for readily deploying a variety of erasure coding solutions into existing distributed storage systems.
  38. [38]
    A digital fountain approach to reliable distribution of bulk data
    A digital fountain allows any number of heterogeneous clients to acquire bulk data with optimal efficiency at times of their choosing.
  39. [39]
    LT Codes | Proceedings of the 43rd Symposium on Foundations of ...
    We introduce LT codes, the first rateless erasure codes that are very efficient as the data length grows.
  40. [40]
    Fountain Code Based Encoding Scheme for Wireless Video Streaming
    The fountain codes were developed to achieve efficient transmission in erasure channels with the primary application in multimedia video streaming. The fountain ...
  41. [41]
    [PDF] High Performance Vehicular Connectivity with Opportunistic Erasure ...
    Compared to using retransmissions or capacity-oblivious erasure coding, OEC reduces the mean flow completion time by at least a factor of 1.4. 2. Target ...<|control11|><|separator|>
  42. [42]
    [1312.0972] Rank-Modulation Rewrite Coding for Flash Memories
    Dec 3, 2013 · The key benefits of the new scheme include: (i) the ability to store close to 2 bits per cell on each write with minimal impact on the lifetime ...Missing: multilevel | Show results with:multilevel
  43. [43]
    [PDF] Rank-Modulation Rewriting Codes for Flash Memories
    The aim of rewriting codes is to maximize the number of writes between block erasures. In rank-modulation, each cell has a certain rank, according to its ...Missing: SSD | Show results with:SSD
  44. [44]
    [PDF] Load-balanced, Low-latency Cluster Caching with Online Erasure ...
    EC-Cache is a load-balanced, low latency cluster cache that uses online erasure coding to overcome the limitations of selective replication. EC-Cache employs.
  45. [45]
    Enabling Efficient Erasure Coding in Disaggregated Memory Systems
    Aug 27, 2025 · Erasure coding (EC) is expected to provide fault tolerance in DM with low memory cost. In DM with EC, objects are first coded in compute servers ...
  46. [46]
    [PDF] Hydra : Resilient and Highly Available Remote Memory - USENIX
    Feb 24, 2022 · Our solution, Hydra, is a configurable resilience mechanism that applies online erasure coding to individual remote memory pages while ...Missing: programmable | Show results with:programmable
  47. [47]
    [PDF] Stripeless Data Placement for Erasure-Coded In-Memory Storage
    Jul 9, 2025 · Con- ventional erasure coding schemes determine data placement based on stripes. However, placing data into stripes can incur non-negligible ...
  48. [48]
    [PDF] G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding
    Jan 11, 2018 · The evaluation results revealed that the throughput of. G-CRS was 10 times faster than most of the other coding libraries. ... puting techniques ...
  49. [49]
    [PDF] IBM Spectrum Scale: Erasure Code Edition Guide
    This gives a fault tolerance of 2 nodes, one node and one disk or 2 disks. On the other hand with 8+2P erasure code on 6 nodes , there are 10 strips (8 data and ...