Fact-checked by Grok 2 weeks ago

Read–modify–write

In computing, particularly in processor architecture and low-level programming, a read–modify–write (RMW) operation is a sequence or hardware-supported instruction that reads a value from a memory location or register, performs a specified modification on it, and writes the resulting value back to the same location, ensuring precise data updates without unintended side effects.^[1]^[2] This idiom is fundamental in embedded systems and assembly language, where it allows selective modification of bits within a register—such as clearing or setting specific fields—while preserving the rest of the value, thereby avoiding disruptions to system functionality controlled by those registers.^[1] In ARM architecture, for instance, RMW typically involves transferring the register value to a general-purpose register, applying bitwise operations like BIC (bit clear) or ORR (bitwise OR), and then restoring the modified value.^[1] In the context of concurrent and parallel programming, RMW operations are often implemented atomically to support synchronization in multiprocessor environments, where the entire read-modify-write cycle executes indivisibly to prevent race conditions and ensure data consistency across threads or cores.^[2] Hardware mechanisms, such as the LOCK prefix in x86 processors, enforce this atomicity by locking the bus or cache line during the operation, serializing access to shared memory and enabling reliable inter-processor communication.^[2] Atomic RMW instructions are guaranteed for aligned operands up to 64 bits in modern IA-32 architectures, provided they operate on cacheable memory types like write-back (WB) or write-through (WT).^[2] Notable examples of atomic RMW primitives include exchange (XCHG), compare-and-exchange (CMPXCHG), fetch-and-add (XADD), and bit test/manipulation instructions (BTS, BTR, BTC), which underpin synchronization structures like spinlocks, semaphores, and mutual exclusion primitives in operating systems and multithreaded applications.^[2]^[3] These operations are vital for building higher-level concurrency controls, as they provide the indivisible building blocks needed for safe shared-memory access in scalable parallel systems.^[3]

Definition and Basic Concepts

Definition

In computing, a read–modify–write (RMW) operation consists of a sequence where a value is read from a specific memory location, a modification is applied to that value based on its content, and the resulting modified value is then written back to the same memory location.^[4] This process is fundamental to low-level hardware and software interactions, particularly in environments involving shared memory.^[5] The primary purpose of RMW operations is to facilitate updates that rely on the current state of the data being modified, such as incrementing a shared counter or toggling a status flag in a concurrent system.^[6] These operations allow programs to perform state-dependent changes efficiently without requiring complex locking mechanisms in simple cases, though they are often implemented with hardware support to ensure reliability.^[4] For correctness in multi-threaded or multi-processor environments, the entire RMW sequence must execute atomically, appearing indivisible to concurrent operations and thereby preventing inconsistencies like interleaved modifications.^[6] Without this atomicity, race conditions can arise, where one operation's read sees a value altered by another before its write completes.^[4] RMW operations emerged in early computing systems during the 1960s and 1970s to manage shared physical resources like memory and peripherals in multiprogramming environments.^[7] Their formalization in concurrent programming literature occurred in the late 1980s and early 1990s, with key contributions defining correctness criteria such as linearizability to ensure sequential consistency in highly concurrent settings.^[6]

Common Examples

One common read–modify–write operation is the increment of a shared counter, often used to track the number of active threads or processed items in concurrent programs. In this pattern, a process reads the current value N from the shared variable, computes the new value N + 1, and writes it back to the same location. Without atomicity, concurrent executions can lead to lost updates, where one thread's increment is overwritten by another's stale read.^[8] The following pseudocode illustrates a non-atomic increment:

read [counter](/page/Counter) into [temp](/page/Temp)  // temp = [counter](/page/Counter)
temp = [temp](/page/Temp) + 1         // Modify
[counter](/page/Counter) = [temp](/page/Temp)          // Write back
read [counter](/page/Counter) into [temp](/page/Temp)  // temp = [counter](/page/Counter)
temp = [temp](/page/Temp) + 1         // Modify
[counter](/page/Counter) = [temp](/page/Temp)          // Write back

This example appears frequently in multithreaded applications managing counts, such as reference counting in memory management.^[9] Another typical pattern is flag toggling, employed in simple synchronization primitives like busy-waiting locks or status indicators. Here, a process reads the current state of a bit (e.g., 0 for unlocked, 1 for locked), inverts it based on a condition (such as acquiring the lock), and writes the updated bit back. Interleaving threads may cause inconsistent states, such as multiple acquisitions of the same lock.^[8] Pseudocode for toggling a lock flag to acquire it:

read flag into temp     // temp = flag (0 or 1)
if temp == 0 then       // Condition check
    temp = 1            // Set to locked
flag = temp             // Write back
read flag into temp     // temp = flag (0 or 1)
if temp == 0 then       // Condition check
    temp = 1            // Set to locked
flag = temp             // Write back

Such operations are foundational in software concurrency control, though they highlight the need for atomicity to prevent races.^[9] Accumulator updates represent a broader class of read–modify–write sequences, commonly seen in logging systems or real-time statistics gathering where values are aggregated across multiple threads. A process reads the current total from a shared accumulator, adds a delta (e.g., an elapsed time or event count), and writes the new sum back. Concurrent additions without synchronization can result in undercounting due to overlapping modifications.^[8] Pseudocode for an accumulator update:

read total into temp    // temp = total
temp = temp + delta     // Add contribution
total = temp            // Write back
read total into temp    // temp = total
temp = temp + delta     // Add contribution
total = temp            // Write back

This pattern is prevalent in performance monitoring and distributed counters, emphasizing the risks of non-atomic aggregation.^[9]

Applications and Contexts

Processor Instructions

In multi-core processors, read–modify–write (RMW) sequences are integral to cache coherence protocols like MESI (Modified, Exclusive, Shared, Invalid), which manage shared data access across cores. Under MESI, a core initiating an RMW must transition a cache line to the Exclusive or Modified state to ensure exclusive ownership, invalidating shared copies in other caches to avoid stale data during the modification phase.^[10] This process serializes access, enabling consistent updates but introducing latency from cache-to-cache transfers.^[11] Non-atomic RMW operations in pipelined execution pose significant pitfalls, as modern CPU pipelines—featuring out-of-order execution and speculative fetching—can interleave reads and writes from concurrent threads. For example, one thread's load might capture a value, only for another thread's store to partially overlap the subsequent modification and write-back, resulting in lost updates or inconsistent states observable across cores.^[12] Such interleaving violates the intended atomicity of the sequence, exacerbating issues in multi-threaded workloads where pipelines process instructions from multiple streams simultaneously.^[13] Basic load-modify-store sequences in assembly languages illustrate non-atomic RMW without hardware guarantees. In x86, a simple increment of a memory operand requires separate instructions, as shown below:

assembly
mov [eax](/page/EAX), [mem]  ; Load value into [register](/page/Register)
inc [eax](/page/EAX)         ; Modify the value
mov [mem], [eax](/page/EAX)  ; Store back to memory
mov [eax](/page/EAX), [mem]  ; Load value into [register](/page/Register)
inc [eax](/page/EAX)         ; Modify the value
mov [mem], [eax](/page/EAX)  ; Store back to memory

This trio is not atomic; intervening operations from other cores can occur between the load and store, leading to race conditions. In ARM architectures, a comparable non-atomic increment uses load, arithmetic, and store instructions:

assembly
ldr r0, [r1]    ; Load value from address in r1
add r0, r0, #1  ; Modify by adding 1
str r0, [r1]    ; Store result back
ldr r0, [r1]    ; Load value from address in r1
add r0, r0, #1  ; Modify by adding 1
str r0, [r1]    ; Store result back

Without exclusive monitors or barriers, this sequence allows interleaving, compromising data integrity in shared-memory multi-core systems.^[14] The performance implications of non-atomic RMW are pronounced in high-throughput scenarios like vector processing, where SIMD instructions process multiple data elements in parallel but often lack atomic guarantees for memory operations. Non-atomic vector stores can result in torn writes—partial updates visible to other cores—incurring overhead from required software retries or synchronization barriers to maintain correctness, potentially reducing throughput in contended multi-core environments. This overhead contrasts with scalar atomic primitives but highlights the scalability challenges in parallel vector workloads without dedicated hardware support.^[15]

Storage Systems

In storage systems, read–modify–write (RMW) operations are essential for maintaining data redundancy in array-based configurations like RAID levels 4, 5, and 6, where partial stripe writes require updating parity information to preserve fault tolerance.^[16] These levels use block-level striping with dedicated or distributed parity to enable recovery from disk failures, but writing less than a full stripe necessitates RMW to recompute parity without disrupting the array's integrity.^[16] In RAID 4, parity is stored on a dedicated disk per stripe group, limiting write parallelism due to contention on that disk during updates.^[16] The RMW process for partial writes in these RAID levels involves reading the existing data block and its associated parity (or parities), modifying the data, recalculating the parity using exclusive-OR (XOR) operations on the delta between old and new data, and then writing the updated data and parity back to the array.^[16] For RAID 5, which distributes parity across all disks in the stripe, this process allows multiple simultaneous partial writes, improving efficiency over RAID 4 by eliminating the single parity disk bottleneck; a small write typically requires two reads (old data and old parity) and two writes (new data and new parity).^[16] In RAID 6, which employs dual distributed parities (P and Q, where Q often uses a more complex scheme like Reed-Solomon for second-failure tolerance), the process extends to three reads (old data, old P, and old Q) and three writes (new data, new P, and new Q) for a single-block partial stripe update, ensuring consistency across both parity sets.^[17] This atomic RMW sequence prevents parity inconsistencies that could lead to data loss if a failure occurs mid-operation. By enforcing atomicity in parity updates during partial writes, RMW in RAID 4, 5, and 6 enhances data integrity in distributed storage environments, allowing the array to detect and correct errors from single (RAID 4/5) or dual (RAID 6) disk failures without requiring full stripe rewrites.^[16] For instance, consistent parity enables reconstruction of lost data blocks using surviving stripes, mitigating risks in scenarios like power failures or concurrent I/O in large-scale systems.^[17] This mechanism is particularly valuable in environments with frequent small writes, such as databases or virtualized storage, where it balances redundancy with performance overhead.^[18] The concept of RMW for parity in RAID originated in the late 1980s as part of early redundancy array designs, with RAID levels 4 and 5 formalized in a seminal 1988 technical report by Patterson, Gibson, and Katz at UC Berkeley, which highlighted the need for such operations to address the "small write problem" in inexpensive disk arrays.^[16] RAID 6 emerged as an extension in the 1990s to handle growing disk capacities and dual-failure risks, building on these foundations with advanced dual-parity schemes to further leverage RMW for enhanced reliability.^[18]

I/O Operations

In input/output (I/O) operations, read–modify–write (RMW) sequences encounter unique challenges due to the architecture of peripheral devices, where registers often serve dual purposes for reading and writing, potentially leading to inconsistencies. Many I/O peripherals, such as those in embedded systems, maintain separate internal latches for output data and input sensing, meaning a read operation captures the current state of external pins or signals rather than the previously written latch value. This separation can result in stale data during the modify phase, as external factors like device activity or noise may alter the pin states between the read and write steps. For instance, in universal asynchronous receiver-transmitter (UART) status registers, reading to check flags (e.g., overrun or framing errors) reflects the instantaneous device state, but any subsequent write to clear or modify those flags risks incorporating outdated pin data if the device updates asynchronously.^[19] These register mismatches frequently produce unexpected results in RMW operations on device buffers or control registers. When modifying a shared buffer—such as a transmit or receive FIFO in a serial peripheral—the read phase might retrieve a snapshot of data that includes prior transmissions, but the write phase could apply changes to a buffer that has since been altered by ongoing device activity, like incoming bytes overwriting slots. This discrepancy arises because I/O devices often do not guarantee atomicity between read and write accesses to the same logical register; the write may target a shadow latch or control path disconnected from the read path, leading to partial updates or lost modifications. A classic example occurs in bit-level operations on I/O ports, where attempting to toggle a single control bit (e.g., enabling a UART interrupt) via RMW can inadvertently corrupt adjacent bits if the read captures transient input from connected peripherals.^[19] To mitigate these RMW challenges in I/O drivers, developers employ polling or interrupt-driven sequences to enforce temporal coherence and verify device states before modifications. In polling-based approaches, the driver repeatedly reads status registers until a stable condition is confirmed (e.g., buffer empty or ready flag set), ensuring the subsequent RMW operates on current, non-stale data and minimizing race-like inconsistencies from device timing. Interrupt-driven methods complement this by triggering CPU intervention only when the peripheral signals completion via hardware interrupts, allowing the driver to sequence reads and writes in a synchronized manner without continuous CPU overhead. These techniques are essential for maintaining data integrity in asynchronous I/O environments.^[20]^[21] Historical cases in early personal computers and embedded systems highlight the prevalence of RMW pitfalls in I/O operations. In 1980s and 1990s IBM PC-compatible systems, programming parallel (LPT) and serial (COM) ports required careful handling of RMW to avoid corrupting control lines, as documented in technical references where direct port I/O instructions like IN and OUT could yield inconsistent results on bidirectional ports due to pin-state reads. Embedded controllers, such as those in early industrial automation, faced similar issues with UART and GPIO ports, where unmitigated RMW led to communication failures, prompting driver guidelines emphasizing full-port writes or status polling. These problems were widely addressed in era-specific hardware manuals, influencing modern driver design practices.^[22]^[23]

Challenges and Hazards

Race Conditions

In multi-threaded environments, non-atomic read–modify–write (RMW) operations can lead to race conditions when multiple threads interleave their accesses to shared data, causing the final state to depend on unpredictable scheduling order.^[24] Specifically, the mechanism involves two or more threads reading the same initial value, performing independent modifications, and then writing back, resulting in lost updates where one thread's change overwrites another's.^[25] For instance, if two threads both read a shared counter value of N, each increments it to N+1, and both write N+1, the counter ends at N+1 instead of the expected N+2.^[24] Race conditions in RMW sequences manifest in two primary types: write-write races, where concurrent writes to the same location overwrite prior modifications without coordination, and read-write races, which produce inconsistent views of data for subsequent operations.^[26] Write-write races directly cause lost updates in structures like counters, while read-write races can lead to threads operating on stale data, amplifying errors in dependent computations.^[27] The impact of these races is severe, often resulting in data corruption within shared structures such as queues or counters, where invariants like total count accuracy or buffer integrity are violated.^[27] In queue implementations, for example, non-atomic RMW on index pointers can cause buffer overflows or underflows, leading to lost elements or incorrect dequeuing.^[24] Such bugs are typically nondeterministic heisenbugs, manifesting sporadically based on timing and thus complicating debugging.^[25] Detection of RMW-related race conditions relies on tools like thread sanitizers, which instrument code at compile time to monitor memory accesses and flag concurrent unsynchronized reads/writes to shared variables during runtime.^[28] These tools, such as ThreadSanitizer integrated in compilers like Clang and GCC, provide dynamic analysis to identify data races without requiring specific test cases, though they may introduce performance overhead.^[25]

Device-Specific Hazards

In direct memory access (DMA) controllers, timing hazards arise from delays between the read and write phases of a read–modify–write (RMW) sequence, potentially allowing concurrent DMA operations to alter the target location before the write completes. For instance, if a CPU executes an RMW on a memory location while a DMA controller performs a write to the same address during the interval between the CPU's read and write, the DMA's modification may be overwritten or lost, leading to data corruption or inconsistent system state.^[29] This issue is exacerbated in high-throughput environments where DMA transfers occur frequently, such as in embedded systems processing real-time data streams.^[29] Cache coherence problems in non-uniform memory access (NUMA) systems can manifest during RMW operations on remote memory, where inconsistencies between processor caches arise due to varying access latencies and coherence protocol overheads. In cache-coherent NUMA (ccNUMA) architectures, accesses to remote nodes may result in stale cache data if coherence protocols do not promptly invalidate or snoop affected lines, potentially causing incorrect value propagation across nodes. Such discrepancies are particularly pronounced in I/O-intensive workloads involving peripheral devices.^[30]^[31] Unintended side effects in device operations during RMW sequences often stem from hardware-specific behaviors, such as triggering interrupts or altering peripheral states midway through the operation. For example, in microcontroller I/O ports configured for outputs, RMW instructions on PORT registers can produce unexpected pin level changes due to capacitive loading and device clock speeds, where the read phase captures transient pin states that do not match the intended latch values during the write.^[32] Similarly, interrupts may be lost if generated during an RMW on interrupt flag registers, as the operation's read phase might clear pending flags prematurely without acknowledging the new event.^[33] Case studies highlight these hazards in modern peripherals. In graphics processing units (GPUs), intensive RMW-like patterns in memory access—such as repeated row activations in GDDR memory—can exploit Rowhammer vulnerabilities, inducing bit flips in adjacent cells due to electrical interference, which compromises data integrity in compute workloads. This was demonstrated on NVIDIA GPUs post-2020, where attackers could manipulate deep learning models by targeting GPU memory during high-frequency read-write cycles, revealing the risks of non-atomic memory operations in parallel processing environments.^[34] For network interface cards (NICs), RMW operations on descriptor rings shared between host and device can lead to inconsistencies if not properly synchronized, potentially resulting in packet processing errors in high-speed Ethernet designs.^[35] Additionally, non-atomic RMW on multi-byte data can result in torn reads or writes, where only part of the data is updated, leading to invalid intermediate states, especially on architectures without hardware support for atomic multi-word operations.^[36]

Solutions for Atomicity

Hardware Primitives

Hardware primitives for read–modify–write operations are low-level processor instructions designed to perform atomic updates on shared memory locations, ensuring that the read, modification, and write phases occur indivisibly from the perspective of other processors. These primitives form the foundation for synchronization in multiprocessor systems by preventing race conditions during concurrent access. Common examples include test-and-set, fetch-and-add, compare-and-swap (CAS), and load-link/store-conditional (LL/SC), each providing varying degrees of flexibility for implementing lock-free algorithms. The test-and-set instruction atomically reads the value from a memory location and sets it to 1 (or a non-zero value), returning the original value to indicate whether the location was previously unset. This primitive, seminal for mutual exclusion, was first introduced in the IBM System/370 mainframe architecture in 1970 to support multiprocessor synchronization in early parallel computing environments.^[37] It remains supported in modern IBM z/Architecture descendants and x86 via instructions like BTS (bit test and set) with the LOCK prefix. Fetch-and-add, another read–modify–write primitive, atomically adds a specified value to a memory location and returns the prior value, enabling efficient counter increments without full locks. Originating from the Ultracomputer project in the 1980s for handling concurrent memory references, it is implemented in x86 through the XADD instruction (introduced in the 80486 processor in 1989) and in ARM via LDADD in AArch64. Compare-and-swap (CAS) atomically compares the contents of a memory location with an expected value and, if they match, replaces it with a new value, returning success or failure. This versatile primitive, also introduced in the IBM System/370 in 1970 as an advancement over simpler test-and-set for process parallelism, underpins many non-blocking data structures and was formalized in its theoretical power by Maurice Herlihy in 1991.^[38] In CISC architectures like x86, CAS is realized via CMPXCHG prefixed with LOCK (available since the 80486). RISC architectures such as ARM traditionally rely on LL/SC for equivalent functionality, though AArch64 introduced direct CAS instructions like CASP for paired operations. Load-link/store-conditional (LL/SC) works by "linking" a load operation to an address, establishing a reservation, and allowing a conditional store only if no other processor modifies the location in between; failure returns a flag for retry. Proposed in the 1970s for the S-1 multiprocessor at Lawrence Livermore National Laboratory, LL/SC is native to RISC designs like ARM (via LDXR/STXR, with reservations spanning 8–2048 bytes) and MIPS, offering advantages over CAS by detecting the ABA problem where values revert to the original.^[39] These primitives achieve atomicity through hardware mechanisms that exploit cache coherence protocols and bus arbitration. In x86 CISC architectures, the LOCK prefix asserts the LOCK# signal, initially locking the system bus to prevent other processors from accessing shared memory during the operation; modern implementations optimize this by locking only the affected cache line via MESI-like protocols, reducing contention.^[40] RISC architectures like ARM favor LL/SC, where the load-link sets a reservation bit in the cache, and store-conditional checks for intervening writes via coherence traffic, succeeding only on exclusive ownership. The evolution of these supports traces to 1970s mainframes, where IBM System/370 pioneered atomic instructions like test-and-set amid the shift to multiprogramming, contrasting with uniprocessor focus; by the 1980s–1990s, RISC designs emphasized separate load/store for LL/SC to align with load-store paradigms, while CISC like x86 integrated RMW directly with prefixes for backward compatibility.^[38] Performance of these primitives varies by architecture and contention but prioritizes low latency for uncontended cases to enable scalable concurrency. On modern x86 processors like Skylake, a locked increment (fetch-and-add by 1) incurs about 20 cycles for L1 cache hits, rising to 100+ cycles on remote cache misses due to coherence overhead, while CAS exhibits similar single-instruction latency but higher effective cost in retry loops under contention.^[41] In ARM AArch64, LL/SC pairs achieve comparable latencies (10–50 cycles uncontended) via exclusive monitors, though reservation granularity affects scalability compared to x86's direct RMW. These costs, though elevated over non-atomic operations, establish critical context for their use in high-throughput synchronization, with optimizations in recent processors minimizing the gap to non-atomic read–modify–write sequences.^[41]

Software Techniques

Software techniques for achieving atomicity in read–modify–write (RMW) operations rely on higher-level abstractions provided by operating systems, programming languages, or custom algorithms, rather than low-level hardware instructions. These methods ensure that the read, modification, and write phases of an RMW sequence execute as a single indivisible unit by serializing access to shared data among concurrent threads. By wrapping the RMW operation within a protected critical section, software techniques prevent interleaving that could lead to inconsistent states, such as lost updates in counters or corrupted data structures. This approach is portable across architectures but may introduce overhead from context switches or busy-waiting, depending on the implementation.^[42] Locking mechanisms, such as mutexes and spinlocks, are foundational software techniques for serializing RMW sequences. A mutex (mutual exclusion lock) allows only one thread to acquire the lock and execute the RMW operation at a time; other threads block until the lock is released, ensuring the entire sequence completes atomically. For example, in an atomic increment of a shared counter, a thread locks the mutex, reads the current value, adds one, writes it back, and unlocks, preventing partial overlaps. Mutexes are implemented in libraries like POSIX threads (pthreads), where pthread_mutex_lock and pthread_mutex_unlock provide this serialization, often backed by operating system primitives for blocking efficiency.^[43]^[42] Spinlocks offer an alternative for short RMW operations, where a thread repeatedly polls (spins) in a tight loop until the lock is available, avoiding the overhead of thread suspension. This busy-waiting approach serializes access by using a shared flag or variable that threads atomically test and set before entering the RMW critical section. Spinlocks are particularly useful in kernel or real-time environments where latency is critical, as seen in the Linux kernel's implementation for protecting short code paths. However, prolonged spinning can waste CPU cycles, making spinlocks suitable only for low-contention scenarios.^[44] Transactional memory provides an optimistic software-based approach to RMW atomicity, treating sequences of operations as transactions that execute in isolation. Software transactional memory (STM) systems buffer reads and writes in a private log during the transaction; upon commit, they validate that no conflicting modifications occurred and apply changes atomically, or abort and retry on conflicts. This enables RMW blocks, such as updating multiple related variables (e.g., a balance and transaction log), to appear atomic without explicit locking, simplifying concurrent programming for complex data structures. Seminal work on STM, including static transaction implementations, demonstrated its viability for lock-free concurrency by leveraging conflict detection to resolve races dynamically.^[45] Classic software algorithms like Dekker's and Peterson's achieve RMW atomicity through mutual exclusion without hardware atomics, using only reads, writes, and shared flags for two threads. Dekker's algorithm, the first software solution to the mutual exclusion problem, employs two boolean flags and a turn indicator to coordinate entry into a critical section containing the RMW operation, ensuring progress and bounded waiting while preventing simultaneous access. It works by having each thread signal intent via its flag and yield based on the turn, serializing the RMW to maintain consistency. Peterson's algorithm refines this for simplicity, using a single turn variable and flags where each thread sets its flag and designates the other as prioritized, guaranteeing mutual exclusion for the RMW sequence with minimal assumptions on memory access atomicity. These algorithms laid the groundwork for software synchronization, applicable in environments lacking atomic primitives.^[46] Standard libraries integrate these techniques for practical RMW atomicity. In pthreads, mutexes wrap RMW operations like increments, as shown in examples where a shared counter is protected to ensure thread-safe updates without races. Similarly, Java's synchronized blocks or methods acquire an intrinsic lock on an object, serializing RMW access; for instance, a synchronized block around reading, modifying, and writing a shared integer guarantees atomicity by enforcing exclusive execution. These library constructs abstract away low-level details, promoting portable software solutions for concurrent RMW in multithreaded applications.^[42]^[47]

Theoretical Foundations

Consensus Number

The consensus number of a read–modify–write (RMW) primitive is defined as the largest integer n (or infinity if no such maximum exists) such that the primitive, in conjunction with read-write registers, can be used to implement a wait-free consensus protocol for n processes.^[48] This measure, introduced by Maurice Herlihy, quantifies the synchronization power of the primitive by assessing its ability to solve the consensus problem—where processes must agree on a single value from their initial proposals—without relying on locks or blocking operations, even in the presence of arbitrary process speeds and failures.^[48] In his seminal 1991 paper, Herlihy established a foundational theorem stating that any nontrivial RMW operation on a register has a consensus number of at least 2, meaning it can solve two-process consensus but may or may not extend to higher numbers depending on the operation's semantics.^[48] The proof involves constructing a simple two-process protocol using the RMW operation to achieve agreement, while demonstrating that read-write registers alone (with consensus number 1) cannot solve even two-process consensus due to the impossibility of distinguishing process proposals in asynchronous settings.^[48] For classification, Herlihy further showed that certain RMW operations, such as test-and-set, have exactly consensus number 2, as they enable two-process consensus but cannot implement three-process consensus without additional primitives; the upper bound proof relies on constructing a bivalent initial state and showing that no sequence of operations can resolve it to a unanimous decision for three processes.^[48] In contrast, stronger RMW operations like compare-and-swap achieve infinite consensus number, allowing wait-free consensus for any number of processes.^[48] Examples illustrate this hierarchy: read-write registers, lacking any modification capability beyond simple reads and writes, have consensus number 1 and thus provide no synchronization beyond single-process triviality.^[48] Basic RMW primitives like test-and-set on a single bit, commonly used for mutual exclusion, also cap at 2, sufficient for binary agreement but inadequate for multi-process coordination without escalation.^[48] More powerful RMW operations, such as compare-and-swap, reach infinity by enabling the implementation of universal constructions that simulate any shared object wait-free.^[48] Similarly, abstract data types built atop RMW, like queues and stacks, inherit a consensus number of 2 when their operations (e.g., enqueue/dequeue or push/pop) are wait-free, as they can solve two-process consensus but face the same bivalency issues for three or more.^[48] The implications of consensus number extend to ensuring progress guarantees in distributed systems, as primitives with higher numbers (especially infinity) allow the construction of wait-free algorithms that tolerate asynchrony and failures without starvation, enabling robust implementations of complex synchronization tasks like locks or barriers for arbitrary process counts.^[48] This framework underscores why selecting RMW primitives with sufficient consensus strength is critical for scalable, non-blocking concurrency in multiprocessor environments.^[48]

Synchronization Hierarchy

The synchronization hierarchy organizes atomic primitives by their consensus numbers, a metric introduced by Maurice Herlihy to quantify the computational power of synchronization mechanisms in asynchronous shared-memory systems. At the base level, read-write registers have a consensus number of 1, supporting only basic coordination where processes can agree on initial values but cannot resolve contention among multiple participants. Nontrivial read-modify-write (RMW) operations, such as test-and-set (TAS) and fetch-and-add, advance to consensus number 2, allowing implementation of mutual exclusion and two-process consensus but falling short for larger process sets. At the pinnacle, advanced RMW primitives like compare-and-swap (CAS) and load-link/store-conditional (LL/SC) achieve infinite consensus numbers, enabling the construction of any wait-free shared object.^[48]^[49] Within this hierarchy, basic RMW primitives like swap demonstrate their limitations by supporting mutual exclusion—effectively solving consensus for two processes through simple spin-based protocols—but they cannot extend to three or more processes without risking non-termination in adversarial scheduling. For instance, TAS can enforce exclusive access to a critical section for two threads by atomically setting a flag, but attempts to scale this to n>2 processes fail to guarantee agreement on a single value among all, as the primitive lacks the conditional retry capability of higher-level operations. This positioning underscores why RMW operations with consensus number 2 are sufficient for basic contention resolution but inadequate for complex coordination tasks requiring universal solvability. Universal constructions bridge gaps in the hierarchy by allowing primitives with a given consensus number k to implement any shared object with consensus number at most k; for basic RMW (k=2), this means building stacks or queues that tolerate two-process contention wait-free, but higher consensus demands stronger primitives like CAS. Herlihy's framework proves that no combination of weaker primitives can simulate a stronger one. This framework has been applied to the development of lock-free data structures, where RMW primitives with higher consensus numbers facilitate scalable, non-blocking algorithms for concurrent environments, as seen in implementations of queues and deques that avoid deadlock and livelock under high contention.^[48]^[50]

References

[1]
The Read-Modify-Write operation - Arm Developer
The read-modify-write operation ensures that you modify only the specific bits in a system register that you want to change.
[2]
[PDF] IA-32 Intel® Architecture Software Developer's Manual
... read-modify-write operation when modifying a memory operand. This mechanism is used to allow reliable communications between processors in multiprocessor ...
[3]
[PDF] Lecture 10.2 Read, Modify, Write Atomics - Parallel Computing for ...
○ Common atomic operations (building blocks):. – Read. – Write. – Test-and-set. – Swap. – Fetch and add (fetch and increment). – Read-modify-write. – Compare- ...
[4]
[PDF] Atomic Read Modify Write Primitives for I/O Devices - Intel.es
Read Modify Write (RMW) operations are hardware-assisted operations that atomically update a variable at its memory location. These operations have a long ...
[5]
https://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/
[6]
[PDF] Linearizability: A Correctness Condition for Concurrent Objects
Linearizability is a correctness condition for concurrent objects, making them equivalent to legal sequential computations, allowing high concurrency.
[7]
[PDF] 60 Years of Mastering Concurrent Computing through Sequential ...
Jun 17, 2021 · The read/write shared register abstraction provides several advantages over message passing: a more natural transition from uniprocessors, and ...
[8]
Reading 17: Concurrency - MIT
Concurrency means multiple computations are happening at the same time. Concurrency is everywhere in modern programming, whether we like it or not.
[9]
[PDF] What every systems programmer should know about concurrency
Apr 28, 2020 · to read a value, modify it, and write it back as a single atomic step. There are a few common read-modify-write (rmw) op- erations. In C++ ...
[10]
[PDF] Cache coherence in shared-memory architectures
Each CPU (cache system) 'snoops' (i.e. watches continually) for write activity concerned with data addresses which it has cached. • This assumes a bus structure ...
[11]
MESI and MOESI protocols - Arm Developer
A write can only be done if the cache line is in the Modified or Exclusive state. · A cache can discard a Shared line at any time, changing to the Invalid state.
[12]
Atomic vs. Non-Atomic Operations - Preshing on Programming
Jun 18, 2013 · Atomic operations complete in a single step, guaranteeing no half-complete modifications. Non-atomic operations lack these guarantees, and C/C+ ...Missing: pitfalls | Show results with:pitfalls
[13]
Synchronization 1: Atomicity – CS 61 2021
Some computer operations are atomic operations. These operations have ... std::atomic<T> types in C++ support atomic read-modify-write operations; std ...
[14]
Atomicity in the ARM architecture
Atomicity is a feature of memory accesses, described as atomic accesses. The ARM architecture description refers to two types of atomicity.
[15]
(PDF) Atomic Vector Operations on Chip Multiprocessors
The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance.
[16]
[PDF] A Case for Redundant Arrays of Inexpensive Disks (RAID)
RAID, based on magnetic disk tech, offers improvements in performance, reliability, power, and scalability, as an alternative to SLED.
[17]
[PDF] Considerations for RAID-6 Availability and Format/Rebuild ...
Mar 30, 2010 · Read and write the data and read and write the parity. RAID-6 is about the same. However, the P and Q parity disks must be read and updated.
[18]
None
### Summary of Read-Modify-Write Process and Parity Updates in RAID-6 (Liberation Codes)
[19]
The Read-Modify-Write problem - PICList
Apr 5, 2018 · When the uP reads a port register, it reads the actual state of the pins, rather than the output latch. This can cause two problems.
[20]
[PDF] COS 318: Operating Systems I/O Device Interactions and Drivers
◇ Device driver allocates a buffer for read and schedules I/O ... open read write. Driver (dev 1). Open: Read: Write: Open(1,…) Page 10. 10. 37.Missing: mitigating modify-
[21]
Polling or Interrupt based method - embedded - Stack Overflow
Jun 18, 2010 · An interrupt has more overhead than a polling cycle, so if the event is frequent polling creates less overhead.Why is Read-Modify-Write necessary for registers on embedded ...when polling is better than interrupt? - Stack OverflowMore results from stackoverflow.com
[22]
[PDF] IBM PC Technical Reference - Bitsavers.org
The System Unit is the heart of your IBM Personal Computer system. The System Unit houses the microprocessor, Read-Only Memory. (ROM), Read/Write Memory ...
[23]
[PDF] IBM PC XT Tecnical Reference
The IBM Personal Computer XT Technical Reference manual describes the hardware design and provides interface information for the IBM Personal Computer XT. This ...
[24]
Reading 21: Concurrency - MIT
Each time you run a program containing a race condition, you may get different behavior. These kinds of bugs are heisenbugs, which are nondeterministic and hard ...
[25]
[PDF] Data Races & Race Conditions
A data race occurs when two concurrent threads access a shared variable ... • Read-modify-write. • Check-then-act. 11. CMSC433 Fall 2021. Page 12. Thread ...
[26]
https://www-db.cs.tum.de/teaching/ss21/c%2B%2Bpraktikum/slides/lecture-10.1.pdf
[27]
[PDF] concurrency.pdf
- S.C. doesn't make such read-modify-write instructions atomic. - So on multiprocessor, suffer same race as 3-instruction version. • Can make x86 instruction ...
[28]
ThreadSanitizer — Clang 22.0.0git documentation - LLVM
ThreadSanitizer is a tool that detects data races. It consists of a compiler instrumentation module and a run-time library.
[29]
[PDF] TMS320x 2834x Delfino Direct Memory Access (DMA) Module
... read-modify-write operation and the DMA performs a write to the same location, the DMA write may be lost if the operation occurs in between the CPU read and ...Missing: hazards | Show results with:hazards
[30]
[PDF] Thread and Memory Placement on NUMA Systems - USENIX
Jul 10, 2015 · It is well known that the placement of threads and memory plays a crucial role for performance on NUMA. (Non-Uniform Memory-Access) systems.
[31]
[PDF] Challenges of Memory Management on Modern NUMA Systems
Current x86 NUMA systems are cache coherent (called ccNUMA), which means programs can transparently access memory on local and remote nodes without changes to ...
[32]
None
### Summary of Read-Modify-Write Operations, Unintended Behaviors, Side Effects, and Hazards on PORT Registers
[33]
[PDF] TMS320F2803x Piccolo System Control and Interrupts
An interrupt may be lost during the read-modify-write operation. See Section Section 6.3.1 for a method to clear flagged interrupts. 6.5.2. PIE Interrupt ...Missing: side | Show results with:side<|separator|>
[34]
[PDF] GPUHammer: Rowhammer Attacks on GPU Memories are Practical
Aug 12, 2025 · Rowhammer is a read disturbance vulnerability in modern. DRAM that causes bit-flips, compromising security and reli- ability.
[35]
[PDF] Network Interface Design for Low Latency Request-Response ...
Jun 26, 2013 · To efficiently communicate with the network card in cache line size units, we utilize the cache hierarchy and write-gathering buffers. All ...
[36]
[PDF] The z/Architecture Principles of Operation - IBM
... TEST AND SET . . . . . . . . . . . . . . . . . . . . . . 7-223. TEST UNDER ... System/370. VM/ESA z/Architecture z/OS. ANSI is a registered trademark of ...
[37]
https://www.ibm.com/docs/en/SSQ2R2_15.0.0/com.ibm.tpf.toolkit.hlasm.doc/dz9zr006.pdf
[38]
https://www.cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf
[39]
Assert LOCK# Signal Prefix
The LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.Missing: bus | Show results with:bus
[40]
Evaluating the Cost of Atomic Operations on Modern Architectures
Oct 19, 2020 · In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of ...
[41]
Mutex Lock Code Examples (Multithreaded Programming Guide)
The two functions in Example 4-1 use the mutex lock for different purposes. The increment_count() function uses the mutex lock simply to ensure an atomic update ...
[42]
pthread_mutex_lock
Every time a thread relocks this mutex, the lock count shall be incremented by one. Each time the thread unlocks the mutex, the lock count shall be decremented ...
[43]
Locking lessons - The Linux Kernel documentation
The lessons cover spin locks, the basic locking primitive, reader-writer spinlocks for read-mostly data, and faster, non-interrupt spinlocks.<|control11|><|separator|>
[44]
[PDF] Software Transactional Memory - MIT
This paper we will focus on implementations of a transactional memory that supports static. transactions, a class that includes most of the known and proposed.Missing: seminal | Show results with:seminal
[45]
[PDF] Dijkstra Book Chapter on Concurrent Algorithms - Leslie Lamport
May 14, 2025 · In 1965, Dijkstra published in Communications of the ACM (CACM) a paper [5] containing what I believe was the first concurrent algorithm to.
[46]
Intrinsic Locks and Synchronization (The Java™ Tutorials ...
Intrinsic locks play a role in both aspects of synchronization: enforcing exclusive access to an object's state and establishing happens-before relationships.
[47]
[PDF] Wait-free synchronization - Brown CS
If X has consensus number n, and. Y has consensus number m < n, then there exists no wait-free implementation of X by Y in a system of more than m processes.