Fact-checked by Grok 2 weeks ago

Read–modify–write

In computing, particularly in processor architecture and low-level programming, a read–modify–write (RMW) operation is a sequence or hardware-supported instruction that reads a value from a memory location or register, performs a specified modification on it, and writes the resulting value back to the same location, ensuring precise data updates without unintended side effects. This idiom is fundamental in embedded systems and , where it allows selective modification of bits within a —such as clearing or setting specific fields—while preserving the rest of the value, thereby avoiding disruptions to system functionality controlled by those . In , for instance, RMW typically involves transferring the value to a general-purpose , applying bitwise operations like (bit clear) or ORR (bitwise OR), and then restoring the modified value. In the context of concurrent and parallel programming, RMW operations are often implemented atomically to support in multiprocessor environments, where the entire read-modify-write cycle executes indivisibly to prevent conditions and ensure across threads or cores. Hardware mechanisms, such as the LOCK prefix in x86 processors, enforce this atomicity by locking the bus or line during the , serializing access to and enabling reliable inter-processor communication. Atomic RMW instructions are guaranteed for aligned operands up to 64 bits in modern architectures, provided they operate on cacheable memory types like write-back () or write-through (). Notable examples of atomic RMW primitives include (XCHG), compare-and-exchange (CMPXCHG), (XADD), and bit test/manipulation instructions (, BTR, BTC), which underpin synchronization structures like spinlocks, semaphores, and primitives in operating systems and multithreaded applications. These operations are vital for building higher-level concurrency controls, as they provide the indivisible building blocks needed for safe shared-memory access in scalable parallel systems.

Definition and Basic Concepts

Definition

In , a read–modify–write (RMW) operation consists of a sequence where a value is read from a specific , a modification is applied to that value based on its content, and the resulting modified value is then written back to the same . This process is fundamental to low-level hardware and software interactions, particularly in environments involving . The primary purpose of RMW operations is to facilitate updates that rely on the current state of the data being modified, such as incrementing a shared or toggling a status flag in a concurrent system. These operations allow programs to perform state-dependent changes efficiently without requiring complex locking mechanisms in simple cases, though they are often implemented with hardware support to ensure reliability. For correctness in multi-threaded or multi-processor environments, the entire RMW sequence must execute atomically, appearing indivisible to concurrent operations and thereby preventing inconsistencies like interleaved modifications. Without this atomicity, race conditions can arise, where one operation's read sees a value altered by another before its write completes. RMW operations emerged in early systems during the and to manage shared physical resources like and peripherals in multiprogramming environments. Their formalization in concurrent programming literature occurred in the late 1980s and early 1990s, with key contributions defining correctness criteria such as to ensure in highly concurrent settings.

Common Examples

One common read–modify–write operation is the increment of a shared , often used to track the number of active threads or processed items in concurrent programs. In this pattern, a reads the current value N from the shared , computes the new value N + 1, and writes it back to the same location. Without atomicity, concurrent executions can lead to lost updates, where one thread's increment is overwritten by another's stale read. The following illustrates a non-atomic increment:
read [counter](/page/Counter) into [temp](/page/Temp)  // temp = [counter](/page/Counter)
temp = [temp](/page/Temp) + 1         // Modify
[counter](/page/Counter) = [temp](/page/Temp)          // Write back
This example appears frequently in multithreaded applications managing counts, such as in . Another typical pattern is flag toggling, employed in simple synchronization primitives like busy-waiting locks or status indicators. Here, a reads the current state of a bit (e.g., 0 for unlocked, 1 for locked), inverts it based on a condition (such as acquiring the lock), and writes the updated bit back. Interleaving threads may cause inconsistent states, such as multiple acquisitions of the same lock. Pseudocode for toggling a lock flag to acquire it:
read flag into temp     // temp = flag (0 or 1)
if temp == 0 then       // Condition check
    temp = 1            // Set to locked
flag = temp             // Write back
Such operations are foundational in software , though they highlight the need for atomicity to prevent races. Accumulator updates represent a broader class of read–modify–write sequences, commonly seen in systems or statistics gathering where values are aggregated across multiple threads. A reads the current total from a shared accumulator, adds a delta (e.g., an elapsed time or event count), and writes the new sum back. Concurrent additions without can result in undercounting due to overlapping modifications. Pseudocode for an accumulator update:
read total into temp    // temp = total
temp = temp + delta     // Add contribution
total = temp            // Write back
This pattern is prevalent in performance monitoring and distributed counters, emphasizing the risks of non-atomic aggregation.

Applications and Contexts

Processor Instructions

In multi-core processors, read–modify–write (RMW) sequences are integral to protocols like MESI (Modified, Exclusive, Shared, Invalid), which manage shared data access across cores. Under MESI, a core initiating an RMW must transition a cache line to the Modified state to ensure exclusive ownership, invalidating shared copies in other caches to avoid stale data during the modification phase. This process serializes access, enabling consistent updates but introducing from cache-to-cache transfers. Non-atomic RMW operations in pipelined execution pose significant pitfalls, as modern CPU pipelines—featuring and speculative fetching—can interleave reads and writes from concurrent threads. For example, one thread's load might capture a value, only for another thread's store to partially overlap the subsequent modification and write-back, resulting in lost updates or inconsistent states observable across cores. Such interleaving violates the intended atomicity of the sequence, exacerbating issues in multi-threaded workloads where pipelines process instructions from multiple streams simultaneously. Basic load-modify-store sequences in languages illustrate non-atomic RMW without hardware guarantees. In x86, a simple increment of a operand requires separate instructions, as shown below:
assembly
mov [eax](/page/EAX), [mem]  ; Load value into [register](/page/Register)
inc [eax](/page/EAX)         ; Modify the value
mov [mem], [eax](/page/EAX)  ; Store back to memory
This trio is not ; intervening operations from other cores can occur between the load and store, leading to conditions. In architectures, a comparable non-atomic increment uses load, arithmetic, and store instructions:
assembly
ldr r0, [r1]    ; Load value from address in r1
add r0, r0, #1  ; Modify by adding 1
str r0, [r1]    ; Store result back
Without exclusive monitors or barriers, this sequence allows interleaving, compromising in shared-memory multi-core systems. The performance implications of non-atomic RMW are pronounced in high-throughput scenarios like processing, where SIMD instructions process multiple data elements in parallel but often lack atomic guarantees for memory operations. Non-atomic stores can result in writes—partial updates visible to other cores—incurring overhead from required software retries or barriers to maintain correctness, potentially reducing throughput in contended multi-core environments. This overhead contrasts with scalar atomic primitives but highlights the scalability challenges in parallel workloads without dedicated support.

Storage Systems

In storage systems, read–modify–write (RMW) operations are essential for maintaining in array-based configurations like levels 4, 5, and 6, where partial writes require updating information to preserve . These levels use block-level striping with dedicated or distributed to enable from disk failures, but writing less than a full necessitates RMW to recompute without disrupting the array's integrity. In 4, is stored on a dedicated disk per group, limiting write parallelism due to contention on that disk during updates. The RMW process for partial writes in these RAID levels involves reading the existing data block and its associated parity (or parities), modifying the data, recalculating the parity using exclusive-OR (XOR) operations on the delta between old and new data, and then writing the updated data and parity back to the array. For , which distributes parity across all disks in the stripe, this process allows multiple simultaneous partial writes, improving efficiency over RAID 4 by eliminating the single parity disk bottleneck; a small write typically requires two reads (old data and old parity) and two writes (new data and new parity). In , which employs dual distributed parities (P and Q, where Q often uses a more complex scheme like Reed-Solomon for second-failure tolerance), the process extends to three reads (old data, old P, and old Q) and three writes (new data, new P, and new Q) for a single-block partial stripe update, ensuring consistency across both parity sets. This atomic RMW sequence prevents parity inconsistencies that could lead to if a failure occurs mid-operation. By enforcing atomicity in parity updates during partial writes, RMW in RAID 4, 5, and 6 enhances in distributed storage environments, allowing the array to detect and correct errors from single (RAID 4/5) or dual (RAID 6) disk failures without requiring full stripe rewrites. For instance, consistent parity enables reconstruction of lost data blocks using surviving stripes, mitigating risks in scenarios like power failures or concurrent I/O in large-scale systems. This mechanism is particularly valuable in environments with frequent small writes, such as or virtualized storage, where it balances redundancy with performance overhead. The concept of RMW for parity in RAID originated in the late 1980s as part of early redundancy array designs, with RAID levels 4 and 5 formalized in a seminal 1988 technical report by Patterson, Gibson, and Katz at UC Berkeley, which highlighted the need for such operations to address the "small write problem" in inexpensive disk arrays. RAID 6 emerged as an extension in the 1990s to handle growing disk capacities and dual-failure risks, building on these foundations with advanced dual-parity schemes to further leverage RMW for enhanced reliability.

I/O Operations

In (I/O) operations, read–modify–write (RMW) sequences encounter unique challenges due to the architecture of peripheral , where registers often serve dual purposes for reading and writing, potentially leading to inconsistencies. Many I/O peripherals, such as those in systems, maintain separate internal for output and input sensing, meaning a read operation captures the current state of external pins or signals rather than the previously written latch value. This separation can result in stale during the modify phase, as external factors like activity or noise may alter the pin states between the read and write steps. For instance, in (UART) status registers, reading to check flags (e.g., overrun or framing errors) reflects the instantaneous state, but any subsequent write to clear or modify those flags risks incorporating outdated pin if the updates asynchronously. These mismatches frequently produce unexpected results in RMW operations on buffers or control s. When modifying a shared —such as a transmit or receive FIFO in a serial peripheral—the read phase might retrieve a of that includes prior transmissions, but the write phase could apply changes to a that has since been altered by ongoing activity, like incoming bytes overwriting slots. This discrepancy arises because I/O s often do not guarantee atomicity between read and write accesses to the same logical ; the write may target a shadow or control path disconnected from the read path, leading to partial updates or lost modifications. A classic example occurs in bit-level operations on I/O ports, where attempting to toggle a single control bit (e.g., enabling a UART ) via RMW can inadvertently corrupt adjacent bits if the read captures transient input from connected peripherals. To mitigate these RMW challenges in I/O drivers, developers employ polling or interrupt-driven sequences to enforce temporal and verify device states before modifications. In polling-based approaches, the driver repeatedly reads status registers until a stable condition is confirmed (e.g., empty or ready set), ensuring the subsequent RMW operates on current, non-stale data and minimizing race-like inconsistencies from device timing. Interrupt-driven methods complement this by triggering CPU intervention only when the signals completion via hardware interrupts, allowing the driver to sequence reads and writes in a synchronized manner without continuous CPU overhead. These techniques are essential for maintaining in environments. Historical cases in early computers and systems highlight the prevalence of RMW pitfalls in I/O operations. In 1980s and 1990s PC-compatible systems, programming parallel (LPT) and serial () ports required careful handling of RMW to avoid corrupting control lines, as documented in technical references where direct port I/O instructions like IN and OUT could yield inconsistent results on bidirectional ports due to pin-state reads. controllers, such as those in early industrial automation, faced similar issues with UART and GPIO ports, where unmitigated RMW led to communication failures, prompting driver guidelines emphasizing full-port writes or status polling. These problems were widely addressed in era-specific hardware manuals, influencing modern driver design practices.

Challenges and Hazards

Race Conditions

In multi-threaded environments, non-atomic read–modify–write (RMW) operations can lead to race conditions when multiple threads interleave their accesses to shared data, causing the final state to depend on unpredictable scheduling order. Specifically, the mechanism involves two or more threads reading the same initial value, performing independent modifications, and then writing back, resulting in lost updates where one thread's change overwrites another's. For instance, if two threads both read a shared value of N, each increments it to N+1, and both write N+1, the counter ends at N+1 instead of the expected N+2. Race conditions in RMW sequences manifest in two primary types: write-write races, where concurrent writes to the same overwrite prior modifications without coordination, and read-write races, which produce inconsistent views of data for subsequent operations. Write-write races directly cause lost updates in structures like counters, while read-write races can lead to threads operating on stale data, amplifying errors in dependent computations. The impact of these races is severe, often resulting in within shared structures such as or counters, where invariants like total count accuracy or integrity are violated. In implementations, for example, non-atomic RMW on index pointers can cause overflows or underflows, leading to lost elements or incorrect dequeuing. Such bugs are typically nondeterministic heisenbugs, manifesting sporadically based on timing and thus complicating . Detection of RMW-related race conditions relies on tools like thread sanitizers, which instrument code at to monitor memory accesses and flag concurrent unsynchronized reads/writes to shared variables during . These tools, such as ThreadSanitizer integrated in compilers like and , provide dynamic analysis to identify data races without requiring specific test cases, though they may introduce performance overhead.

Device-Specific Hazards

In direct memory access (DMA) controllers, timing hazards arise from delays between the read and write phases of a read–modify–write (RMW) sequence, potentially allowing concurrent DMA operations to alter the target location before the write completes. For instance, if a CPU executes an RMW on a memory location while a DMA controller performs a write to the same address during the interval between the CPU's read and write, the DMA's modification may be overwritten or lost, leading to or inconsistent system state. This issue is exacerbated in high-throughput environments where DMA transfers occur frequently, such as in systems processing streams. Cache coherence problems in non-uniform memory access (NUMA) systems can manifest during RMW operations on remote memory, where inconsistencies between processor caches arise due to varying access latencies and coherence protocol overheads. In cache-coherent NUMA (ccNUMA) architectures, accesses to remote nodes may result in stale cache data if coherence protocols do not promptly invalidate or snoop affected lines, potentially causing incorrect value propagation across nodes. Such discrepancies are particularly pronounced in I/O-intensive workloads involving peripheral devices. Unintended side effects in device operations during RMW sequences often stem from hardware-specific behaviors, such as triggering interrupts or altering peripheral states midway through the operation. For example, in I/O ports configured for outputs, RMW instructions on registers can produce unexpected pin level changes due to capacitive loading and device clock speeds, where the read phase captures transient pin states that do not match the intended values during the write. Similarly, interrupts may be lost if generated during an RMW on registers, as the operation's read phase might clear pending flags prematurely without acknowledging the new event. Case studies highlight these hazards in modern peripherals. In graphics processing units (GPUs), intensive RMW-like patterns in access—such as repeated row activations in GDDR —can exploit vulnerabilities, inducing bit flips in adjacent cells due to electrical interference, which compromises in compute workloads. This was demonstrated on GPUs post-2020, where attackers could manipulate models by targeting GPU during high-frequency read-write cycles, revealing the risks of non-atomic operations in environments. For network interface cards (NICs), RMW operations on descriptor rings shared between host and device can lead to inconsistencies if not properly synchronized, potentially resulting in packet processing errors in high-speed Ethernet designs. Additionally, non-atomic RMW on multi-byte data can result in torn reads or writes, where only part of the data is updated, leading to invalid intermediate states, especially on architectures without support for atomic multi-word operations.

Solutions for Atomicity

Hardware Primitives

primitives for read–modify–write operations are low-level instructions designed to perform atomic updates on locations, ensuring that the read, modification, and write phases occur indivisibly from the perspective of other processors. These primitives form the foundation for synchronization in multiprocessor systems by preventing race conditions during concurrent access. Common examples include , , (), and (), each providing varying degrees of flexibility for implementing lock-free algorithms. The instruction atomically reads the value from a location and sets it to 1 (or a non-zero value), returning the original value to indicate whether the location was previously unset. This primitive, seminal for , was first introduced in the mainframe architecture in 1970 to support multiprocessor synchronization in early environments. It remains supported in modern IBM z/Architecture descendants and x86 via instructions like (bit test and set) with the LOCK prefix. , another read–modify–write primitive, atomically adds a specified value to a location and returns the prior value, enabling efficient increments without full locks. Originating from the Ultracomputer project in the 1980s for handling concurrent references, it is implemented in x86 through the XADD instruction (introduced in the 80486 processor in 1989) and in via LDADD in AArch64. Compare-and-swap (CAS) atomically compares the contents of a memory location with an expected value and, if they match, replaces it with a new value, returning success or failure. This versatile primitive, also introduced in the IBM System/370 in 1970 as an advancement over simpler test-and-set for process parallelism, underpins many non-blocking data structures and was formalized in its theoretical power by Maurice Herlihy in 1991. In CISC architectures like x86, CAS is realized via CMPXCHG prefixed with LOCK (available since the 80486). RISC architectures such as ARM traditionally rely on LL/SC for equivalent functionality, though AArch64 introduced direct CAS instructions like CASP for paired operations. Load-link/store-conditional (LL/SC) works by "linking" a load operation to an address, establishing a reservation, and allowing a conditional store only if no other processor modifies the location in between; failure returns a flag for retry. Proposed in the 1970s for the S-1 multiprocessor at Lawrence Livermore National Laboratory, LL/SC is native to RISC designs like ARM (via LDXR/STXR, with reservations spanning 8–2048 bytes) and MIPS, offering advantages over CAS by detecting the ABA problem where values revert to the original. These primitives achieve atomicity through hardware mechanisms that exploit protocols and bus arbitration. In x86 CISC architectures, the LOCK prefix asserts the LOCK# signal, initially locking the system bus to prevent other processors from accessing during the operation; modern implementations optimize this by locking only the affected cache line via MESI-like protocols, reducing contention. RISC architectures like favor LL/SC, where the load-link sets a reservation bit in the cache, and store-conditional checks for intervening writes via coherence traffic, succeeding only on exclusive ownership. The evolution of these supports traces to 1970s mainframes, where pioneered atomic instructions like amid the shift to multiprogramming, contrasting with uniprocessor focus; by the 1980s–1990s, RISC designs emphasized separate load/store for LL/SC to align with load-store paradigms, while CISC like x86 integrated RMW directly with prefixes for . Performance of these primitives varies by architecture and contention but prioritizes low for uncontended cases to enable scalable concurrency. On modern x86 processors like Skylake, a locked increment (fetch-and-add by 1) incurs about 20 cycles for L1 cache hits, rising to 100+ cycles on remote cache misses due to overhead, while exhibits similar single-instruction but higher effective cost in retry loops under contention. In ARM AArch64, LL/SC pairs achieve comparable latencies (10–50 cycles uncontended) via exclusive monitors, though reservation granularity affects scalability compared to x86's direct RMW. These costs, though elevated over non-atomic operations, establish critical context for their use in high-throughput , with optimizations in recent processors minimizing the gap to non-atomic read–modify–write sequences.

Software Techniques

Software techniques for achieving atomicity in read–modify–write (RMW) operations rely on higher-level abstractions provided by operating systems, programming languages, or custom algorithms, rather than low-level instructions. These methods ensure that the read, modification, and write phases of an RMW sequence execute as a single indivisible unit by serializing access to shared data among concurrent threads. By wrapping the RMW operation within a protected , software techniques prevent interleaving that could lead to inconsistent states, such as lost updates in counters or corrupted data structures. This approach is portable across architectures but may introduce overhead from context switches or busy-waiting, depending on the implementation. Locking mechanisms, such as mutexes and spinlocks, are foundational software techniques for serializing RMW sequences. A (mutual exclusion lock) allows only one to acquire the lock and execute the RMW operation at a time; other block until the lock is released, ensuring the entire sequence completes . For example, in an atomic increment of a shared counter, a locks the , reads the current value, adds one, writes it back, and unlocks, preventing partial overlaps. are implemented in libraries like (pthreads), where pthread_mutex_lock and pthread_mutex_unlock provide this , often backed by operating system primitives for blocking efficiency. Spinlocks offer an alternative for short RMW operations, where a repeatedly polls (spins) in a tight loop until the lock is available, avoiding the overhead of thread suspension. This busy-waiting approach serializes access by using a shared flag or variable that threads atomically before entering the RMW . Spinlocks are particularly useful in kernel or environments where latency is critical, as seen in the kernel's implementation for protecting short code paths. However, prolonged spinning can waste CPU cycles, making spinlocks suitable only for low-contention scenarios. Transactional memory provides an optimistic software-based approach to RMW atomicity, treating sequences of operations as that execute in . (STM) systems buffer reads and writes in a private during the ; upon commit, they validate that no conflicting modifications occurred and apply changes , or abort and retry on conflicts. This enables RMW blocks, such as updating multiple related variables (e.g., a balance and ), to appear atomic without explicit locking, simplifying concurrent programming for complex data structures. Seminal work on STM, including static transaction implementations, demonstrated its viability for lock-free concurrency by leveraging conflict detection to resolve races dynamically. Classic software algorithms like Dekker's and Peterson's achieve RMW atomicity through mutual exclusion without hardware atomics, using only reads, writes, and shared flags for two threads. Dekker's algorithm, the first software solution to the mutual exclusion problem, employs two boolean flags and a turn indicator to coordinate entry into a critical section containing the RMW operation, ensuring progress and bounded waiting while preventing simultaneous access. It works by having each thread signal intent via its flag and yield based on the turn, serializing the RMW to maintain consistency. Peterson's algorithm refines this for simplicity, using a single turn variable and flags where each thread sets its flag and designates the other as prioritized, guaranteeing mutual exclusion for the RMW sequence with minimal assumptions on memory access atomicity. These algorithms laid the groundwork for software synchronization, applicable in environments lacking atomic primitives. Standard libraries integrate these techniques for practical RMW atomicity. In , mutexes wrap RMW operations like increments, as shown in examples where a shared is protected to ensure thread-safe updates without races. Similarly, Java's synchronized blocks or methods acquire an intrinsic lock on an object, serializing RMW access; for instance, a synchronized block around reading, modifying, and writing a shared guarantees atomicity by enforcing exclusive execution. These library constructs abstract away low-level details, promoting portable software solutions for concurrent RMW in multithreaded applications.

Theoretical Foundations

Consensus Number

The consensus number of a read–modify–write (RMW) primitive is defined as the largest integer n (or infinity if no such maximum exists) such that the primitive, in conjunction with read-write registers, can be used to implement a wait-free protocol for n processes. This measure, introduced by Maurice Herlihy, quantifies the synchronization power of the primitive by assessing its ability to solve the problem—where processes must agree on a single value from their initial proposals—without relying on locks or blocking operations, even in the presence of arbitrary process speeds and failures. In his seminal 1991 paper, Herlihy established a foundational stating that any nontrivial RMW on a has a number of at least 2, meaning it can solve two-process but may or may not extend to higher numbers depending on the 's semantics. The proof involves constructing a simple two-process using the RMW to achieve agreement, while demonstrating that read-write alone (with number 1) cannot solve even two-process due to the impossibility of distinguishing process proposals in asynchronous settings. For classification, Herlihy further showed that certain RMW , such as , have exactly number 2, as they enable two-process but cannot implement three-process without additional ; the upper bound proof relies on constructing a bivalent initial state and showing that no sequence of operations can resolve it to a for three processes. In contrast, stronger RMW like achieve infinite number, allowing wait-free for any number of processes. Examples illustrate this hierarchy: read-write registers, lacking any modification capability beyond simple reads and writes, have consensus number 1 and thus provide no synchronization beyond single-process triviality. Basic RMW primitives like on a single bit, commonly used for , also cap at 2, sufficient for binary agreement but inadequate for multi-process coordination without escalation. More powerful RMW operations, such as , reach infinity by enabling the implementation of universal constructions that simulate any shared object wait-free. Similarly, abstract data types built atop RMW, like queues and stacks, inherit a consensus number of 2 when their operations (e.g., enqueue/dequeue or ) are wait-free, as they can solve two-process but face the same bivalency issues for three or more. The implications of number extend to ensuring progress guarantees in distributed systems, as primitives with higher numbers (especially ) allow the construction of wait-free algorithms that tolerate asynchrony and failures without , enabling robust implementations of complex tasks like locks or barriers for arbitrary process counts. This framework underscores why selecting RMW primitives with sufficient consensus strength is critical for scalable, non-blocking concurrency in multiprocessor environments.

Synchronization Hierarchy

The synchronization hierarchy organizes atomic primitives by their consensus numbers, a metric introduced by Maurice Herlihy to quantify the computational power of mechanisms in asynchronous shared-memory systems. At the base level, read-write registers have a consensus number of 1, supporting only basic coordination where processes can agree on initial values but cannot resolve contention among multiple participants. Nontrivial read-modify-write (RMW) operations, such as (TAS) and , advance to consensus number 2, allowing implementation of and two-process but falling short for larger process sets. At the pinnacle, advanced RMW primitives like (CAS) and (LL/SC) achieve infinite consensus numbers, enabling the construction of any wait-free shared object. Within this hierarchy, basic RMW primitives like swap demonstrate their limitations by supporting —effectively solving for two processes through simple spin-based protocols—but they cannot extend to three or more processes without risking non-termination in adversarial scheduling. For instance, can enforce exclusive access to a for two threads by atomically setting a flag, but attempts to scale this to n>2 processes fail to guarantee agreement on a single value among all, as the primitive lacks the conditional retry capability of higher-level operations. This positioning underscores why RMW operations with number 2 are sufficient for basic contention resolution but inadequate for complex coordination tasks requiring universal solvability. Universal constructions bridge gaps in the by allowing with a given number k to implement any shared object with consensus number at most k; for basic RMW (k=2), this means building stacks or queues that tolerate two-process contention wait-free, but higher consensus demands stronger like . Herlihy's proves that no combination of weaker can simulate a stronger one. This has been applied to the development of lock-free data structures, where RMW with higher consensus numbers facilitate scalable, non-blocking algorithms for concurrent environments, as seen in implementations of queues and deques that avoid and livelock under high contention.

References

  1. [1]
    The Read-Modify-Write operation - Arm Developer
    The read-modify-write operation ensures that you modify only the specific bits in a system register that you want to change.
  2. [2]
    [PDF] IA-32 Intel® Architecture Software Developer's Manual
    ... read-modify-write operation when modifying a memory operand. This mechanism is used to allow reliable communications between processors in multiprocessor ...
  3. [3]
    [PDF] Lecture 10.2 Read, Modify, Write Atomics - Parallel Computing for ...
    ○ Common atomic operations (building blocks):. – Read. – Write. – Test-and-set. – Swap. – Fetch and add (fetch and increment). – Read-modify-write. – Compare- ...
  4. [4]
    [PDF] Atomic Read Modify Write Primitives for I/O Devices - Intel.es
    Read Modify Write (RMW) operations are hardware-assisted operations that atomically update a variable at its memory location. These operations have a long ...
  5. [5]
  6. [6]
    [PDF] Linearizability: A Correctness Condition for Concurrent Objects
    Linearizability is a correctness condition for concurrent objects, making them equivalent to legal sequential computations, allowing high concurrency.
  7. [7]
    [PDF] 60 Years of Mastering Concurrent Computing through Sequential ...
    Jun 17, 2021 · The read/write shared register abstraction provides several advantages over message passing: a more natural transition from uniprocessors, and ...
  8. [8]
    Reading 17: Concurrency - MIT
    Concurrency means multiple computations are happening at the same time. Concurrency is everywhere in modern programming, whether we like it or not.
  9. [9]
    [PDF] What every systems programmer should know about concurrency
    Apr 28, 2020 · to read a value, modify it, and write it back as a single atomic step. There are a few common read-modify-write (rmw) op- erations. In C++ ...
  10. [10]
    [PDF] Cache coherence in shared-memory architectures
    Each CPU (cache system) 'snoops' (i.e. watches continually) for write activity concerned with data addresses which it has cached. • This assumes a bus structure ...
  11. [11]
    MESI and MOESI protocols - Arm Developer
    A write can only be done if the cache line is in the Modified or Exclusive state. · A cache can discard a Shared line at any time, changing to the Invalid state.
  12. [12]
    Atomic vs. Non-Atomic Operations - Preshing on Programming
    Jun 18, 2013 · Atomic operations complete in a single step, guaranteeing no half-complete modifications. Non-atomic operations lack these guarantees, and C/C+ ...Missing: pitfalls | Show results with:pitfalls
  13. [13]
    Synchronization 1: Atomicity – CS 61 2021
    Some computer operations are atomic operations. These operations have ... std::atomic<T> types in C++ support atomic read-modify-write operations; std ...
  14. [14]
    Atomicity in the ARM architecture
    Atomicity is a feature of memory accesses, described as atomic accesses. The ARM architecture description refers to two types of atomicity.
  15. [15]
    (PDF) Atomic Vector Operations on Chip Multiprocessors
    The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance.
  16. [16]
    [PDF] A Case for Redundant Arrays of Inexpensive Disks (RAID)
    RAID, based on magnetic disk tech, offers improvements in performance, reliability, power, and scalability, as an alternative to SLED.
  17. [17]
    [PDF] Considerations for RAID-6 Availability and Format/Rebuild ...
    Mar 30, 2010 · Read and write the data and read and write the parity. RAID-6 is about the same. However, the P and Q parity disks must be read and updated.
  18. [18]
    None
    ### Summary of Read-Modify-Write Process and Parity Updates in RAID-6 (Liberation Codes)
  19. [19]
    The Read-Modify-Write problem - PICList
    Apr 5, 2018 · When the uP reads a port register, it reads the actual state of the pins, rather than the output latch. This can cause two problems.
  20. [20]
    [PDF] COS 318: Operating Systems I/O Device Interactions and Drivers
    ◇ Device driver allocates a buffer for read and schedules I/O ... open read write. Driver (dev 1). Open: Read: Write: Open(1,…) Page 10. 10. 37.Missing: mitigating modify-
  21. [21]
    Polling or Interrupt based method - embedded - Stack Overflow
    Jun 18, 2010 · An interrupt has more overhead than a polling cycle, so if the event is frequent polling creates less overhead.Why is Read-Modify-Write necessary for registers on embedded ...when polling is better than interrupt? - Stack OverflowMore results from stackoverflow.com
  22. [22]
    [PDF] IBM PC Technical Reference - Bitsavers.org
    The System Unit is the heart of your IBM Personal Computer system. The System Unit houses the microprocessor, Read-Only Memory. (ROM), Read/Write Memory ...
  23. [23]
    [PDF] IBM PC XT Tecnical Reference
    The IBM Personal Computer XT Technical Reference manual describes the hardware design and provides interface information for the IBM Personal Computer XT. This ...
  24. [24]
    Reading 21: Concurrency - MIT
    Each time you run a program containing a race condition, you may get different behavior. These kinds of bugs are heisenbugs, which are nondeterministic and hard ...
  25. [25]
    [PDF] Data Races & Race Conditions
    A data race occurs when two concurrent threads access a shared variable ... • Read-modify-write. • Check-then-act. 11. CMSC433 Fall 2021. Page 12. Thread ...
  26. [26]
  27. [27]
    [PDF] concurrency.pdf
    - S.C. doesn't make such read-modify-write instructions atomic. - So on multiprocessor, suffer same race as 3-instruction version. • Can make x86 instruction ...
  28. [28]
    ThreadSanitizer — Clang 22.0.0git documentation - LLVM
    ThreadSanitizer is a tool that detects data races. It consists of a compiler instrumentation module and a run-time library.
  29. [29]
    [PDF] TMS320x 2834x Delfino Direct Memory Access (DMA) Module
    ... read-modify-write operation and the DMA performs a write to the same location, the DMA write may be lost if the operation occurs in between the CPU read and ...Missing: hazards | Show results with:hazards
  30. [30]
    [PDF] Thread and Memory Placement on NUMA Systems - USENIX
    Jul 10, 2015 · It is well known that the placement of threads and memory plays a crucial role for performance on NUMA. (Non-Uniform Memory-Access) systems.
  31. [31]
    [PDF] Challenges of Memory Management on Modern NUMA Systems
    Current x86 NUMA systems are cache coherent (called ccNUMA), which means programs can transparently access memory on local and remote nodes without changes to ...
  32. [32]
    None
    ### Summary of Read-Modify-Write Operations, Unintended Behaviors, Side Effects, and Hazards on PORT Registers
  33. [33]
    [PDF] TMS320F2803x Piccolo System Control and Interrupts
    An interrupt may be lost during the read-modify-write operation. See Section Section 6.3.1 for a method to clear flagged interrupts. 6.5.2. PIE Interrupt ...Missing: side | Show results with:side<|separator|>
  34. [34]
    [PDF] GPUHammer: Rowhammer Attacks on GPU Memories are Practical
    Aug 12, 2025 · Rowhammer is a read disturbance vulnerability in modern. DRAM that causes bit-flips, compromising security and reli- ability.
  35. [35]
    [PDF] Network Interface Design for Low Latency Request-Response ...
    Jun 26, 2013 · To efficiently communicate with the network card in cache line size units, we utilize the cache hierarchy and write-gathering buffers. All ...
  36. [36]
    [PDF] The z/Architecture Principles of Operation - IBM
    ... TEST AND SET . . . . . . . . . . . . . . . . . . . . . . 7-223. TEST UNDER ... System/370. VM/ESA z/Architecture z/OS. ANSI is a registered trademark of ...
  37. [37]
  38. [38]
  39. [39]
    Assert LOCK# Signal Prefix
    The LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.Missing: bus | Show results with:bus
  40. [40]
    Evaluating the Cost of Atomic Operations on Modern Architectures
    Oct 19, 2020 · In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of ...
  41. [41]
    Mutex Lock Code Examples (Multithreaded Programming Guide)
    The two functions in Example 4-1 use the mutex lock for different purposes. The increment_count() function uses the mutex lock simply to ensure an atomic update ...
  42. [42]
    pthread_mutex_lock
    Every time a thread relocks this mutex, the lock count shall be incremented by one. Each time the thread unlocks the mutex, the lock count shall be decremented ...
  43. [43]
    Locking lessons - The Linux Kernel documentation
    The lessons cover spin locks, the basic locking primitive, reader-writer spinlocks for read-mostly data, and faster, non-interrupt spinlocks.<|control11|><|separator|>
  44. [44]
    [PDF] Software Transactional Memory - MIT
    This paper we will focus on implementations of a transactional memory that supports static. transactions, a class that includes most of the known and proposed.Missing: seminal | Show results with:seminal
  45. [45]
    [PDF] Dijkstra Book Chapter on Concurrent Algorithms - Leslie Lamport
    May 14, 2025 · In 1965, Dijkstra published in Communications of the ACM (CACM) a paper [5] containing what I believe was the first concurrent algorithm to.
  46. [46]
    Intrinsic Locks and Synchronization (The Java™ Tutorials ...
    Intrinsic locks play a role in both aspects of synchronization: enforcing exclusive access to an object's state and establishing happens-before relationships.
  47. [47]
    [PDF] Wait-free synchronization - Brown CS
    If X has consensus number n, and. Y has consensus number m < n, then there exists no wait-free implementation of X by Y in a system of more than m processes.