Fact-checked by Grok 2 weeks ago

Compare-and-swap

Compare-and-swap (CAS), also known as compare-and-exchange, is an atomic hardware instruction used in concurrent programming to safely update shared memory locations in multiprocessor systems. It operates by comparing the current value stored at a specified memory address with an expected value provided as input; if the two values match, it atomically replaces the memory location's value with a new value, also provided as input, and returns success, while if they do not match, it leaves the memory unchanged and typically returns failure.^[1] This operation ensures indivisibility, preventing race conditions where multiple threads might interfere with each other's updates to the same variable.^[2] First introduced by IBM in the System/370 architecture in 1970, CAS provided a foundational synchronization primitive for operating systems like OS/360 and early UNIX ports, enabling efficient mutual exclusion without traditional locks in certain scenarios.^[3] Over time, CAS has become a standard feature in modern processor architectures, including x86 (via the CMPXCHG instruction), ARM (via CAS and CASP), and others, supporting both single-word and double-word variants for 32-bit and 64-bit operations.^[4] Its pseudocode is commonly represented as:

bool CAS(address, expected, new_value) {
    if (*address == expected) {
        *address = new_value;
        return true;
    }
    return false;
}
bool CAS(address, expected, new_value) {
    if (*address == expected) {
        *address = new_value;
        return true;
    }
    return false;
}

This simplicity belies its power in enabling non-blocking algorithms, where threads retry operations on failure rather than blocking, improving scalability in high-contention environments.^[5] In theoretical computer science, Maurice Herlihy's 1991 work demonstrated CAS's universality for wait-free synchronization, proving it sufficient to implement any shared object in a wait-free manner—guaranteeing progress for each thread independent of others—unlike weaker primitives like read-write registers or fetch-and-add.^[6] This result has profoundly influenced the design of lock-free data structures, such as queues, stacks, and hash tables.^[7] Extensions like multi-word CAS (MCAS) address limitations in updating multiple disjoint locations atomically, further expanding its applicability in advanced concurrent algorithms.^[8] Despite its benefits, CAS can suffer from the ABA problem, where a value temporarily changes and reverts, leading to incorrect updates, often mitigated by hazard pointers or version tags.^[9]

Fundamentals

Definition

Compare-and-swap (CAS) is an atomic primitive implemented as a single, indivisible CPU instruction or software-emulated operation that reads the current value from a specified memory location, compares it to an expected value, and conditionally stores a new value only if the current value matches the expected one.^[10] This ensures the entire read-compare-write sequence occurs without interference from other threads, maintaining data consistency in concurrent environments.^[11] The operation requires three primary parameters: a pointer to the target memory address, the expected value (representing the anticipated current state), and the desired new value to write if the comparison succeeds.^[10] Upon execution, CAS returns a boolean value indicating success—true if the comparison matched and the write occurred, or false if the memory value had changed in the interim.^[12] In synchronization contexts, CAS facilitates non-blocking (lock-free) concurrency by supporting optimistic update strategies, where threads attempt modifications assuming no conflicts and retry only upon failure, avoiding the overhead and potential deadlocks of traditional mutual exclusion locks.^[10]^[12]

Pseudocode and Operation

The compare-and-swap (CAS) operation performs an atomic read-modify-write on a shared memory location, proceeding in the following logical steps: first, it loads the current value from the specified memory address; second, it compares this value to an expected value provided as input; third, if the values match, it stores a new value at the address and indicates success; fourth, if they do not match, it leaves the memory unchanged and indicates failure.^[13]^[14] The following pseudocode illustrates the basic logic of CAS, where the operation takes a memory address, an expected value, and a new value as parameters:

[function](/page/Function) compareAndSwap([address](/page/Address), expected, newValue) returns [boolean](/page/Boolean)
    current ← load([address](/page/Address))  // Read current value from [memory](/page/Memory)
    if current == expected then
        store([address](/page/Address), newValue)  // Write new value if match
        return true
    else
        return false
    end if
end [function](/page/Function)
[function](/page/Function) compareAndSwap([address](/page/Address), expected, newValue) returns [boolean](/page/Boolean)
    current ← load([address](/page/Address))  // Read current value from [memory](/page/Memory)
    if current == expected then
        store([address](/page/Address), newValue)  // Write new value if match
        return true
    else
        return false
    end if
end [function](/page/Function)

In actual implementations, the entire sequence—reading, comparing, and conditionally writing—must execute atomically to ensure indivisibility and prevent concurrent interference.^[13]^[14] A non-atomic software emulation of this logic, such as the pseudocode above without hardware support, risks race conditions where concurrent threads could interleave reads and writes, leading to inconsistent states; achieving atomicity in software typically requires mechanisms like disabling interrupts, using memory barriers or fences, or wrapping in locks, whereas hardware implementations (e.g., via dedicated instructions) guarantee indivisibility at the processor level.^[14] The operation returns a boolean value indicating whether the swap succeeded (true if the expected value matched and was replaced, false otherwise), which enables algorithms to detect failures and retry in a loop until success, supporting non-blocking progress in concurrent settings.^[13]^[14]

Applications

Lock-Free Data Structures

Compare-and-swap (CAS) enables the construction of lock-free data structures by providing an atomic primitive for optimistic concurrency control, where threads read shared state, compute updates assuming no interference, and use CAS to install changes only if the state remains unmodified.^[13] This approach contrasts with lock-based synchronization, such as mutexes, which serialize access and can lead to deadlocks or priority inversion under contention, whereas lock-free methods ensure that at least one thread makes progress without blocking others.^[15] In lock-free designs, failed CAS attempts trigger retries, promoting non-blocking progress in multiprocessor environments. Common lock-free data structures leverage CAS for pointer manipulations in linked representations. For instance, Treiber's stack implements push and pop operations by atomically swapping the top pointer: a thread reads the current top, links a new node to it for push, or unlinks it for pop, and retries via CAS if concurrent modifications occur.^[16] Similarly, the Michael-Scott queue uses CAS to update head and tail pointers in a singly-linked list, enabling lock-free enqueue and dequeue by swinging pointers and handling races through repeated attempts.^[17] Hash tables, such as Harris's design, employ CAS for inserting or removing elements in per-bucket lock-free linked lists, maintaining consistency without global locks.^[18] These structures offer scalability benefits on multi-core systems by minimizing contention; unlike coarse-grained locks that bottleneck at shared resources, CAS-based retries distribute load across cores, yielding higher throughput under high parallelism as demonstrated in queue benchmarks where performance scales near-linearly with threads.^[17] Lock-freedom specifically guarantees system-wide progress—ensuring some thread completes its operation in finite steps amid contention—providing stronger assurances than obstruction-free methods, which only require progress in the absence of interference.

Example: Atomic Adder

One practical application of compare-and-swap (CAS) is implementing a thread-safe atomic increment on a shared counter variable, allowing multiple threads to update the counter without using locks and avoiding race conditions. In this scenario, threads concurrently attempt to increment a shared integer counter, such as tracking the number of processed items in a multithreaded application; CAS ensures that only one thread's update succeeds at a time, with others retrying transparently.^[9] The algorithm operates in a loop: a thread first reads the current value of the counter as the expected value, computes the new value by adding the increment (typically 1), and then attempts a CAS operation to replace the counter's value with the new one only if it still matches the expected value. If the CAS succeeds, the update is complete; if it fails—indicating another thread modified the counter in the interim—the thread discards the computation and retries by reading the updated value. This retry mechanism serializes the increments without blocking, leveraging the atomicity of CAS to detect and resolve conflicts.^[9] The following pseudocode illustrates a general atomic add operation using CAS, where address points to the shared counter and delta is the increment amount:

atomic_add(address, delta) {
  do {
    expected = *address;
    new_val = expected + delta;
  } while (CAS(address, expected, new_val) != expected);
  return expected;
}
atomic_add(address, delta) {
  do {
    expected = *address;
    new_val = expected + delta;
  } while (CAS(address, expected, new_val) != expected);
  return expected;
}

Here, the CAS primitive returns the original value at address if the swap succeeds (matching expected), or the current value if it fails, prompting a loop iteration. For a simple increment, set delta = 1. This implementation is lock-free, as each thread makes progress independently, though retries may occur under contention.^[9] Consider a trace of two threads, Thread A and Thread B, both attempting to increment a shared counter initialized to 0:

Both threads read the counter as expected = 0 and compute new_val = 1.
Thread A performs CAS( counter, 0, 1 ), which succeeds since the value is still 0; the counter is now 1, and Thread A returns.
Thread B's CAS( counter, 0, 1 ) fails because the counter is now 1 (not matching expected=0); Thread B retries.
On retry, Thread B reads expected = 1, computes new_val = 2, and performs CAS( counter, 1, 2 ), which succeeds; the counter is now 2, and Thread B returns.

This execution ensures the counter reaches 2 atomically, with increments applied in some sequential order despite concurrency.^[9] The operation is atomic because the CAS instruction executes indivisibly as a single hardware primitive, guaranteeing that no other thread can observe or interfere with the comparison and update midway; combined with the retry loop, it emulates a sequentially consistent increment equivalent to a locked ++ operation, preventing lost updates or races.^[9]

Challenges

ABA Problem

The ABA problem arises in compare-and-swap (CAS) operations when a thread reads a shared value A from a memory location, performs some computation, and later attempts a CAS to replace it with a new value, but another thread has meanwhile modified the location from A to B and then back to A, causing the CAS to succeed incorrectly despite the intervening change.^[19] This issue stems from CAS relying solely on direct value comparison without accounting for the history of modifications to the location.^[20] It is particularly prevalent in environments with memory reuse, such as garbage-collected languages or systems using object pools, where deallocated memory can be quickly reallocated with the same pointer value.^[21] A classic illustration occurs in lock-free linked list implementations, such as during node deletion. Suppose thread T1 reads a pointer to node N (value A) from the previous node's next field. While T1 prepares to unlink N, thread T2 removes N, freeing it, and possibly another thread reuses the same memory address for a new node with value A. When T1 executes CAS to update the previous node's next field, skipping over the now-reused N, the operation succeeds because the pointer still appears as A, potentially leading to incorrect linking and loss of subsequent nodes.^[21] This scenario highlights how pointer-based structures exacerbate the problem, as addresses can cycle through reuse faster than threads complete operations.^[20] The impact of the ABA problem includes data corruption, such as lost updates or structural inconsistencies in data structures, and can even induce infinite loops in retry-based algorithms if the false success propagates undetected.^[19] To detect and mitigate it, one common approach appends a version counter or tag to the value being compared, ensuring uniqueness even if the base value reappears; for instance, in a 64-bit architecture, the low 32 bits might hold the pointer or value, while the high 32 bits store an incrementing counter updated on each modification.^[19] In Java, the AtomicStampedReference class implements this by pairing an object reference with an integer stamp, allowing CAS operations like compareAndSet to validate both the reference and stamp atomically, thus preventing false successes in scenarios like the linked list example.^[22] This tagged approach maintains the efficiency of single-word CAS while addressing the historical oversight inherent in plain value comparisons.^[19]

Performance Trade-offs

Compare-and-swap (CAS) operations provide significant benefits in concurrent environments, particularly for uncontended access where they execute as a single atomic instruction with low latency, typically around 4-5 CPU cycles on modern Intel processors.^[23] This efficiency stems from avoiding the overhead of acquiring and releasing locks, enabling better scalability on multi-core systems by permitting concurrent progress without serializing threads.^[24] In read-heavy workloads, CAS-based algorithms often outperform traditional locking mechanisms, as they minimize blocking and support higher throughput under low conflict.^[24] However, under high contention, CAS incurs substantial costs due to the need for retry loops, where failed attempts require reloading values and spinning, which can consume hundreds of CPU cycles per operation—up to 200 cycles or more for remote cache accesses on Intel architectures.^[23] This spin-waiting exacerbates cache line bouncing in shared memory systems, as frequent invalidations and coherency traffic degrade performance, especially in write-heavy scenarios where linearizability guarantees demand strict atomicity.^[23] Benchmarks indicate that at moderate contention levels (e.g., around 50% success rate), CAS retries can waste 10-100 times more cycles than a successful lock acquisition, depending on the architecture and workload.^[23] Performance varies across architectures; for instance, recent 2020s benchmarks on ARM AArch64 processors like Fujitsu A64FX show CAS exhibiting quadratic latency scaling under high thread counts due to contention, often performing worse than x86 equivalents in such cases, though load-link/store-conditional alternatives can mitigate this on ARM.^[25] CAS is ideal for low-conflict operations, such as counters or queues with infrequent updates, but developers should fallback to locks for high-contention environments to avoid excessive retry overhead.^[24] In write-heavy workloads, the stronger consistency provided by CAS can lead to slower overall performance compared to coarser-grained locks, as retries enforce linearizability at the expense of throughput.^[24]

Implementations

In C and related languages, compare-and-swap (CAS) is implemented through standardized atomic libraries and compiler-specific intrinsics, enabling portable lock-free programming across platforms. The C11 standard introduces the <stdatomic.h> header, which defines atomic types via the _Atomic qualifier and provides generic atomic operations, including atomic_compare_exchange_n for portable CAS on integer types. This function atomically loads the value from an atomic object, compares it to an expected value, and—if they match—stores a desired value while returning true; otherwise, it stores the current value into the expected pointer and returns false. A representative example is an atomic increment of a shared counter using the weak variant atomic_compare_exchange_weak, which may spuriously fail for performance reasons but must be used in a retry loop:

c
#include <stdatomic.h>

_Atomic int [counter](/page/Counter) = 0;
int expected = atomic_load(&[counter](/page/Counter));

while (!atomic_compare_exchange_weak(&[counter](/page/Counter), &expected, expected + 1)) {
    expected = atomic_load(&[counter](/page/Counter));
}
#include <stdatomic.h>

_Atomic int [counter](/page/Counter) = 0;
int expected = atomic_load(&[counter](/page/Counter));

while (!atomic_compare_exchange_weak(&[counter](/page/Counter), &expected, expected + 1)) {
    expected = atomic_load(&[counter](/page/Counter));
}

This pattern ensures progress despite contention or spurious failures, with the weak form often compiling to more efficient instructions on hardware supporting it.^[26] In C++, the <atomic> header extends this with std::atomic<T>, offering compare_exchange_strong (guaranteed no spurious failure) and compare_exchange_weak (allows spurious failure for potential speedups in retry loops, such as lock-free queues). The weak variant is recommended for performance-critical loops where retries are expected, as it avoids unnecessary hardware retries on platforms like x86.^[27] For platform dependencies, GCC's legacy __sync_val_compare_and_swap intrinsic (superseded since GCC 4.8 by __atomic builtins) provided a simple CAS but lacked memory order control; it is replaced by __atomic_compare_exchange_n, which supports C11 semantics and explicit ordering. On Microsoft Visual C++, InterlockedCompareExchange delivers a 32-bit CAS with implicit full barriers, suitable for Windows threading.^[28]^[29] Best practices emphasize memory ordering: use memory_order_acquire on loads and memory_order_release on successful CAS stores to synchronize dependent operations between threads without the cost of memory_order_seq_cst full barriers, which impose total ordering. Pitfalls include infinite loops from unupdated expected values in weak CAS retries and overusing strong variants, which can degrade performance under high contention by forcing hardware retries.^[30]^[27]

Hardware-Level Support

The compare-and-swap (CAS) operation is natively supported in the x86 architecture through the CMPXCHG instruction, which performs a 32-bit atomic compare-and-exchange, introduced in the Intel 80486 processor in 1989.^[31] This instruction compares the value in the accumulator (AL, AX, EAX, or RAX depending on operand size) with a memory operand and, if equal, replaces it with a source value while setting the zero flag accordingly; a LOCK prefix ensures atomicity across multi-core systems.^[32] Extensions followed with CMPXCHG8B for 64-bit operations in the original Pentium processor in 1993, enabling atomic swaps on doublewords essential for 64-bit concurrency primitives.^[33] Further evolution included CMPXCHG16B in the x86-64 extensions, ratified by AMD in 2000 and implemented starting with the Opteron in 2003, supporting 128-bit atomic operations for advanced synchronization in modern 64-bit environments.^[34] In the ARM architecture, direct single-word CAS is not provided as a primitive instruction; instead, the Load-Exclusive (LDREX or LDXR) and Store-Exclusive (STREX or STXR) pair, introduced in ARMv6 (announced 2001, first implemented 2002), emulates CAS through a reservation-based mechanism that detects intervening modifications between load and store.^[35] True hardware CAS support arrived in ARMv8 (2011) with atomic load/store instructions, and ARMv8.1 (2016) added the CASP family for paired 64-bit or 128-bit compare-and-swap operations, facilitating lock-free algorithms on doublewords without exclusive monitors.^[36] Other architectures provide analogous hardware support: PowerPC employs the Load Word And Reserve Indexed (lwarx) and Store Word Conditional Indexed (stwcx.) instructions since the PowerPC 601 in 1994 to implement CAS via reservation, ensuring atomicity for 32- and 64-bit operations in a similar manner to ARM's exclusive access. RISC-V's base A extension, frozen in 2011 and widely adopted by the early 2010s, uses Load-Reserved/Store-Conditional (LR/SC) loops for CAS emulation, with Atomic Memory Operations (AMOs) like atomic read-modify-write providing related primitives; the Zacas extension, ratified in November 2023, introduces direct amocas.w, amocas.d, and amocas.q instructions for explicit 32-, 64-, and 128-bit CAS to reduce loop overhead in software.^[37] By the early 2000s, hardware CAS or equivalent support became ubiquitous in 64-bit CPUs across x86, ARM, PowerPC, and emerging ISAs like RISC-V, driven by the rise of multi-core processors and the need for scalable concurrency.^[38] Hardware atomicity for CAS is enforced through mechanisms like bus locking in early designs or, more efficiently, cache coherence protocols such as MESI (Modified, Exclusive, Shared, Invalid), which ensure a core gains exclusive ownership of the cache line before performing the read-compare-write sequence, preventing interference from other cores.^[39] Under MESI, the atomic operation transitions the line to the Modified state, serializing access via snooping or directory-based coherence to guarantee visibility and consistency across cores without full bus locking in modern implementations.^[40] Post-2020 developments, such as RISC-V's Zacas, address gaps in direct CAS support while maintaining compatibility with existing coherence models.^[41]

Extensions

Load-Link/Store-Conditional

Load-link/store-conditional (LL/SC) is a synchronization primitive consisting of two instructions designed to enable atomic read-modify-write operations in load-store architectures. The load-link (LL) instruction retrieves the current value from a specified memory address into a register while establishing a hardware reservation on that address, marking it for exclusive monitoring. The subsequent store-conditional (SC) instruction attempts to write a new value from a register back to the same address, succeeding and updating the memory only if no other processor or thread has performed a store to that address since the LL; in case of failure, the SC does not modify memory and typically returns a failure indicator to the issuing register, prompting a retry. This mechanism relies on hardware enforcement of the reservation, often through cache coherence protocols or dedicated monitors, ensuring atomicity without locking the entire system.^[42]^[43] LL/SC provides functional equivalence to the compare-and-swap (CAS) operation, allowing CAS to be implemented using a retry loop: an LL loads the expected value, which is compared locally to determine the desired update, followed by an SC to apply the change if the reservation remains valid. If the SC fails, the loop reloads via LL and retries, ensuring the update occurs atomically relative to the original load. Unlike CAS, which only detects changes if the value matches the expected one, LL/SC inherently mitigates the ABA problem by failing the SC upon any intervening store to the address, irrespective of whether the value has reverted to its prior state. This detection of modifications at the address level, rather than value level, makes LL/SC particularly robust for pointer-based structures where value recycling could otherwise lead to incorrect behavior.^[44] Several architectures provide native LL/SC support, including MIPS with its LL and SC instructions for 32- or 64-bit words, ARM through variants like Load-Exclusive (LDREX or LDAXR) and Store-Exclusive (STREX or STLXR), and PowerPC via Load Word And Reserve Indexed (lwarx) and Store Word Conditional Indexed (stwcx.). These instructions are typically restricted to cached, aligned memory accesses to leverage coherence hardware efficiently. In software ecosystems, LL/SC underpins atomic operations in the Java platform's sun.misc.Unsafe class, which emulates CAS on LL/SC-capable architectures like ARM and PowerPC by wrapping the instructions in retry loops for portable lock-free programming. The Linux kernel similarly employs LL/SC on ARM and PowerPC to realize generic atomic primitives, such as atomic increments and exchanges, ensuring cross-architecture consistency in concurrent code.^[42]^[43]^[45] Compared to CAS, LL/SC offers advantages in ABA avoidance and alignment with RISC principles by separating load and store phases, potentially enabling more flexible compiler optimizations. However, it introduces disadvantages, including overhead from reservation metadata—such as per-core exclusive monitors or reservation bits in caches—that track addresses and can consume additional hardware resources. Additionally, LL/SC loops risk livelock from spurious SC failures, often triggered by store forwarding within the same core or unrelated coherence events that invalidate the reservation without an actual conflicting store, necessitating backoff strategies for progress guarantees. Historically, LL/SC originated as a proposal by Jensen, Hagensen, and Broughton for the S-1 AAP multiprocessor project at Lawrence Livermore National Laboratory in the late 1970s, serving as an early foundation for non-blocking synchronization in shared-memory systems and influencing subsequent RISC designs before widespread adoption of fused CAS instructions.^[46]

Multi-Word Variants

Multi-word variants of compare-and-swap (CAS) extend the atomic primitive to multiple memory locations, enabling compound updates that maintain consistency in concurrent environments without intermediate states visible to other threads. The double compare-and-swap (DCAS) operation, which atomically compares the contents of two distinct memory locations against expected values and replaces both with new values if the comparison succeeds, was proposed in the mid-1990s as a foundation for non-blocking synchronization techniques in operating system structures.^[47] This extension addresses limitations of single-word CAS in scenarios requiring simultaneous updates to related data, such as pointer pairs in linked structures. Implementations of DCAS and broader multi-word CAS (MCAS) primarily use software emulation due to sparse hardware support. Lock-based software methods ensure atomicity but incur blocking overhead and reduce scalability under contention. Lock-free emulations built atop single-word CAS, such as the descriptor-based approach for arbitrary N-word operations, provide practical alternatives with disjoint-access parallelism, though they require careful management of helper threads and retry mechanisms.^[48] On the hardware side, instructions like x86's CMPXCHG16B offer double-width CAS for 128-bit operands—effectively two adjacent 64-bit words—but apply only to packed data, not arbitrary non-contiguous locations, and are available only on processors supporting 64-bit extensions.^[49] DCAS facilitates atomic operations in concurrent data structures, such as updating both head and tail pointers in double-ended queues (deques) to avoid races during enqueue or dequeue.^[50] It also supports consistent hash table modifications by atomically swapping key-value pairs, preventing partial updates that could lead to inconsistencies.^[51] Generic transactional memory libraries, including Microsoft's Persistent Multi-word Compare-and-Swap (PMwCAS), employ MCAS to enable lock-free atomic updates across multiple variables, particularly in non-volatile memory systems for durable, concurrent programming. Key challenges in multi-word variants include the rarity of dedicated hardware support, limited to specialized instructions on select CPUs like those with x86 extensions, which restricts performance and portability. Scalability degrades beyond 2-4 words due to heightened contention in emulations and exponential growth in state validation complexity. Progress in the 2020s leverages Intel's Transactional Synchronization Extensions (TSX) to implement MCAS via hardware transactions, reducing aborts from retry-heavy single-CAS sequences by providing all-or-nothing multi-word atomicity, albeit with a 10-15% performance overhead in lock-free algorithms.^[52]

References

[1]
Spin locks (CS 4410, Summer 2018)
The compare and swap instruction (CAS) is similar to, but more complicated than, the test_and_set instruction. The CAS instruction takes three parameters: a ...
[2]
[PDF] Implementation of Atomic Primitives on Distributed Shared Memory ...
The compare and swap primitive was first provided on the IBM System/370 [4]. Compare and swap takes three parameters: the address of the destination operand ...
[3]
A UNIX System Implementation for System/370 - Nokia
Since its introduction by IBM in 1970, System/370 ... However, it does include a synchronizing instruction, Compare and Swap (CS), that was used in implementing P ...
[4]
The different groups of A64 atomic instructions - Arm Developer
Compare and swap instructions, CAS and CASP. Atomic memory operation instructions, LD<OP> and ST<OP> , where OP is one of arithmetic and bitwise operations.<|control11|><|separator|>
[5]
[PDF] 6.852: Distributed Algorithms - csail
If initial value = v, then access the C&S shared variable with compare-and-swap( ⊥, v ), obtain the previous value w. • If w = ⊥ then decide v. (You are ...
[6]
Wait-free synchronization | ACM Transactions on Programming ...
ASPNES, J., AND HERLIHY, M.P. Fast randomized ... On utility accrual processor scheduling with wait-free synchronization for embedded real-time software.
[7]
Impossibility and universality results for wait-free synchronization
Wait-free synchronization. A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite ...
[8]
A Practical Multi-word Compare-and-Swap Operation
We present an efficient and practical lock-free implementation of a concurrent deque that supports parallelism for disjoint accesses and uses atomic primitives ...
[9]
[PDF] Synchronization Lecture Notes
In using compare-and-swap, the architecture is asked to operate based on whether the expected value is still present at that memory location. For commutable ...
[10]
The Balancing Act of Choosing Nonblocking Features
Sep 1, 2013 · ... Compare-and-Swap (CAS) atomic primitive. CAS atomically reads a shared location and compares the read value with an expected value. If equal ...
[11]
BQ: A Lock-Free Queue with Batching - ACM Digital Library
A CAS primitive is defined by a triplet consisting of a target memory address, an expected value, and a new value. A CAS operation compares the value stored in ...<|control11|><|separator|>
[12]
Lock-free linked lists using compare-and-swap - ACM Digital Library
Lock-free linked lists using compare-and-swap. Article. Free access. Share on. Lock-free linked lists using compare-and-swap. Author: John D. Valois. John D.
[13]
[PDF] IBM System/370 Principles of Operation
Sep 1, 1975 · IBM System/3 70 highlights some of the major features of System/370 ... COMPARE AND SWAP. COMPARE DOUBLE AND SWAP. COMPARE HALFWORD ...
[14]
General-Purpose Multicore Architectures - arXiv
Aug 23, 2024 · The wider adoption of multicore CPUs was hastened by two trends observed in the late 1990s. First, a handful of computer engineers started ...
[15]
[PDF] Wait-free synchronization - Brown CS
It is impossible to construct a wait-free implementation of a compare&swap register from a set of registers that support any combination of read, write, ...
[16]
[PDF] Art of Multiprocessor Programming - School of Computing Science
Page 1. Page 2. The Art of Multiprocessor Programming. Page 3. This page ... compare(Timestamp);. 3. } 4 public interface TimestampSystem {. 5 public ...
[17]
[PDF] Real-Time Computing with Lock-Free Shared Objects
As the name suggests, lock-free shared objects are distinguished by the fact that they are accessed without locking. As such, they do not give rise to priority ...<|control11|><|separator|>
[18]
[PDF] Systems Programming: Coping With Parallelism - IBM Research
SYSTEMS PROGRAMMING: COPING WITH PARALLELISM. R. Kent Treiber. IBM Almaden Research Center. 650 Harry Road. San Jose, California 95120-6099 ?:== Research ...
[19]
[PDF] Simple, Fast, and Practical Non-Blocking and Blocking Concurrent ...
We use Treiber's simple and efficient non-blocking stack algorithm [21] to implement a non-blocking free list. Figure 2 presents commented pseudo-code for the ...
[20]
High performance dynamic lock-free hash tables and list-based sets
This paper presents the first CAS-based lock-free list-based set algorithm that is compatible with all lock-free memory management methods.
[21]
[PDF] Understanding and Effectively Preventing the ABA Problem in ...
The usage of version tags is the most commonly cited solution for ABA avoidance [6]. The approach is effective, when it is possible to apply, but suffers from a ...Missing: mitigation | Show results with:mitigation
[22]
Simple, fast, and practical non-blocking and blocking concurrent ...
1 Compare.and.swap, introduced on the IBM System 370, take:s as arguments ... ABA problem: if a process reads a value A in a shared loca- tion ...<|control11|><|separator|>
[23]
[PDF] A Pragmatic Implementation of Non-Blocking Linked-Lists
This paper presents a new non-blocking, linearizable linked-list implementation using compare-and-swap (CAS) with two CAS operations for deletion, and is ...
[24]
AtomicStampedReference (Java Platform SE 8 )
### Summary of How AtomicStampedReference Solves the ABA Problem
[25]
[PDF] Evaluating the Cost of Atomic Operations on Modern Architectures
Oct 19, 2020 · In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and ...Missing: ARM | Show results with:ARM
[26]
The Balancing Act of Choosing Nonblocking Features - ACM Queue
Aug 12, 2013 · ... Compare-and-Swap (CAS) atomic primitive. CAS atomically reads a shared location and compares the read value with an expected value. If they ...
[27]
A Study on the Performance Implications of AArch64 Atomics
May 21, 2023 · In this paper we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance Arm-based CPUs, ...Missing: AMD | Show results with:AMD
[28]
C11 Lock-free Stack - null program
Sep 2, 2014 · It's centered around the new C11 stdatomic.h function atomic_compare_exchange_weak() . This is an atomic operation more generally called compare ...
[29]
std::atomic<T>::compare_exchange_weak, std::atomic<T>::compare_exchange_strong - cppreference.com
### Summary of `std::atomic::compare_exchange_strong` and `compare_exchange_weak`
[30]
__atomic Builtins (Using the GNU Compiler Collection (GCC))
This built-in function implements an atomic compare and exchange operation. This compares the contents of * ptr with the contents of * expected . If equal ...Missing: pseudocode | Show results with:pseudocode
[31]
InterlockedCompareExchange function (winnt.h) - Win32 apps
May 23, 2022 · Performs an atomic compare-and-exchange operation on the specified values. The function compares two specified 32-bit values and exchanges with another 32-bit ...
[32]
https://www.felixcloutier.com/x86/cmpxchg
[33]
https://hackaday.com/2025/07/02/why-the-latest-linux-kernel-wont-run-on-your-486-and-586-anymore/
[34]
CMPXCHG — Compare and Exchange
Compares the value in the AL, AX, EAX, or RAX register with the first operand (destination operand). If the two values are equal, the second operand (source ...Missing: PowerPC | Show results with:PowerPC
[35]
Why The Latest Linux Kernel Won't Run On Your 486 And 586 ...
Jul 2, 2025 · ... CMPXCHG8B instructions. These became part of x86 when Intel introduced the very first Pentium processors to the market in the early 1990s.
[36]
X86-64 microarchitecture levels - openSUSE Wiki
Jul 4, 2025 · The original specification, created by AMD and released in 2000, has been implemented by AMD, Intel, and VIA. ... CPU features: CMOV, CMPXCHG8B ( ...
[37]
LDREX and STREX - RealView Developer Kit Assembler Guide
The address used in an STREX instruction must be the same as the address in the most recently executed LDREX instruction. The result of executing an STREX ...Missing: CAS | Show results with:CAS
[38]
caspa, caspal, casp, caspl, caspal, casp, caspl - Arm Developer
Supported in ARMv8.1 and later. Usage. Compare and Swap Pair of words or doublewords in memory reads a pair of 32-bit words or 64 ...
[39]
riscv-zacas created from docs-spec-template template - GitHub
Mar 21, 2024 · The Zacas extension adds support for amocas.w, amocas.d, and amocas.q instructions to perform atomic Compare-and-Swap (CAS) operations.
[40]
Which CPU architectures support Compare And Swap (CAS)?
Sep 30, 2008 · There are two major approach for hardware synchronization support by CPU: Atomic transaction based. Cache coherence protocol based. No one is ...Fastest way to Compare And Swap (CAS) on Intel x86 CPU?Is there hardware support for 128-bit integers in modern processors?More results from stackoverflow.comMissing: 2000s | Show results with:2000s
[41]
[PDF] Cache Coherence and Atomic Operations in Hardware
Dec 1, 2006 · Writable: Exactly 1 cache has the block and only that processor can write to it. 3. Read-only: Any number of caches can hold the block, and.
[42]
How does the cache coherency protocol enforce atomicity?
Aug 17, 2014 · In case of MESI , the core executing the atomic instruction will first ensure it has ownership of the cache line containing the operand and ...Why can the MESI protocol not guarantee atomicity of CMPXCHG on ...LOCK prefix vs MESI protocol? - Stack OverflowMore results from stackoverflow.com
[43]
Ratified Extensions - Home - RISC-V Tech Hub
Aug 29, 2025 · Atomic Compare-and-Swap (CAS) Instructions (Zacas). November 2023. Zacas. RISC-V Cryptography Extensions Volume II: Vector Instructions.
[44]
[PDF] MIPS IV Instruction Set
There are paired instructions, Load Linked and Store Conditional, that can be used ... CCA of the reference, however, and the pseudocode for load and store common.
[45]
Exclusive access instructions - ARMv8-A synchronization primitives
In AArch32 state, A32 and T32 have instructions with a similar behavior: Load Exclusive ( LDREX ); Store Exclusive ( STREX ). Whenever an address is read using ...Missing: CAS | Show results with:CAS
[46]
2023/09/01 - CAS, ABA and LL/SC - memzero
Load linked store conditional. So far we looked at atomic CAS instructions which have read-modify-write semantics and are typically found with CISC ISAs.
[47]
Atomic types - The Linux Kernel documentation
However, this is not evident on LL/SC architectures, because while an LL/SC architecture 'can/should/must' provide forward progress guarantees between competing ...
[48]
Livermore S-1 Supercomputer -- A Short History - Clemson University
Summary: The S-1 project was an attempt to build a family of multiprocessor supercomputers. The project was envisioned by Lowell Wood at the Lawrence Livermore ...
[49]
[PDF] The Synergy Between Non-blocking Synchronization and Operating ...
The main techniques we use for modularity, performance and reliability are atomic DCAS(or. Double-Compare-and-Swap), type-stable memory management (TSM), and ...
[50]
[PDF] A Practical Multi-Word Compare-and-swap Operation - Tim Harris
Abstract. Work on non-blocking data structures has proposed extend- ing processor designs with a compare-and-swap primitive, CAS2, which.Missing: adoption 1990s
[51]
CMPXCHG8B/CMPXCHG16B — Compare and Exchange Bytes
Compares the 64-bit value in EDX:EAX (or 128-bit value in RDX:RAX if operand size is 128 bits) with the operand (destination operand).Missing: introduction date
[52]
DCAS-based concurrent deques - ACM Digital Library
The computer industry is currently examining the use of strong synchronization operations such as double compare-and-swap (DCAS) as a means of supporting ...
[53]
[PDF] DRAMHiT: A Hash Table Architected for the Speed of DRAM
To implement insertion for tuples that are smaller than 16 bytes, DRAMHiT relies on a double-word compare-and-swap (cas) instruction. Concur- rent updates are ...
[54]
[PDF] On the Interplay of Hardware Transactional Memory and Lock-Free ...
A nice feature of lock elision is that it is identical to programming with locks (Figure 1b). The difference between traditional lock-based synchronization is ...