Compare-and-swap
Compare-and-swap (CAS), also known as compare-and-exchange, is an atomic hardware instruction used in concurrent programming to safely update shared memory locations in multiprocessor systems. It operates by comparing the current value stored at a specified memory address with an expected value provided as input; if the two values match, it atomically replaces the memory location's value with a new value, also provided as input, and returns success, while if they do not match, it leaves the memory unchanged and typically returns failure.[1] This operation ensures indivisibility, preventing race conditions where multiple threads might interfere with each other's updates to the same variable.[2]
First introduced by IBM in the System/370 architecture in 1970, CAS provided a foundational synchronization primitive for operating systems like OS/360 and early UNIX ports, enabling efficient mutual exclusion without traditional locks in certain scenarios.[3] Over time, CAS has become a standard feature in modern processor architectures, including x86 (via the CMPXCHG instruction), ARM (via CAS and CASP), and others, supporting both single-word and double-word variants for 32-bit and 64-bit operations.[4] Its pseudocode is commonly represented as:
bool CAS(address, expected, new_value) {
if (*address == expected) {
*address = new_value;
return true;
}
return false;
}
bool CAS(address, expected, new_value) {
if (*address == expected) {
*address = new_value;
return true;
}
return false;
}
This simplicity belies its power in enabling non-blocking algorithms, where threads retry operations on failure rather than blocking, improving scalability in high-contention environments.[5]
In theoretical computer science, Maurice Herlihy's 1991 work demonstrated CAS's universality for wait-free synchronization, proving it sufficient to implement any shared object in a wait-free manner—guaranteeing progress for each thread independent of others—unlike weaker primitives like read-write registers or fetch-and-add.[6] This result has profoundly influenced the design of lock-free data structures, such as queues, stacks, and hash tables.[7] Extensions like multi-word CAS (MCAS) address limitations in updating multiple disjoint locations atomically, further expanding its applicability in advanced concurrent algorithms.[8] Despite its benefits, CAS can suffer from the ABA problem, where a value temporarily changes and reverts, leading to incorrect updates, often mitigated by hazard pointers or version tags.[9]
Fundamentals
Definition
Compare-and-swap (CAS) is an atomic primitive implemented as a single, indivisible CPU instruction or software-emulated operation that reads the current value from a specified memory location, compares it to an expected value, and conditionally stores a new value only if the current value matches the expected one.[10] This ensures the entire read-compare-write sequence occurs without interference from other threads, maintaining data consistency in concurrent environments.[11]
The operation requires three primary parameters: a pointer to the target memory address, the expected value (representing the anticipated current state), and the desired new value to write if the comparison succeeds.[10] Upon execution, CAS returns a boolean value indicating success—true if the comparison matched and the write occurred, or false if the memory value had changed in the interim.[12]
In synchronization contexts, CAS facilitates non-blocking (lock-free) concurrency by supporting optimistic update strategies, where threads attempt modifications assuming no conflicts and retry only upon failure, avoiding the overhead and potential deadlocks of traditional mutual exclusion locks.[10][12]
The compare-and-swap (CAS) operation performs an atomic read-modify-write on a shared memory location, proceeding in the following logical steps: first, it loads the current value from the specified memory address; second, it compares this value to an expected value provided as input; third, if the values match, it stores a new value at the address and indicates success; fourth, if they do not match, it leaves the memory unchanged and indicates failure.[13][14]
The following pseudocode illustrates the basic logic of CAS, where the operation takes a memory address, an expected value, and a new value as parameters:
[function](/page/Function) compareAndSwap([address](/page/Address), expected, newValue) returns [boolean](/page/Boolean)
current ← load([address](/page/Address)) // Read current value from [memory](/page/Memory)
if current == expected then
store([address](/page/Address), newValue) // Write new value if match
return true
else
return false
end if
end [function](/page/Function)
[function](/page/Function) compareAndSwap([address](/page/Address), expected, newValue) returns [boolean](/page/Boolean)
current ← load([address](/page/Address)) // Read current value from [memory](/page/Memory)
if current == expected then
store([address](/page/Address), newValue) // Write new value if match
return true
else
return false
end if
end [function](/page/Function)
In actual implementations, the entire sequence—reading, comparing, and conditionally writing—must execute atomically to ensure indivisibility and prevent concurrent interference.[13][14]
A non-atomic software emulation of this logic, such as the pseudocode above without hardware support, risks race conditions where concurrent threads could interleave reads and writes, leading to inconsistent states; achieving atomicity in software typically requires mechanisms like disabling interrupts, using memory barriers or fences, or wrapping in locks, whereas hardware implementations (e.g., via dedicated instructions) guarantee indivisibility at the processor level.[14]
The operation returns a boolean value indicating whether the swap succeeded (true if the expected value matched and was replaced, false otherwise), which enables algorithms to detect failures and retry in a loop until success, supporting non-blocking progress in concurrent settings.[13][14]
Applications
Lock-Free Data Structures
Compare-and-swap (CAS) enables the construction of lock-free data structures by providing an atomic primitive for optimistic concurrency control, where threads read shared state, compute updates assuming no interference, and use CAS to install changes only if the state remains unmodified.[13] This approach contrasts with lock-based synchronization, such as mutexes, which serialize access and can lead to deadlocks or priority inversion under contention, whereas lock-free methods ensure that at least one thread makes progress without blocking others.[15] In lock-free designs, failed CAS attempts trigger retries, promoting non-blocking progress in multiprocessor environments.
Common lock-free data structures leverage CAS for pointer manipulations in linked representations. For instance, Treiber's stack implements push and pop operations by atomically swapping the top pointer: a thread reads the current top, links a new node to it for push, or unlinks it for pop, and retries via CAS if concurrent modifications occur.[16] Similarly, the Michael-Scott queue uses CAS to update head and tail pointers in a singly-linked list, enabling lock-free enqueue and dequeue by swinging pointers and handling races through repeated attempts.[17] Hash tables, such as Harris's design, employ CAS for inserting or removing elements in per-bucket lock-free linked lists, maintaining consistency without global locks.[18]
These structures offer scalability benefits on multi-core systems by minimizing contention; unlike coarse-grained locks that bottleneck at shared resources, CAS-based retries distribute load across cores, yielding higher throughput under high parallelism as demonstrated in queue benchmarks where performance scales near-linearly with threads.[17] Lock-freedom specifically guarantees system-wide progress—ensuring some thread completes its operation in finite steps amid contention—providing stronger assurances than obstruction-free methods, which only require progress in the absence of interference.
Example: Atomic Adder
One practical application of compare-and-swap (CAS) is implementing a thread-safe atomic increment on a shared counter variable, allowing multiple threads to update the counter without using locks and avoiding race conditions. In this scenario, threads concurrently attempt to increment a shared integer counter, such as tracking the number of processed items in a multithreaded application; CAS ensures that only one thread's update succeeds at a time, with others retrying transparently.[9]
The algorithm operates in a loop: a thread first reads the current value of the counter as the expected value, computes the new value by adding the increment (typically 1), and then attempts a CAS operation to replace the counter's value with the new one only if it still matches the expected value. If the CAS succeeds, the update is complete; if it fails—indicating another thread modified the counter in the interim—the thread discards the computation and retries by reading the updated value. This retry mechanism serializes the increments without blocking, leveraging the atomicity of CAS to detect and resolve conflicts.[9]
The following pseudocode illustrates a general atomic add operation using CAS, where address points to the shared counter and delta is the increment amount:
atomic_add(address, delta) {
do {
expected = *address;
new_val = expected + delta;
} while (CAS(address, expected, new_val) != expected);
return expected;
}
atomic_add(address, delta) {
do {
expected = *address;
new_val = expected + delta;
} while (CAS(address, expected, new_val) != expected);
return expected;
}
Here, the CAS primitive returns the original value at address if the swap succeeds (matching expected), or the current value if it fails, prompting a loop iteration. For a simple increment, set delta = 1. This implementation is lock-free, as each thread makes progress independently, though retries may occur under contention.[9]
Consider a trace of two threads, Thread A and Thread B, both attempting to increment a shared counter initialized to 0:
- Both threads read the counter as
expected = 0 and compute new_val = 1.
- Thread A performs CAS( counter, 0, 1 ), which succeeds since the value is still 0; the counter is now 1, and Thread A returns.
- Thread B's CAS( counter, 0, 1 ) fails because the counter is now 1 (not matching expected=0); Thread B retries.
- On retry, Thread B reads
expected = 1, computes new_val = 2, and performs CAS( counter, 1, 2 ), which succeeds; the counter is now 2, and Thread B returns.
This execution ensures the counter reaches 2 atomically, with increments applied in some sequential order despite concurrency.[9]
The operation is atomic because the CAS instruction executes indivisibly as a single hardware primitive, guaranteeing that no other thread can observe or interfere with the comparison and update midway; combined with the retry loop, it emulates a sequentially consistent increment equivalent to a locked ++ operation, preventing lost updates or races.[9]
Challenges
ABA Problem
The ABA problem arises in compare-and-swap (CAS) operations when a thread reads a shared value A from a memory location, performs some computation, and later attempts a CAS to replace it with a new value, but another thread has meanwhile modified the location from A to B and then back to A, causing the CAS to succeed incorrectly despite the intervening change.[19] This issue stems from CAS relying solely on direct value comparison without accounting for the history of modifications to the location.[20] It is particularly prevalent in environments with memory reuse, such as garbage-collected languages or systems using object pools, where deallocated memory can be quickly reallocated with the same pointer value.[21]
A classic illustration occurs in lock-free linked list implementations, such as during node deletion. Suppose thread T1 reads a pointer to node N (value A) from the previous node's next field. While T1 prepares to unlink N, thread T2 removes N, freeing it, and possibly another thread reuses the same memory address for a new node with value A. When T1 executes CAS to update the previous node's next field, skipping over the now-reused N, the operation succeeds because the pointer still appears as A, potentially leading to incorrect linking and loss of subsequent nodes.[21] This scenario highlights how pointer-based structures exacerbate the problem, as addresses can cycle through reuse faster than threads complete operations.[20]
The impact of the ABA problem includes data corruption, such as lost updates or structural inconsistencies in data structures, and can even induce infinite loops in retry-based algorithms if the false success propagates undetected.[19] To detect and mitigate it, one common approach appends a version counter or tag to the value being compared, ensuring uniqueness even if the base value reappears; for instance, in a 64-bit architecture, the low 32 bits might hold the pointer or value, while the high 32 bits store an incrementing counter updated on each modification.[19] In Java, the AtomicStampedReference class implements this by pairing an object reference with an integer stamp, allowing CAS operations like compareAndSet to validate both the reference and stamp atomically, thus preventing false successes in scenarios like the linked list example.[22] This tagged approach maintains the efficiency of single-word CAS while addressing the historical oversight inherent in plain value comparisons.[19]
Compare-and-swap (CAS) operations provide significant benefits in concurrent environments, particularly for uncontended access where they execute as a single atomic instruction with low latency, typically around 4-5 CPU cycles on modern Intel processors.[23] This efficiency stems from avoiding the overhead of acquiring and releasing locks, enabling better scalability on multi-core systems by permitting concurrent progress without serializing threads.[24] In read-heavy workloads, CAS-based algorithms often outperform traditional locking mechanisms, as they minimize blocking and support higher throughput under low conflict.[24]
However, under high contention, CAS incurs substantial costs due to the need for retry loops, where failed attempts require reloading values and spinning, which can consume hundreds of CPU cycles per operation—up to 200 cycles or more for remote cache accesses on Intel architectures.[23] This spin-waiting exacerbates cache line bouncing in shared memory systems, as frequent invalidations and coherency traffic degrade performance, especially in write-heavy scenarios where linearizability guarantees demand strict atomicity.[23] Benchmarks indicate that at moderate contention levels (e.g., around 50% success rate), CAS retries can waste 10-100 times more cycles than a successful lock acquisition, depending on the architecture and workload.[23]
Performance varies across architectures; for instance, recent 2020s benchmarks on ARM AArch64 processors like Fujitsu A64FX show CAS exhibiting quadratic latency scaling under high thread counts due to contention, often performing worse than x86 equivalents in such cases, though load-link/store-conditional alternatives can mitigate this on ARM.[25] CAS is ideal for low-conflict operations, such as counters or queues with infrequent updates, but developers should fallback to locks for high-contention environments to avoid excessive retry overhead.[24] In write-heavy workloads, the stronger consistency provided by CAS can lead to slower overall performance compared to coarser-grained locks, as retries enforce linearizability at the expense of throughput.[24]
Implementations
In C and related languages, compare-and-swap (CAS) is implemented through standardized atomic libraries and compiler-specific intrinsics, enabling portable lock-free programming across platforms.
The C11 standard introduces the <stdatomic.h> header, which defines atomic types via the _Atomic qualifier and provides generic atomic operations, including atomic_compare_exchange_n for portable CAS on integer types. This function atomically loads the value from an atomic object, compares it to an expected value, and—if they match—stores a desired value while returning true; otherwise, it stores the current value into the expected pointer and returns false.
A representative example is an atomic increment of a shared counter using the weak variant atomic_compare_exchange_weak, which may spuriously fail for performance reasons but must be used in a retry loop:
c
#include <stdatomic.h>
_Atomic int [counter](/page/Counter) = 0;
int expected = atomic_load(&[counter](/page/Counter));
while (!atomic_compare_exchange_weak(&[counter](/page/Counter), &expected, expected + 1)) {
expected = atomic_load(&[counter](/page/Counter));
}
#include <stdatomic.h>
_Atomic int [counter](/page/Counter) = 0;
int expected = atomic_load(&[counter](/page/Counter));
while (!atomic_compare_exchange_weak(&[counter](/page/Counter), &expected, expected + 1)) {
expected = atomic_load(&[counter](/page/Counter));
}
This pattern ensures progress despite contention or spurious failures, with the weak form often compiling to more efficient instructions on hardware supporting it.[26]
In C++, the <atomic> header extends this with std::atomic<T>, offering compare_exchange_strong (guaranteed no spurious failure) and compare_exchange_weak (allows spurious failure for potential speedups in retry loops, such as lock-free queues). The weak variant is recommended for performance-critical loops where retries are expected, as it avoids unnecessary hardware retries on platforms like x86.[27]
For platform dependencies, GCC's legacy __sync_val_compare_and_swap intrinsic (superseded since GCC 4.8 by __atomic builtins) provided a simple CAS but lacked memory order control; it is replaced by __atomic_compare_exchange_n, which supports C11 semantics and explicit ordering. On Microsoft Visual C++, InterlockedCompareExchange delivers a 32-bit CAS with implicit full barriers, suitable for Windows threading.[28][29]
Best practices emphasize memory ordering: use memory_order_acquire on loads and memory_order_release on successful CAS stores to synchronize dependent operations between threads without the cost of memory_order_seq_cst full barriers, which impose total ordering. Pitfalls include infinite loops from unupdated expected values in weak CAS retries and overusing strong variants, which can degrade performance under high contention by forcing hardware retries.[30][27]
Hardware-Level Support
The compare-and-swap (CAS) operation is natively supported in the x86 architecture through the CMPXCHG instruction, which performs a 32-bit atomic compare-and-exchange, introduced in the Intel 80486 processor in 1989.[31] This instruction compares the value in the accumulator (AL, AX, EAX, or RAX depending on operand size) with a memory operand and, if equal, replaces it with a source value while setting the zero flag accordingly; a LOCK prefix ensures atomicity across multi-core systems.[32] Extensions followed with CMPXCHG8B for 64-bit operations in the original Pentium processor in 1993, enabling atomic swaps on doublewords essential for 64-bit concurrency primitives.[33] Further evolution included CMPXCHG16B in the x86-64 extensions, ratified by AMD in 2000 and implemented starting with the Opteron in 2003, supporting 128-bit atomic operations for advanced synchronization in modern 64-bit environments.[34]
In the ARM architecture, direct single-word CAS is not provided as a primitive instruction; instead, the Load-Exclusive (LDREX or LDXR) and Store-Exclusive (STREX or STXR) pair, introduced in ARMv6 (announced 2001, first implemented 2002), emulates CAS through a reservation-based mechanism that detects intervening modifications between load and store.[35] True hardware CAS support arrived in ARMv8 (2011) with atomic load/store instructions, and ARMv8.1 (2016) added the CASP family for paired 64-bit or 128-bit compare-and-swap operations, facilitating lock-free algorithms on doublewords without exclusive monitors.[36]
Other architectures provide analogous hardware support: PowerPC employs the Load Word And Reserve Indexed (lwarx) and Store Word Conditional Indexed (stwcx.) instructions since the PowerPC 601 in 1994 to implement CAS via reservation, ensuring atomicity for 32- and 64-bit operations in a similar manner to ARM's exclusive access. RISC-V's base A extension, frozen in 2011 and widely adopted by the early 2010s, uses Load-Reserved/Store-Conditional (LR/SC) loops for CAS emulation, with Atomic Memory Operations (AMOs) like atomic read-modify-write providing related primitives; the Zacas extension, ratified in November 2023, introduces direct amocas.w, amocas.d, and amocas.q instructions for explicit 32-, 64-, and 128-bit CAS to reduce loop overhead in software.[37] By the early 2000s, hardware CAS or equivalent support became ubiquitous in 64-bit CPUs across x86, ARM, PowerPC, and emerging ISAs like RISC-V, driven by the rise of multi-core processors and the need for scalable concurrency.[38]
Hardware atomicity for CAS is enforced through mechanisms like bus locking in early designs or, more efficiently, cache coherence protocols such as MESI (Modified, Exclusive, Shared, Invalid), which ensure a core gains exclusive ownership of the cache line before performing the read-compare-write sequence, preventing interference from other cores.[39] Under MESI, the atomic operation transitions the line to the Modified state, serializing access via snooping or directory-based coherence to guarantee visibility and consistency across cores without full bus locking in modern implementations.[40] Post-2020 developments, such as RISC-V's Zacas, address gaps in direct CAS support while maintaining compatibility with existing coherence models.[41]
Extensions
Load-Link/Store-Conditional
Load-link/store-conditional (LL/SC) is a synchronization primitive consisting of two instructions designed to enable atomic read-modify-write operations in load-store architectures. The load-link (LL) instruction retrieves the current value from a specified memory address into a register while establishing a hardware reservation on that address, marking it for exclusive monitoring. The subsequent store-conditional (SC) instruction attempts to write a new value from a register back to the same address, succeeding and updating the memory only if no other processor or thread has performed a store to that address since the LL; in case of failure, the SC does not modify memory and typically returns a failure indicator to the issuing register, prompting a retry. This mechanism relies on hardware enforcement of the reservation, often through cache coherence protocols or dedicated monitors, ensuring atomicity without locking the entire system.[42][43]
LL/SC provides functional equivalence to the compare-and-swap (CAS) operation, allowing CAS to be implemented using a retry loop: an LL loads the expected value, which is compared locally to determine the desired update, followed by an SC to apply the change if the reservation remains valid. If the SC fails, the loop reloads via LL and retries, ensuring the update occurs atomically relative to the original load. Unlike CAS, which only detects changes if the value matches the expected one, LL/SC inherently mitigates the ABA problem by failing the SC upon any intervening store to the address, irrespective of whether the value has reverted to its prior state. This detection of modifications at the address level, rather than value level, makes LL/SC particularly robust for pointer-based structures where value recycling could otherwise lead to incorrect behavior.[44]
Several architectures provide native LL/SC support, including MIPS with its LL and SC instructions for 32- or 64-bit words, ARM through variants like Load-Exclusive (LDREX or LDAXR) and Store-Exclusive (STREX or STLXR), and PowerPC via Load Word And Reserve Indexed (lwarx) and Store Word Conditional Indexed (stwcx.). These instructions are typically restricted to cached, aligned memory accesses to leverage coherence hardware efficiently. In software ecosystems, LL/SC underpins atomic operations in the Java platform's sun.misc.Unsafe class, which emulates CAS on LL/SC-capable architectures like ARM and PowerPC by wrapping the instructions in retry loops for portable lock-free programming. The Linux kernel similarly employs LL/SC on ARM and PowerPC to realize generic atomic primitives, such as atomic increments and exchanges, ensuring cross-architecture consistency in concurrent code.[42][43][45]
Compared to CAS, LL/SC offers advantages in ABA avoidance and alignment with RISC principles by separating load and store phases, potentially enabling more flexible compiler optimizations. However, it introduces disadvantages, including overhead from reservation metadata—such as per-core exclusive monitors or reservation bits in caches—that track addresses and can consume additional hardware resources. Additionally, LL/SC loops risk livelock from spurious SC failures, often triggered by store forwarding within the same core or unrelated coherence events that invalidate the reservation without an actual conflicting store, necessitating backoff strategies for progress guarantees. Historically, LL/SC originated as a proposal by Jensen, Hagensen, and Broughton for the S-1 AAP multiprocessor project at Lawrence Livermore National Laboratory in the late 1970s, serving as an early foundation for non-blocking synchronization in shared-memory systems and influencing subsequent RISC designs before widespread adoption of fused CAS instructions.[46]
Multi-Word Variants
Multi-word variants of compare-and-swap (CAS) extend the atomic primitive to multiple memory locations, enabling compound updates that maintain consistency in concurrent environments without intermediate states visible to other threads. The double compare-and-swap (DCAS) operation, which atomically compares the contents of two distinct memory locations against expected values and replaces both with new values if the comparison succeeds, was proposed in the mid-1990s as a foundation for non-blocking synchronization techniques in operating system structures.[47] This extension addresses limitations of single-word CAS in scenarios requiring simultaneous updates to related data, such as pointer pairs in linked structures.
Implementations of DCAS and broader multi-word CAS (MCAS) primarily use software emulation due to sparse hardware support. Lock-based software methods ensure atomicity but incur blocking overhead and reduce scalability under contention. Lock-free emulations built atop single-word CAS, such as the descriptor-based approach for arbitrary N-word operations, provide practical alternatives with disjoint-access parallelism, though they require careful management of helper threads and retry mechanisms.[48] On the hardware side, instructions like x86's CMPXCHG16B offer double-width CAS for 128-bit operands—effectively two adjacent 64-bit words—but apply only to packed data, not arbitrary non-contiguous locations, and are available only on processors supporting 64-bit extensions.[49]
DCAS facilitates atomic operations in concurrent data structures, such as updating both head and tail pointers in double-ended queues (deques) to avoid races during enqueue or dequeue.[50] It also supports consistent hash table modifications by atomically swapping key-value pairs, preventing partial updates that could lead to inconsistencies.[51] Generic transactional memory libraries, including Microsoft's Persistent Multi-word Compare-and-Swap (PMwCAS), employ MCAS to enable lock-free atomic updates across multiple variables, particularly in non-volatile memory systems for durable, concurrent programming.
Key challenges in multi-word variants include the rarity of dedicated hardware support, limited to specialized instructions on select CPUs like those with x86 extensions, which restricts performance and portability. Scalability degrades beyond 2-4 words due to heightened contention in emulations and exponential growth in state validation complexity. Progress in the 2020s leverages Intel's Transactional Synchronization Extensions (TSX) to implement MCAS via hardware transactions, reducing aborts from retry-heavy single-CAS sequences by providing all-or-nothing multi-word atomicity, albeit with a 10-15% performance overhead in lock-free algorithms.[52]