Test-and-set
Test-and-set is an atomic hardware instruction used in computer architecture for synchronization, which atomically reads the value of a specified memory location and sets it to 1 while returning the original value, enabling the implementation of mutual exclusion mechanisms in multiprocessor and multithreaded environments.[1] This primitive ensures that operations on shared resources occur indivisibly, preventing race conditions where multiple processes or threads attempt simultaneous access.[2]
Introduced by IBM in its System/360 series of computers, announced in 1964, as a single machine instruction, test-and-set provided an early solution for hardware-based process synchronization, using a single bit to represent lock states (0 for free, 1 for busy).[3][4] The instruction's pseudocode is typically expressed as int test_and_set(int *m) { int old = *m; *m = 1; return old; }, where the entire sequence executes without interruption.[1] In practice, it forms the basis for spinlocks, where a process repeatedly tests the lock until it acquires it by setting the flag, then clears the flag upon release.[2]
Key advantages include its simplicity and efficiency for short critical sections, as it leverages hardware support to avoid software overhead in checking and updating shared variables.[1] However, prolonged spinning can lead to busy-waiting and potential starvation in high-contention scenarios, prompting optimizations like test-and-test-and-set or integration with higher-level constructs such as semaphores.[5] Modern processors, including x86 and ARM architectures, provide variants of test-and-set (e.g., via LOCK CMPXCHG or LDREX/STREX instructions) to support scalable concurrency in operating systems and parallel programming.[6][7]
Fundamentals
Definition
The test-and-set operation is an atomic read-modify-write instruction that examines the value of a shared memory location—typically a single bit or byte initialized to 0—and unconditionally sets it to 1 while returning the original value to the issuing processor or thread.[8] This ensures that only one contender can successfully acquire the resource when the value was 0, as subsequent attempts will read 1 and fail until the lock is released.[9]
In pseudocode, the operation can be represented as follows, where the entire sequence executes indivisibly:
boolean testAndSet(boolean *lock) {
boolean old = *lock;
*lock = true;
return old;
}
boolean testAndSet(boolean *lock) {
boolean old = *lock;
*lock = true;
return old;
}
[10]
The atomicity of test-and-set guarantees that no other processor or thread can observe or interfere with an intermediate state between reading the original value and writing the new one, preventing race conditions in concurrent environments.[8] In the original IBM implementation, the instruction fetches a byte from main storage, sets it to all ones (hex FF), and uses the leftmost bit (bit 0) to determine the condition code: 0 if the bit was originally 0 (and now set to 1), or 1 if it was already 1.[8]
Test-and-set was introduced in the 1960s to support multiprocessor synchronization, with its first hardware formalization in the IBM System/360 architecture, announced in 1964.[11] The instruction, opcode 0x93 in the System/360 Principles of Operation, provided a primitive for inspecting and setting a storage bit atomically, enabling reliable shared resource management across compatible mainframe models.[8]
Purpose in Concurrency
In concurrent programming, race conditions arise when multiple threads or processes simultaneously access and modify shared resources, potentially leading to unpredictable and inconsistent program states. Without proper synchronization, such as when two threads increment a shared counter, the final value may be incorrect due to interleaved operations where one thread's update overwrites the other's.[12] This issue is particularly prevalent in multi-processor systems, where the lack of atomicity in standard load-modify-store sequences exacerbates the problem, necessitating primitives that ensure indivisible operations on shared memory.[13]
Test-and-set addresses these challenges by providing a mechanism to protect critical sections—segments of code that access shared data and require mutual exclusion to maintain consistency. In a critical section, only one thread should execute at a time to prevent data corruption; test-and-set facilitates this through a simple, atomic test of a lock variable followed by its set to a locked state if previously unlocked. This enables the implementation of busy-waiting locks, where threads spin until the lock is acquired, ensuring progress without suspending execution or relying on the operating system's scheduler.[12] By atomically determining the lock's state and claiming it in one instruction, test-and-set guarantees that exactly one thread enters the critical section while others immediately detect the failure and retry, thus avoiding the need for explicit inter-thread signaling or complex coordination.[14]
The non-blocking nature of test-and-set stems from its ability to allow continuous polling without blocking the thread in a kernel queue, which contrasts with higher-overhead mechanisms like sleep-wake cycles. This approach minimizes latency in short critical sections but can lead to CPU waste in contended scenarios; nonetheless, it provides a lightweight foundation for synchronization.[15] Furthermore, test-and-set serves as a fundamental building block for more advanced synchronization constructs in operating systems, such as semaphores, where atomic wait and signal operations can be implemented atop it to manage resource counts; monitors, which encapsulate mutual exclusion with condition variables; and barriers, which coordinate thread arrival in parallel computations. These higher-level abstractions rely on the atomicity of test-and-set to compose reliable concurrency control without reinventing low-level atomic guarantees.[15]
Implementations
Hardware Variants
The test-and-set operation is typically realized in hardware through dedicated atomic read-modify-write (RMW) instructions that ensure indivisibility in multi-processor environments. In x86 architectures, this is commonly implemented using instructions such as XCHG (exchange) or BTS (bit test and set) prefixed with LOCK to guarantee atomicity. For instance, a test-and-set can be performed by executing LOCK [XCHG](/page/Exchange) [memory], #1, which atomically swaps the memory location with 1 and returns the original value, leveraging the processor's LOCK prefix to serialize access.[16] This mechanism relies on the processor's ability to either lock the system bus during the operation or use cache coherence protocols to prevent concurrent modifications from other cores.[16]
A prominent variation is the compare-and-swap (CAS) instruction, which generalizes test-and-set by atomically comparing a memory location to an expected value and, if equal, replacing it with a desired value; test-and-set corresponds to the special case where the expected value is 0 and the desired value is 1. In x86, this is provided by the CMPXCHG instruction, which compares the destination operand to the accumulator (EAX/RAX) and exchanges it with the source if they match, with the LOCK prefix ensuring atomic execution across cores.[16] The operation can be formally described as:
\begin{align*}
&\text{if } (*ptr == \text{expected}) \{ \\
&\quad *ptr = \text{desired}; \\
&\quad \text{return true}; \\
&\}
\end{align*}
else return false, where for test-and-set, expected = 0 and desired = 1.[16] This flexibility allows CAS to support more complex synchronization patterns beyond simple binary flags while maintaining the same hardware guarantees.
In RISC architectures such as ARM, test-and-set is often implemented using a pair of load-exclusive (LDREX) and store-exclusive (STREX) instructions, which provide RMW semantics without a single monolithic instruction. LDREX loads a value from memory and sets an exclusive monitor for the address in the processor's local monitor hardware, while STREX attempts to store a new value only if no intervening modification occurred, returning success or failure accordingly; a loop retries on failure to achieve atomicity for operations like setting a flag to 1 if it was 0. Other RMW primitives, such as fetch-and-add (e.g., via LDAXR/STLXR in newer ARM versions), can be adapted for binary flags by incrementing a counter or using conditional logic, though the exclusive pair is more directly suited for test-and-set emulation.
In multi-processor systems, hardware atomicity for these instructions is enforced at the bus or interconnect level to handle inter-core interference. For x86, the LOCK prefix can trigger a locked bus cycle, where the processor asserts the LOCK# signal to prevent other agents from reading or writing the target cache line during the RMW operation, or it uses dual-snoop or cache-lock mechanisms in modern implementations to avoid full bus locking.[16] Similarly, ARM's exclusive monitors integrate with cache coherence protocols like MOESI (a variant of MESI) across shareable domains, where the global monitor tracks reservations and invalidates them on conflicting accesses from other cores, ensuring the STREX succeeds only for the originating processor. These extensions to standard cache coherence protocols, such as snoop filters and directory-based coherence, minimize contention while preserving atomic guarantees without excessive bus traffic.
Software Emulations
In environments lacking dedicated hardware support for test-and-set, software emulation can achieve similar atomic read-modify-write semantics using operating system primitives or higher-level synchronization mechanisms. One common approach in uniprocessor systems involves briefly disabling interrupts to prevent context switches during the operation, ensuring that the read of the lock variable followed by its conditional write executes without interruption.[14] This method is particularly suitable for kernel-level code where direct control over interrupts is available, as it simulates atomicity for short critical sections without relying on multiprocessor coordination.[17] However, it is ineffective in multiprocessor settings, where multiple CPUs can still interleave operations independently of interrupts on a single processor.[14]
Library-based emulations leverage higher-level constructs provided by operating systems or threading libraries to wrap non-atomic sequences. For instance, POSIX threads (pthreads) mutexes can protect a software test-and-set implementation, where the mutex ensures exclusive access during the read and write, and the underlying OS often utilizes hardware atomics if present to optimize the mutex itself.[18] Similarly, Windows Critical Sections offer a lightweight synchronization primitive for threads within a single process, emulating atomic test-and-set by combining spin-waiting with kernel-assisted blocking, which falls back to OS-level locking when hardware support is unavailable.[19] These approaches promote portability across platforms but introduce dependencies on the OS's synchronization facilities.
Compiler intrinsics and standard library functions provide portable ways to emulate test-and-set by generating appropriate machine code or falling back to software mechanisms. In GCC, the __sync_lock_test_and_set builtin performs an atomic exchange, storing a value (typically 1) and returning the prior contents; on architectures with limited atomic support, it may restrict operations to constant values like 1, relying on the compiler's code generation for basic atomicity.[20] The C11 standard's atomic_exchange function similarly atomically swaps values in a designated atomic object, with implementations using lock-based fallbacks—such as mutexes or spinlocks—when hardware primitives are absent, often hashing addresses to select from a pool of locks for efficiency.[21]
Software emulations generally incur higher overhead compared to hardware instructions due to potential context switches in lock-based fallbacks or repeated retries in spin-based ones, making them less suitable for high-contention scenarios. A simple uniprocessor emulation using interrupt control, as seen in early kernel designs, follows this pseudocode pattern (using x86-specific cli and sti for disable/enable):
cli(); // Disable [interrupts](/page/Interrupt)
old = *lock;
if (old == 0) *lock = 1;
sti(); // Enable [interrupts](/page/Interrupt)
return old;
cli(); // Disable [interrupts](/page/Interrupt)
old = *lock;
if (old == 0) *lock = 1;
sti(); // Enable [interrupts](/page/Interrupt)
return old;
This ensures the sequence is uninterruptible but limits applicability to single-processor contexts without OS privileges.[14]
Synchronization Applications
Mutual Exclusion Locks
A spinlock is a synchronization primitive that enforces mutual exclusion through busy-waiting, where contending threads repeatedly test a shared lock variable until they can acquire it. Implemented using the test-and-set operation, the spinlock relies on the atomicity of test-and-set to ensure that only one thread successfully transitions the lock from unlocked (0) to locked (1), while others continue spinning.[9] This approach is suitable for short critical sections in multiprocessor environments, as it avoids context switches but consumes CPU cycles during contention.[9]
A straightforward implementation of a test-and-set spinlock in pseudo-C code appears as follows:
c
typedef struct {
int flag;
} lock_t;
void acquire(lock_t *lock) {
while (test_and_set(&lock->flag) == 1)
; // spin until success
}
void release(lock_t *lock) {
lock->flag = 0;
}
typedef struct {
int flag;
} lock_t;
void acquire(lock_t *lock) {
while (test_and_set(&lock->flag) == 1)
; // spin until success
}
void release(lock_t *lock) {
lock->flag = 0;
}
In the acquire function, the loop invokes test-and-set, which atomically reads the flag and sets it to 1, returning the prior value; if 1 (locked), the thread spins, but if 0, it acquires the lock and exits the loop to enter the critical section.[9] The release function simply clears the flag non-atomically, as no contention arises during unlock. This design introduces fairness issues, exhibiting first-waiter-wins behavior where the earliest spinning thread typically acquires the lock upon release, potentially starving later arrivals due to timing and cache effects.[22]
On x86 architectures, test-and-set for spinlocks leverages the LOCK BTS (Bit Test and Set) instruction, which atomically tests and sets a specified bit in memory. A minimal assembly implementation for acquire might look like this:
mutex_acquire:
lock bts [var], 0 ; atomically test and set bit 0 of var
jc mutex_acquire ; if [carry flag](/page/Carry_flag) set (bit was 1), retry ([spin](/page/Spin))
mutex_acquire:
lock bts [var], 0 ; atomically test and set bit 0 of var
jc mutex_acquire ; if [carry flag](/page/Carry_flag) set (bit was 1), retry ([spin](/page/Spin))
Here, var holds the lock flag; the LOCK prefix ensures atomicity across processors, and the carry flag indicates if the bit was already set, triggering a retry loop for unsuccessful attempts.[23] Release is implemented as mov [var], 0, clearing the bit without atomicity needs.[23]
To protect a critical section, a thread invokes [acquire](/page/Acquire) prior to accessing shared resources and release afterward; upon failure in the test-and-set loop, contending threads spin in the while loop, polling until the releasing thread clears the flag to 0, at which point one thread atomically acquires it.[9] This mechanism, grounded in the atomicity of test-and-set, guarantees that only one thread executes the critical section at a time.[9]
Evaluation Metrics
Evaluation of test-and-set locks relies on metrics that quantify their efficiency in concurrent environments, particularly focusing on latency, scalability under contention, busy-waiting energy costs, and standardized benchmarking approaches.
Latency metrics primarily assess the time required for the atomic test-and-set operation itself. On x86 processors such as Intel Haswell (2013) or Skylake (2015) architectures, an uncontended test-and-set—often implemented via instructions like LOCK XCHG or LOCK CMPXCHG—typically incurs 10-20 cycles, encompassing the atomic swap and any associated serialization.[24] On newer architectures like AMD Zen 4 (2022), latencies can be lower, around 7-9 cycles for similar operations.[25] This duration increases under contention due to bus lock overhead, where the LOCK prefix serializes access to the memory bus, potentially extending latency to dozens of cycles as the system ensures exclusive ownership across cores.[26]
Scalability under contention evaluates how performance degrades as the number of threads increases. Test-and-set locks suffer from cache line bouncing, where frequent atomic accesses invalidate and migrate the shared lock variable across processor caches, leading to significant throughput drops under high contention due to inter-core coherence traffic.[27] A simple analytical model captures this effect, approximating the success probability of a single test-and-set attempt as \approx 1/N for N contending threads assuming uniform retry rates, which implies an expected O(N) retries per acquisition and linear degradation in overall system throughput.[28]
Busy-waiting energy cost measures the inefficiency of spinning loops during lock contention, where threads repeatedly execute test-and-set without yielding the CPU. This results in wasted CPU cycles and elevated power consumption, with measurements on Intel Xeon servers indicating up to 3-4% higher total system power (around 2-3 watts per core) compared to idle states during local spinning, escalating further with global contention due to sustained core activity at high frequencies.[29] The PAUSE instruction, intended to hint the processor for lower activity in spin loops, generally reduces power consumption compared to unpaused busy-waiting.[30] However, relative to other techniques like memory barriers, busy-waiting remains inherently energy-intensive compared to blocking alternatives.[29]
Benchmarking tools for test-and-set locks emphasize custom microbenchmarks to precisely measure acquire and release times in multiprocessor environments, often using timestamp counters (e.g., RDTSC on x86) over iterations of contended and uncontended accesses.[31] While suites like lmbench provide indirect insights into system latency, they do not directly benchmark lock primitives; instead, tools such as Synchrobench or tailored synthetic workloads simulate varying contention levels to report median acquire times, typically in nanoseconds to microseconds per operation across multi-socket setups.[32][27]
Comparisons to Alternatives
Test-and-set (TAS) serves as a straightforward atomic primitive for binary synchronization, such as simple spinlocks, but it is less versatile than compare-and-swap (CAS), which supports conditional updates on arbitrary values and facilitates the construction of lock-free data structures like concurrent queues and stacks.[33] While TAS unconditionally sets a lock variable, often resulting in high contention and cache-coherence overhead in multiprocessor systems, CAS reduces such inefficiencies by only committing updates if the expected value matches, thereby minimizing retries and enhancing scalability.[33] This added flexibility of CAS, however, introduces greater implementation complexity compared to the minimalistic design of TAS.[33]
In contrast to semaphores, which employ blocking mechanisms for resource coordination, TAS-based spinlocks prioritize low overhead by busy-waiting without context switches, rendering them more efficient for brief critical sections in multiprocessor environments where delays are minimal.[34] Semaphores, by suspending waiting threads and relying on kernel scheduling, better accommodate longer critical sections or high-contention scenarios but incur the expense of context-switch latency and potential priority inversion, issues less pronounced in TAS spinlocks for short holds.[34] Nonetheless, prolonged spinning with TAS can lead to wasted CPU cycles and energy consumption, whereas semaphores promote idleness among contenders, improving overall system utilization for extended waits.[35]
Ticket locks address the unfairness of basic TAS locks—where later arrivals may overtake earlier ones—by employing fetch-and-add to assign sequential service numbers, ensuring FIFO ordering and moderate scalability improvements over TAS in low-to-medium contention.[36] MCS (Mellor-Crummey and Scott) locks build on this by forming a per-thread queue with local spinning on private flags, drastically reducing shared interconnect traffic and providing superior scalability to both TAS and ticket locks under high contention.[36] Empirical evaluations on shared-memory multiprocessors demonstrate that MCS locks achieve nearly constant acquisition latency as processor counts increase, while TAS exhibits linear degradation and ticket locks show intermediate performance with proportional backoff.[36]
Contemporary hardware primitives like load-link/store-conditional (LL/SC), implemented in architectures such as ARM and PowerPC, circumvent the ABA problem plaguing CAS-based algorithms by invalidating conditional stores upon any intervening write, thus enabling more robust non-blocking synchronization than TAS without relying on version tags.[33] Although LL/SC demands retry loops akin to CAS retries, its explicit detection of modifications yields stronger progress guarantees in lock-free settings, trading TAS's simplicity for enhanced reliability in complex concurrent structures.[33]