Fact-checked by Grok 2 weeks ago

Test-and-set

Test-and-set is an hardware instruction used in for , which atomically reads the value of a specified location and sets it to 1 while returning the original value, enabling the implementation of mechanisms in multiprocessor and multithreaded environments. This primitive ensures that operations on shared resources occur indivisibly, preventing race conditions where multiple processes or threads attempt simultaneous access. Introduced by in its System/360 series of computers, announced in , as a single machine instruction, test-and-set provided an early solution for hardware-based synchronization, using a single bit to represent lock states (0 for free, 1 for busy). The instruction's pseudocode is typically expressed as int test_and_set(int *m) { int old = *m; *m = 1; return old; }, where the entire sequence executes without interruption. In practice, it forms the basis for spinlocks, where a process repeatedly tests the lock until it acquires it by setting the flag, then clears the flag upon release. Key advantages include its simplicity and efficiency for short critical sections, as it leverages hardware support to avoid software overhead in checking and updating shared variables. However, prolonged spinning can lead to busy-waiting and potential in high-contention scenarios, prompting optimizations like test-and-test-and-set or integration with higher-level constructs such as semaphores. Modern processors, including x86 and architectures, provide variants of test-and-set (e.g., via LOCK CMPXCHG or LDREX/STREX instructions) to support scalable concurrency in operating systems and parallel programming.

Fundamentals

Definition

The test-and-set operation is an read-modify-write that examines the of a location—typically a single bit or byte initialized to 0—and unconditionally sets it to 1 while returning the original to the issuing or . This ensures that only one contender can successfully acquire the resource when the value was 0, as subsequent attempts will read 1 and fail until the lock is released. In , the operation can be represented as follows, where the entire sequence executes indivisibly:
boolean testAndSet(boolean *lock) {
    boolean old = *lock;
    *lock = true;
    return old;
}
The atomicity of test-and-set guarantees that no other or can observe or interfere with an between reading the original value and writing the new one, preventing race conditions in concurrent environments. In the original implementation, the instruction fetches a byte from main storage, sets it to all ones ( FF), and uses the leftmost bit (bit 0) to determine the condition code: 0 if the bit was originally 0 (and now set to 1), or 1 if it was already 1. Test-and-set was introduced in the to support multiprocessor , with its first hardware formalization in the architecture, announced in 1964. The instruction, 0x93 in the System/360 Principles of Operation, provided a primitive for inspecting and setting a storage bit atomically, enabling reliable management across compatible mainframe models.

Purpose in Concurrency

In concurrent programming, race conditions arise when multiple threads or processes simultaneously access and modify , potentially leading to unpredictable and inconsistent program states. Without proper , such as when two threads increment a , the final value may be incorrect due to interleaved operations where one thread's overwrites the other's. This issue is particularly prevalent in multi-processor systems, where the lack of atomicity in standard load-modify-store sequences exacerbates the problem, necessitating primitives that ensure indivisible operations on . Test-and-set addresses these challenges by providing a to protect s—segments of code that access shared data and require to maintain consistency. In a , only one should execute at a time to prevent ; test-and-set facilitates this through a simple, test of a lock followed by its set to a locked if previously unlocked. This enables the of busy-waiting locks, where threads spin until the lock is acquired, ensuring without suspending execution or relying on the operating system's scheduler. By atomically determining the lock's and claiming it in one , test-and-set guarantees that exactly one enters the while others immediately detect the failure and retry, thus avoiding the need for explicit inter-thread signaling or complex coordination. The non-blocking nature of test-and-set stems from its ability to allow continuous polling without blocking the thread in a kernel queue, which contrasts with higher-overhead mechanisms like sleep-wake cycles. This approach minimizes latency in short critical sections but can lead to CPU waste in contended scenarios; nonetheless, it provides a lightweight foundation for synchronization. Furthermore, test-and-set serves as a fundamental building block for more advanced synchronization constructs in operating systems, such as semaphores, where atomic wait and signal operations can be implemented atop it to manage resource counts; monitors, which encapsulate mutual exclusion with condition variables; and barriers, which coordinate thread arrival in parallel computations. These higher-level abstractions rely on the atomicity of test-and-set to compose reliable concurrency control without reinventing low-level atomic guarantees.

Implementations

Hardware Variants

The test-and-set operation is typically realized in hardware through dedicated atomic read-modify-write (RMW) instructions that ensure indivisibility in multi-processor environments. In x86 architectures, this is commonly implemented using instructions such as (exchange) or (bit test and set) prefixed with LOCK to guarantee atomicity. For instance, a test-and-set can be performed by executing LOCK [XCHG](/page/Exchange) [memory], #1, which atomically swaps the memory location with 1 and returns the original value, leveraging the processor's LOCK prefix to serialize access. This mechanism relies on the processor's ability to either lock the during the or use protocols to prevent concurrent modifications from other cores. A prominent variation is the (CAS) instruction, which generalizes test-and-set by atomically comparing a memory location to an expected value and, if equal, replacing it with a desired value; test-and-set corresponds to the special case where the expected value is 0 and the desired value is 1. In x86, this is provided by the CMPXCHG instruction, which compares the destination operand to the accumulator (/RAX) and exchanges it with the source if they match, with the LOCK prefix ensuring atomic execution across cores. The operation can be formally described as: \begin{align*} &\text{if } (*ptr == \text{expected}) \{ \\ &\quad *ptr = \text{desired}; \\ &\quad \text{return true}; \\ &\} \end{align*} else return false, where for test-and-set, expected = 0 and desired = 1. This flexibility allows to support more complex patterns beyond simple flags while maintaining the same guarantees. In RISC architectures such as , test-and-set is often implemented using a pair of load-exclusive (LDREX) and store-exclusive (STREX) instructions, which provide RMW semantics without a single monolithic instruction. LDREX loads a value from and sets an exclusive for the in the processor's local , while STREX attempts to store a new value only if no intervening modification occurred, returning success or failure accordingly; a loop retries on failure to achieve atomicity for operations like setting a flag to 1 if it was 0. Other RMW primitives, such as (e.g., via LDAXR/STLXR in newer versions), can be adapted for flags by incrementing a or using conditional logic, though the exclusive pair is more directly suited for test-and-set . In multi-processor systems, hardware atomicity for these instructions is enforced at the bus or interconnect level to handle inter-core interference. For x86, the LOCK prefix can trigger a locked bus cycle, where the processor asserts the LOCK# signal to prevent other agents from reading or writing the target cache line during the RMW operation, or it uses dual-snoop or cache-lock mechanisms in modern implementations to avoid full bus locking. Similarly, ARM's exclusive monitors integrate with cache coherence protocols like MOESI (a variant of MESI) across shareable domains, where the global monitor tracks reservations and invalidates them on conflicting accesses from other cores, ensuring the STREX succeeds only for the originating processor. These extensions to standard cache coherence protocols, such as snoop filters and directory-based coherence, minimize contention while preserving atomic guarantees without excessive bus traffic.

Software Emulations

In environments lacking dedicated hardware support for test-and-set, software can achieve similar read-modify-write semantics using operating system primitives or higher-level mechanisms. One common approach in uniprocessor systems involves briefly disabling interrupts to prevent context switches during the operation, ensuring that the read of the lock followed by its conditional write executes without interruption. This method is particularly suitable for kernel-level code where direct control over interrupts is available, as it simulates atomicity for short critical sections without relying on multiprocessor coordination. However, it is ineffective in multiprocessor settings, where multiple CPUs can still interleave operations independently of interrupts on a single processor. Library-based emulations leverage higher-level constructs provided by operating systems or threading libraries to wrap non-atomic sequences. For instance, threads (pthreads) mutexes can protect a software test-and-set implementation, where the mutex ensures exclusive during the read and write, and the underlying OS often utilizes atomics if present to optimize the mutex itself. Similarly, Windows Critical Sections offer a lightweight primitive for threads within a single , emulating atomic test-and-set by combining spin-waiting with kernel-assisted blocking, which falls back to OS-level locking when support is unavailable. These approaches promote portability across platforms but introduce dependencies on the OS's facilities. Compiler intrinsics and standard library functions provide portable ways to emulate test-and-set by generating appropriate machine code or falling back to software mechanisms. In GCC, the __sync_lock_test_and_set builtin performs an atomic exchange, storing a value (typically 1) and returning the prior contents; on architectures with limited atomic support, it may restrict operations to constant values like 1, relying on the compiler's code generation for basic atomicity. The C11 standard's atomic_exchange function similarly atomically swaps values in a designated atomic object, with implementations using lock-based fallbacks—such as mutexes or spinlocks—when hardware primitives are absent, often hashing addresses to select from a pool of locks for efficiency. Software generally incur higher overhead compared to instructions due to potential switches in lock-based fallbacks or repeated retries in spin-based ones, making them less suitable for high-contention scenarios. A simple uniprocessor using control, as seen in early designs, follows this pattern (using x86-specific cli and sti for disable/enable):
cli();  // Disable [interrupts](/page/Interrupt)
old = *lock;
if (old == 0) *lock = 1;
sti();  // Enable [interrupts](/page/Interrupt)
return old;
This ensures the sequence is uninterruptible but limits applicability to single-processor contexts without OS privileges.

Synchronization Applications

Mutual Exclusion Locks

A is a primitive that enforces through busy-waiting, where contending repeatedly test a shared lock variable until they can acquire it. Implemented using the test-and-set operation, the spinlock relies on the atomicity of test-and-set to ensure that only one thread successfully transitions the lock from unlocked (0) to locked (1), while others continue spinning. This approach is suitable for short critical sections in multiprocessor environments, as it avoids context switches but consumes CPU cycles during contention. A straightforward implementation of a test-and-set spinlock in pseudo-C code appears as follows:
c
typedef struct {
    int flag;
} lock_t;

void acquire(lock_t *lock) {
    while (test_and_set(&lock->flag) == 1)
        ;  // spin until success
}

void release(lock_t *lock) {
    lock->flag = 0;
}
In the acquire function, the loop invokes test-and-set, which atomically reads the flag and sets it to 1, returning the prior value; if 1 (locked), the thread spins, but if 0, it acquires the lock and exits the loop to enter the critical section. The release function simply clears the flag non-atomically, as no contention arises during unlock. This design introduces fairness issues, exhibiting first-waiter-wins behavior where the earliest spinning thread typically acquires the lock upon release, potentially starving later arrivals due to timing and cache effects. On x86 architectures, test-and-set for spinlocks leverages the LOCK BTS (Bit Test and Set) instruction, which atomically tests and sets a specified bit in . A minimal implementation for might look like this:
mutex_acquire:
    lock bts [var], 0    ; atomically test and set bit 0 of var
    jc mutex_acquire     ; if [carry flag](/page/Carry_flag) set (bit was 1), retry ([spin](/page/Spin))
Here, var holds the lock flag; the LOCK prefix ensures atomicity across processors, and the indicates if the bit was already set, triggering a retry loop for unsuccessful attempts. Release is implemented as mov [var], 0, clearing the bit without atomicity needs. To protect a , a invokes [acquire](/page/Acquire) prior to accessing shared resources and release afterward; upon failure in the test-and-set loop, contending spin in the , polling until the releasing thread clears the flag to 0, at which point one atomically acquires it. This mechanism, grounded in the atomicity of test-and-set, guarantees that only one executes the at a time.

Performance and Limitations

Evaluation Metrics

Evaluation of test-and-set locks relies on metrics that quantify their efficiency in concurrent environments, particularly focusing on , under contention, busy-waiting energy costs, and standardized approaches. metrics primarily assess the time required for the test-and-set operation itself. On x86 processors such as Haswell (2013) or Skylake (2015) architectures, an uncontended test-and-set—often implemented via instructions like LOCK XCHG or LOCK CMPXCHG—typically incurs 10-20 cycles, encompassing the swap and any associated serialization. On newer architectures like AMD Zen 4 (2022), latencies can be lower, around 7-9 cycles for similar operations. This duration increases under contention due to bus lock overhead, where the LOCK prefix serializes access to the memory bus, potentially extending to dozens of cycles as the system ensures exclusive ownership across cores. Scalability under contention evaluates how performance degrades as the number of threads increases. Test-and-set locks suffer from cache line bouncing, where frequent atomic accesses invalidate and migrate the shared lock variable across caches, leading to significant throughput drops under high contention due to inter-core traffic. A simple analytical model captures this effect, approximating the success probability of a single test-and-set attempt as \approx 1/N for N contending threads assuming uniform retry rates, which implies an expected O(N) retries per acquisition and linear degradation in overall system throughput. Busy-waiting energy cost measures the inefficiency of spinning loops during lock contention, where threads repeatedly execute test-and-set without yielding the CPU. This results in wasted CPU cycles and elevated consumption, with measurements on servers indicating up to 3-4% higher total system (around 2-3 watts per core) compared to idle states during local spinning, escalating further with global contention due to sustained core activity at high frequencies. The PAUSE instruction, intended to hint the processor for lower activity in spin loops, generally reduces consumption compared to unpaused busy-waiting. However, relative to other techniques like memory barriers, busy-waiting remains inherently energy-intensive compared to blocking alternatives. Benchmarking tools for test-and-set locks emphasize custom microbenchmarks to precisely measure and release times in multiprocessor environments, often using counters (e.g., RDTSC on x86) over iterations of contended and uncontended accesses. While suites like lmbench provide indirect insights into system latency, they do not directly benchmark lock primitives; instead, tools such as Synchrobench or tailored synthetic workloads simulate varying contention levels to report median times, typically in nanoseconds to microseconds per operation across multi-socket setups.

Comparisons to Alternatives

Test-and-set (TAS) serves as a straightforward atomic primitive for binary synchronization, such as simple spinlocks, but it is less versatile than (CAS), which supports conditional updates on arbitrary values and facilitates the construction of lock-free data structures like concurrent queues and stacks. While TAS unconditionally sets a lock , often resulting in high contention and cache-coherence overhead in multiprocessor systems, CAS reduces such inefficiencies by only committing updates if the expected value matches, thereby minimizing retries and enhancing . This added flexibility of CAS, however, introduces greater implementation complexity compared to the minimalistic design of TAS. In contrast to semaphores, which employ blocking mechanisms for resource coordination, TAS-based spinlocks prioritize low overhead by busy-waiting without context switches, rendering them more efficient for brief critical sections in multiprocessor environments where delays are minimal. Semaphores, by suspending waiting threads and relying on kernel scheduling, better accommodate longer critical sections or high-contention scenarios but incur the expense of context-switch latency and potential priority inversion, issues less pronounced in TAS spinlocks for short holds. Nonetheless, prolonged spinning with TAS can lead to wasted CPU cycles and energy consumption, whereas semaphores promote idleness among contenders, improving overall system utilization for extended waits. Ticket locks address the unfairness of basic TAS locks—where later arrivals may overtake earlier ones—by employing to assign sequential service numbers, ensuring ordering and moderate scalability improvements over TAS in low-to-medium contention. MCS (Mellor-Crummey and Scott) locks build on this by forming a per-thread queue with local spinning on private flags, drastically reducing shared interconnect traffic and providing superior scalability to both TAS and locks under high contention. Empirical evaluations on shared-memory multiprocessors demonstrate that MCS locks achieve nearly constant acquisition as processor counts increase, while TAS exhibits linear degradation and locks show intermediate performance with proportional backoff. Contemporary hardware primitives like (LL/SC), implemented in architectures such as and PowerPC, circumvent the plaguing -based algorithms by invalidating conditional stores upon any intervening write, thus enabling more robust non-blocking synchronization than without relying on version tags. Although LL/SC demands retry loops akin to CAS retries, its explicit detection of modifications yields stronger guarantees in lock-free settings, trading TAS's for enhanced reliability in concurrent structures.

References

  1. [1]
    [PDF] Atomic Instructions! - Cornell: Computer Science
    Synchronization! Atomic Test and Set! Using test-and-set for mutual exclusion! Use test-‐and-‐set to implement mutex / spinlock / crit.
  2. [2]
    Operating Systems Lecture Notes Lecture 5 Implementing ...
    Test-And-Set. The test and set instruction atomically checks if a memory location is zero, and if so, sets the memory location to 1. If the memory location is ...
  3. [3]
    [PDF] CSC 453 Operating Systems - Lecture 6 : Process Synchronization
    Test and Set Lock. • Test and Set is a single machine instruction introduced by IBM in its Series 360 computers. – A single bit stored the value of the lock ...<|separator|>
  4. [4]
    [PDF] Systems Reference Library IBM System/360 Principles of Operation
    This publication is the machine reference manual for the IBM System/360. It provides a direct, comprehen- sive description of the system structure; ...
  5. [5]
    [PDF] threads-locks.pdf - cs.wisc.edu
    The reason it is called “test and set” is that it enables you to “test” the old value (which is what is returned) while simultaneously. “setting” the memory ...
  6. [6]
    Chapter Five -- Process Synchronization -- Lecture Notes
    Our text gives the examples of what can be done with an atomic test-and-set instruction or an atomic swap instruction. DEFINITION OF THE TestAndSet INSTRUCTION
  7. [7]
    [PDF] System/360 and Beyond
    The evolution of modern large-scale computer architecture within IBM is described, starting with the announcement of. System/360 in 1964 and covering the latest ...
  8. [8]
    None
    ### Summary of Test-and-Set, Concurrency, Race Conditions, and Critical Sections
  9. [9]
    [PDF] Concurrent Programming: Critical Sections and Locks - CS@Cornell
    A Data Race occurs when two threads try to access the same variable and at least one access is non-atomic and at least one access is an update. o The outcome of ...
  10. [10]
    [PDF] Lecture 5: Synchronization with Locks - UCSD CSE
    Atomic instructions (e.g., test-and-set). ♢. Disable/enable interrupts (prevents context switches). Page 18. CSE 120 – Lecture 5: Synchronization. 18. Test-And- ...
  11. [11]
    [PDF] Synchronization Principles Race Condition The Critical-Section ...
    Sep 25, 2008 · Synchronization. • Test-and-set – atomic exchange. • Fetch-and-op (e.g., increment) – returns value and atomically performs op (e.g., increments ...
  12. [12]
    [PDF] Intel® 64 and IA-32 Architectures Software Developer's Manual
    NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of four volumes: Basic Architecture, Order Number 253665; Instruction Set ...
  13. [13]
    L15 - PDOS-MIT
    The paper describes several ways to implement test-and-set in software: Emulation. The kernel provides a TSL system call, which is implemented atomically by, ...
  14. [14]
    Mutual Exclusion | Operating Systems: updated 11 Jan 2024
    The pthreads library in Linux and Minix offer an efficient implementation of mutexes. Rather than wasting CPU time by 'spinning', the thread is put to sleep if ...<|separator|>
  15. [15]
    Critical Section Objects - Win32 apps - Microsoft Learn
    Jan 7, 2021 · A critical section object provides synchronization similar to that provided by a mutex object, except that a critical section can be used only by the threads ...
  16. [16]
    __sync Builtins (Using the GNU Compiler Collection (GCC))
    The `__sync` functions are legacy built-ins for atomic memory access, compatible with Intel Itanium, implemented using `__atomic` and should not be used for ...Missing: hardware | Show results with:hardware
  17. [17]
    An implementation of the C11 <code><stdatomic.h></code> interface
    ### Summary: Fallback Implementations for C11 Atomic Operations (e.g., `atomic_exchange`)
  18. [18]
    [PDF] Review: Test-and-set spinlock
    • Test-and-set spinlock has several advantages. - Simple to implement and understand. - One memory location for arbitrarily many CPUs. • But also has ...
  19. [19]
    [PDF] Synchronization - Cornell: Computer Science
    Hardware Primitive: Test and Set. Test-and-set is a typical way to achieve synchronization when only one processor is allowed to access a critical section ...
  20. [20]
    [PDF] Solution of a Problem in Concurrent Programming Control
    The solution ensures only one of N cyclic processes is in its critical section at a time, using a common store and a boolean array and integer.
  21. [21]
    [PDF] Lecture Summary Mutual Exclusion
    This section presents two algorithms that solve the mutual exclusion problem. The first one is due to Dekker and it solves the mutual exclusion problem for two ...
  22. [22]
    [PDF] Instruction latencies and throughput for AMD and Intel x86 processors
    Aug 2, 2019 · The latency is the indicated 5 cycles for the upper product half. The latency for the lower part is 4 cycles. 2. The latency is data ...
  23. [23]
    Evaluating the Cost of Atomic Operations on Modern Architectures
    PDF | On Oct 1, 2015, Hermann Schweizer and others published Evaluating the Cost of Atomic Operations on Modern Architectures | Find, read and cite all the ...
  24. [24]
    [PDF] More Than You Ever Wanted to Know about Synchronization
    Test-and-set spinlocks are prone to the bouncing problem where acquiring a lock invalidates the cache of all threads reading the lock as they are waiting ...<|separator|>
  25. [25]
    [PDF] The Performance of Spin Lock Alternatives for Shared-Memory ...
    The waiting processor sees the change in state and performs a test-and-set; if someone acquired the lock in the interim, the processor can resume spinning in ...Missing: probability | Show results with:probability
  26. [26]
    [PDF] Unlocking Energy - USENIX
    Jun 22, 2016 · In this section, we evaluate the costs of busy waiting and sleeping, and examine different ways of reducing them. 4.1 Power: The Price of Busy ...
  27. [27]
    [PDF] Everything You Still Don't Know About Synchronization and How It ...
    In this study, we present an analysis of various layers - hardware atomic instructions, software synchronization prim- itives and applications which use these ...
  28. [28]
    [PDF] Portable tools for performance analysis - lmbench
    Abstract lmbench is a micro-benchmark suite designed to focus attention on the basic building blocks of many common system applications, such as databases, ...
  29. [29]
    [PDF] Art of Multiprocessor Programming - School of Computing Science
    ... Test-And-Set Locks. 144. 7.3 TAS-Based Spin Locks Revisited. 146. 7.4 ... art of multiprocessor programming spend time solving the problems presented.
  30. [30]
    [PDF] Lecture 6: Semaphores and Monitors - UCSD CSE
    Locks work, but they have some drawbacks when critical sections are long. ♢. Spinlocks – inefficient. ♢. Disabling interrupts – can miss or delay important ...Missing: disadvantages | Show results with:disadvantages
  31. [31]
    Chapter Six -- Synchronization Tools -- Lecture Notes
    Spinlocks have the advantage of not requiring those context switches, so systems often use spinlocks in situations where the wait is expected to take time less ...
  32. [32]
    [PDF] Algorithms for scalable synchronization on shared-memory ...
    Three algorithms-the test-and-set lock with exponential backoff, the ticket lock with proportional backoff, and the MCS lock-all scale extremely well. Ignoring ...