Cache coherence
Cache coherence refers to the discipline that ensures all processors in a multiprocessor system maintain a consistent view of shared memory data stored in their local caches, preventing inconsistencies such as stale data from unpropagated writes.[1] This uniformity is essential in shared-memory multiprocessors, where multiple caches may hold copies of the same data block, and changes to shared operands must be propagated timely to avoid the cache coherence problem, where processors observe different values for the same memory location.[2][1] The problem arises primarily in systems with private caches per processor, as each cache aims to reduce memory access latency but can lead to multiple inconsistent copies of shared data; without coherence mechanisms, parallel programs relying on shared read-write data structures would produce incorrect results due to visibility issues.[1] Cache coherence protocols enforce key invariants, such as the Single-Writer-Multiple-Reader (SWMR) rule—where a memory location has at most one writer or multiple readers at a time—and the Data-Value Invariant, ensuring that the value written is visible to subsequent readers after the writer's epoch ends.[1] These protocols operate at the granularity of cache blocks (typically 64 bytes), using states like Modified (M), Exclusive (E), Shared (S), and Invalid (I) in variants such as MSI or MESI to track and update data.[1][2] Mechanisms for achieving coherence fall into hardware-based and software-based categories, with hardware approaches dominating modern implementations for their transparency to programmers.[2] Snooping protocols, common in bus-based systems, broadcast coherence transactions (e.g., invalidations or updates) to all caches, which "snoop" the bus and respond accordingly; examples include the Write-Once and Berkeley protocols from the 1980s.[1][2] For scalability in larger systems, directory-based protocols use a centralized or distributed directory to track cache block locations and multicast targeted messages, avoiding broadcast overhead; pioneering systems like Stanford DASH (1990) and SGI Origin 2000 (mid-1990s) demonstrated their viability for up to 1024 processors.[1][2] Write-invalidate policies, which invalidate other copies on a write, are more common than write-update due to lower bandwidth needs, though hybrids exist for optimization.[2] Historically, cache coherence emerged with early multiprocessors in the 1970s and 1980s, building on Leslie Lamport's 1979 definition of sequential consistency, which requires operations to appear in a single total order to all processors.[1] Seminal work includes the MOESI state model introduced by Sweazey and Smith in 1986, and surveys like Stenström's 1990 review of schemes, which classified hardware solutions as snoopy, directory, or network-based.[1][2] Today, coherence extends to heterogeneous systems including GPUs and accelerators, with optimizations like token coherence (2003) and scoped models for efficient synchronization in compute-intensive workloads.[1] These advancements ensure cache coherence remains foundational for performance and correctness in multicore and distributed architectures.[1]Introduction
Overview
Cache coherence is the discipline that ensures shared data stored in multiple local caches remains consistent across processors in a shared-memory multiprocessor system.[2] This uniformity prevents discrepancies that could arise when processors independently cache copies of the same data from main memory.[2] In symmetric multiprocessing (SMP) systems, where each processor maintains a private cache to exploit data locality, cache coherence plays a critical role in preventing data inconsistencies by synchronizing updates across all relevant cache copies.[2] Without it, processors might operate on stale or divergent versions of shared data, leading to incorrect computations in parallel applications.[2] Coherence is typically maintained at the granularity of a cache block, or line, which serves as the atomic unit for data transfers between caches and main memory.[2] This approach enables fast local access to data while upholding global consistency, often handling over 95% of memory requests entirely within caches to reduce average access latency and bus traffic.[2] A key challenge in cache coherence arises from bandwidth overhead in bus-based systems, where maintaining consistency may necessitate broadcasting updates or invalidations to all processors, potentially limiting scalability.[2]Historical Development
The cache coherence problem emerged in the late 1970s as multiprocessor systems began incorporating private caches per processor to improve performance, leading to inconsistencies where different processors could hold stale copies of shared data. Early parallel machines like the ILLIAC IV, operational in 1972, operated as SIMD array processors without private caches, avoiding coherence issues but highlighting the need for scalable memory access in parallel computing. By the late 1970s, the first formal recognition of multicache coherence challenges appeared, with proposals for directory-based tracking of cache states to ensure consistency without broadcasts. Initial symmetric multiprocessors (SMPs) in the early 1980s, such as those from Encore and early commercial systems, often lacked hardware coherence mechanisms, relying instead on software flushes or write-through policies that sacrificed performance for correctness. Snooping protocols were introduced in 1983 through the IEEE Futurebus specification, which defined a standard bus supporting cache consistency via broadcast snooping, where caches monitor bus transactions to invalidate or update local copies. This approach enabled efficient hardware enforcement of coherence in bus-based SMPs. Commercial adoption followed with systems like the Sequent Symmetry in 1987, a 20-processor i386-based SMP that implemented snooping for copy-back caches, demonstrating practical scalability for shared-memory multiprocessing.[2] The late 1980s saw the development of the MSI (Modified, Shared, Invalid) protocol, a foundational snooping-based invalidate scheme that reduced bus traffic by distinguishing modified blocks from shared ones, as evaluated in early multiprocessor simulations.[2] An optimization, the MESI (Modified, Exclusive, Shared, Invalid) protocol, emerged in the early 1990s and was integrated into Intel processors like the Pentium, adding an exclusive state to avoid unnecessary invalidations and improve bandwidth efficiency in multi-core designs. As bus-based snooping struggled with scalability beyond dozens of processors, the 1990s marked a shift to directory-based protocols, exemplified by the Stanford DASH project in 1991, which used distributed directories to track cache block ownership via point-to-point messages, enabling coherent shared memory for up to 64 processors with latencies around 120 cycles for remote accesses.[3] Post-2020 developments have focused on heterogeneous and multi-socket systems, with ARM's AMBA CHI specification reaching Issue E in 2021 to support scalable coherent interconnects across diverse accelerators and CPUs, incorporating features like atomic operations and flow control for chiplet-based designs.[4] Multi-socket AMD EPYC processors, such as the 7003 series (Milan, 2021), employ directory-based coherence over Infinity Fabric links to maintain consistency across up to 64 cores per socket.[5] Similarly, Intel Xeon processors like Sapphire Rapids (2023) integrate MESIF extensions with directory protocols for multi-socket scalability, supporting up to 8 sockets while minimizing coherence overhead in cloud and HPC environments.[6]Fundamentals
Cache Systems in Multiprocessors
In multiprocessor systems, particularly multi-core CPUs, each processing core typically features private L1 and L2 caches dedicated to its local computations, while higher-level caches such as L3 are shared among multiple cores to improve overall access efficiency.[7] The private L1 cache, split into instruction and data subsets, provides the fastest access for frequently used data, whereas the L2 cache offers larger capacity for slightly less urgent locality.[8] Shared L3 caches, often serving as the last-level cache (LLC), aggregate resources across cores to reduce latency for inter-core data sharing and minimize trips to main memory.[9] The cache hierarchy in these systems is organized into multiple levels, with L1 being the smallest and fastest (typically 32-64 KB per core), L2 intermediate (256 KB to 2 MB per core), and L3 the largest (8-128 MB shared or more).[10] Caches employ set-associative mapping, where associativity determines the number of cache lines per set to balance hit rates and hardware complexity. Cache blocks, or lines, are standardized at 64 bytes to align with memory bus widths and typical data access patterns.[11] Multiprocessors operate under shared memory models, primarily uniform memory access (UMA) or non-uniform memory access (NUMA). In UMA systems, all processors access the shared memory with equal latency via a common interconnect, simplifying design but limiting scalability.[12] NUMA architectures, common in larger systems, assign memory modules locally to processor nodes, resulting in faster access to local memory but higher latency for remote accesses, which influences cache design for locality optimization.[13] Cores interact with memory hierarchically: a core first checks its private L1 cache for reads or writes; misses propagate to L2, then to shared L3, and finally to main memory if needed.[14] This process enables data replication, where the same memory block can reside in multiple caches simultaneously to exploit spatial and temporal locality and reduce bus contention.[14] Such replication enhances performance but introduces the potential for inconsistencies across caches. Inclusion policies dictate how data propagates between levels: inclusive policies require all data in lower-level caches (e.g., L1/L2) to also exist in the higher-level cache (e.g., L3), as seen in many Intel architectures like Nehalem.[15] Exclusive policies, conversely, prohibit overlap, allowing higher utilization of total cache space, which AMD processors often employ in their LLC designs.[15]The Coherence Problem
In multiprocessor systems equipped with private caches, the cache coherence problem emerges when multiple copies of the same memory block are cached across processors, allowing one processor's modification to go unreflected in others' caches, thereby producing inconsistent data views. This issue stems from the replication of data blocks to exploit locality, but without coordination, it violates the expectation that shared memory provides a uniform value for any location at any time.[1] A fundamental scenario illustrating this problem involves two processors accessing a shared variable A initialized to 42 in main memory. Processor P1 reads A into its cache, obtaining 42. Subsequently, P1 writes 43 to A, updating its local cache copy while leaving main memory unchanged in a write-back policy. Meanwhile, processor P2 reads A into its cache before P1's write, also obtaining 42, and later rereads A from its cache, retrieving the stale 42 instead of the updated 43. This demonstrates how P2 observes an outdated value despite P1's intervening write.[1] To highlight the progression of cache states in this two-processor example without coherence mechanisms, consider the operations on shared block containing A (initially 42 in memory), assuming a basic state model where blocks can be invalid (I), shared (S), or modified (M/dirty):| Operation | P1 Cache State for Block | P2 Cache State for Block | Main Memory |
|---|---|---|---|
| Initial | I | I | 42 |
| P1 reads A | S, 42 | I | 42 |
| P2 reads A | S, 42 | S, 42 | 42 |
| P1 writes 43 to A | M, 43 (dirty) | S, 42 | 42 |
| P2 rereads A | M, 43 (dirty) | S, 42 (hit, stale) | 42 |
Requirements and Models
Formal Definition
Cache coherence is formally defined as a protocol-level property in shared-memory multiprocessor systems that maintains a consistent view of shared data across multiple private caches, ensuring that the effects of memory operations are uniformly observable by all processors. This definition encompasses two fundamental invariants: the single-writer-multiple-reader (SWMR) invariant and the data-value invariant. The SWMR invariant stipulates that, for any shared memory location, at any logical time, either exactly one processor has permission to both read from and write to it (acting as the sole writer and reader), or multiple processors have read-only permission with no writer allowed.[1] The data-value invariant requires that the value stored in a memory location at the conclusion of a read-only epoch matches the value established at the end of the preceding read-write epoch for that location.[1] These invariants are upheld through three essential requirements: write propagation, serialization, and read reflection. Write propagation ensures that every write operation to a shared location eventually becomes visible to all other processors in the system, preventing stale data from persisting indefinitely in remote caches.[1] Serialization imposes a total order on all writes to the same memory location, such that every processor perceives these writes in the identical sequence, thereby avoiding conflicting interpretations of update history.[1] Read reflection guarantees that any read operation following a write to the same location will retrieve either the value from that write or from a later write in the serialized order, ensuring reads accurately reflect the current state.[1] A precise formal statement of cache coherence is as follows: for any memory address A, if a processor executes a write W to A followed later by a read R to A (possibly on a different processor), then R must return the value produced by W or by some write W' to A that follows W in the total serialization order of writes to A; moreover, all reads across processors must observe the writes to A in this same total order.[1] Cache coherence must be distinguished from cache consistency in a single-processor environment. In uniprocessor systems, cache consistency pertains solely to synchronizing a single cache with main memory, often via mechanisms like write-through or write-back policies to ensure the processor sees its own updates promptly. By contrast, cache coherence in multiprocessor systems extends this to coordinate multiple caches, ensuring inter-cache agreement on shared data values and preventing inconsistencies arising from concurrent access by multiple processors.[1] Coherence is maintained at the granularity of cache lines (also called cache blocks), treating these fixed-size units—typically 64 bytes—as the atomic entities for consistency management, rather than finer-grained bytes or words, to optimize hardware efficiency and reduce protocol overhead.[1] Edge cases include uncached memory accesses, where data bypasses caches and interacts directly with main memory; coherence protocols must ensure these do not disrupt invariants, often by routing such accesses through the same serialization points as cached operations. Additionally, I/O coherence addresses direct memory access (DMA) by peripherals, requiring extensions like flush or snoop mechanisms to propagate device writes to processor caches and invalidate stale copies, maintaining system-wide consistency.[1]Consistency Models
Memory consistency models specify the permissible orderings of memory operations—reads and writes—as observed across multiple processors in a shared-memory multiprocessor system, defining a contract between hardware and software on how operations interleave and become visible.[1] These models ensure that parallel programs execute predictably, but they differ fundamentally from cache coherence, which focuses solely on ensuring a unique value for each memory location at any time, without prescribing inter-operation ordering.[1] Cache coherence provides the foundational visibility of updates (e.g., via invalidation or update protocols), enabling the implementation of various consistency models by guaranteeing that writes propagate correctly, while consistency models build atop this to enforce broader ordering guarantees for correctness in parallel programming.[1] The strongest such model is sequential consistency (SC), introduced by Lamport, which mandates that the results of any execution appear as if all memory operations from all processors occurred atomically in some global total order that respects the program order on each individual processor.[16] Under SC, operations on distinct memory locations must also appear sequentially ordered, preventing reordering that could alter program semantics, such as a processor seeing its own write before another processor's intervening write.[16] This model simplifies programming by mimicking the behavior of a uniprocessor but imposes strict hardware constraints, often requiring full serialization of operations to maintain the global order.[1] Weaker models relax these constraints to enhance performance through reordering and buffering, while still leveraging cache coherence for value visibility. Processor consistency, proposed by Goodman, preserves per-processor program order for reads following reads (R→R) and writes following writes (W→W), but allows a read to see an old value even after a write to the same location if the write is from another processor, as long as writes establish a global total order visible to future reads. This permits store buffering, where writes are delayed before becoming globally visible, reducing contention but requiring programmers to use synchronization for cross-processor ordering. Release consistency, developed by Gharachorloo et al., further weakens guarantees by tying ordering to explicit synchronization events like acquires and releases; ordinary memory operations (acquires, releases, and data accesses) can be reordered freely within their categories, but synchronization points enforce consistency, allowing aggressive hardware optimizations such as delayed updates. Hardware implements these models via cache coherence protocols combined with mechanisms like store buffers, load bypassing, and memory fences (barriers). For SC, coherence protocols must enforce immediate global visibility and strict serialization, often at high cost; weaker models like processor or release consistency tolerate relaxed propagation, using fences to insert ordering points that flush buffers or stall reordering, ensuring coherence-maintained updates respect synchronization.[1] For example, the x86 architecture's Total Store Order (TSO) model, a form of processor consistency variant, allows stores to buffer before global visibility (permitting out-of-order reads of other processors' writes) but enforces total ordering for loads and stores within a processor, relying on cache coherence for write propagation and serializing instructions (e.g., SFENCE) for stronger guarantees. In contrast, ARM's weak ordering model permits extensive reordering of loads and stores across processors unless controlled by explicit barriers (e.g., DMB), where cache coherence ensures update visibility but programmers must insert barriers to achieve sequential-like behavior, highlighting how coherence alone does not suffice without model-specific synchronization.Core Mechanisms
Snooping Protocols
Snooping protocols maintain cache coherence in bus-based shared-memory multiprocessor systems by having each cache controller continuously monitor, or "snoop," all transactions on the shared bus for memory addresses that match lines in its cache.[17] This monitoring allows caches to detect relevant read or write requests from other processors and respond accordingly to ensure data consistency across the system.[17] The approach relies on a broadcast medium like a bus, where every transaction is visible to all caches, enabling simple hardware implementation without centralized coordination.[17] Cache controllers in snooping systems typically track line states such as modified (data has been altered and is the only valid copy), shared (data is unchanged and may exist in multiple caches), and invalid (data is no longer valid and must be fetched from memory or another cache).[18] State transitions occur in response to bus events: for example, a processor's read miss triggers a bus read request, which other caches snoop and supply data if in the modified state, transitioning to shared; a write request may invalidate shared copies in other caches to prevent stale data.[18] These transitions ensure that no processor reads outdated data while allowing efficient local access to unmodified lines.[18] Specific protocols like the MSI (Modified-Shared-Invalid) protocol exemplify this by defining precise rules for state changes on bus requests.[18] Snooping protocols operate under two primary bus strategies: write-invalidate and write-update. In write-invalidate protocols, a writing processor broadcasts an invalidate signal on the bus before updating its copy, forcing other caches to mark matching lines invalid and refetch on future reads.[18] This minimizes bus traffic for repeated writes by a single processor but can increase misses if sharing is frequent.[18] In contrast, write-update protocols broadcast the new data value to all caches on a write, allowing shared copies to update in place without invalidation, which reduces latency for subsequent reads by others but consumes more bandwidth due to data broadcasts.[18] Write-invalidate is more common in practice due to lower overall traffic in typical workloads.[18] These protocols offer low-latency coherence enforcement in small-scale systems, as snooping enables quick interventions without directory lookups or point-to-point messaging.[19] However, they suffer from bus bandwidth contention caused by frequent broadcasts, limiting scalability to modest numbers of processors, typically 16 to 32, beyond which traffic overwhelms the shared medium.[19] Implementations are prevalent in uniform memory access (UMA) architectures, such as early generations of Intel Xeon processors that used the Front Side Bus (FSB) for multi-socket coherence via snooping.[20]Directory-Based Protocols
Directory-based protocols address the scalability limitations of snooping protocols by maintaining a centralized or distributed directory that tracks the state and location of each cache block across the system. Unlike snooping mechanisms that rely on broadcasting requests to all caches, directory-based approaches use point-to-point communication to notify only the relevant caches, making them suitable for large-scale non-uniform memory access (NUMA) and distributed shared-memory systems. The directory typically resides with the home memory node for each block and records which caches hold copies of the block (sharers) and in what state, such as shared, exclusive, or modified.[21] The core operation involves consulting the directory on a cache miss. For a read miss, the requesting processor sends a request to the home directory; if the block is clean and shared, it is supplied from memory or forwarded from a sharer, and the directory updates to include the new sharer. For a write miss, the directory identifies all current sharers and sends invalidation messages point-to-point to them, ensuring exclusive access before supplying the block to the requester. This selective notification reduces network traffic compared to broadcasts, though it introduces indirection latency as the home node coordinates responses. Distributed directories, where entries are spread across memory controllers, further enhance scalability by localizing access and avoiding a single point of contention.[22][23] Several variants optimize the directory structure to balance storage overhead and performance. The full-bit vector scheme uses one bit per processor to indicate presence, plus state bits (e.g., a dirty bit), allowing precise tracking of all sharers but incurring high storage costs (e.g., up to 50% overhead for 256 processors). Coarse vector variants group processors into clusters and use fewer bits per group, approximating sharer locations to reduce space while tolerating some imprecision. Pointer-based schemes store a limited number of pointers (e.g., 4-8) to exact sharer locations, which is efficient when few caches share blocks but requires overflow handling. When the pointer limit is exceeded, protocols may evict a random pointer (potentially causing unnecessary invalidations), fall back to broadcasting, or use chaining to link additional entries, though this adds complexity and latency. These trade-offs were evaluated in early simulations showing full-bit vectors provide the best accuracy but scale poorly in storage, while limited pointers suit systems with low sharing degrees.[21][22] Directory-based protocols offer significant advantages in scalability, supporting hundreds of processors without the broadcast storms that limit snoopy systems to dozens, but they suffer from higher access latency due to directory lookups and multi-message exchanges. Seminal implementations include the Stanford DASH multiprocessor (1990), which pioneered distributed full-bit vector directories for up to 64 processors, influencing later designs. In the 1990s, SGI Origin servers employed hybrid bit-vector and coarse-vector directories, scaling to 1024 processors in cc-NUMA configurations with low remote access penalties. Modern examples, such as multi-socket AMD EPYC systems, integrate directory-based coherence over Infinity Fabric links, using probe filters in the L3 cache to track inter-socket sharers efficiently in MOESI protocols.[24]Specific Protocols
Invalidation-Based Protocols
Invalidation-based protocols maintain cache coherence by invalidating copies of a cache line in other caches when a processor writes to it, ensuring that only the writing cache has a valid copy. This approach contrasts with update-based methods by avoiding the broadcast of write data, thereby conserving bandwidth in shared-bus systems. These protocols typically operate in snoopy implementations, where caches monitor bus transactions to update their states accordingly.[1] The MSI (Modified, Shared, Invalid) protocol is a foundational invalidation-based scheme using three stable states for each cache line: Invalid (I), indicating the line is uninitialized or stale; Shared (S), indicating the line is clean and possibly replicated across multiple caches; and Modified (M), indicating the line is dirty and uniquely held by one cache. Transitions occur on processor requests (PrRd for read, PrWr for write) or bus actions (BusRd for shared reads, BusRdX for exclusive reads/writes, BusUpgr for upgrades). In a read hit, the state remains unchanged whether in S or M. A read miss from I fetches the line via BusRd, transitioning to S if shared or potentially M if exclusive. A write hit in S issues BusUpgr to invalidate others, moving to M; in M, it stays M without bus action. A write miss from I issues BusRdX, fetching and invalidating others to enter M. On eviction or write-back from M, the line transitions to I after supplying data via BusWr.[1] The following table summarizes key state transitions in the MSI protocol for a snoopy bus implementation:| Event | Current State | Action/Bus Request | New State | Notes |
|---|---|---|---|---|
| PrRd (read hit) | S | None | S | Data from cache |
| PrRd (read hit) | M | None | M | Data from cache |
| PrRd (read miss) | I | BusRd | S | Fetch from memory or sharer |
| PrWr (write hit) | S | BusUpgr | M | Invalidate other caches |
| PrWr (write hit) | M | None | M | Update cache only |
| PrWr (write miss) | I | BusRdX | M | Fetch, invalidate others |
| Eviction/Write-back | M | BusWr | I | Supply dirty data to bus |
| Event | Current State | Action/Bus Request | New State | Notes |
|---|---|---|---|---|
| PrRd (read hit) | E | None | E | Data from cache |
| PrRd (read miss) | I | BusRd | E | If no sharers (clean fetch) |
| PrRd (read miss) | I | BusRd | S | If sharers exist |
| PrWr (write hit) | E | None | M | Silent upgrade |
| PrWr (write hit) | S | BusUpgr | M | Invalidate others |
| Snooped BusRd | E | None | S | Downgrade to shared |
| Eviction | E | None (PutE) | I | No data write-back |
| Event | Current State | Action/Bus Request | New State | Notes |
|---|---|---|---|---|
| PrRd (read hit) | O | None | O | Data from cache |
| PrWr (write hit) | O | None | M | Update owned line |
| Snooped BusRd | M | BusWr (supply data) | O | Share dirty data, retain ownership |
| Snooped BusRd | O | BusWr (supply data) | O | Service from owner |
| PrRd (read miss) | I | BusRd | O | If owner provides dirty data |