Fact-checked by Grok 2 weeks ago

Cache coherence

Cache coherence refers to the discipline that ensures all processors in a multiprocessor system maintain a consistent view of shared memory data stored in their local caches, preventing inconsistencies such as stale data from unpropagated writes.^[1] This uniformity is essential in shared-memory multiprocessors, where multiple caches may hold copies of the same data block, and changes to shared operands must be propagated timely to avoid the cache coherence problem, where processors observe different values for the same memory location.^[2]^[1] The problem arises primarily in systems with private caches per processor, as each cache aims to reduce memory access latency but can lead to multiple inconsistent copies of shared data; without coherence mechanisms, parallel programs relying on shared read-write data structures would produce incorrect results due to visibility issues.^[1] Cache coherence protocols enforce key invariants, such as the Single-Writer-Multiple-Reader (SWMR) rule—where a memory location has at most one writer or multiple readers at a time—and the Data-Value Invariant, ensuring that the value written is visible to subsequent readers after the writer's epoch ends.^[1] These protocols operate at the granularity of cache blocks (typically 64 bytes), using states like Modified (M), Exclusive (E), Shared (S), and Invalid (I) in variants such as MSI or MESI to track and update data.^[1]^[2] Mechanisms for achieving coherence fall into hardware-based and software-based categories, with hardware approaches dominating modern implementations for their transparency to programmers.^[2] Snooping protocols, common in bus-based systems, broadcast coherence transactions (e.g., invalidations or updates) to all caches, which "snoop" the bus and respond accordingly; examples include the Write-Once and Berkeley protocols from the 1980s.^[1]^[2] For scalability in larger systems, directory-based protocols use a centralized or distributed directory to track cache block locations and multicast targeted messages, avoiding broadcast overhead; pioneering systems like Stanford DASH (1990) and SGI Origin 2000 (mid-1990s) demonstrated their viability for up to 1024 processors.^[1]^[2] Write-invalidate policies, which invalidate other copies on a write, are more common than write-update due to lower bandwidth needs, though hybrids exist for optimization.^[2] Historically, cache coherence emerged with early multiprocessors in the 1970s and 1980s, building on Leslie Lamport's 1979 definition of sequential consistency, which requires operations to appear in a single total order to all processors.^[1] Seminal work includes the MOESI state model introduced by Sweazey and Smith in 1986, and surveys like Stenström's 1990 review of schemes, which classified hardware solutions as snoopy, directory, or network-based.^[1]^[2] Today, coherence extends to heterogeneous systems including GPUs and accelerators, with optimizations like token coherence (2003) and scoped models for efficient synchronization in compute-intensive workloads.^[1] These advancements ensure cache coherence remains foundational for performance and correctness in multicore and distributed architectures.^[1]

Introduction

Overview

Cache coherence is the discipline that ensures shared data stored in multiple local caches remains consistent across processors in a shared-memory multiprocessor system.^[2] This uniformity prevents discrepancies that could arise when processors independently cache copies of the same data from main memory.^[2] In symmetric multiprocessing (SMP) systems, where each processor maintains a private cache to exploit data locality, cache coherence plays a critical role in preventing data inconsistencies by synchronizing updates across all relevant cache copies.^[2] Without it, processors might operate on stale or divergent versions of shared data, leading to incorrect computations in parallel applications.^[2] Coherence is typically maintained at the granularity of a cache block, or line, which serves as the atomic unit for data transfers between caches and main memory.^[2] This approach enables fast local access to data while upholding global consistency, often handling over 95% of memory requests entirely within caches to reduce average access latency and bus traffic.^[2] A key challenge in cache coherence arises from bandwidth overhead in bus-based systems, where maintaining consistency may necessitate broadcasting updates or invalidations to all processors, potentially limiting scalability.^[2]

Historical Development

The cache coherence problem emerged in the late 1970s as multiprocessor systems began incorporating private caches per processor to improve performance, leading to inconsistencies where different processors could hold stale copies of shared data. Early parallel machines like the ILLIAC IV, operational in 1972, operated as SIMD array processors without private caches, avoiding coherence issues but highlighting the need for scalable memory access in parallel computing. By the late 1970s, the first formal recognition of multicache coherence challenges appeared, with proposals for directory-based tracking of cache states to ensure consistency without broadcasts. Initial symmetric multiprocessors (SMPs) in the early 1980s, such as those from Encore and early commercial systems, often lacked hardware coherence mechanisms, relying instead on software flushes or write-through policies that sacrificed performance for correctness. Snooping protocols were introduced in 1983 through the IEEE Futurebus specification, which defined a standard bus supporting cache consistency via broadcast snooping, where caches monitor bus transactions to invalidate or update local copies. This approach enabled efficient hardware enforcement of coherence in bus-based SMPs. Commercial adoption followed with systems like the Sequent Symmetry in 1987, a 20-processor i386-based SMP that implemented snooping for copy-back caches, demonstrating practical scalability for shared-memory multiprocessing.^[2] The late 1980s saw the development of the MSI (Modified, Shared, Invalid) protocol, a foundational snooping-based invalidate scheme that reduced bus traffic by distinguishing modified blocks from shared ones, as evaluated in early multiprocessor simulations.^[2] An optimization, the MESI (Modified, Exclusive, Shared, Invalid) protocol, emerged in the early 1990s and was integrated into Intel processors like the Pentium, adding an exclusive state to avoid unnecessary invalidations and improve bandwidth efficiency in multi-core designs. As bus-based snooping struggled with scalability beyond dozens of processors, the 1990s marked a shift to directory-based protocols, exemplified by the Stanford DASH project in 1991, which used distributed directories to track cache block ownership via point-to-point messages, enabling coherent shared memory for up to 64 processors with latencies around 120 cycles for remote accesses.^[3] Post-2020 developments have focused on heterogeneous and multi-socket systems, with ARM's AMBA CHI specification reaching Issue E in 2021 to support scalable coherent interconnects across diverse accelerators and CPUs, incorporating features like atomic operations and flow control for chiplet-based designs.^[4] Multi-socket AMD EPYC processors, such as the 7003 series (Milan, 2021), employ directory-based coherence over Infinity Fabric links to maintain consistency across up to 64 cores per socket.^[5] Similarly, Intel Xeon processors like Sapphire Rapids (2023) integrate MESIF extensions with directory protocols for multi-socket scalability, supporting up to 8 sockets while minimizing coherence overhead in cloud and HPC environments.^[6]

Fundamentals

Cache Systems in Multiprocessors

In multiprocessor systems, particularly multi-core CPUs, each processing core typically features private L1 and L2 caches dedicated to its local computations, while higher-level caches such as L3 are shared among multiple cores to improve overall access efficiency.^[7] The private L1 cache, split into instruction and data subsets, provides the fastest access for frequently used data, whereas the L2 cache offers larger capacity for slightly less urgent locality.^[8] Shared L3 caches, often serving as the last-level cache (LLC), aggregate resources across cores to reduce latency for inter-core data sharing and minimize trips to main memory.^[9] The cache hierarchy in these systems is organized into multiple levels, with L1 being the smallest and fastest (typically 32-64 KB per core), L2 intermediate (256 KB to 2 MB per core), and L3 the largest (8-128 MB shared or more).^[10] Caches employ set-associative mapping, where associativity determines the number of cache lines per set to balance hit rates and hardware complexity. Cache blocks, or lines, are standardized at 64 bytes to align with memory bus widths and typical data access patterns.^[11] Multiprocessors operate under shared memory models, primarily uniform memory access (UMA) or non-uniform memory access (NUMA). In UMA systems, all processors access the shared memory with equal latency via a common interconnect, simplifying design but limiting scalability.^[12] NUMA architectures, common in larger systems, assign memory modules locally to processor nodes, resulting in faster access to local memory but higher latency for remote accesses, which influences cache design for locality optimization.^[13] Cores interact with memory hierarchically: a core first checks its private L1 cache for reads or writes; misses propagate to L2, then to shared L3, and finally to main memory if needed.^[14] This process enables data replication, where the same memory block can reside in multiple caches simultaneously to exploit spatial and temporal locality and reduce bus contention.^[14] Such replication enhances performance but introduces the potential for inconsistencies across caches. Inclusion policies dictate how data propagates between levels: inclusive policies require all data in lower-level caches (e.g., L1/L2) to also exist in the higher-level cache (e.g., L3), as seen in many Intel architectures like Nehalem.^[15] Exclusive policies, conversely, prohibit overlap, allowing higher utilization of total cache space, which AMD processors often employ in their LLC designs.^[15]

The Coherence Problem

In multiprocessor systems equipped with private caches, the cache coherence problem emerges when multiple copies of the same memory block are cached across processors, allowing one processor's modification to go unreflected in others' caches, thereby producing inconsistent data views. This issue stems from the replication of data blocks to exploit locality, but without coordination, it violates the expectation that shared memory provides a uniform value for any location at any time.^[1] A fundamental scenario illustrating this problem involves two processors accessing a shared variable A initialized to 42 in main memory. Processor P1 reads A into its cache, obtaining 42. Subsequently, P1 writes 43 to A, updating its local cache copy while leaving main memory unchanged in a write-back policy. Meanwhile, processor P2 reads A into its cache before P1's write, also obtaining 42, and later rereads A from its cache, retrieving the stale 42 instead of the updated 43. This demonstrates how P2 observes an outdated value despite P1's intervening write.^[1] To highlight the progression of cache states in this two-processor example without coherence mechanisms, consider the operations on shared block containing A (initially 42 in memory), assuming a basic state model where blocks can be invalid (I), shared (S), or modified (M/dirty):

Operation	P1 Cache State for Block	P2 Cache State for Block	Main Memory
Initial	I	I	42
P1 reads A	S, 42	I	42
P2 reads A	S, 42	S, 42	42
P1 writes 43 to A	M, 43 (dirty)	S, 42	42
P2 rereads A	M, 43 (dirty)	S, 42 (hit, stale)	42

Here, P2's reread hits in its cache but yields the inconsistent stale value, as P1's modification remains local and unpropagated.^[1] Key types of inconsistencies include the "stale data" problem, where a read returns an obsolete value after another processor's write, and "lost updates" in write-back caches, where concurrent writes to the same block by different processors result in one update overwriting the other without visibility to the first writer. These arise particularly in systems relying on cache replication for performance, amplifying the risk when writes are buffered locally before eviction to memory.^[1] Without addressing coherence, such inconsistencies manifest as race conditions in parallel programs, where execution order determines which processor's view prevails, leading to non-deterministic outcomes that undermine program reliability and reproducibility. For instance, an increment operation on a shared counter may undercount if one processor's update is lost, causing incorrect results regardless of logical intent. This problem directly impacts programming models that rely on shared variables for inter-thread communication, such as OpenMP and POSIX threads (pthreads), where variables declared as shared are expected to reflect updates across all threads but can appear inconsistent due to uncoordinated caching, necessitating explicit synchronization to enforce visibility. In OpenMP, for example, the default shared scope for variables outside parallel regions assumes a coherent view, yet cache-level discrepancies can cause threads to operate on divergent copies unless barriers or atomics are used. Similarly, in pthreads, shared global variables face the same risks, where mutexes or condition variables must serialize access to mitigate coherence-induced races.

Requirements and Models

Formal Definition

Cache coherence is formally defined as a protocol-level property in shared-memory multiprocessor systems that maintains a consistent view of shared data across multiple private caches, ensuring that the effects of memory operations are uniformly observable by all processors. This definition encompasses two fundamental invariants: the single-writer-multiple-reader (SWMR) invariant and the data-value invariant. The SWMR invariant stipulates that, for any shared memory location, at any logical time, either exactly one processor has permission to both read from and write to it (acting as the sole writer and reader), or multiple processors have read-only permission with no writer allowed.^[1] The data-value invariant requires that the value stored in a memory location at the conclusion of a read-only epoch matches the value established at the end of the preceding read-write epoch for that location.^[1] These invariants are upheld through three essential requirements: write propagation, serialization, and read reflection. Write propagation ensures that every write operation to a shared location eventually becomes visible to all other processors in the system, preventing stale data from persisting indefinitely in remote caches.^[1] Serialization imposes a total order on all writes to the same memory location, such that every processor perceives these writes in the identical sequence, thereby avoiding conflicting interpretations of update history.^[1] Read reflection guarantees that any read operation following a write to the same location will retrieve either the value from that write or from a later write in the serialized order, ensuring reads accurately reflect the current state.^[1] A precise formal statement of cache coherence is as follows: for any memory address A, if a processor executes a write W to A followed later by a read R to A (possibly on a different processor), then R must return the value produced by W or by some write W' to A that follows W in the total serialization order of writes to A; moreover, all reads across processors must observe the writes to A in this same total order.^[1] Cache coherence must be distinguished from cache consistency in a single-processor environment. In uniprocessor systems, cache consistency pertains solely to synchronizing a single cache with main memory, often via mechanisms like write-through or write-back policies to ensure the processor sees its own updates promptly. By contrast, cache coherence in multiprocessor systems extends this to coordinate multiple caches, ensuring inter-cache agreement on shared data values and preventing inconsistencies arising from concurrent access by multiple processors.^[1] Coherence is maintained at the granularity of cache lines (also called cache blocks), treating these fixed-size units—typically 64 bytes—as the atomic entities for consistency management, rather than finer-grained bytes or words, to optimize hardware efficiency and reduce protocol overhead.^[1] Edge cases include uncached memory accesses, where data bypasses caches and interacts directly with main memory; coherence protocols must ensure these do not disrupt invariants, often by routing such accesses through the same serialization points as cached operations. Additionally, I/O coherence addresses direct memory access (DMA) by peripherals, requiring extensions like flush or snoop mechanisms to propagate device writes to processor caches and invalidate stale copies, maintaining system-wide consistency.^[1]

Consistency Models

Memory consistency models specify the permissible orderings of memory operations—reads and writes—as observed across multiple processors in a shared-memory multiprocessor system, defining a contract between hardware and software on how operations interleave and become visible.^[1] These models ensure that parallel programs execute predictably, but they differ fundamentally from cache coherence, which focuses solely on ensuring a unique value for each memory location at any time, without prescribing inter-operation ordering.^[1] Cache coherence provides the foundational visibility of updates (e.g., via invalidation or update protocols), enabling the implementation of various consistency models by guaranteeing that writes propagate correctly, while consistency models build atop this to enforce broader ordering guarantees for correctness in parallel programming.^[1] The strongest such model is sequential consistency (SC), introduced by Lamport, which mandates that the results of any execution appear as if all memory operations from all processors occurred atomically in some global total order that respects the program order on each individual processor.^[16] Under SC, operations on distinct memory locations must also appear sequentially ordered, preventing reordering that could alter program semantics, such as a processor seeing its own write before another processor's intervening write.^[16] This model simplifies programming by mimicking the behavior of a uniprocessor but imposes strict hardware constraints, often requiring full serialization of operations to maintain the global order.^[1] Weaker models relax these constraints to enhance performance through reordering and buffering, while still leveraging cache coherence for value visibility. Processor consistency, proposed by Goodman, preserves per-processor program order for reads following reads (R→R) and writes following writes (W→W), but allows a read to see an old value even after a write to the same location if the write is from another processor, as long as writes establish a global total order visible to future reads. This permits store buffering, where writes are delayed before becoming globally visible, reducing contention but requiring programmers to use synchronization for cross-processor ordering. Release consistency, developed by Gharachorloo et al., further weakens guarantees by tying ordering to explicit synchronization events like acquires and releases; ordinary memory operations (acquires, releases, and data accesses) can be reordered freely within their categories, but synchronization points enforce consistency, allowing aggressive hardware optimizations such as delayed updates. Hardware implements these models via cache coherence protocols combined with mechanisms like store buffers, load bypassing, and memory fences (barriers). For SC, coherence protocols must enforce immediate global visibility and strict serialization, often at high cost; weaker models like processor or release consistency tolerate relaxed propagation, using fences to insert ordering points that flush buffers or stall reordering, ensuring coherence-maintained updates respect synchronization.^[1] For example, the x86 architecture's Total Store Order (TSO) model, a form of processor consistency variant, allows stores to buffer before global visibility (permitting out-of-order reads of other processors' writes) but enforces total ordering for loads and stores within a processor, relying on cache coherence for write propagation and serializing instructions (e.g., SFENCE) for stronger guarantees. In contrast, ARM's weak ordering model permits extensive reordering of loads and stores across processors unless controlled by explicit barriers (e.g., DMB), where cache coherence ensures update visibility but programmers must insert barriers to achieve sequential-like behavior, highlighting how coherence alone does not suffice without model-specific synchronization.

Core Mechanisms

Snooping Protocols

Snooping protocols maintain cache coherence in bus-based shared-memory multiprocessor systems by having each cache controller continuously monitor, or "snoop," all transactions on the shared bus for memory addresses that match lines in its cache.^[17] This monitoring allows caches to detect relevant read or write requests from other processors and respond accordingly to ensure data consistency across the system.^[17] The approach relies on a broadcast medium like a bus, where every transaction is visible to all caches, enabling simple hardware implementation without centralized coordination.^[17] Cache controllers in snooping systems typically track line states such as modified (data has been altered and is the only valid copy), shared (data is unchanged and may exist in multiple caches), and invalid (data is no longer valid and must be fetched from memory or another cache).^[18] State transitions occur in response to bus events: for example, a processor's read miss triggers a bus read request, which other caches snoop and supply data if in the modified state, transitioning to shared; a write request may invalidate shared copies in other caches to prevent stale data.^[18] These transitions ensure that no processor reads outdated data while allowing efficient local access to unmodified lines.^[18] Specific protocols like the MSI (Modified-Shared-Invalid) protocol exemplify this by defining precise rules for state changes on bus requests.^[18] Snooping protocols operate under two primary bus strategies: write-invalidate and write-update. In write-invalidate protocols, a writing processor broadcasts an invalidate signal on the bus before updating its copy, forcing other caches to mark matching lines invalid and refetch on future reads.^[18] This minimizes bus traffic for repeated writes by a single processor but can increase misses if sharing is frequent.^[18] In contrast, write-update protocols broadcast the new data value to all caches on a write, allowing shared copies to update in place without invalidation, which reduces latency for subsequent reads by others but consumes more bandwidth due to data broadcasts.^[18] Write-invalidate is more common in practice due to lower overall traffic in typical workloads.^[18] These protocols offer low-latency coherence enforcement in small-scale systems, as snooping enables quick interventions without directory lookups or point-to-point messaging.^[19] However, they suffer from bus bandwidth contention caused by frequent broadcasts, limiting scalability to modest numbers of processors, typically 16 to 32, beyond which traffic overwhelms the shared medium.^[19] Implementations are prevalent in uniform memory access (UMA) architectures, such as early generations of Intel Xeon processors that used the Front Side Bus (FSB) for multi-socket coherence via snooping.^[20]

Directory-Based Protocols

Directory-based protocols address the scalability limitations of snooping protocols by maintaining a centralized or distributed directory that tracks the state and location of each cache block across the system. Unlike snooping mechanisms that rely on broadcasting requests to all caches, directory-based approaches use point-to-point communication to notify only the relevant caches, making them suitable for large-scale non-uniform memory access (NUMA) and distributed shared-memory systems. The directory typically resides with the home memory node for each block and records which caches hold copies of the block (sharers) and in what state, such as shared, exclusive, or modified.^[21] The core operation involves consulting the directory on a cache miss. For a read miss, the requesting processor sends a request to the home directory; if the block is clean and shared, it is supplied from memory or forwarded from a sharer, and the directory updates to include the new sharer. For a write miss, the directory identifies all current sharers and sends invalidation messages point-to-point to them, ensuring exclusive access before supplying the block to the requester. This selective notification reduces network traffic compared to broadcasts, though it introduces indirection latency as the home node coordinates responses. Distributed directories, where entries are spread across memory controllers, further enhance scalability by localizing access and avoiding a single point of contention.^[22]^[23] Several variants optimize the directory structure to balance storage overhead and performance. The full-bit vector scheme uses one bit per processor to indicate presence, plus state bits (e.g., a dirty bit), allowing precise tracking of all sharers but incurring high storage costs (e.g., up to 50% overhead for 256 processors). Coarse vector variants group processors into clusters and use fewer bits per group, approximating sharer locations to reduce space while tolerating some imprecision. Pointer-based schemes store a limited number of pointers (e.g., 4-8) to exact sharer locations, which is efficient when few caches share blocks but requires overflow handling. When the pointer limit is exceeded, protocols may evict a random pointer (potentially causing unnecessary invalidations), fall back to broadcasting, or use chaining to link additional entries, though this adds complexity and latency. These trade-offs were evaluated in early simulations showing full-bit vectors provide the best accuracy but scale poorly in storage, while limited pointers suit systems with low sharing degrees.^[21]^[22] Directory-based protocols offer significant advantages in scalability, supporting hundreds of processors without the broadcast storms that limit snoopy systems to dozens, but they suffer from higher access latency due to directory lookups and multi-message exchanges. Seminal implementations include the Stanford DASH multiprocessor (1990), which pioneered distributed full-bit vector directories for up to 64 processors, influencing later designs. In the 1990s, SGI Origin servers employed hybrid bit-vector and coarse-vector directories, scaling to 1024 processors in cc-NUMA configurations with low remote access penalties. Modern examples, such as multi-socket AMD EPYC systems, integrate directory-based coherence over Infinity Fabric links, using probe filters in the L3 cache to track inter-socket sharers efficiently in MOESI protocols.^[24]

Specific Protocols

Invalidation-Based Protocols

Invalidation-based protocols maintain cache coherence by invalidating copies of a cache line in other caches when a processor writes to it, ensuring that only the writing cache has a valid copy. This approach contrasts with update-based methods by avoiding the broadcast of write data, thereby conserving bandwidth in shared-bus systems. These protocols typically operate in snoopy implementations, where caches monitor bus transactions to update their states accordingly.^[1] The MSI (Modified, Shared, Invalid) protocol is a foundational invalidation-based scheme using three stable states for each cache line: Invalid (I), indicating the line is uninitialized or stale; Shared (S), indicating the line is clean and possibly replicated across multiple caches; and Modified (M), indicating the line is dirty and uniquely held by one cache. Transitions occur on processor requests (PrRd for read, PrWr for write) or bus actions (BusRd for shared reads, BusRdX for exclusive reads/writes, BusUpgr for upgrades). In a read hit, the state remains unchanged whether in S or M. A read miss from I fetches the line via BusRd, transitioning to S if shared or potentially M if exclusive. A write hit in S issues BusUpgr to invalidate others, moving to M; in M, it stays M without bus action. A write miss from I issues BusRdX, fetching and invalidating others to enter M. On eviction or write-back from M, the line transitions to I after supplying data via BusWr.^[1] The following table summarizes key state transitions in the MSI protocol for a snoopy bus implementation:

Event	Current State	Action/Bus Request	New State	Notes
PrRd (read hit)	S	None	S	Data from cache
PrRd (read hit)	M	None	M	Data from cache
PrRd (read miss)	I	BusRd	S	Fetch from memory or sharer
PrWr (write hit)	S	BusUpgr	M	Invalidate other caches
PrWr (write hit)	M	None	M	Update cache only
PrWr (write miss)	I	BusRdX	M	Fetch, invalidate others
Eviction/Write-back	M	BusWr	I	Supply dirty data to bus

These transitions ensure the single-writer-multiple-reader (SWMR) invariant, where writes serialize access.^[1] The MESI protocol extends MSI by adding an Exclusive (E) state, which represents a clean, unique copy of the line not shared with other caches, reducing bus traffic for subsequent writes. This fourth state allows silent upgrades: a write hit in E transitions to M without bus intervention, avoiding unnecessary invalidations. On a read miss from I, if no other caches hold the line, it enters E via BusRd; otherwise, it enters S. A read hit in E remains E. On snooping a BusRd from another cache while in E, it transitions to S without supplying data since the line is clean. Eviction from E to I requires no write-back (PutE), as the data matches memory. MESI is widely used in systems like Intel processors to optimize for common private data access patterns.^[1]^[25] Key MESI state transitions build on MSI as follows:

Event	Current State	Action/Bus Request	New State	Notes
PrRd (read hit)	E	None	E	Data from cache
PrRd (read miss)	I	BusRd	E	If no sharers (clean fetch)
PrRd (read miss)	I	BusRd	S	If sharers exist
PrWr (write hit)	E	None	M	Silent upgrade
PrWr (write hit)	S	BusUpgr	M	Invalidate others
Snooped BusRd	E	None	S	Downgrade to shared
Eviction	E	None (PutE)	I	No data write-back

This extension minimizes coherence overhead by distinguishing clean exclusive accesses.^[1] The MOESI variant further refines MESI by introducing an Owned (O) state, which allows a modified line to be shared with other caches while retaining ownership for servicing future requests, without immediately updating memory. This is particularly beneficial in hierarchical systems with a last-level cache (LLC), as it reduces write-backs to the LLC. In MOESI, a read hit in O remains O, supplying data if needed. A write hit in O transitions to M without bus action. On a shared read (BusRd) while in M, it transitions to O, supplying dirty data without memory update. The O state enables efficient handling of producer-consumer patterns. MOESI is employed in AMD systems, such as Opteron processors, to optimize multi-core scalability.^[1]^[25] MOESI transitions extend MESI with O-specific behavior:

Event	Current State	Action/Bus Request	New State	Notes
PrRd (read hit)	O	None	O	Data from cache
PrWr (write hit)	O	None	M	Update owned line
Snooped BusRd	M	BusWr (supply data)	O	Share dirty data, retain ownership
Snooped BusRd	O	BusWr (supply data)	O	Service from owner
PrRd (read miss)	I	BusRd	O	If owner provides dirty data

These states support both snoopy bus and directory-based realizations.^[1] Optimizations in these protocols often favor write-back policies over write-through, where writes update the cache immediately and only evict dirty lines to memory, reducing bus traffic compared to write-through's immediate memory updates on every store. Write-back is standard in MSI, MESI, and MOESI to handle modified data efficiently. To mitigate false sharing—where unrelated data in the same cache line causes unnecessary invalidations—techniques like sub-block invalidation or fine-grained coherence track ownership at smaller granularities, though they increase hardware complexity. Transient states (e.g., IMAD in MSI for pending modifications) further optimize by allowing non-stalling operations during bus transactions.^[1]

Update-Based Protocols

Update-based protocols, also known as write-update protocols, maintain cache coherence by propagating write updates directly to all caches that hold a copy of the data block, thereby keeping shared copies consistent without invalidating them.^[26] This approach contrasts with invalidation-based protocols, where copies are invalidated upon a write, forcing subsequent reads to fetch the updated data from memory or the writer.^[27] A seminal example of an update-based protocol is the Dragon protocol, developed for the Xerox Dragon multiprocessor workstation in the 1980s. In the Dragon protocol, caches maintain states such as exclusive (for private modified data), shared clean, shared modified, and others to facilitate update propagation; when a processor writes to a block, it broadcasts the update to all caches in shared states, updating their copies while avoiding immediate memory writes until eviction. This design combines write-through for shared blocks and copy-back for exclusive ones, using bus transactions like BusUpdt to push changes selectively. Update-based protocols offer advantages in workloads with frequent reads following writes, as they reduce read misses by keeping remote caches up-to-date without intervention on access.^[26] However, they incur high bandwidth costs for broadcasting updates, particularly in systems with many sharers or frequent writes, leading to inefficiencies when updates propagate to caches that do not immediately need the data.^[26] Additionally, the granularity mismatch between cache lines and synchronization units can result in unnecessary updates or false sharing overhead.^[26] To mitigate these drawbacks, hybrid approaches combine update and invalidation mechanisms, selectively applying updates based on sharing patterns. The Firefly protocol, implemented in the DEC Firefly multiprocessor workstation, exemplifies this by using copy-back updates for private blocks and broadcast updates for shared ones, while allowing multiple caches to hold modified copies through a shared-modified state.^[29] This design reduces bus traffic compared to pure updates but still leverages invalidation for exclusive access when needed.^[29] In modern systems, pure update-based protocols are rare due to their bandwidth demands in large-scale multiprocessors, though hybrid variants persist in niche scenarios like certain GPU coherence schemes where read-heavy thread patterns benefit from proactive updates.^[30] For instance, proposals for GPU architectures explore update mechanisms to minimize coherence overhead in heterogeneous CPU-GPU environments with fine-grained sharing.^[31]

Advanced Aspects

Hierarchical and Inclusive Coherence

In multi-level cache hierarchies common in modern multiprocessors, cache coherence is achieved through protocols that propagate updates and requests across cache levels, from private per-core L1 caches to shared L2 and L3 caches. When a core modifies a cache line in its L1 cache, the coherence protocol triggers actions such as invalidations or updates to higher-level caches (L2/L3) to maintain the single-writer-multiple-reader (SWMR) invariant, ensuring all caches observe a consistent view of memory. This propagation often uses directory-based mechanisms at higher levels, where L2 caches can filter unnecessary snoop requests to L1 if the line is absent, reducing intra-chip traffic. For example, in systems like the IBM Power5, ring-based interconnects facilitate ordered propagation within a chip, while global directories track states across levels.^[1] Cache inclusion policies dictate whether data in lower-level caches (e.g., L1) must also reside in higher-level caches (e.g., L2/L3), directly influencing coherence efficiency and capacity. In an inclusive policy, higher-level caches contain all blocks present in lower-level caches, simplifying coherence by allowing the last-level cache (LLC) to serve as a central point for tracking global states without needing explicit directories for lower levels. This approach, used in Intel processors such as Sandy Bridge and Skylake (where the L3 LLC is inclusive of L1 and L2), reduces broadcast traffic and eases protocol implementation but incurs redundancy, lowering effective capacity and potentially causing inclusion victims—evictions from lower caches due to LLC pressure. Performance studies show inclusive designs can underperform non-inclusive by up to 12% in workloads sensitive to capacity.^[32]^[33]^[1] Conversely, an exclusive policy prohibits overlap, ensuring a cache line resides in only one level at a time, which maximizes aggregate capacity by eliminating duplication and can reduce conflict misses. AMD Zen architectures employ mostly exclusive L3 caches, enabling higher effective storage but complicating coherence: victim lines from lower caches must be explicitly inserted into higher levels, and upgrades (e.g., from exclusive to modified state) require additional protocol actions like silent evictions. This increases design complexity and inter-level data movement, particularly in shared-memory systems.^[32]^[1] Many modern CPUs adopt a non-inclusive/non-exclusive (NI/NE) policy for flexibility, where higher-level caches may or may not hold lower-level data, avoiding strict inclusion's overhead while permitting some overlap for optimization. For instance, Intel's L2 caches in Sandy Bridge and Skylake are non-inclusive of L1, allowing reduced back-invalidations and better adaptability to workloads, though it necessitates auxiliary structures like directories for coherence tracking. This hybrid approach has become a default in recent designs, balancing capacity and protocol simplicity.^[32]^[33] Coherence in multi-chip modules, such as multi-socket systems, extends these hierarchies across chips, posing challenges in bandwidth and latency for propagating updates between sockets. Intel's QuickPath Interconnect (QPI) and its successor Ultra Path Interconnect (UPI) handle inter-socket coherence traffic by forwarding requests and responses over point-to-point links, but limited bandwidth (e.g., in 4-socket configurations) can bottleneck performance, consuming a significant fraction of link capacity. Similarly, AMD's Infinity Fabric enables coherent access across chiplets and sockets in multi-chip packages, supporting zero-copy memory sharing via a scalable mesh, yet faces contention in high-traffic scenarios like GPU-CPU interactions. These interconnects require hybrid protocols—snooping intra-chip and directory-based inter-chip—to maintain consistency without a single inclusive LLC, amplifying validation complexity due to distributed states.^[1]^[34]

Scalability Challenges

Cache coherence protocols introduce significant performance overheads in large-scale systems, primarily from protocol traffic and directory access latencies. In multiprocessor systems, coherence traffic can consume 30-50% of the interconnect bandwidth, particularly in workloads with high sharing, as observed in simulations and benchmarks where coherence messages dominate communication and reduce effective memory bandwidth to below 70% of peak.^[35] Directory-based protocols, while scalable, incur additional latency due to lookups and pointer traversals, with cross-socket accesses adding 3-4 times the delay compared to intra-socket operations in modern NUMA systems like AMD EPYC.^[35] To mitigate these overheads, several optimizations have been developed. Victim forwarding in snooping protocols allows a cache evicting a dirty line to directly supply it to a requesting cache, reducing unnecessary directory or bus interventions and lowering traffic by up to 20% in shared workloads.^[36] Speculative coherence techniques, such as predictive invalidations based on access patterns, enable proactive resolution of conflicts, decreasing average coherence latency by 15-25% in adaptive protocols.^[37] Coherence prefetching further optimizes by anticipating sharing events and preloading directory entries or data, which can reduce miss penalties in prefetch-friendly benchmarks by 10-30%.^[38] Scalability limits become evident as core counts increase. Snooping protocols, reliant on broadcasts, fail to scale beyond approximately 64 cores due to excessive bandwidth demands and contention, leading to quadratic traffic growth.^[39] Directory-based approaches extend to thousands of cores but introduce NUMA penalties, where remote directory accesses inflate latencies by factors of 2-5, impacting performance in distributed workloads by 12-38% as measured in SPEC CPU benchmarks.^[35] Metrics like coherence miss rates and traffic volume highlight these issues, with SPEC benchmarks showing coherence overheads contributing to up to 38% slowdown in memory-intensive cases.^[35] In modern extensions, cache coherence adapts to heterogeneous and distributed environments. For GPUs, NVIDIA's NVLink interconnect supports hardware cache coherence, enabling unified memory access between CPUs and GPUs with low-latency atomic operations and efficient line sharing, scaling to multi-GPU setups while maintaining consistency.^[40] In CPU-GPU heterogeneous systems, protocols like Heterogeneous System Coherence (HSC) address bandwidth mismatches by using region-based directories, reducing coherence requests by 94% and achieving 2x average performance gains over traditional block-level schemes.^[41] Selective caching in GPUs eliminates full hardware coherence overheads by caching only critical regions, rivaling complex protocols in scalability for integrated systems.^[42] For distributed shared memory (DSM) in cloud computing, in-network coherence like CONCORDIA leverages programmable switches for scalable directory management, minimizing latency in disaggregated setups and supporting thousands of nodes with reduced traffic.^[43] As of 2025, advancements include protocols like the Shared-Exclusive Latch-based Cache-Coherence (SELCC) for disaggregated memory over CXL, enabling efficient coherence in cloud-scale systems.^[44]

References

[1]
[PDF] A Primer on Memory Consistency and Cache Coherence, Second ...
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics pertaining to the science and art of designing, analyzing, ...
[2]
[PDF] A survey of cache coherence schemes for multiprocessors - MIT
Hardware- based protocols for maintaining cache coherence guarantee memory system co- herence without software-implemented mechanisms. Typically, hardware mecha ...
[3]
FOS: a low-power cache organization for multicores
Oct 1, 2019 · The cache hierarchy of current multicore processors typically consists of one or two levels of private caches per core and a large shared ...
[4]
[PDF] Improving Real-Time Performance by Utilizing Cache Allocation ...
Apr 15, 2015 · Cache Hierarchy Intel's Xeon® processors include three levels of cache: the L1, L2, and L3 caches. The L1 cache is the smallest, but fastest, ...Missing: multi- | Show results with:multi-
[5]
Why On-Chip Cache Coherence Is Here to Stay
Jul 1, 2012 · Cache Coherence Today. Before investigating the issues involved in coherence's future, we first describe today's cache coherence protocols.<|separator|>
[6]
https://www.intel.com/content/www/us/en/products/docs/processors/xeon/xeon-scalable-processors-4th-gen-brief.html
[7]
A Practical Cache Partitioning Method for Multi-Core Processor on a ...
Feb 10, 2025 · Unlike the L1 cache, which is dedicated to each core and operates independently without interference, the L2 cache is shared among cores, making ...
[8]
[PDF] Optimizing Applications for NUMA - Intel
UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all ...
[9]
A case for uniform memory access multiprocessors
NUMA machines are easier to build, but it is much more difficult to program them effectively. In the last decade, UMA machines built around a single shared-bus ...
[10]
Performance Analysis of Cache Coherence Protocols for Multi-core ...
Aug 12, 2016 · Caching of shared data may produce a problem of replication in multiple caches. Replication provides reduction in contention for shared data ...
[11]
The impact of cache inclusion policies on cache management techniq
Inclusive caches require data to be present in all lower levels, while exclusive caches are becoming more common. This paper addresses how cache management ...
[12]
[PDF] How to Make a Correct Multiprocess Program Execute Correctly on ...
It has been proposed that, when sequential consistency is not provided by the memory system, it can be achieved by a constrained style of programming.Missing: paper | Show results with:paper
[13]
Using cache memory to reduce processor-memory traffic
Using cache memory to reduce processor-memory traffic. Author: James R. Goodman.
[14]
https://dl.acm.org/doi/10.1145/2979779.2979801
[15]
[PDF] Formal Verification and its Impact on the Snooping versus Directory ...
Snooping protocols. Although snooping is seduc- tively simple from a ... modest number of processors [23]. Transitioning to a di- rectory protocol ...
[16]
Practical Cache Coherence - Yizhou Shan's Home Page
Intel is using MESIF cache coherence protocol, but it has multiple cache coherence implementations. The first one is Source Snoop (or Early Snoop ), which is ...<|control11|><|separator|>
[17]
An evaluation of directory schemes for cache coherence
The problem of cache coherence in shared-memory multiprocessors has been addressed using two basic approaches: directory schemes and snoopy cache schemes.
[18]
[PDF] Directory-Based Cache Coherence
The snooping cache coherence protocols from the last lecture relied on broadcasting coherence information to all processors over the chip interconnect.Missing: early | Show results with:early
[19]
The directory-based cache coherence protocol for the DASH ...
In this paper, we present the design of the DASH coherence protocol and discuss how it addresses the above issues.
[20]
[PDF] A Class of Compatible Cache Consistency Protocols and their ...
A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus. Paul Sweazey* and Alan Jay Smith **÷. Abstract. Standardization of a ...
[21]
[PDF] UPDATE-BASED CACHE COHERENCE PROTOCOLS FOR ...
This paper presents two hardware-controlled update-based cache coherence protocols: one based on a centralized directory and the other based on a singly linked ...
[22]
[PDF] Cache Coherence Protocols: Evaluation Using a Multiprocessor ...
Each cache coherence protocol consists of a specification of possible block states in the local caches and the actions that are to be taken by the cache.
[23]
[PDF] Firefly : a multiprocessor workstation - Bitsavers.org
Oct 5, 1987 · The coherence protocol used in the Firefly differs from most others in that multiple caches are allowed to contain a datum simultaneously, and ...
[24]
[PDF] Invalidate or Update? Revisiting Coherence for Tomorrow's Cache ...
1-Update is a new cache coherence protocol that is based on tra- ditional invalidate protocols and also adopts the advantages of up- dating protocols. It ...Missing: disadvantages | Show results with:disadvantages
[25]
[PDF] Memory Consistency Model-Aware Cache Coherence for ...
Sep 24, 2024 · Abstract—Implementing cache-coherent shared memory in heterogeneous systems is challenged by memory consistency model (MCM) mismatches among ...
[26]
The impact of cache inclusion policies on cache management ...
This paper addresses the question of how sensitive existing cache management techniques are to the inclusion policy, quantifies how effective they are for ...Missing: AMD | Show results with:AMD
[27]
[PDF] Achieving Non-Inclusive Cache Performance with Inclusive Caches
Such a cache, known as a non- inclusive cache [10], allows cache lines to reside in the core cache(s) without also being duplicated in the LLC.
[28]
[PDF] Understanding Data Movement in AMD Multi-GPU Systems ... - arXiv
Oct 1, 2024 · The Infinity Fabric intercon- nect supports zero-copy memory access, where any processor in the system, CPU or GPU, can access each other's ...
[29]
[PDF] Demystifying Cache Coherency in Modern Multiprocessor Systems
Jun 7, 2025 · Intel's MESIF protocol represents a significant evolution in coherence design, particularly for snoop-based multiprocessor systems with ...
[30]
Memory Forwarding: Enabling Aggressive Layout Optimizations by ...
By optimizing data layout at run-time, we can potentially en- hance the performance of caches by actively creating spatial lo- cality, facilitating prefetching, ...
[31]
[PDF] The Locality-Aware Adaptive Cache Coherence Protocol
On a set of parallel benchmarks, our low- overhead locality-aware mechanisms reduce the overall en- ergy by 25% and completion time by 15% in an NoC-based.
[32]
A Survey of Recent Prefetching Techniques for Processor Caches
Lookahead shows how well in advance a prefetch is issued, such that prefetched data arrive in cache on time and are not evicted by other data. A prefetch is ...Missing: forwarding | Show results with:forwarding
[33]
[PDF] A Quantitative Analysis of the Performance and Scalability of ...
Cache coherence protocols can be evaluated on how well they deal with the following four issues: Protocol memory efficiency: how much memory overhead does the ...Missing: benchmarks | Show results with:benchmarks<|separator|>
[34]
NVLink - Nvidia - WikiChip
Apr 7, 2025 · With the addition of flat address space, NVLink now has cache coherence support, allowing the CPU to efficiently cache GPU memory, significantly ...Overview · Links · Data Rates · NVLink 1.0
[35]
[PDF] Heterogenous System Coherence for Integrated CPU-GPU Systems
Dec 7, 2013 · This paper develops Heterogeneous System Coherence (HSC) for. CPU-GPU systems to mitigate the coherence bandwidth effects of. GPU memory ...
[36]
[PDF] Selective GPU Caches to Eliminate CPU–GPU HW Cache Coherence
Apr 15, 2015 · We propose GPU selective caching, which enables a CPU–GPU system that provides a unified shared memory without requiring hardware cache- ...
[37]
[PDF] Distributed Shared Memory with In-Network Cache Coherence
Feb 25, 2021 · In this paper, we propose CONCORDIA, a DSM with fast in-network cache coherence backed by programmable switches. At the core of. CONCORDIA is ...