Fact-checked by Grok 2 weeks ago

Memory coherence

Memory coherence is a fundamental property in shared-memory multiprocessor systems, ensuring that all processors maintain a consistent view of memory data by guaranteeing that a read operation returns the value of the most recent write to that location, as determined by a hypothetical serial across all processors. This coherence is typically achieved through cache coherence protocols, which manage multiple copies of data in private caches to enforce invariants such as the Single-Writer-Multiple-Reader (SWMR) rule—where only one cache can modify a data block at a time—and the Data-Value , ensuring all readers see the same value after a write. While closely related, memory coherence differs from memory consistency models, which define the global ordering of operations across different memory locations (e.g., preserves program order for all loads and stores, whereas Total Store Order relaxes store-to-load reordering for performance); coherence focuses on per-location value correctness and serves as a prerequisite for consistency. In practice, cache coherence protocols are categorized into snooping-based and directory-based approaches. Snooping protocols, suitable for bus-connected systems with a small number of processors, rely on each monitoring (or "snooping") bus transactions to maintain states; common variants include (Modified, Shared, Invalid), which handles basic write-invalidation, and its optimizations like MESI (adding Exclusive to reduce traffic) and (adding Owned for better write-back efficiency). Directory-based protocols scale to larger systems by using a centralized or distributed to track cache locations of data blocks and route messages unicast rather than broadcast, as seen in systems like the SGI Origin 2000, which supported up to 1024 processors. Emerging paradigms, such as temporal with leases or self-invalidation, further optimize for relaxed consistency models in GPUs and , where hardware like NVIDIA's architecture integrates for GPGPU workloads. The concept traces back to early shared-memory multiprocessors in the 1970s and 1980s, with foundational work on by in 1979 establishing the need for ordered operations, and subsequent innovations like (1986) and release consistency (1990) balancing correctness with performance gains from relaxed models like x86's Total Store Order. Today, memory coherence remains essential for scalable parallelism in multicore processors, preventing issues like stale data or race conditions, though challenges in verification—addressed by formal tools like Murphi and TLA+—persist as core counts grow.

Fundamentals

Definition and Core Principles

Memory coherence is a fundamental property in shared-memory multiprocessor systems that ensures all processors maintain a consistent and unique view of the contents, such that any write operation to a location becomes visible to all subsequent reads across processors in a well-defined order. This property makes caches functionally transparent, propagating writes from one processor's to others to avoid discrepancies in data values observed by different processors. The core principles underpinning memory coherence include the single-writer-multiple-reader (SWMR) invariant, which ensures that only one cache can modify a data block at a time while multiple can read it, and the Data-Value Invariant, which guarantees that all reads return the value from the most recent write in a total serialization order of operations. A related visibility rule ensures that a write to a memory location is eventually visible to all processors, upholding the SWMR invariant where reads return the most recent write value in a serialized order. In multi-level cache hierarchies, the inclusion property—where higher-level caches contain a superset of the blocks in lower-level caches—can simplify coherence management by ensuring actions at lower levels propagate upward, though not all systems enforce strict inclusion. For instance, in a two-processor system, if processor A writes a value to memory location X, processor B's subsequent read must observe that updated value before B can perform its own write to X, ensuring serialized writes and consistent visibility across the system. Memory coherence emerged in the early 1980s amid the development of shared-memory multiprocessors, notably through projects like the IBM RP3 initiated in 1983, which addressed coherence challenges in scalable parallel architectures using software-assisted hardware mechanisms.

Role in Multiprocessor Architectures

In multiprocessor systems, memory coherence is essential to prevent inconsistencies that arise when multiple processors maintain private caches of locations. Without coherence mechanisms, a processor might operate on stale from its local , leading to errors in program execution where one processor's write must be visible to others accessing the same . Coherence protocols enforce invariants such as single-writer-multiple-reader (SWMR), ensuring that all processors perceive a consistent view of despite local caching. Memory coherence integrates seamlessly into symmetric multiprocessor () architectures, which provide uniform memory access (UMA) through a shared interconnect like a bus, allowing all processors to access any memory location with equal latency. In systems, coherence typically relies on snooping protocols where caches monitor shared traffic to maintain across local caches. In contrast, (NUMA) architectures distribute memory across nodes, introducing varying access latencies, and extend coherence to span both local and remote caches using mechanisms like directories to track shared data locations efficiently. This integration preserves the abstraction of a single space while accommodating . The primary benefit of memory coherence in these architectures is enabling efficient by permitting private caches to reduce and demands on the shared interconnect, all while upholding global data consistency. For instance, in early bus-based systems like the introduced in 1987, coherence protocols minimized bus traffic by allowing write-back caching and invalidation-based updates, supporting a moderate number of processors with improved throughput for multiuser workloads compared to uncached designs. This approach not only avoids the pitfalls of inconsistent views but also scales parallel program performance without requiring programmers to manage data placement explicitly.

Coherence Models

While memory coherence ensures that all processors see a consistent value for each memory location, the models discussed in this section are memory consistency models that define the ordering of operations across different locations and rely on underlying coherence protocols for per-location correctness.

Sequential Consistency Model

, proposed by in 1979, is a memory consistency model that ensures all memory operations appear to execute instantaneously in some global consistent with the program order on each processor. This model provides a straightforward semantic guarantee for multiprocessor systems, making it intuitive for programmers by mimicking the behavior of a uniprocessor executing operations sequentially. The rules of dictate that every read must return the value of the most recent write according to the global order, and no operations from different are reordered relative to one another. Formally, as defined by Lamport, a multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all were executed in some sequential order, with the operations of each individual appearing in that order as specified by its . For instance, consider operations where a write W1 on A precedes a read R1 on A, and another read R2 on B follows W1 in order; sequential consistency requires R2 to return the value written by W1. This model offers a strong guarantee against unexpected reorderings, enabling correct execution of shared-memory programs without subtle conditions. For example, in a program where two threads each increment a shared counter once, ensures the final value is 2, as the writes and reads interleave in a that respects each thread's sequence. However, achieving imposes significant hardware overhead, as it restricts common optimizations like operation reordering and buffering, often requiring strict serialization that limits parallelism and increases latency.

Relaxed Consistency Models

Relaxed consistency models relax the ordering constraints imposed by to enable greater concurrency and performance in multiprocessor systems, allowing hardware to reorder, , or memory operations while preserving enough guarantees for correct program execution. These models recognize that most programs rely on explicit for , permitting optimizations that would violate the total global order of but align with typical usage patterns. By categorizing accesses based on their synchronization context, relaxed models reduce unnecessary traffic and improve , particularly in systems with write buffers or non-blocking caches. Key variants include processor consistency, , and release consistency, each offering progressive relaxations tailored to different implementation trade-offs. In contrast to , which serves as the strict baseline requiring all operations to appear and in a single global order, these models shift the burden to software to enforce critical orderings. Processor consistency extends by guaranteeing that each processor observes its own memory operations in program order, but allows writes from different processors to become visible to other processors in varying orders, without a global across all writes. This non-atomic visibility of remote writes enables effective use of write buffers without strict serialization, reducing latency while maintaining per-processor orderliness for reads and writes. The model ensures but relaxes inter-processor write serialization, making it suitable for systems where synchronization primitives handle cross-processor dependencies. Introduced by Dubois, Scheurich, and , processor consistency provides a modest relaxation that supports efficient caching while avoiding anomalies in unsynchronized code. Weak ordering further relaxes constraints by partitioning memory operations into synchronized and non-synchronized categories, permitting arbitrary reordering, buffering, or overlapping within non-synchronized accesses but enforcing strict ordering around operations like locks or barriers. Non-synchronized accesses, which include most ordinary loads and stores not tied to , can be freely reordered relative to each other and even to some synchronized accesses, as long as points establish a for critical sections. This approach, formalized by Adve and , allows hardware to aggressively optimize common-case operations while relying on programmer-inserted to delimit regions where ordering matters, thereby enhancing pipelining and reducing stalls in large multiprocessors. Release consistency builds on weak ordering by explicitly distinguishing acquire (e.g., lock acquisition) and release (e.g., lock release) synchronization points, treating ordinary memory accesses as potentially unordered except when bound to these events. Under this model, writes need only become visible to subsequent acquires on the same synchronization variable, enabling lazy update propagation and decoupling data coherence from strict operation ordering. A key optimization, lazy release consistency (LRC), defers the delivery of released updates until an acquire explicitly requests them, minimizing unnecessary invalidations and communication in distributed settings by using diff-based tracking of changes since the last synchronization. Formulated by Gharachorloo, , and , release consistency, including its lazy variant, supports high-performance implementations by aligning coherence with program synchronization semantics rather than assuming all accesses are critical. These relaxed models offer significant trade-offs: they enhance by reducing global overhead—significantly reducing traffic in benchmark workloads—and allow better resource utilization in pipelined architectures, but demand that programmers explicitly manage ordering through fences, barriers, or instructions to avoid subtle bugs from unexpected reorderings.

Cache Protocols

Snoopy-Based Protocols

Snoopy-based protocols maintain in shared-bus multiprocessor systems by having each cache controller monitor all bus initiated by other processors, a process known as snooping. This allows caches to invalidate or update their local copies of shared data blocks as needed to ensure consistency without centralized coordination. The protocols rely on a broadcast medium like a bus, where every is visible to all caches, enabling simple hardware implementation through finite-state machines that track the state of individual cache lines. The foundational MSI protocol employs a three-state model for each cache line: Modified (M), indicating the line has been locally updated and is the sole valid copy; Shared (S), meaning the line is clean and potentially present in multiple s; and Invalid (I), denoting the line is unusable and must be fetched from or another . In operation, a processor's read miss triggers a bus read request; if another holds the line in M state, it supplies the data and transitions to S, while the requesting enters S. A write miss, however, requires the cache to issue a bus invalidate (or ) request, forcing all other copies to I state before the write proceeds, transitioning the local line to M. This invalidate-based approach enforces a single writer among multiple possible readers, aligning with coherence models like by ensuring writes are propagated appropriately. Extensions to MSI address inefficiencies in state transitions and bus traffic. The MESI protocol introduces an Exclusive (E) state, a clean version of M where the line is the unique copy, allowing the cache to silently discard it on eviction without bus notification. The MOESI protocol further adds an Owned (O) state, where the cache holds a potentially modified line but permits other caches to read it without immediate write-back to memory, optimizing data sharing in systems like those from IBM. These protocols excel in simplicity and low overhead for small-scale systems with up to 8-16 processors, as the broadcast nature avoids complex structures while leveraging the bus for operations. They were employed in early multiprocessor implementations, such as Intel's series for intra-chip coherence and Sun's systems for bus-based . However, as processor counts grow, bus contention from frequent broadcasts degrades performance, restricting scalability beyond modest configurations.

Directory-Based Protocols

Directory-based protocols maintain cache coherence in multiprocessor systems by using a to track the state of each block across caches, avoiding the broadcast overhead inherent in protocols for small-scale systems. The , which can be centralized or distributed, records such as whether a block is uncached, exclusively cached by one , or shared among multiple , along with the identities of the caching . When a requests a block, it sends a point-to-point to the home ; the then responds by granting permission, invalidating copies in other caches if necessary, or supplying data from a sharer, ensuring through targeted interventions rather than system-wide broadcasts. A full-map directory employs a bit vector for each memory block, where each bit corresponds to a specific processor or cache, indicating whether that cache holds a copy of the block. This approach precisely tracks all sharers, supporting scalability to hundreds of processors by allowing efficient point-to-point invalidations or supplies on write or read misses. However, the storage overhead is significant, requiring approximately one bit per processor per memory block, which can consume substantial memory in large systems. To mitigate the memory demands of full-map directories, coarse directories use compressed representations such as limited pointers or bit-vector summaries that group multiple processors. In a limited-pointer scheme, the directory allocates a fixed number of pointers (e.g., 2–4) to explicitly list sharers; if more sharers exist, it may overflow to a coarse indicator like "all caches" or a hierarchical summary, introducing minor imprecision but reducing storage to logarithmic in the number of processors per block. Hierarchical directories further enhance this for (NUMA) systems by organizing directories in a , where local nodes handle intra-cluster sharing and higher levels manage inter-cluster , balancing storage and access . Early implementations of directory-based protocols include the DASH (Directory-based Architecture for Shared memory) multiprocessor developed at Stanford University around 1990, which used a distributed directory across processing nodes to enforce a directory-based invalidation protocol supporting up to 64 processors. In DASH, each node maintains a portion of the directory for its local memory, and coherence actions propagate via point-to-point messages over a scalable interconnect, demonstrating effective handling of shared-memory workloads without bus contention. A modern example is the SGI Origin 2000 server from the late 1990s, which employed a directory-based protocol in a ccNUMA architecture scalable to 1,024 processors, combining full-bit-vector directories for small sharer sets with coarse-vector approximations for efficiency. These protocols excel in large-scale systems by eliminating broadcast traffic, which grows quadratically with processor count in schemes, and by exploiting sparse sharing patterns common in applications where most blocks are cached by few processors. This results in lower network contention and better , with studies showing directory protocols maintaining performance up to 128 processors where systems saturate. Additionally, they support efficient handling of infrequent through precise tracking in full-map variants or approximations in coarse ones, minimizing unnecessary messages.

Implementation and Challenges

Hardware Support Mechanisms

Hardware support for memory coherence relies on specialized cache states and transition mechanisms to track and update data copies across processors. The exemplifies this through four primary cache block states: Modified (M), where the block is dirty and exclusively owned by one cache, allowing read and write access while the memory copy is stale; Exclusive (E), a clean exclusive state where the block matches memory and permits read or write access by a single cache; Shared (S), a read-only state where multiple caches hold identical clean copies matching memory; and Invalid (I), indicating no valid data in the cache block. These states enable caches to respond appropriately to coherence events, ensuring the single-writer-multiple-reader invariant. In snoopy-based protocols, transitions occur via bus signals: from I to E or S on a processor read (PrRd) if no other sharers exist or multiple copies are allowed; from E to M on a local write hit without bus activity; from S to I on a bus read-exclusive (BusRdX) signal from another processor, invalidating shared copies; from M to S on a BusRd, supplying data while flushing to memory; and from M to I on BusRdX, triggering a write-back. Directory-based systems use point-to-point messages for transitions: from I to S on a read-shared request (GetS), adding the requester to a sharers list; from I to E on a read-exclusive (GetM) if no sharers; from S to I via directory-issued invalidations (Inv) on a write request; and from E or M to S by forwarding data and updating the directory. Interconnect support distinguishes write-back from write-through policies in bus-based systems. Write-back caches delay updates until or events, reducing bus traffic but requiring tracking to supply data to other caches during transitions like M to . Write-through policies immediately propagate writes to , simplifying invalidation in protocols but increasing bus contention, as seen in write-through-with-invalidate schemes where all caches snoop and invalidate on writes. In directory systems, interconnects handle request types such as read-shared for non-exclusive reads, read-exclusive for acquiring , and invalidate messages to revoke permissions from sharers. Coherence operates at the granularity of cache blocks, typically 64 bytes, to balance transfer overhead and locality. Larger blocks exploit spatial locality but amplify , where unrelated data in the same block triggers unnecessary invalidations across processors, elevating miss rates without the sharp improvements seen in uniprocessors. Granularity effects manifest as processors contending for blocks containing independent variables, leading to coherence traffic disproportionate to true . Hardware mitigates coherence misses—accesses requiring intervention due to invalid or outdated states—and protocol races through transient states and escalation mechanisms. misses are classified during lookups, distinguishing them from or compulsory misses to enable targeted handling, such as reissuing requests if a (e.g., pending ) blocks access. races, arising from concurrent requests on unordered interconnects, are resolved using token-based counting where blocks require specific tokens for reads (at least one) or writes (all tokens), with transient requests reissued on failure and persistent requests activated after timeouts to forward tokens and override races without global serialization. In systems, the CoreLink CMN-600 coherent network provides hardware via the Coherent Hub Interface () , supporting snoop filters and scalable to track states and route requests across clusters of multi-core CPUs, enabling efficient sharing in large-scale Armv8-A configurations such as servers.

Scalability and Performance Issues

Snoopy-based protocols face significant scalability challenges due to their reliance on broadcast mechanisms over shared interconnects, leading to bottlenecks as the number of processors increases. The broadcast traffic required for actions, such as invalidations and interventions, grows linearly with the number of processors, overwhelming bus and increasing . These protocols typically scale effectively only up to around processors, beyond which the interconnect becomes a major limiter. In contrast, directory-based protocols address broadcast issues by using point-to-point communication but introduce storage overhead that scales as O(N) with N processors. A full bit-vector directory requires one bit per processor per cache line to track sharers, resulting in memory overhead proportional to the product of processors and memory size; for example, with 256 processors, this can exceed 50% of total memory. Limited-pointer directories mitigate this by tracking only a fixed number of sharers, but they risk overflow and additional traffic for unresolved cases. Coherence mechanisms impose substantial overhead in shared workloads, where traffic from invalidations and interventions can consume over 50% of available , particularly in decision support systems. This overhead arises from frequent misses, which force additional interconnect traversals and delay data access. Such misses elevate the average memory access time (AMAT) in multiprocessors compared to uniprocessors, as AMAT incorporates coherence-induced penalties alongside hit time, miss rate, and miss penalty. To alleviate these issues, techniques like victim caches and prediction-based prefetching have been proposed, though they offer partial relief rather than comprehensive solutions. A victim cache, a small fully associative holding evicted lines from a direct-mapped , reduces conflict misses that exacerbate traffic by up to 95% in some configurations. Prediction-based prefetching anticipates shared data accesses to preempt requests, lowering overhead in chip multiprocessors by synchronizing prefetches with events; for instance, synchronization-aware schemes can cut invalidation traffic while minimizing useless prefetches. A notable from the involves supercomputers like the Cray T3D, where potential coherence overhead in large-scale systems prompted a shift to non-cache-coherent NUMA designs. The T3D's architecture provided access without coherence, relying on software management to avoid broadcast and directory costs, enabling scalability to thousands of processors at the expense of programmer effort for consistency. This approach optimized performance for NUMA latencies in workloads. Emerging challenges in 2025 arise from disaggregated memory architectures in data centers, where compute and memory resources are pooled separately using interconnects like (CXL). Traditional protocols struggle to scale across remote memory nodes, leading to high latency and in maintaining ; recent proposes selective or hardware-software solutions to address the "coherence conundrum" in these systems. In modern evaluations, tools like the gem5 simulator facilitate detailed for scalability, modeling metrics such as utilization, under varying counts, and message distributions in multicore systems. These simulations reveal that traffic can dominate usage in shared-data benchmarks, guiding optimizations like snoop filters to reduce unnecessary broadcasts by up to 50%.

Applications and Extensions

Coherence in Distributed Systems

Distributed shared memory (DSM) systems extend the abstraction of a unified across networked nodes, emulating in environments where physical memory is distributed, such as clusters of workstations or wide-area s. Unlike tightly coupled multiprocessors with hardware , DSM relies on software or hybrid hardware-software mechanisms to manage over higher-latency interconnects. Early examples include the IVY system, developed in 1988, which implemented page-based DSM on a token-ring of Apollo workstations using techniques like page migration—transferring entire pages to the accessing node—and replication to maintain consistency while minimizing communication overhead. These approaches allow programmers to use familiar shared-memory models without explicit , though they introduce challenges in balancing with performance. Key protocols in DSM emphasize relaxed consistency models to reduce communication costs, building on concepts like release consistency. The Munin system, introduced in 1990, employed lazy release consistency, where updates from multiple writers are propagated only at synchronization points, such as releases, using multiple protocols tailored to data types (e.g., write-once for barriers, write-all for locks) to optimize for different access patterns. Similarly, TreadMarks, developed in 1994, utilized diff-based updates in a page-fault-driven manner, computing and exchanging only the differences between page versions rather than full pages, which significantly reduced transfer volumes in update-heavy workloads on standard workstations connected via ATM LANs. These protocols leverage software coherence managers to track page states and handle invalidations or updates, often integrating with systems for fault handling. DSM implementations face significant challenges, including hiding the high latency of network communication and managing in the presence of failures like network partitions. Latency hiding is achieved through prefetching, overlapping with communication via software-managed queues, and optimistic protocols that delay actions until necessary, as explored in systems that use support or optimizations to mask remote access delays. Network partitions require mechanisms such as quorum-based replication or vectors to detect and resolve inconsistencies upon reconnection, ensuring eventual without violating programmer expectations. Hybrid approaches mitigate these issues by combining DSM with explicit ; for instance, in clusters—commodity Linux-based systems—programmers use within SMP nodes and MPI for inter-node communication, allowing fine-grained control over data movement while retaining DSM for intra-node locality. The evolution of DSM has progressed from research prototypes to large-scale production systems, adapting coherence to global distributions. Modern examples include Google's Spanner, deployed in 2012, which provides linearizable consistency—a strict form of external consistency—across datacenters using synchronized clocks (TrueTime ) and two-phase commit protocols over replication, enabling globally consistent reads and writes despite planetary-scale latencies. Recent advancements (2023–2025) leverage technologies like (CXL) for memory disaggregation and RDMA for low-latency in AI and cloud workloads, including systems like Shray for distributed array storage and RMAI for remote in inference tasks. This represents a shift toward fault-tolerant, geo-replicated integrated with database semantics and hardware acceleration, influencing paradigms while inheriting relaxed models for .

Modern Extensions in Multicore Processors

Modern multicore processors, as of 2025, have evolved to incorporate chiplet-based architectures and elements, necessitating extensions to traditional protocols for improved and performance. These extensions often build on directory-based mechanisms to handle larger core counts and diverse workloads, while integrating accelerators like GPUs and FPGAs into the coherence domain. For instance, AMD's 5th Gen (Turin) processors, released in 2024, employ a design where multiple dies are interconnected via Infinity Fabric—an enhanced coherent protocol that implements directory-based across chiplets, enabling efficient NUMA handling in multi-socket systems with up to 128 cores per socket. This approach reduces latency for inter-die communication compared to earlier monolithic designs, supporting scalable coherence traffic. Intel's 6 (Granite Rapids) processors, launched in 2024, utilize a interconnect for on-chip communication, combining directory-based with snoopy-like optimizations to maintain in up to 128 cores per . This extension enhances bandwidth allocation and reduces contention in NUMA configurations, where remote memory accesses are minimized through intelligent data placement. In ARM-based systems, the AMBA () protocol serves as a key extension, providing a scalable, link-layer for in multicore SoCs, as seen in designs like NVIDIA's Grace CPU (2023), with recent integrations via Fusion for full coherency as of November 2025. supports efficient snoop filtering and sharing across heterogeneous cores, enabling low-latency in big.LITTLE configurations and integrated GPU setups. To address the demands of heterogeneous multicore environments, standards like (CXL) extend coherence beyond the to accelerators and devices. CXL leverages the PCIe physical layer with protocols such as CXL.cache and CXL.mem to enable asymmetric MESI coherence, allowing devices to participate in the host's domain for fine-grained sharing. This results in latencies comparable to remote socket access (around 57 ns) and bandwidth scaling up to 256 GB/s, benefiting and disaggregated systems by pooling resources without software-managed . Additionally, fine-grain coherence specialization, as proposed in frameworks like (2022), tailors protocol actions per access—such as update forwarding or owner prediction—to optimize for low-locality workloads in CPU-GPU setups, reducing execution time by up to 61% and network traffic by 99% in benchmarks. These extensions prioritize simplicity while enhancing reuse in diverse accelerator ecosystems.

References

  1. [1]
    [PDF] A Primer on Memory Consistency and Cache Coherence, Second ...
    This is a primer on memory consistency and cache coherence, part of the Synthesis Lectures on Computer Architecture series.
  2. [2]
    [PDF] Memory Coherency and Consistency
    A memory system is coherent if results of a parallel program are consistent with a hypothetical serial order, where reads return the last write value.
  3. [3]
    Correct memory operation of cache-based multiprocessors
    The common model used to define correct execution of cache-based multiprocessors is, however, not sequential consistency but memory coherence. Censier and ...
  4. [4]
    On the inclusion properties for multi-level cache hierarchies
    The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies.
  5. [5]
    IBM RESEARCH PARALLEL PROCESSOR PROTOTYPE (RP3)
    Dec 1, 1985 · The RP3 machine being designed is a highly parallel MIMD design with a uniquely flexible organization encompassing both shared-memory paradigms and local ...
  6. [6]
    [PDF] Research Parallel Processor Prototype (RP3) - MIT
    RP3 solves the cache coherence problem in software, with hardware assist: A high level language programmer can declare appropriate data "shared." The compiler, ...
  7. [7]
    [PDF] A Survey of Cache Coherence Schemes for MultiDrocessors
    Sequent Computer Systems' Symmetry ... For write-update policies, an important issue concerns reducing the sharing of data to lessen bus traffic.
  8. [8]
    [PDF] sequential consistency - Washington
    We first observe that a sequential processor could execute the. "b=1" and "fetch b" operations of process 1 in either order. (When process I's program is ...Missing: original | Show results with:original
  9. [9]
    [PDF] Shared Memory Consistency Models: A Tutorial - Computer
    Adve, Designing Memory Consistency Models for Shared. Memory Multiprocessors, PhD thesis, Tech. Report 1198, CS. Department, Univ. of Wisconsin, Madison, 1993.
  10. [10]
    Memory consistency and event ordering in scalable shared-memory ...
    This paper introduces a new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed ...
  11. [11]
    Weak ordering—a new definition | Proceedings of the 17th annual ...
    S. V. ADVE and M. D. HILL. Weak Ordering - A New Definition And Some Implications, Computer Sciences Technical Report #902, University of Wisconsin, Madison ...
  12. [12]
  13. [13]
    [PDF] Cache Coherence Protocols: Evaluation Using a Multiprocessor ...
    Cache coherence protocols maintain consistency when multiple caches have copies of a memory location. They use snooping cache controllers to observe bus ...
  14. [14]
    Cache coherence protocols: evaluation using a multiprocessor ...
    Using simulation, we examine the efficiency of several distributed, hardware-based solutions to the cache coherence problem in shared-bus multiprocessors.
  15. [15]
    A class of compatible cache consistency protocols and their support ...
    In this paper we define a class of compatible consistency protocols supported by the current IEEE Futurebus design. We refer to this class as the MOESI class of ...
  16. [16]
    [PDF] An Evaluation of Directory Schemes for Cache Coherence
    How do snoopy cache protocols work? A typical scheme enforces consistency by allowing multiple readers but only one writer. The state associated with a ...Missing: seminal | Show results with:seminal
  17. [17]
    The directory-based cache coherence protocol for the DASH ...
    In this paper, we present the design of the DASH coherence protocol and discuss how it addresses the above issues.
  18. [18]
    The Directory-Based Cache Coherence Protocol for the DASH ...
    The DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent.
  19. [19]
    [PDF] Reducing Memory and Traffic Requirements for Scalable Directory ...
    Snoopy cache coherence schemes rely on the bus as a broadcast medium and the caches snoop on the bus to keep themselves coherent. Unfortunately, the bus can ...
  20. [20]
    [PDF] The SGI Origin: A ccnuma Highly Scalable Server
    The directory-based coherence removes the broadcast bottleneck that prevents scalability of the snoopy bus-based coherence. The glo- bally addressable memory ...
  21. [21]
    [PDF] The Stanford Dash multiprocessor - Computer
    Directory-based cache coherence gives Dash the ease-of-use of shared-memory architectures while maintaining the scalability of message-passing machines.
  22. [22]
    An evaluation of directory schemes for cache coherence
    ... With-Invalidate. (WTI) is a simple snoopy cache protocol that relies on a write-through (as opposed to copy-back) cache policy and is used in several commercial.
  23. [23]
    False sharing and spatial locality in multiprocessor caches
    In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The ...
  24. [24]
    Design Space Exploration for On-the-Fly Detection of Coherence ...
    Co- herence misses are easily detected with a minor modification to the existing cache lookup procedure: a cache miss is identified as a coherence miss if a ...
  25. [25]
    [PDF] Token Coherence: Decoupling Performance and Correctness
    To enable low-latency cache-to-cache misses on unordered interconnects, this paper introduces Token Coherence. Token Coherence resolves protocol races without ...
  26. [26]
    ARM CoreLink CCI-400 Cache Coherent Interconnect Technical ...
    This is the Technical Reference Manual (TRM) for the CoreLink CCI-400 Cache Coherent Interconnect.
  27. [27]
    [PDF] Snooping-Based Cache Coherence
    In terms of the first coherence definition: there is no global ordering of loads and stores to X that is in agreement with results of this parallel program. ( ...
  28. [28]
    The Stanford FLASH multiprocessor - ACM Digital Library
    This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, ...
  29. [29]
    [PDF] Speeding-up Multiprocessors Running DSS Workloads through ...
    such workload; ii) the kernel effects account for 50% of the coherence overhead. Previous studies that considered DSS workloads were mostly limited to 4 ...
  30. [30]
    Improving direct-mapped cache performance by the addition of a ...
    Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim ...
  31. [31]
    An Advanced Compiler Framework for Non-Cache-Coherent ...
    The Cray T3D and T3E are non-cache-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance ...
  32. [32]
    [PDF] Efficient Shared Memory with Minimal Hardware Support
    To support this claim we have developed the. Cashmere family of software coherence protocols for NCC-NUMA (Non-Cache-Coherent, Non-Uniform-Memory ... Cray ...
  33. [33]
    [PDF] Richard Schaefer - cs.Princeton
    Apr 10, 1989 · IVY: A Shared Virtual Memory System for Parallel Computing. In Proceedings of the 1988 International Conference on Parallel Processing, pages 94 ...
  34. [34]
    distributed shared memory based on type-specific memory coherence
    This paper focuses on the design and use of Munin's memory coherence mechanisms, and compares our approach to previous work in this area. Formats available.
  35. [35]
    [PDF] TreadMarks: Distributed Shared Memory on Standard
    This paper presents a performance evaluation of TreadMarks running on. Ultrix using DECstation-5000/240's that are connected by a 100-Mbps switch-based ATM LAN.
  36. [36]
    (PDF) Hiding Communication Latency and Coherence Overhead in ...
    PDF | On Jan 1, 1996, Ricardo Bianchini and others published Hiding Communication Latency and Coherence Overhead in Software DSMs. | Find, read and cite all ...
  37. [37]
    [PDF] Using a PC Cluster for High-Performance Computing and Applications
    One way to program SMP clusters is to use an all-message-passing model. This approach uses message passing even for intra-node communication. It simplifies ...
  38. [38]
    [PDF] Spanner: Google's Globally-Distributed Database
    This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock ...
  39. [39]
    Memory Performance of AMD EPYC Rome and Intel Cascade Lake ...
    Apr 9, 2022 · This paper describes and experimentally evaluates the memory hierarchy of AMD EPYC Rome and Intel Xeon Cascade Lake SP server processors in ...
  40. [40]
    Memory Performance of AMD EPYC Rome and Intel Cascade Lake ...
    This paper evaluates memory hierarchy of AMD EPYC Rome and Intel Cascade Lake SP, finding different performance patterns due to distinct microarchitectures.<|control11|><|separator|>
  41. [41]
    Benchmarking CPU-Requested GPU Memory Access
    Apr 19, 2025 · Each core has a private 64 KB L1 and 1 MB L2 Cache [31]. The total L3 cache size is 114 MB [16]. NVIDIA Scalable Coherency Fabric is the CPU's ...3 Characterizing Gpu Memory... · 3.2 Throughput · 3.4 Latency<|separator|>
  42. [42]
    Data Prefetching on Processors with Heterogeneous Memory
    Dec 11, 2024 · Our technique enables a prefetcher to dynamically determine the optimal prefetch degree and distance based on memory type.Missing: coherence | Show results with:coherence
  43. [43]
    An Introduction to the Compute Express Link (CXL) Interconnect
    ” The host processor orchestrates cache coherency, as described below. ... CXL is compatible with a range of coherence implementations inside the CPU.Missing: multicore | Show results with:multicore
  44. [44]
  45. [45]
    A Case for Fine-grain Coherence Specialization in Heterogeneous ...
    There have been many coherence extensions proposed over the years (discussed further in Section 2), but these generally build upon conventional hardware ...