Fact-checked by Grok 2 weeks ago

Cache hierarchy

Cache hierarchy refers to the multi-level structure of cache memories in modern computer architectures, where smaller, faster caches are organized in successive layers between the and main to store frequently accessed and instructions, thereby bridging the performance gap between the CPU and slower . This organization exploits principles of temporal locality (reuse of recently accessed ) and spatial locality (access to near recently used locations) to minimize average access latency. The primary purpose of the cache hierarchy is to compensate for the widening disparity in access speeds between processors and main memory, a gap that has grown significantly since the due to faster CPU clock rates compared to slower improvements. By placing caches at multiple levels, systems achieve a balance of speed, capacity, and cost, using expensive but fast static (SRAM) for upper levels and transitioning to larger, cheaper dynamic (DRAM) further down. Data is transferred in fixed-size blocks between levels, with hardware-managed policies determining placement, replacement, and to ensure efficient operation. In typical implementations, the hierarchy consists of three primary on-chip cache levels: L1, L2, and L3. The L1 cache, closest to the cores, is the smallest (often 32-64 per core) and fastest (1-4 cycles ), frequently split into separate (L1i) and data (L1d) caches to optimize performance. The L2 cache is larger (256 to 2 MB per core) and slightly slower (5-10 cycles), serving as a for L1 misses while still being on-chip for low . The L3 cache, shared among multiple cores, is the largest (up to 64 MB or more) and slowest among the caches (20+ cycles), acting as a last on-chip buffer before accessing main memory, which has latencies around 100 cycles or higher. This design has evolved with multi-core processors, where shared lower-level caches like L3 improve inter-core but introduce challenges in and contention. Overall, the cache hierarchy significantly enhances system performance, with hit rates often exceeding 90% in the upper levels for well-behaved workloads, though effectiveness depends on application locality patterns.

Introduction

Definition and Purpose

A is a multi-tiered memory system in , consisting of multiple levels of memory arranged between the and main memory. Each level serves as a smaller, faster buffer for the next larger and slower one, typically including L1, , and sometimes L3 caches, with progressively increasing capacity and decreasing access speed. This organization exploits data locality to store copies of frequently used data closer to the CPU, thereby minimizing the time required for data retrieval from slower main memory. The primary purpose of a cache hierarchy is to bridge the significant performance gap between the rapid execution speeds of modern and the comparatively slower access times of (DRAM). By positioning fast (SRAM)-based caches on or near the chip, the hierarchy enables the to access commonly used instructions and data with minimal latency, effectively masking the delays inherent in deeper memory layers. This design is essential in contemporary systems, where processor clock speeds have outpaced memory bandwidth improvements for decades. Key benefits include substantial reductions in average memory access latency, higher instruction throughput, and improved overall system efficiency, particularly in performance-critical applications like scientific computing and real-time processing. For instance, successful data hits in upper cache levels can deliver access times orders of magnitude faster than main fetches, leading to measurable gains in utilization. Each cache level functions as a buffer for its successor, with the L1 cache being the smallest and fastest, often integrated directly on the die to support split instruction and data storage. The effectiveness of this approach relies on the principle of , where programs tend to reuse recently accessed data.

Historical Evolution

The concept of cache memory as a fast "slave" store to supplement slower main memory was first formalized by Maurice Wilkes in 1965, laying the theoretical groundwork for hierarchical memory systems. The first commercial implementation appeared in the IBM System/360 Model 85 mainframe in 1968, which introduced a 16 KB high-speed buffer storage operating as a cache to accelerate access to the larger main memory. During the 1970s and 1980s, single-level cache designs predominated in processor architectures, exemplified by the experimental IBM 801 minicomputer project initiated in 1975 and prototyped by 1980, which integrated separate instruction and data caches to support its reduced instruction set computing approach. The marked a pivotal shift toward multi-level hierarchies to address growing performance demands from increasing clock speeds and application complexity. Intel's , released in 1995, pioneered the inclusion of a secondary implemented off-chip with sizes up to 1 MB, separate from the on-chip L1 , to extend hit rates for larger working sets. By the late , on-die integration of caches became feasible, as demonstrated by AMD's K6-III in , which featured 256 KB of directly on the die running at core speed to reduce . A key milestone during this era was the transition from asynchronous caches, common in early mainframes like the System/360, to synchronous designs aligned with clocks, enabling tighter pipelining and higher frequencies starting in the mid-1980s with RISC processors. In the 2000s, the rise of multi-core processors drove the adoption of tertiary L3 caches, often shared across cores to optimize coherence and bandwidth. Intel's Nehalem microarchitecture, introduced in 2008 with the Core i7 processors, integrated up to 8 MB of shared inclusive L3 cache on-die, facilitating efficient data sharing in multi-threaded environments. The 2010s saw further refinements, including non-inclusive L3 policies to enhance effective capacity by avoiding duplication of L1 and L2 data; Intel's server Skylake-SP (Xeon Scalable) processors in 2017 implemented a non-inclusive shared L3 cache of up to 38.5 MB, reducing snoop traffic in multi-core setups. Integration with multi-core designs accelerated in this period, with caches evolving to support coherence protocols like Intel's Mesif for shared resources. The 2020s have emphasized scaling cache sizes for data-intensive workloads, particularly AI and machine learning, where large datasets benefit from reduced memory latency. AMD's EPYC 9004 "Genoa" series, launched in 2022, exemplifies this with up to 384 MB of shared L3 cache per socket in its 96-core configurations, boosting performance in AI training by minimizing off-chip accesses. Subsequent advancements include the AMD EPYC 9005 "Turin" series launched in October 2024 with up to 192 Zen 5 cores and 384 MB L3 cache per socket, alongside Intel's Granite Rapids Xeon processors in 2024 featuring up to 128 cores and enlarged L3 caches exceeding 300 MB in high-end models, continuing to balance capacity, latency, and power within multi-level hierarchies.

Fundamentals

Locality of Reference

is a fundamental principle in that describes the tendency of programs to access a relatively small subset of their repeatedly over short periods, enabling efficient designs such as caches. This principle, formalized by Peter J. Denning in his analysis of program behavior, underpins the effectiveness of caching by predicting that memory references cluster in both time and space, reducing the need to fetch data from slower main memory. Temporal locality refers to the likelihood that a memory location recently accessed will be reused soon afterward, often observed in program constructs like loops where the same variables or data structures are repeatedly referenced. For instance, in iterative algorithms, control flow repeatedly accesses the same code or data elements, creating reuse patterns that caches can exploit by retaining recently used items. Spatial locality, on the other hand, arises when programs access data located near previously referenced addresses, such as during sequential traversal of arrays or structures stored contiguously in . These behaviors stem from the structured nature of typical , where execution flows through localized regions of code and . The theoretical foundation of locality derives from empirical analyses of program execution traces, revealing that most references occur within a "working set" of active pages or data blocks, as modeled by Denning to approximate program demands over time windows. This has implications under for memory-bound applications, where poor locality amplifies the serial fraction of execution time dominated by slow memory accesses, limiting overall speedup despite faster processors; conversely, strong locality mitigates this by minimizing effective memory latency. In cache hierarchies, these principles enable hit rates exceeding 90% in smaller, faster levels like L1 caches for typical workloads, as recent benchmarks on modern processors demonstrate 95-97% L1 hit rates due to clustered references. A classic example is , where temporal locality manifests as each matrix element is reused O(n) times across nested loops, and spatial locality appears in row- or column-wise traversals that access contiguous blocks. Techniques like hardware prefetching extend these benefits by anticipating spatial patterns to load data proactively, further boosting hit rates in hierarchies without altering core program behavior. Real-world benchmarks, such as those on SPEC CPU suites, consistently show locality driving 90%+ hit rates in L1 caches for compute-intensive tasks, underscoring its role in achieving performance close to ideal memory speeds.

Cache Levels and Organization

The cache hierarchy in modern processors is typically organized into multiple levels, each with distinct characteristics in size, speed, and scope to optimize performance by exploiting . The first level, L1 cache, is the smallest and fastest, usually ranging from 32 to 64 KB per core, with latencies of 1 to 4 cycles. It is positioned closest to the CPU execution units and is commonly split into separate (I-cache) and (D-cache) components to allow simultaneous for fetching and loading/storing . The second level, L2 cache, serves as a backup to the L1 and is larger, typically 256 KB to 2 MB per core, with access latencies of 10 to 20 cycles. It is often implemented as a unified cache, holding both instructions and data, and is usually private to each core to reduce contention in multi-core systems. This level provides a balance between capacity and speed, capturing data that does not fit in L1 but is still frequently accessed. The third level, L3 cache or last-level cache (LLC), is the largest in the on-chip hierarchy, shared among multiple cores with sizes from 8 to 100 MB or more, and access latencies of 20 to 50 cycles. It acts as a communal resource to filter misses from lower levels before accessing main memory, which has much higher latencies (typically 100-300 cycles). In contemporary designs, L1, L2, and L3 caches are predominantly on-chip for reduced latency, though earlier systems placed L2 or L3 off-chip. Cache organization within each level commonly employs associativity to map blocks to cache lines, including direct-mapped (one possible location per block), set-associative (multiple locations within a set), or fully associative (any location). The hierarchy can be inclusive (lower levels contain all data from upper levels), exclusive (no overlap between levels), or non-inclusive (partial overlap allowed), influencing how data propagates through the levels. Overall, these levels form a progressive filtering mechanism, where misses at one level trigger searches in the next, minimizing expensive main accesses.

Multi-Level Design

Average Access Time

In multi-level cache hierarchies, the average access time (AAT), often referred to as average access time (AMAT), represents the expected time to retrieve data from the system, accounting for hits and misses across all levels and main . This metric quantifies the overall performance of the by weighting the access times of each level by their respective hit probabilities, providing a single value that reflects the system's effectiveness in bridging the gap between the and main . The AAT for a three-level cache hierarchy is calculated recursively as follows: \text{AAT} = h_1 t_1 + (1 - h_1) \left[ h_2 t_2 + (1 - h_2) \left[ h_3 t_3 + (1 - h_3) t_m \right] \right] where h_i denotes the hit rate at cache level i (with $0 \leq h_i \leq 1), t_i is the latency at level i (typically in clock cycles), and t_m is the time for main . This formula assumes hit rates are independent across levels and that misses at one level propagate to the next. To derive this, begin at the lowest level: the effective access time for level 3 (L3 ) is \text{AMAT}_3 = h_3 t_3 + (1 - h_3) t_m, as an L3 miss requires fetching from main . For level 2 (L2 ), the miss penalty is the AMAT of L3, yielding \text{AMAT}_2 = h_2 t_2 + (1 - h_2) \text{AMAT}_3. Extending this to level 1 (L1 ) gives the overall AAT as \text{AMAT}_1 = h_1 t_1 + (1 - h_1) \text{AMAT}_2, which expands to the full expression above. For a simplified two-level , the formula reduces to \text{AAT} = h t_c + (1 - h) t_m, where h is the aggregate hit rate for the level, t_c is its access time, and t_m is main time; this highlights how even small improvements in hit rate can dramatically lower AAT due to the high penalty of accesses. Several factors influence AAT, primarily the hit and miss ratios at each level—which depend on workload characteristics like locality—and the differences between levels, where upper-level caches (e.g., L1) prioritize low over . For instance, consider a two-level system with L1 hit rate h_1 = 0.95, L1 access time t_1 = 1 cycle, L2 hit rate h_2 = 0.90 (conditional on L1 ), L2 access time t_2 = 14 cycles, and main memory time t_m = 200 cycles. The L2 AMAT is $0.90 \times 14 + 0.10 \times 200 = 12.6 + 20 = 32.6 cycles, so overall AAT = $0.95 \times 1 + 0.05 \times 32.6 \approx 2.63 cycles; adjusting for more realistic multi-level hit rates and latencies often yields AAT values of 5–10 cycles in designs, underscoring the hierarchy's role in keeping access times close to L1 latencies. In practice, AAT is measured using hardware performance counters during benchmarks to capture and miss events, enabling computation via the above formulas. Tools like the perf utility record these counters (e.g., L1-dcache-load-misses) for specific workloads, providing empirical rates and latencies to evaluate and tune performance without relying solely on .

Inclusion Policies

In cache hierarchies, inclusion policies dictate whether and how data blocks from higher-level s (closer to the , such as L1) are required to be present in lower-level s (farther from the , such as or L3). These policies influence effective , overhead, and overall system performance by managing data replication across levels. The inclusive policy mandates that all data in higher-level s is also duplicated in the corresponding lower-level s, making the lower levels a strict superset of the upper ones. This approach was common in early designs, such as the , where the inclusively contained all L1 data. Inclusive policies simplify protocols by ensuring that snoops to the lower level can capture all relevant data without needing to query upper levels separately, which aids directory-based in multi-core systems. However, they lead to upper-level pollution, where evictions from the lower (known as victims) back-invalidate hot data in upper s, reducing effective and hit rates, particularly as core counts increase. In contrast, the exclusive policy prohibits data overlap between cache levels, ensuring that a block present in a higher-level cache cannot reside in a lower-level one, thereby maximizing total unique storage capacity across the hierarchy. Early Zen architectures employed a non-inclusive victim L3 cache design, treating it as a cache for evicted lines from private L1 and caches. While this boosts hit rates by avoiding replication—potentially improving performance by up to 9.4% in workloads sensitive to capacity—it complicates and insertion processes, as must be moved between levels upon misses or replacements, increasing on-chip traffic by as much as 72.6%. Exclusive policies also heighten challenges, requiring additional to prevent conditions during transfers. The non-inclusive policy, often termed victim-inclusive, relaxes strict inclusion by allowing the lower-level cache (typically L3) to hold a mix of data: some from upper levels (L1/L2) plus evicted victims, but without mandating full duplication. Intel's Skylake-X () processors adopted a non-inclusive L3 cache, balancing and by permitting partial overlap. This design mitigates from inclusive —improving by 5.9% over strict non-inclusive setups in some benchmarks—while reducing snoop compared to exclusive policies, though it may still incur up to 50% more back-invalidations in multi-core scenarios. Non-inclusive approaches support hybrid coherence, where directories track sharers without full guarantees. Trade-offs among these policies center on hit rates, snoop traffic, and : inclusive designs favor simpler at the cost of 3-8% penalties from in large hierarchies, while exclusive and non-inclusive variants enhance (up to 18% miss reduction) but elevate traffic and complexity. The evolution shifted from predominant inclusive policies in the (e.g., uniprocessor and early CMPs like ) to non-inclusive and exclusive in the 2010s and continuing into the 2020s, driven by multi-core demands for higher effective and reduced replication in processors like and Intel's . Strict inclusion remains beneficial for directory-based s, as it streamlines sharer tracking and minimizes broadcast snoops.

Write Policies

In cache hierarchies, write policies determine how write operations to are handled to balance performance, , and complexity. The write-through policy updates both the and the next level of simultaneously upon a write , ensuring immediate data across the hierarchy. This simplicity makes it suitable for scenarios requiring low in persistence, such as I/O buffers where data must be promptly available to devices, but it increases demands due to every write propagating downward. The write-back policy, also known as copy-back or write-behind, updates only the on a write and defers propagation to lower levels until the block is evicted, replaced, or explicitly flushed. A dirty bit per cache block tracks modifications to enable selective writes, reducing unnecessary traffic while risking temporary inconsistency or on cache failure without backups. This approach is prevalent in modern processors, such as the i7's L2 and L3 caches, to minimize bus contention in bandwidth-constrained systems. Write policies interact with allocation strategies on write misses. Write-allocate fetches the entire block from into the cache before applying the , leveraging spatial locality for future accesses and typically paired with write-back to amortize fetch costs over multiple writes. In contrast, no-write-allocate bypasses cache allocation, writing directly to lower-level to prevent from one-time writes; this is often combined with write-through to avoid fetching unused . In multi-level hierarchies, policies are tailored per level: L1 caches commonly use write-back with write-allocate to buffer writes efficiently and exploit spatial locality, while and L3 employ write-back with write-allocate to buffer writes before main memory access. For example, the implements write-back in L1 data caches and write-back in the unified L2, using dirty bits to track changes and ensure inclusion of L1 data in L2. Performance impacts vary by ; write-back reduces traffic by 50-70% in write-intensive scenarios compared to write-through, lowering penalties by 20-30% in multi-level setups and yielding 10-20% overall in SPEC benchmarks. However, write-through incurs 5-10% loss from elevated usage, though write buffers can mitigate stalls in both policies. In multiprocessor environments, write-back elevates overhead due to delayed visibility of updates.

Organizational Variants

Unified versus Split Caches

In cache hierarchy design, unified caches store both and in a single structure, providing a shared pool that simplifies hardware implementation by reducing the need for duplicate storage and management logic. This design enhances utilization in workloads with varying instruction and data access patterns, as the cache can dynamically allocate space without fixed partitioning. In contrast, split caches employ separate instruction (I-cache) and data (D-cache) structures, typically at the L1 level, enabling simultaneous accesses to instructions and data for improved parallelism and reduced contention during fetch and load/store operations. This separation aligns with Harvard architecture principles, potentially doubling bandwidth compared to a unified cache of equivalent size by avoiding resource conflicts between instruction fetches and data manipulations. However, split caches increase overhead through duplicated control logic and tags, which can lower the overall hit rate for a given total capacity since resources cannot be flexibly shared. Unified caches mitigate tag storage overhead but risk internal contention when instruction and data accesses compete for the same banks, potentially degrading in instruction-intensive or data-heavy phases. Split caches, while offering better isolation to ease design and minimize structural hazards, may underutilize one if access patterns are imbalanced, leading to inefficiencies in space allocation. The adoption of split caches at the L1 level emerged in the 1980s, as seen in the , which featured dedicated 64 KB and 64 KB data caches to support pipelined execution without bottlenecks. Higher-level caches, such as and L3, have predominantly remained unified to facilitate sharing across instructions and data, promoting higher effective hit rates in multi-level hierarchies. Contemporary trends in system-on-chip (SoC) designs incorporate hybrid approaches, such as virtually caches, which logically partition a unified structure dynamically based on access demands to balance the bandwidth benefits of splitting with the flexibility of unification, thereby enhancing power efficiency in resource-constrained embedded systems. These designs can reduce dynamic power by up to 29% while improving by around 2.5% compared to traditional unified configurations. The choice between unified and split caches also influences integration with or shared multi-core setups, where split L1 designs per complement unified higher levels for coherent sharing.

Shared versus Private Caches

In multi-core processors, caches are dedicated exclusively to a , ensuring low-latency without interference from concurrent es by other cores. The L1 caches, both and , are universally in modern designs to minimize times critical for fetch and load/ operations. Extending this model, caches are often as well, particularly in architectures like AMD's series, where each core is allocated 512 KB to 1 MB of cache to insulate it from higher-level latencies. This configuration eliminates remote overhead and contention, allowing each core to operate with dedicated and reduced power consumption for local . In contrast, shared caches are accessible by multiple cores, typically implemented at the L3 level to facilitate data sharing and reduce overall memory redundancy across the chip. For instance, Intel processors from the Nehalem microarchitecture onward feature a shared L3 cache among all cores on the die, ranging from 8 MB to over 100 MB depending on the model, which serves as a victim cache for L2 evictions and holds shared data. Similarly, AMD's EPYC processors employ shared L3 caches within core complexes (CCXs), with 32 MB per eight-core group in Zen 4 designs, enabling efficient inter-core data reuse in server workloads. Shared caches promote higher effective capacity by avoiding duplication of frequently accessed shared data, such as in multi-threaded applications, and simplify management of common datasets. The choice between private and shared caches involves key trade-offs in performance, power, and complexity. Private caches minimize intra-chip contention and provide faster local access—often with latencies 20-50 cycles for — but demand larger total on-die area to avoid capacity waste from redundant copies, increasing manufacturing costs and power draw. Shared caches, while enhancing utilization and reducing off-chip memory traffic in sharing-heavy workloads, introduce potential bandwidth bottlenecks and higher average access latencies due to interconnect traversal and snoop overhead, particularly for cross-core requests. In shared setups, coherence protocols are essential to maintain data consistency, though they add marginal overhead compared to fully private hierarchies. Since the mid-2000s, hybrid designs combining private L1 and caches with a shared L3 have become standard in multi-core processors, striking a balance between per-core speed and system-wide sharing; for example, Intel's Core 2 Duo (2006) used paired-core shared L2, evolving to private L2 with shared L3 in subsequent generations. This approach supports efficient multi-threading by keeping hot data close to each core while pooling resources for cold or shared data. For scalability in larger systems, shared L3 caches work well for 2-8 cores per domain, but in high-core-count processors like AMD EPYC with up to 128 cores, hybrid partitioning—such as per-chiplet shared L3 slices of 32 MB each—mitigates latency and contention by localizing sharing within smaller groups. Such designs provide performance gains in parallel workloads compared to all-private hierarchies.

Banked and Interleaved Designs

Banked caches divide the physical cache storage into multiple independent banks, each capable of operating autonomously to handle simultaneous memory accesses from different cores or threads. This design typically employs 4 to 16 banks in shared last-level caches like L3, where each bank serves a subset of the , connected via a crossbar or to enable parallel operations without contention on a single monolithic structure. In multi-core processors, this partitioning mitigates wire delays and supports higher throughput by allowing multiple requests to proceed concurrently, as seen in designs like the Nehalem architecture's sliced L3 cache. Interleaved, or striped, cache designs extend banking by systematically distributing cache lines across banks based on address bits, reducing the likelihood of conflicts in multi-threaded workloads. Low-order interleaving assigns consecutive addresses to sequential banks, while more advanced hashing—such as XOR-based mapping—uses bit permutations and exclusive-OR operations on physical address fields to evenly spread accesses and minimize hotspots. For instance, Intel's L3 caches employ model-specific XOR hashing with 4 to 10 slices (banks), where address bits are selectively XORed against permutation masks to determine the target slice, enhancing scalability in processors like the i9-10900K. This interleaving is particularly vital in high-core-count CPUs and GPUs, where it distributes concurrent vector operations across banks to exploit thread-level parallelism. The primary advantages of banked and interleaved designs include significantly increased effective , enabling up to multiple simultaneous accesses per —such as 6 loads/stores in wide-issue processors—without the area overhead of fully multi-ported caches. In GPUs, multi-banking reduces load-to-use stalls by allowing parallel servicing of requests across banks, improving throughput for memory-intensive kernels. These techniques have evolved from single-bank caches dominant in 1990s processors, like early designs, to widespread multi-bank adoption in the for shared hierarchies in multi-core systems. However, these designs introduce complexities, such as the need for sophisticated addressing logic that can increase if conflicts occur, and potential for thrashing in poorly interleaved setups where repeated accesses overload specific . Bank conflicts, where multiple requests target the same , can degrade performance by up to 49% in workloads if spatial locality is not managed, as in interleaving schemes with high rates. Additional drawbacks include elevated wiring demands for inter-bank connections, raising area and power costs in dense on-chip layouts.

Performance Considerations

Trade-offs and Evolution

One fundamental in cache hierarchy is between cache size and access speed. Larger caches can accommodate more data, thereby reducing miss rates and improving overall system performance by minimizing accesses to slower main . However, increasing cache size leads to higher due to longer signal propagation delays across the larger on-chip area, as well as greater power consumption and silicon area requirements. For SRAM-based caches, which dominate on-die implementations, scaling size by a factor of two roughly doubles the area and cost while exacerbating these issues, prompting designers to balance against these penalties. Another key compromise involves cache associativity, which determines how flexibly data blocks can be placed within the to avoid s. Higher set-associativity, such as 8-way or 16-way configurations, enhances rates by reducing conflict misses compared to direct-mapped or lower-associativity designs, allowing better utilization of space for diverse patterns. Yet, this comes at the expense of increased lookup latency, as parallel comparisons across more ways require additional hardware and time—typically adding 2-4 clock cycles to times in modern processors. Designers often select moderate associativity levels (e.g., 4-8 ways for L1 caches) to optimize this between hit rate improvements and the overhead of complex matching. Power efficiency and manufacturing cost further constrain cache design, particularly in choices between on-die SRAM and alternatives like embedded DRAM (eDRAM). SRAM offers low latency and high speed but consumes significant static power and die area due to its six-transistor cell structure, making large caches expensive for high-performance computing. In contrast, eDRAM provides higher density (up to 3-4x that of SRAM) and lower static power—reducing dissipation by factors of 5 or more—while maintaining comparable access speeds for last-level caches, though it requires periodic refresh overhead. These trade-offs have driven adoption of eDRAM in select high-capacity implementations, such as in some IBM and Intel processors, to mitigate the power and cost burdens of scaling SRAM-based hierarchies. The evolution of cache hierarchies reflects ongoing adaptations to these trade-offs, progressing from single-level designs in the 1980s—where processors like the Intel 80486 relied solely on small on-chip L1 caches of 8 —to multi-level structures by the , incorporating and L3 caches to bridge widening processor-memory speed gaps. By the , off-chip caches became standard, evolving into on-die unified L3 caches in the early for better and . In the 2020s, hierarchies have expanded to include massive last-level caches (e.g., 100+ MB shared L3) and emerging system-level caching mechanisms to support multi-core and AI workloads, prioritizing capacity over strict level counts. Historical shifts in inclusion policies have also addressed multi-core challenges. Early designs favored exclusive or inclusive policies to simplify coherence, but as core counts grew in the 2000s, non-inclusive (or victim-inclusive) approaches gained prominence, particularly in AMD processors, to avoid duplicating L1 data in shared L2/L3 caches and maximize effective capacity without excessive coherence traffic. This transition reduced redundancy in multi-core environments, improving scalability while maintaining coherence through directory-based protocols. Prefetcher integration has evolved to mitigate miss latencies, with hardware prefetchers becoming standard in CPU caches since the early 2000s to anticipate data accesses based on patterns observed in L1/ misses. Advanced implementations, such as stride or stream prefetchers, now reside in and L3 levels, issuing proactive loads to hide latencies in irregular workloads, though they introduce bandwidth overhead if accuracy is low. Looking ahead, future cache hierarchies may incorporate disaggregated designs enabled by standards like (CXL), allowing coherent sharing of remote memory pools across devices to scale capacity beyond on-chip limits while reducing per-core power. Emerging research into optical caches, leveraging photonic interconnects for ultra-low-latency data movement, promises to alleviate electrical signaling bottlenecks in dense hierarchies, though integration challenges remain. These trends aim to extend benefits amid slowing transistor scaling.

Gains and Limitations

Cache hierarchies deliver significant performance gains by mitigating the latency disparity between high-speed processors and slower main , often yielding speedups of 10 to 100 times for frequently accessed . High hit rates, typically around 95% or better in well-designed systems, enable effective access times to drop from approximately 100 for main fetches to 1-5 when resides in the . These improvements stem from the proximity and speed of on-chip storage, allowing processors to sustain high instruction throughput without stalling on operations. Despite these benefits, cache hierarchies face limitations in certain workloads, particularly irregular access patterns common in processing, where cache pollution and compulsory misses degrade hit rates. Cache pollution occurs when prefetched or irrelevant data evicts useful content, exacerbating misses in sparse or unpredictable datasets like analytics or training on large-scale data. In multi-core environments, cache coherence protocols introduce additional overhead, consuming 5-20% of available bandwidth due to snoop traffic and invalidations needed to maintain data consistency across caches. Key disadvantages include the inherent of designing multi-level hierarchies, which requires balancing , associativity, and policies to avoid like thrashing or increased miss penalties. Caches are also vulnerable to side-channel attacks, such as , which exploit timing differences in cache access to leak sensitive information across security boundaries. Furthermore, caches account for 20-30% of a processor's total power budget in modern designs, driven by dynamic access energy and static leakage in large on-chip arrays. To address these limitations, software techniques like cache blocking—restructuring loops to maximize data reuse within blocks—can reduce misses by up to 50% in computations. Hardware prefetchers complement this by anticipating data needs and loading it proactively, improving hit rates in streaming workloads without excessive pollution when tuned properly. Benchmark results underscore these gains; for instance, adding or enlarging L3 caches in SPEC CPU suites has delivered 20-50% performance uplifts in memory-intensive workloads like simulations and , highlighting the hierarchy's role in overall system efficiency.

Modern Implementations

Intel Processors

's cache hierarchy implementations in recent processors emphasize scalability, power efficiency, and optimizations tailored to client and server workloads. In the Arrow Lake family, released in 2024 as part of the Core Ultra 200S series, the redesigned hierarchy features Lion Cove cores (P-cores) with 3 MB of private cache per core, marking a significant increase from prior generations to enhance single-threaded and reduce for core-local access. The shared L3 cache totals 36 MB across the chip, adopting a non-inclusive policy that allows for more flexible placement without duplicating lower-level contents in the L3, thereby improving overall and hit rates in multi-core scenarios. This configuration supports up to 24 cores (8 P-cores and 16 efficiency cores) while prioritizing higher L2 capacity to mitigate bottlenecks in the shared L3. For mobile and low-power applications, the Lunar Lake architecture, also launched in 2024 under the Core Ultra 200V series, optimizes the hierarchy for efficiency with a 12 shared L3 cache designed for reduced power consumption in thin-and-light devices. Each of the four Lion Cove P-cores includes 2.5 of private L2 cache, balancing capacity with energy efficiency to handle AI and productivity workloads. The design integrates on-package LPDDR5X memory, which minimizes latency by placing directly alongside the compute tiles, effectively extending the cache hierarchy with lower access times compared to traditional off-package configurations. In server environments, the processors, introduced in as the 5th Generation Scalable family, maintain an inclusive L3 policy to ensure across high-core-count systems, with up to 320 of shared L3 per supporting up to 64 . Each features 2 of private L2 , enabling efficient while the inclusive L3 duplicates lower-level contents for simplified snoop protocols in multi-socket setups. Key innovations in Intel's recent designs include the mesh interconnect, which facilitates low-latency access to the distributed L3 cache slices by routing requests across a 2D grid of nodes in multi-core processors, improving scalability over ring-based topologies. Since the architecture in 2023, caches incorporate victim cache elements to retain recently evicted lines, reducing miss rates and enhancing reuse in hybrid core configurations. Overall, Intel's shift toward tile-based designs enhances cache hierarchy scalability by modularizing components—such as compute, I/O, and memory tiles—connected via a high-bandwidth fabric, allowing easier customization and higher core densities without monolithic die constraints.

AMD Processors

's cache hierarchy in its Zen-based processors emphasizes a chiplet-based design, where multiple are interconnected via Infinity Fabric to enable scalable core counts while maintaining across shared L3 caches. This approach allows private L1 and caches per core for low-latency access, with larger shared L3 caches per CCD to support high-performance workloads like and HPC. In the architecture, introduced in 2021 with 5000 and 7003 series, each features a unified 32 MB L3 serving up to eight cores, operating under an exclusive policy where data evicted from is stored in L3 without duplication. Each core features a 512 (totaling 4 MB per ), with 32 KB L1 instruction and 32 KB L1 data caches per core. This design improved intra- latency and bandwidth compared to , enhancing single-threaded performance by unifying the L3 slice per . The architecture, launched in 2022-2023 with 7000 and 9004 series, features private 32 KB L1 instruction and 32 KB L1 data s, and 1 MB private L2 per core, with 32 MB L3 per in a non-inclusive manner. The 9684X, a 96-core model with V-Cache, provides 1152 MB total L3 across 12 CCDs (96 MB per ). Infinity Fabric ensures across chiplets, supporting up to 128 PCIe 5.0 lanes and high-bandwidth inter-CCD communication. Zen 5, released in 2024 with 9000 and 9005 series, refines this hierarchy with a 1 MB L2 cache per core featuring 16-way associativity and 64 bytes per bandwidth, paired with 48 KB L1 and 32 KB L1 caches. The L3 remains 32 MB per in a non-inclusive policy, with latency reduced by 3.5 s relative to to improve hit rates in multi-core scenarios. Each supports up to eight cores, interconnected via Infinity Fabric on the I/O die for server scalability up to 192 cores. A hallmark of AMD's design is the use of stacking for increased cache density, particularly in V-Cache variants, where additional L3 layers are bonded directly to using through-silicon vias for sub-nanosecond access. This enables configurations like 96 MB L3 per CCD in gaming-focused models. Recent trends focus on expanding L3 capacity for workloads, with variants reaching up to 144 MB total L3 in dual-CCD desktop processors to accommodate larger datasets and reduce memory accesses. As of 2025, previews of next-generation architectures continue to emphasize larger L3 capacities and efficiency improvements.

ARM-Based SoCs

ARM-based systems-on-chip (SoCs) integrate hierarchies optimized for power efficiency and , particularly in and applications. These designs often employ split L1 caches per core, private or cluster-shared caches, and a shared system-level (SLC) or L3 to balance , , and across CPU, GPU, and other accelerators. The unified memory architecture (UMA) common in many SoCs reduces the need for complex coherency by allowing direct access to a pool, streamlining in integrated environments. Apple's M4 , introduced in 2024, exemplifies this approach with its 10-core CPU configuration of 4 performance (P) cores and 6 efficiency (E) cores. Each P core reportedly features a split with 192 instruction and 128 data, while E cores have 128 instruction and 64 data caches; these are backed by 16 MB shared for the P-core cluster and smaller for E cores, with no dedicated L3—instead, the system leverages up to 32 MB SLC integrated into the UMA for low-latency access across the , including GPU and neural engine. This non-inclusive design prioritizes SoC-wide coherency and efficiency, enabling seamless movement in unified memory setups up to 128 . The Apple M1 Ultra, launched in 2022, scales this hierarchy for high-end desktops with dual-die integration via UltraFusion. It includes 128 KB L1 instruction and 64 KB data caches per E core, with P cores at 192 KB/128 KB split; L2 is 12 MB per cluster (48 MB total for P clusters across dies), complemented by a 96 MB SLC shared system-wide. This non-inclusive SLC serves as the final cache level before unified memory, optimized for integration by caching data for CPU clusters, GPU (up to 64 cores), and media engines, reducing main memory accesses in bandwidth-intensive tasks. In contrast, Qualcomm's Snapdragon 8 Gen 4 (also known as Snapdragon 8 Elite), released in 2024, adopts a big.LITTLE-like structure with 2 Prime Oryon cores, 6 Performance cores, and private caches of 2 MB per Prime and 1 MB per Performance core, totaling 12 MB across the CPU. A shared 8 MB L3 SLC provides system-level caching, supporting the GPU and accelerators in a UMA , with emphasis on low-latency for mobile workloads. Key innovations in SoCs include UMA, which minimizes depth by unifying CPU, GPU, and I/O , thereby reducing coherency overhead and misses in and processing. The SLC, often exclusive to lower-level caches, delivers low-latency sharing for heterogeneous elements, as seen in Apple's designs where it caches up to 96 MB for multi-die scaling. Trends in these SoCs emphasize power efficiency through techniques like per-core and cluster-level for L1/ caches, which cuts leakage in idle states while retaining state for quick resumption, and dynamic sizing to adjust capacity based on demands, enabling up to 95% power reduction in dormant modes without performance loss upon activation.

References

  1. [1]
    [PDF] Lecture 14: Introduction to Caches - Computer Science
    In computer architecture, almost everything is a cache! – Registers “a cache” on variables – software managed. – First-level cache a cache on second-level cache.
  2. [2]
    [PDF] CS429: Computer Organization and Architecture - Cache I
    Apr 8, 2020 · The fundamental idea of a memory hierarchy: For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at ...
  3. [3]
    Caches - CS 3410 - Cornell: Computer Science
    It is common for modern machines to have three levels of caching, called the L1, L2, and L3 caches. The L1 cache is closest to the processor, smallest, and ...
  4. [4]
    Caching - CSCI 0300/1310: Fundamentals of Computer Systems
    The storage hierarchy shows the processor caches divided into multiple levels, with the L1 cache (sometimes pronounced "level-one cache") closer to the ...
  5. [5]
    [PDF] What Is Memory Hierarchy
    In computer architecture, almost everything is a cache! ▫ Register file is the fastest place to cache variables. ▫ First-level cache a cache on second-level ...<|control11|><|separator|>
  6. [6]
    [PDF] Slave Memories and Dynamic Storage Allocation
    If it does, then the word must be updated in the slave as well as in the main memory. Slave Memories and Dynamic Storage Allocation ... 1965. Short Notes. 271.
  7. [7]
    [PDF] Structural aspects of the System/360 Model 85 11 The cache
    This paper discusses organization of the cache and the studies that led to its use in the Model 85 and to selecting of values for its parameters. the SYSTEM/360 ...
  8. [8]
    [PDF] THE 801 MINICOMPUTER
    This paper provides an overview of an experimental system developed at the IBM T. J. Watson Research Center. It consists of a running hardware prototype, ...
  9. [9]
    [PDF] Next Generation Intel® Microarchitecture (Nehalem)
    Next generation Intel microarchitecture (Nehalem) enhances the Intel. Smart Cache by adding an inclusive shared L3 (last-level) cache that can be up to 8MB in ...
  10. [10]
    Skylake Processors - HECC Knowledge Base
    May 13, 2021 · L3 cache: shared non-inclusive 1.375 MB/core; total of 27.5 MB, shared by 20 cores in each socket; fully associative; fastest latency: 44 cycles.
  11. [11]
    AMD EPYC™ 4th Gen 9004 & 8004 Series Server Processors
    Deliver breakthrough performance with up to 96 “Zen 4” cores and 1152 MB of L3 cache per socket · Accelerate product design and enable low CAPEX and OPEX · Boost ...AMD EPYC™ 9754 · EPYC 9534 · AMD EPYC™ 9754S · AMD EPYC™ 9124
  12. [12]
    The working set model for program behavior - ACM Digital Library
    The working set model for program behavior. Author: Peter J. Denning. Peter J. Denning. Massachusetts Institute of Technology, Cambridge. View Profile.Cited By · Information & Contributors · Author Tags
  13. [13]
    The Elements of Cache Programming Style - USENIX
    The carrot is the Principle of Locality and the stick is Amdahl's Law. The Principle of Locality says that programs tend to cluster their memory references.<|control11|><|separator|>
  14. [14]
    How L1 and L2 memory chips fit into modern processor chips
    May 5, 2021 · In the real world, an L1 cache typically has a hit rate between 95 and 97 percent, but the performance impact of those two values in our simple ...
  15. [15]
    [PDF] Caches - Brown Computer Science
    Reason for dramatic difference: – matrix multiplication has inherent temporal locality: » input data: 3n2, computation 2n3. » every array element ...
  16. [16]
    [PDF] Side-Channel Attacks and Mitigations on Mesh Interconnects
    (L1 and L2), and a shared L3 cache, also called last-level cache or LLC. The L1 cache is small (e.g., 32-64 KB) and fast, typically responding within 5 cycles.
  17. [17]
    [PDF] Cache Design - UCSD CSE
    Multi-level Caches, cont. ▫ L1 cache local miss rate 10%, L2 local miss rate 40%. What are the global miss rates?
  18. [18]
    [PDF] Impact of Thermal Constraints on Multi-Core Architectures
    a few cycles, with sizes of 8–128KB. L2 cache: secondary cache, sometimes shared among multiple cores, access times of 10–20 cycles, with sizes of 256KB up.
  19. [19]
    [PDF] Multi-Core Cache Hierarchies - Electrical and Computer Engineering
    and for each, they consider caches built with SRAM, MRAM, PCM, and eDRAM. Their first architecture (LHCA) models a traditional cache hierarchy with three levels ...
  20. [20]
    CACHE SIZE CHART - irc
    The largest cache level (ranging from several MB to several tens of MB), L3 cache is shared among all CPU cores. It's slower than L1 and L2 but faster than ...
  21. [21]
    Memory Latency Components - University of Texas at Austin
    Mar 10, 2011 · ... average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of ...
  22. [22]
    [PDF] CPU clock rate DRAM access latency Growing gap - Error: 400
    • L1 cache: 1- ~ 3-cycle latency. • L2 cache: 8- ~ 13-cycle latency. • Main memory: 100- ~ 300-cycle latency. • Cache hit rates are critical to high performance.
  23. [23]
    [PDF] Cache Associativity - CS Illustrated
    But no matter how large or small they are, caches fall into one of three categories: direct mapped, n-way set associative, and fully associative.
  24. [24]
    [PDF] Achieving Non-Inclusive Cache Performance with Inclusive Caches
    Unfortunately, non-inclusive and exclusive caches increase the capacity of a hierarchy by sacrificing the natural snoop filter benefits of inclusive caches.
  25. [25]
    [PDF] The Memory Hierarchy
    L2 cache. (SRAM). L1 cache holds cache lines retrieved from the L2 cache. CPU registers hold words retrieved from the L1 cache. L2 cache holds cache lines.
  26. [26]
    [PDF] Memory Hierarchy/Cache - UCSD CSE
    Average memory access time. • Average Memory Access Time (AMAT). = Hit Time+ Miss rate* Miss penalty. • Miss penalty = AMAT of the lower memory hierarchy.
  27. [27]
    [PDF] Memory Hierarchy
    AMAT = Average memory access time = Hit time + Miss ratio × Miss penalty. Page 40. Note that speculative and multithreaded processors may execute other ...Missing: formula | Show results with:formula
  28. [28]
    [PDF] Chapter 5 Memory Hierarchy - UCSD ECE
    In the previous subsections, we described the effectiveness of a cache in terms of the average memory access time (AMAT). Eventually, what we are really ...
  29. [29]
    [PDF] CS232 Discussion 9: Caches
    The average memory access time, or AMAT, can then be computed. AMAT = Hit time + (Miss rate × Miss penalty). This is just averaging the amount of time for cache ...
  30. [30]
    25. Cache Optimizations III - UMD Computer Science
    The objectives of this module are to discuss the various factors that contribute to the average memory access time in a hierarchical memory system and discuss ...
  31. [31]
    Analyzing Cache Misses Using the perf Tool in Linux - Baeldung
    Mar 18, 2024 · In this tutorial, we'll analyze cache misses using the perf tool. We'll focus on monitoring and analyzing these events to drive optimal program execution.
  32. [32]
    Linux perf Examples - Brendan Gregg
    Questions that can be answered include: Why is the kernel on-CPU so much? What code-paths? Which code-paths are causing CPU level 2 cache misses?
  33. [33]
    The impact of cache inclusion policies on cache management ...
    This paper addresses the question of how sensitive existing cache management techniques are to the inclusion policy, quantifies how effective they are for ...
  34. [34]
    [PDF] To Include or Not To Include: The CMP Cache Coherency Question
    Multilevel inclusion has been the standard in cache inclusion policies in processors to date with the exception of the aforementioned Piranha. The problem of ...
  35. [35]
    [PDF] Balancing Cache Capacity and On-Chip Traffic via Flexible Exclusion
    In a three-level cache hierarchy, lower-level caches (L2 and L3) may have different inclusion policies. For example,. L3 can be inclusive (i.e., cache blocks in ...
  36. [36]
  37. [37]
  38. [38]
  39. [39]
    Two-ported cache alternatives for superscalar processors
    In addition to the obvious benefit of supplying twice as much peak bandwidth as a single-ported cache, the .split- cache design provides two opportunities for ...
  40. [40]
  41. [41]
    [PDF] IDT MIPS Microprocessor Family Software Reference Manual
    Chapter 5, “Cache Management,” discusses IDT's implementation of the on-chip caches for instructions (I-cache) and data (D-cache). Chapter 6, “Memory Management ...
  42. [42]
    [PDF] 4th Gen AMD EPYC Processor Architecture
    ... and a shared 16 MB L3 cache Two of these core complexes are combined onto a single die for 16 cores per die and a total of 32 MB of L3 cache per die (Figure 1) ...
  43. [43]
    [PDF] Intel® Technology Journal | Volume 14, Issue 3, 2010
    the prior generation of microarchitecture, the first level of cache was private to a core while the second level of cache was shared between pairs of cores.<|control11|><|separator|>
  44. [44]
    On the performance benefits of sharing and privatizing second and ...
    This paper investigates the performance impact of cache sharing on a homogeneous same-ISA 16-core processor with private first-level (L1) caches.
  45. [45]
    [PDF] Bank-aware Dynamic Cache Partitioning for Multicore Architectures
    Banked architectures are now the typical design direction for caches in both industry and academia. Such solutions though are still not efficient enough since ...Missing: seminal | Show results with:seminal
  46. [46]
    [PDF] Generating Last-Level Cache Eviction Sets in the Blink of an Eye
    Apr 20, 2025 · In this section, we introduce the necessary background on Intel caches, their LLC slice function and past recovery efforts, eviction sets and ...
  47. [47]
    [PDF] I-Cache Multi-Banking and Vertical Interleaving
    Mar 11, 2007 · ABSTRACT. This research investigates the impact of a microarchitectural technique called vertical interleaving in multi-banked caches.
  48. [48]
    A Low-latency On-chip Cache Hierarchy for Load-to-use Stall ...
    The main architectural reasons are: (1) GPU L1D is a multi-bank cache, hence, gathering data from these banks requires longer latency to fetch a cache line ...
  49. [49]
    On high-bandwidth data cache design for multi-issue processors
    Though each technique has significant costs and drawbacks, we find that multi-banking holds the key to a low cost cache memory design that can cope with increas ...
  50. [50]
    Designing high bandwidth on-chip caches - ACM Digital Library
    One drawback of implementing a banked cache is an incrcnso In the number of wires required to interconnect the banks. This increases both the area consumed by ...
  51. [51]
    L14: The Memory Hierarchy - Computation Structures
    The cache hardware is designed so that each memory location in the CPU's address space maps to a particular cache line, hence the name direct-mapped (DM) cache.Typical Memory Hierarchy · Cache Access · Cache Metrics · Basic Cache AlgorithmMissing: definition | Show results with:definition
  52. [52]
    [PDF] Access Memory (RAM) SRAM vs DRAM Summary
    Feb 13, 2014 · Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). ▫ The gap between CPU and main memory speed is ...
  53. [53]
    SRAM Scaling Issues, And What Comes Next
    Feb 15, 2024 · The inability of SRAM to scale has challenged power and performance goals forcing the design ecosystem to come up with strategies that range from hardware ...
  54. [54]
    [PDF] Cache Design Trade-offs for Power and Performance Optimization
    A set associative cache usually has a better hit rate than a direct-mapped cache of the same size, although the access time for the set associative cache is.
  55. [55]
    [PDF] Predictive Sequential Associative Cache - UCSD CSE
    Set-associative caches offer lower miss rates than direct- mapped caches, but usually have a longer access time. In a set-associative cache, cache blocks are ...
  56. [56]
    [PDF] Fast Software Cache Design for Network Appliances - USENIX
    Jul 17, 2020 · Cache hit rate versus lookup latency Balancing the cache hit rate and lookup latency has been studied in the context of DRAM hardware caches.<|separator|>
  57. [57]
    DRAM vs. SRAM: Which One Should You Choose? - NinjaOne
    Aug 29, 2025 · Power Consumption: 3–5W per GB when active; static power draw remains low. Voltage: Operates at around 1.8V, higher than DDR4/DDR5 DRAM.
  58. [58]
    The Power Of eDRAM - Semiconductor Engineering
    Jun 16, 2014 · In terms of density comparisons with SRAM, Intel reports a cell bit area of 0.108um² for its SRAM and 0.029um² for eDRAM. IBM has 3 versions of ...
  59. [59]
    MCAIMem: a Mixed SRAM and eDRAM Cell for Area and Energy ...
    Dec 6, 2023 · Specifically, the 2T eDRAM offers a 5.26 × reduction in static power disspitation compared to SRAM. However, the use of eDRAM gain cells ...
  60. [60]
    High performance Si-MoS 2 heterogeneous embedded DRAM
    Nov 12, 2024 · Embedded Dynamic RAM (eDRAM) has become a key solution for large-capacity cache in high-performance processors.
  61. [61]
    [PDF] Evolution of Memory Architecture
    Microarchitectural techniques have evolved to share this memory across ever- larger systems of processors with deep cache hierarchies and have managed to hide ...Missing: milestones | Show results with:milestones
  62. [62]
    Second-Level Cache - an overview | ScienceDirect Topics
    6. Second-level caches can be inclusive, exclusive, or non-inclusive. Inclusive L2 caches duplicate data from L1, simplifying coherence checks but wasting die ...
  63. [63]
    [PDF] A Survey on Recent Hardware Data Prefetching Approaches with An ...
    Sep 1, 2020 · A prefetcher in the first-level cache can observe all memory accesses, and hence, is able to issue highly-accurate prefetch requests, but at the ...
  64. [64]
    [PDF] 35 A Survey of Recent Prefetching Techniques for Processor Caches
    Lookahead shows how well in advance a prefetch is issued, such that prefetched data arrive in cache on time and are not evicted by other data.
  65. [65]
    [PDF] DRack: A CXL-Disaggregated Rack Architecture to Boost Inter-Rack ...
    Jul 9, 2025 · We introduce a cache device installed at each host's CXL port – DRAM cache, so that remote data can be cached and even prefetched to hide the ...
  66. [66]
    [PDF] Survey of Disaggregated Memory: Cross-layer Technique ... - arXiv
    Mar 26, 2025 · Conventional cache-based hierarchical memory management methods can not exploit the high bandwidth and large capacity of emerging devices.
  67. [67]
    [PDF] Understanding and Profiling CXL.mem Using PathFinder - cs.wisc.edu
    Aug 27, 2025 · In a typical memory subsystem, a memory request is first served from. L1, L2, and L3/LLC caches, and then queries the DIMM when missed from the ...
  68. [68]
    [PDF] 09-memory-hierarchy.pdf - Texas Computer Science
    Maybe 100x Cache memories. DRAM 1. 10X. Yes. Yes. 1X. Main memories, frame buffers ... The gap between CPU and main memory speed is widening. ▫ Well-written ...
  69. [69]
    [PDF] CS161 – Design and Architecture of Computer Systems
    Typical caches have a hit rate of 95% or higher, so in fact most memory accesses will be handled by the cache and will be dramatically faster. Page 15. A simple ...<|separator|>
  70. [70]
    [PDF] Prodigy: Improving the Memory Latency of Data-Indirect Irregular ...
    Abstract—Irregular workloads are typically bottlenecked by the memory system. These workloads often use sparse data representations, e.g., compressed sparse ...
  71. [71]
    [PDF] Cache Showdown: The Good, Bad, and Ugly - cs.wisc.edu
    When prefetching into a cache we must be weary of cache pollution. Cache pollution is when a prematurely prefetched block displaces useful data in the cache.
  72. [72]
    [PDF] Demystifying Cache Coherency in Modern Multiprocessor Systems
    Jun 7, 2025 · Particularly notable is their finding that the cache coherence protocol causes bandwidth to scale sub-linearly, achieving only 68% of the ...
  73. [73]
    Caching guidance - Azure Architecture Center | Microsoft Learn
    The requirement to implement a separate cache service might add complexity to the solution. Considerations for using caching. The following sections describe ...
  74. [74]
    [PDF] Exploiting Speculative Execution - Spectre Attacks
    Spectre attacks exploit speculative execution to induce a victim to perform operations that leak confidential information via side channels, using transient ...
  75. [75]
    [PDF] The Efficacy of Software Prefetching and Locality Optimizations on ...
    Software prefetching and locality optimizations are techniques for overcoming the speed gap between processor and memory. In this paper, we provide a ...
  76. [76]
    [PDF] Effective hardware-based data prefetching for high-performance ...
    The goal of this paper is to show that a hardware-based prefetch mechanism can be a cost and performance efficient mechanism when placed in the context of a ( ...<|separator|>
  77. [77]
    [PDF] Notices & Disclaimers - Intel
    with 3 MB L2 Cache. AI-based power management. VEC. INT. 16.67MHz. 8x. Wider predict. Across both allocation/rename & retire. Up to 36MB shared L3 cache on ...
  78. [78]
    Examining Intel's Arrow Lake, at the System Level - Chips and Cheese
    Dec 4, 2024 · So ever since Intel moved to non-inclusive cache design, their strategy is to emphasize the private caches at the expense of the L3, while ...
  79. [79]
    Intel Lunar Lake Technical Deep Dive - The CPU Cores: Part 1
    Rating 5.0 · Review by W1zzard (TPU)Jun 3, 2024 · On Arrow Lake, this core gets 3 MB (3,072 KB) of L2 dedicated L2 cache. The four P-cores on the Lunar Lake silicon share a 12 MB L3 cache.Missing: hierarchy details
  80. [80]
    Intel's Lunar Lake intricacies revealed in new high-resolution die shots
    May 17, 2025 · The TSMC N3B fabbed Compute Tile hosts four Lion Cove-based Performance (P) cores, sharing 12MB of L3 cache, with 2.5MB of private L2 cache per ...
  81. [81]
    What Is the Difference in Cache Memory Between CPUs for Intel ...
    This article contains information about L3 cache of an Intel® Xeon® Scalable Processor and why the value is higher than L1 cache.
  82. [82]
    [PDF] Reverse Engineering the Intel Cascade Lake Mesh Interconnect
    Generally, processors provide two levels of private caches (L1 and L2) for each core and a shared, lower-level L3 cache, otherwise known as a last-level cache ( ...
  83. [83]
    Previewing Meteor Lake at CES - by Chester Lam - Chips and Cheese
    Jan 11, 2024 · ... L2 region suggests Intel has changed up L2 cache's replacement policy. At L3, E-Cores on both Raptor Lake and Meteor Lake see 16.6 ns of latency ...Missing: victim | Show results with:victim
  84. [84]
    The 'Blank Sheet' that Delivered Intel's Most Significant SoC Design ...
    Jan 17, 2024 · “The tiles can be easily swapped, adapting chip capabilities to different requirements. A new, scalable fabric means all blocks within the SoC ...
  85. [85]
    [PDF] 4th Gen AMD EPYC Processor Architecture
    be augmented with 3D V-Cache technology to bring the L3 cache capacity to 96 ... Infinity Fabric interfaces, allowing for double the CPU-core-to-I/O.
  86. [86]
    Discussing AMD's Zen 5 at Hot Chips 2024 - by Chester Lam
    Sep 15, 2024 · AMD is proud that the L1 data cache has a 50% capacity and associativity increase while maintaining a 4-cycle load-to-use latency. L2 ...
  87. [87]
    AMD "Zen" Core Architecture
    With a core engine that supports simultaneous multi-threading for future-looking workloads; a leading-edge cache system and neural-net prediction, to help ...
  88. [88]
    AMD EPYC™ 9684X
    Realize exceptional time-to-results and energy efficiency for your business-critical applications with AMD EPYC™ 9004 Series processors for modern data centers.
  89. [89]
    5th Generation AMD EPYC™ Processors
    AMD EPYC 9005 Series processors include up to 192 “Zen 5” or “Zen 5c” cores with exceptional memory bandwidth and capacity. The innovative AMD chiplet ...AMD EPYC™ 9965 · AMD EPYC™ 9175F · AMD EPYC™ 9575F · Document 70353
  90. [90]
    AMD 3D V-Cache™ Technology
    2nd Gen AMD 3D V-Cache™ Technology · Up to 8-core “Zen 5” CCD · 64MB L3 Cache Die · Through Silicon Vias (TSVs) for Silicon-to-silicon Communication · Direct Copper ...
  91. [91]
    Exploiting Exclusive System-Level Cache in Apple M-Series SoCs ...
    Apr 18, 2025 · In this paper, we target the System-Level Cache (SLC) of Apple M-series SoCs, which is exclusive to higher-level CPU caches.Missing: innovations unified power gating
  92. [92]
    Apple M1 Pro and M1 Max: Specs, Performance, Everything We Know
    Oct 25, 2021 · The M1 Pro and M1 Max, professional-grade processors debuting in the 14-inch and 16-inch MacBook Pros. Here's everything you need to know about the M1 Pro and ...
  93. [93]
    Apple MacBook Pro "M4 Max" 14 CPU/32 GPU 14" Specs
    Oct 30, 2024 · Each performance core is believed to also have a 32 MB L2 cache and each efficiency core is believed to have a 4 MB L2 cache.Missing: hierarchy SLC
  94. [94]
    Analyzing the memory ordering models of the Apple M1
    Each processor encompasses separate L1 instruction (L1i) and L1 data (L1d) caches, while an L2 cache is associated with each cluster. Information about a shared ...
  95. [95]
    Snapdragon 8 Elite Mobile Platform - Qualcomm
    Oct 21, 2024 · Qualcomm Oryon CPU is custom-built with the fastest mobile CPU speeds up to 4.47 GHz · Largest shared cache in mobile industry.
  96. [96]
    Qualcomm Snapdragon 8 Elite (Gen 4): specs and benchmarks
    CPU ; L1 cache, 192 KB ; L2 cache, 12 MB ; L3 cache, 8 MB ; Process, 3 nanometers ; TDP (Sustained Power Limit), 8 W.VS · Apple A18 Pro vs Qualcomm... · Vivo iQOO Neo 10 Pro Plus · OnePlus 13
  97. [97]
    Snapdragon 8 Elite: Everything You Need To Know - Forbes
    Oct 24, 2024 · The Snapdragon 8 Elite cores feature 192KB L1 cache per Prime core and 128KB per Performance core for a total of 12MB L2 cache per user. As ...
  98. [98]
  99. [99]
    [PDF] 27 DPCS: Dynamic Power/Capacity Scaling for SRAM Caches in the ...
    Our mechanism combines multilevel voltage scaling with optional architectural support for power gating of blocks as they become faulty at low voltages. A static ...