Fact-checked by Grok 2 weeks ago

Cache hierarchy

Cache hierarchy refers to the multi-level structure of cache memories in modern computer architectures, where smaller, faster caches are organized in successive layers between the processor and main memory to store frequently accessed data and instructions, thereby bridging the performance gap between the CPU and slower DRAM.^[1] This organization exploits principles of temporal locality (reuse of recently accessed data) and spatial locality (access to data near recently used locations) to minimize average access latency.^[2] The primary purpose of the cache hierarchy is to compensate for the widening disparity in access speeds between processors and main memory, a gap that has grown significantly since the 1980s due to faster CPU clock rates compared to slower DRAM improvements.^[1] By placing caches at multiple levels, systems achieve a balance of speed, capacity, and cost, using expensive but fast static RAM (SRAM) for upper levels and transitioning to larger, cheaper dynamic RAM (DRAM) further down.^[3] Data is transferred in fixed-size blocks between levels, with hardware-managed policies determining placement, replacement, and coherence to ensure efficient operation.^[2] In typical implementations, the hierarchy consists of three primary on-chip cache levels: L1, L2, and L3. The L1 cache, closest to the processor cores, is the smallest (often 32-64 KB per core) and fastest (1-4 cycles latency), frequently split into separate instruction (L1i) and data (L1d) caches to optimize performance.^[3] The L2 cache is larger (256 KB to 2 MB per core) and slightly slower (5-10 cycles), serving as a backup for L1 misses while still being on-chip for low latency.^[1] The L3 cache, shared among multiple cores, is the largest (up to 64 MB or more) and slowest among the caches (20+ cycles), acting as a last on-chip buffer before accessing main memory, which has latencies around 100 cycles or higher.^[2] This design has evolved with multi-core processors, where shared lower-level caches like L3 improve inter-core data sharing but introduce challenges in coherence and contention.^[3] Overall, the cache hierarchy significantly enhances system performance, with hit rates often exceeding 90% in the upper levels for well-behaved workloads, though effectiveness depends on application locality patterns.^[1]

Introduction

Definition and Purpose

A cache hierarchy is a multi-tiered memory system in computer architecture, consisting of multiple levels of cache memory arranged between the processor and main memory. Each level serves as a smaller, faster buffer for the next larger and slower one, typically including L1, L2, and sometimes L3 caches, with progressively increasing capacity and decreasing access speed. This organization exploits data locality to store copies of frequently used data closer to the CPU, thereby minimizing the time required for data retrieval from slower main memory.^[4] The primary purpose of a cache hierarchy is to bridge the significant performance gap between the rapid execution speeds of modern processors and the comparatively slower access times of dynamic random-access memory (DRAM). By positioning fast static random-access memory (SRAM)-based caches on or near the chip, the hierarchy enables the processor to access commonly used instructions and data with minimal latency, effectively masking the delays inherent in deeper memory layers. This design is essential in contemporary computing systems, where processor clock speeds have outpaced memory bandwidth improvements for decades.^[5] Key benefits include substantial reductions in average memory access latency, higher instruction throughput, and improved overall system efficiency, particularly in performance-critical applications like scientific computing and real-time processing. For instance, successful data hits in upper cache levels can deliver access times orders of magnitude faster than main memory fetches, leading to measurable gains in processor utilization. Each cache level functions as a buffer for its successor, with the L1 cache being the smallest and fastest, often integrated directly on the processor die to support split instruction and data storage. The effectiveness of this approach relies on the principle of locality of reference, where programs tend to reuse recently accessed data.^[4]^[5]

Historical Evolution

The concept of cache memory as a fast "slave" store to supplement slower main memory was first formalized by Maurice Wilkes in 1965, laying the theoretical groundwork for hierarchical memory systems.^[6] The first commercial implementation appeared in the IBM System/360 Model 85 mainframe in 1968, which introduced a 16 KB high-speed buffer storage operating as a cache to accelerate access to the larger main memory.^[7] During the 1970s and 1980s, single-level cache designs predominated in processor architectures, exemplified by the experimental IBM 801 minicomputer project initiated in 1975 and prototyped by 1980, which integrated separate instruction and data caches to support its reduced instruction set computing approach.^[8] The 1990s marked a pivotal shift toward multi-level cache hierarchies to address growing performance demands from increasing clock speeds and application complexity. Intel's Pentium Pro processor, released in 1995, pioneered the inclusion of a secondary L2 cache implemented off-chip with sizes up to 1 MB, separate from the on-chip L1 cache, to extend hit rates for larger working sets. By the late 1990s, on-die integration of L2 caches became feasible, as demonstrated by AMD's K6-III processor in 1999, which featured 256 KB of L2 cache directly on the die running at core speed to reduce latency. A key milestone during this era was the transition from asynchronous caches, common in early mainframes like the System/360, to synchronous designs aligned with processor clocks, enabling tighter pipelining and higher frequencies starting in the mid-1980s with RISC processors. In the 2000s, the rise of multi-core processors drove the adoption of tertiary L3 caches, often shared across cores to optimize coherence and bandwidth. Intel's Nehalem microarchitecture, introduced in 2008 with the Core i7 processors, integrated up to 8 MB of shared inclusive L3 cache on-die, facilitating efficient data sharing in multi-threaded environments.^[9] The 2010s saw further refinements, including non-inclusive L3 policies to enhance effective capacity by avoiding duplication of L1 and L2 data; Intel's server Skylake-SP (Xeon Scalable) processors in 2017 implemented a non-inclusive shared L3 cache of up to 38.5 MB, reducing snoop traffic in multi-core setups.^[10] Integration with multi-core designs accelerated in this period, with caches evolving to support coherence protocols like Intel's Mesif for shared resources. The 2020s have emphasized scaling cache sizes for data-intensive workloads, particularly AI and machine learning, where large datasets benefit from reduced memory latency. AMD's EPYC 9004 "Genoa" series, launched in 2022, exemplifies this with up to 384 MB of shared L3 cache per socket in its 96-core configurations, boosting performance in AI training by minimizing off-chip accesses.^[11] Subsequent advancements include the AMD EPYC 9005 "Turin" series launched in October 2024 with up to 192 Zen 5 cores and 384 MB L3 cache per socket, alongside Intel's Granite Rapids Xeon processors in 2024 featuring up to 128 cores and enlarged L3 caches exceeding 300 MB in high-end models, continuing to balance capacity, latency, and power within multi-level hierarchies.^[12]^[13]

Fundamentals

Locality of Reference

Locality of reference is a fundamental principle in computer architecture that describes the tendency of programs to access a relatively small subset of their address space repeatedly over short periods, enabling efficient memory hierarchy designs such as caches. This principle, formalized by Peter J. Denning in his analysis of program behavior, underpins the effectiveness of caching by predicting that memory references cluster in both time and space, reducing the need to fetch data from slower main memory.^[14] Temporal locality refers to the likelihood that a memory location recently accessed will be reused soon afterward, often observed in program constructs like loops where the same variables or data structures are repeatedly referenced. For instance, in iterative algorithms, control flow repeatedly accesses the same code or data elements, creating reuse patterns that caches can exploit by retaining recently used items. Spatial locality, on the other hand, arises when programs access data located near previously referenced addresses, such as during sequential traversal of arrays or structures stored contiguously in memory. These behaviors stem from the structured nature of typical programs, where execution flows through localized regions of code and data.^[14] The theoretical foundation of locality derives from empirical analyses of program execution traces, revealing that most references occur within a "working set" of active pages or data blocks, as modeled by Denning to approximate program demands over time windows. This has implications under Amdahl's law for memory-bound applications, where poor locality amplifies the serial fraction of execution time dominated by slow memory accesses, limiting overall speedup despite faster processors; conversely, strong locality mitigates this by minimizing effective memory latency. In cache hierarchies, these principles enable hit rates exceeding 90% in smaller, faster levels like L1 caches for typical workloads, as recent benchmarks on modern processors demonstrate 95-97% L1 hit rates due to clustered references.^[14]^[15]^[16] A classic example is matrix multiplication, where temporal locality manifests as each matrix element is reused O(n) times across nested loops, and spatial locality appears in row- or column-wise traversals that access contiguous blocks. Techniques like hardware prefetching extend these benefits by anticipating spatial patterns to load data proactively, further boosting hit rates in hierarchies without altering core program behavior. Real-world benchmarks, such as those on SPEC CPU suites, consistently show locality driving 90%+ hit rates in L1 caches for compute-intensive tasks, underscoring its role in achieving performance close to ideal memory speeds.^[17]^[16]

Cache Levels and Organization

The cache hierarchy in modern processors is typically organized into multiple levels, each with distinct characteristics in size, speed, and scope to optimize performance by exploiting locality of reference. The first level, L1 cache, is the smallest and fastest, usually ranging from 32 to 64 KB per core, with access latencies of 1 to 4 cycles.^[18] It is positioned closest to the CPU execution units and is commonly split into separate instruction (I-cache) and data (D-cache) components to allow simultaneous access for fetching instructions and loading/storing data.^[19] The second level, L2 cache, serves as a backup to the L1 and is larger, typically 256 KB to 2 MB per core, with access latencies of 10 to 20 cycles.^[20] It is often implemented as a unified cache, holding both instructions and data, and is usually private to each core to reduce contention in multi-core systems.^[21] This level provides a balance between capacity and speed, capturing data that does not fit in L1 but is still frequently accessed. The third level, L3 cache or last-level cache (LLC), is the largest in the on-chip hierarchy, shared among multiple cores with sizes from 8 to 100 MB or more, and access latencies of 20 to 50 cycles.^[22]^[23] It acts as a communal resource to filter misses from lower levels before accessing main memory, which has much higher latencies (typically 100-300 cycles).^[24] In contemporary designs, L1, L2, and L3 caches are predominantly on-chip for reduced latency, though earlier systems placed L2 or L3 off-chip. Cache organization within each level commonly employs associativity to map memory blocks to cache lines, including direct-mapped (one possible location per block), set-associative (multiple locations within a set), or fully associative (any location).^[25] The hierarchy can be inclusive (lower levels contain all data from upper levels), exclusive (no overlap between levels), or non-inclusive (partial overlap allowed), influencing how data propagates through the levels.^[26] Overall, these levels form a progressive filtering mechanism, where misses at one level trigger searches in the next, minimizing expensive main memory accesses.^[27]

Multi-Level Design

Average Access Time

In multi-level cache hierarchies, the average access time (AAT), often referred to as average memory access time (AMAT), represents the expected time to retrieve data from the memory system, accounting for hits and misses across all cache levels and main memory. This metric quantifies the overall performance of the hierarchy by weighting the access times of each level by their respective hit probabilities, providing a single value that reflects the system's effectiveness in bridging the latency gap between the processor and main memory.^[28]^[29] The AAT for a three-level cache hierarchy is calculated recursively as follows:

\text{AAT} = h_1 t_1 + (1 - h_1) \left[ h_2 t_2 + (1 - h_2) \left[ h_3 t_3 + (1 - h_3) t_m \right] \right]

where h_i denotes the hit rate at cache level i (with $0 \leq h_i \leq 1), t_i is the access latency at level i (typically in processor clock cycles), and t_m is the access time for main memory. This formula assumes hit rates are independent across levels and that misses at one level propagate to the next.^[29]^[30] To derive this, begin at the lowest level: the effective access time for level 3 (L3 cache) is \text{AMAT}_3 = h_3 t_3 + (1 - h_3) t_m, as an L3 miss requires fetching from main memory. For level 2 (L2 cache), the miss penalty is the AMAT of L3, yielding \text{AMAT}_2 = h_2 t_2 + (1 - h_2) \text{AMAT}_3. Extending this to level 1 (L1 cache) gives the overall AAT as \text{AMAT}_1 = h_1 t_1 + (1 - h_1) \text{AMAT}_2, which expands to the full expression above. For a simplified two-level hierarchy, the formula reduces to \text{AAT} = h t_c + (1 - h) t_m, where h is the aggregate hit rate for the cache level, t_c is its access time, and t_m is main memory time; this highlights how even small improvements in hit rate can dramatically lower AAT due to the high penalty of memory accesses.^[29]^[3]^[31] Several factors influence AAT, primarily the hit and miss ratios at each level—which depend on workload characteristics like locality—and the latency differences between levels, where upper-level caches (e.g., L1) prioritize low latency over capacity. For instance, consider a two-level system with L1 hit rate h_1 = 0.95, L1 access time t_1 = 1 cycle, L2 hit rate h_2 = 0.90 (conditional on L1 miss), L2 access time t_2 = 14 cycles, and main memory time t_m = 200 cycles. The L2 AMAT is $0.90 \times 14 + 0.10 \times 200 = 12.6 + 20 = 32.6 cycles, so overall AAT = $0.95 \times 1 + 0.05 \times 32.6 \approx 2.63 cycles; adjusting for more realistic multi-level hit rates and latencies often yields AAT values of 5–10 cycles in processor designs, underscoring the hierarchy's role in keeping access times close to L1 latencies.^[32]^[3] In practice, AAT is measured using hardware performance counters during benchmarks to capture hit and miss events, enabling computation via the above formulas. Tools like the Linux perf utility record these counters (e.g., L1-dcache-load-misses) for specific workloads, providing empirical hit rates and latencies to evaluate and tune hierarchy performance without relying solely on simulation.^[33]^[34]

Inclusion Policies

In cache hierarchies, inclusion policies dictate whether and how data blocks from higher-level caches (closer to the processor, such as L1) are required to be present in lower-level caches (farther from the processor, such as L2 or L3). These policies influence effective cache capacity, coherence overhead, and overall system performance by managing data replication across levels.^[35] The inclusive policy mandates that all data in higher-level caches is also duplicated in the corresponding lower-level caches, making the lower levels a strict superset of the upper ones. This approach was common in early Intel designs, such as the Pentium 4 processor, where the L2 cache inclusively contained all L1 data. Inclusive policies simplify cache coherence protocols by ensuring that snoops to the lower level can capture all relevant data without needing to query upper levels separately, which aids directory-based coherence in multi-core systems. However, they lead to upper-level pollution, where evictions from the lower cache (known as inclusion victims) back-invalidate hot data in upper caches, reducing effective capacity and hit rates, particularly as core counts increase.^[36]^[35]^[26] In contrast, the exclusive policy prohibits data overlap between cache levels, ensuring that a block present in a higher-level cache cannot reside in a lower-level one, thereby maximizing total unique storage capacity across the hierarchy. Early AMD Zen architectures employed a non-inclusive victim L3 cache design, treating it as a cache for evicted lines from private L1 and L2 caches. While this boosts hit rates by avoiding replication—potentially improving performance by up to 9.4% in workloads sensitive to capacity—it complicates eviction and insertion processes, as data must be moved between levels upon misses or replacements, increasing on-chip traffic by as much as 72.6%. Exclusive policies also heighten coherence challenges, requiring additional synchronization to prevent race conditions during data transfers.^[35]^[37] The non-inclusive policy, often termed victim-inclusive, relaxes strict inclusion by allowing the lower-level cache (typically L3) to hold a mix of data: some from upper levels (L1/L2) plus evicted victims, but without mandating full duplication. Intel's Skylake-X (server) processors adopted a non-inclusive L3 cache, balancing capacity and traffic by permitting partial overlap. This design mitigates pollution from inclusive victims—improving performance by 5.9% over strict non-inclusive setups in some benchmarks—while reducing snoop traffic compared to exclusive policies, though it may still incur up to 50% more back-invalidations in multi-core scenarios. Non-inclusive approaches support hybrid coherence, where directories track sharers without full inclusion guarantees.^[35]^[26]^[37] Trade-offs among these policies center on hit rates, snoop traffic, and scalability: inclusive designs favor simpler coherence at the cost of 3-8% performance penalties from pollution in large hierarchies, while exclusive and non-inclusive variants enhance capacity (up to 18% miss reduction) but elevate traffic and protocol complexity. The evolution shifted from predominant inclusive policies in the 1990s (e.g., uniprocessor and early CMPs like IBM Power4) to non-inclusive and exclusive in the 2010s and continuing into the 2020s, driven by multi-core demands for higher effective capacity and reduced replication in processors like AMD Zen and Intel's Alder Lake. Strict inclusion remains beneficial for directory-based protocols, as it streamlines sharer tracking and minimizes broadcast snoops.^[26]^[37]^[36]^[38]

Write Policies

In cache hierarchies, write policies determine how write operations to memory are handled to balance performance, consistency, and complexity. The write-through policy updates both the cache and the next level of memory simultaneously upon a write hit, ensuring immediate data consistency across the hierarchy. This simplicity makes it suitable for scenarios requiring low latency in persistence, such as I/O buffers where data must be promptly available to devices, but it increases memory bandwidth demands due to every write propagating downward.^[39] The write-back policy, also known as copy-back or write-behind, updates only the cache on a write hit and defers propagation to lower levels until the block is evicted, replaced, or explicitly flushed. A dirty bit per cache block tracks modifications to enable selective writes, reducing unnecessary traffic while risking temporary inconsistency or data loss on cache failure without backups. This approach is prevalent in modern processors, such as the Intel Core i7's L2 and L3 caches, to minimize bus contention in bandwidth-constrained systems.^[39] Write policies interact with allocation strategies on write misses. Write-allocate fetches the entire cache block from memory into the cache before applying the update, leveraging spatial locality for future accesses and typically paired with write-back to amortize fetch costs over multiple writes. In contrast, no-write-allocate bypasses cache allocation, writing directly to lower-level memory to prevent pollution from one-time writes; this is often combined with write-through to avoid fetching unused data.^[39] In multi-level hierarchies, policies are tailored per level: L1 caches commonly use write-back with write-allocate to buffer writes efficiently and exploit spatial locality, while L2 and L3 employ write-back with write-allocate to buffer writes before main memory access. For example, the ARM Cortex-A53 implements write-back in L1 data caches and write-back in the unified L2, using dirty bits to track changes and ensure inclusion of L1 data in L2.^[39]^[40] Performance impacts vary by workload; write-back reduces memory traffic by 50-70% in write-intensive scenarios compared to write-through, lowering miss penalties by 20-30% in multi-level setups and yielding 10-20% overall speedup in SPEC benchmarks. However, write-through incurs 5-10% performance loss from elevated bandwidth usage, though write buffers can mitigate stalls in both policies. In multiprocessor environments, write-back elevates coherence overhead due to delayed visibility of updates.^[39]^[41]

Organizational Variants

Unified versus Split Caches

In cache hierarchy design, unified caches store both instructions and data in a single structure, providing a shared pool that simplifies hardware implementation by reducing the need for duplicate tag storage and management logic. This design enhances utilization in workloads with varying instruction and data access patterns, as the cache can dynamically allocate space without fixed partitioning.^[42] In contrast, split caches employ separate instruction (I-cache) and data (D-cache) structures, typically at the L1 level, enabling simultaneous accesses to instructions and data for improved parallelism and reduced contention during fetch and load/store operations. This separation aligns with Harvard architecture principles, potentially doubling bandwidth compared to a unified cache of equivalent size by avoiding resource conflicts between instruction fetches and data manipulations. However, split caches increase overhead through duplicated control logic and tags, which can lower the overall hit rate for a given total capacity since resources cannot be flexibly shared.^[42]^[43] Unified caches mitigate tag storage overhead but risk internal contention when instruction and data accesses compete for the same banks, potentially degrading performance in instruction-intensive or data-heavy phases. Split caches, while offering better isolation to ease pipeline design and minimize structural hazards, may underutilize one partition if access patterns are imbalanced, leading to inefficiencies in space allocation.^[42]^[44] The adoption of split caches at the L1 level emerged in the 1980s, as seen in the MIPS R2000 microprocessor, which featured dedicated 64 KB instruction and 64 KB data caches to support pipelined execution without bandwidth bottlenecks. Higher-level caches, such as L2 and L3, have predominantly remained unified to facilitate sharing across instructions and data, promoting higher effective hit rates in multi-level hierarchies.^[45]^[42] Contemporary trends in system-on-chip (SoC) designs incorporate hybrid approaches, such as virtually split caches, which logically partition a unified structure dynamically based on access demands to balance the bandwidth benefits of splitting with the flexibility of unification, thereby enhancing power efficiency in resource-constrained embedded systems. These designs can reduce dynamic power by up to 29% while improving instructions per cycle by around 2.5% compared to traditional unified configurations.^[42] The choice between unified and split caches also influences integration with private or shared multi-core setups, where split L1 designs per core complement unified higher levels for coherent sharing.

Shared versus Private Caches

In multi-core processors, private caches are dedicated exclusively to a single core, ensuring low-latency access without interference from concurrent accesses by other cores. The L1 caches, both instruction and data, are universally private in modern designs to minimize access times critical for instruction fetch and load/store operations. Extending this model, L2 caches are often private as well, particularly in architectures like AMD's Zen series, where each core is allocated 512 KB to 1 MB of private L2 cache to insulate it from higher-level latencies.^[46] This configuration eliminates remote access overhead and contention, allowing each core to operate with dedicated bandwidth and reduced power consumption for local data.^[21] In contrast, shared caches are accessible by multiple cores, typically implemented at the L3 level to facilitate data sharing and reduce overall memory redundancy across the chip. For instance, Intel processors from the Nehalem microarchitecture onward feature a shared L3 cache among all cores on the die, ranging from 8 MB to over 100 MB depending on the model, which serves as a victim cache for L2 evictions and holds shared data.^[47] Similarly, AMD's EPYC processors employ shared L3 caches within core complexes (CCXs), with 32 MB per eight-core group in Zen 4 designs, enabling efficient inter-core data reuse in server workloads.^[46] Shared caches promote higher effective capacity by avoiding duplication of frequently accessed shared data, such as in multi-threaded applications, and simplify management of common datasets.^[48] The choice between private and shared caches involves key trade-offs in performance, power, and complexity. Private caches minimize intra-chip contention and provide faster local access—often with latencies 20-50 cycles for L2— but demand larger total on-die area to avoid capacity waste from redundant copies, increasing manufacturing costs and power draw.^[21] Shared caches, while enhancing utilization and reducing off-chip memory traffic in sharing-heavy workloads, introduce potential bandwidth bottlenecks and higher average access latencies due to interconnect traversal and snoop overhead, particularly for cross-core requests.^[48] In shared setups, coherence protocols are essential to maintain data consistency, though they add marginal overhead compared to fully private hierarchies.^[21] Since the mid-2000s, hybrid designs combining private L1 and L2 caches with a shared L3 have become standard in multi-core processors, striking a balance between per-core speed and system-wide sharing; for example, Intel's Core 2 Duo (2006) used paired-core shared L2, evolving to private L2 with shared L3 in subsequent generations.^[47] This approach supports efficient multi-threading by keeping hot data close to each core while pooling resources for cold or shared data. For scalability in larger systems, shared L3 caches work well for 2-8 cores per domain, but in high-core-count processors like AMD EPYC with up to 128 cores, hybrid partitioning—such as per-chiplet shared L3 slices of 32 MB each—mitigates latency and contention by localizing sharing within smaller groups.^[46] Such designs provide performance gains in parallel workloads compared to all-private hierarchies.^[48]

Banked and Interleaved Designs

Banked caches divide the physical cache storage into multiple independent banks, each capable of operating autonomously to handle simultaneous memory accesses from different cores or threads. This design typically employs 4 to 16 banks in shared last-level caches like L3, where each bank serves a subset of the address space, connected via a crossbar or ring network to enable parallel operations without contention on a single monolithic structure.^[49] In multi-core processors, this partitioning mitigates wire delays and supports higher throughput by allowing multiple requests to proceed concurrently, as seen in designs like the Intel Nehalem architecture's sliced L3 cache.^[49]^[50] Interleaved, or striped, cache designs extend banking by systematically distributing cache lines across banks based on address bits, reducing the likelihood of conflicts in multi-threaded workloads. Low-order interleaving assigns consecutive addresses to sequential banks, while more advanced hashing—such as XOR-based mapping—uses bit permutations and exclusive-OR operations on physical address fields to evenly spread accesses and minimize hotspots.^[51]^[50] For instance, Intel's L3 caches employ model-specific XOR hashing with 4 to 10 slices (banks), where address bits are selectively XORed against permutation masks to determine the target slice, enhancing scalability in processors like the i9-10900K.^[50] This interleaving is particularly vital in high-core-count CPUs and GPUs, where it distributes concurrent vector operations across banks to exploit thread-level parallelism.^[52] The primary advantages of banked and interleaved designs include significantly increased effective bandwidth, enabling up to multiple simultaneous accesses per cycle—such as 6 loads/stores in wide-issue processors—without the area overhead of fully multi-ported caches.^[53] In GPUs, multi-banking reduces load-to-use stalls by allowing parallel servicing of warp requests across banks, improving throughput for memory-intensive kernels.^[52] These techniques have evolved from single-bank caches dominant in 1990s processors, like early MIPS designs, to widespread multi-bank adoption in the 2010s for shared hierarchies in multi-core systems.^[54]^[49] However, these designs introduce complexities, such as the need for sophisticated addressing logic that can increase latency if conflicts occur, and potential for thrashing in poorly interleaved setups where repeated accesses overload specific banks.^[54] Bank conflicts, where multiple requests target the same bank, can degrade performance by up to 49% in integer workloads if spatial locality is not managed, as in cascade interleaving schemes with high migration rates.^[53]^[49] Additional drawbacks include elevated wiring demands for inter-bank connections, raising area and power costs in dense on-chip layouts.^[54]

Performance Considerations

Trade-offs and Evolution

One fundamental trade-off in cache hierarchy design is between cache size and access speed. Larger caches can accommodate more data, thereby reducing miss rates and improving overall system performance by minimizing accesses to slower main memory. However, increasing cache size leads to higher latency due to longer signal propagation delays across the larger on-chip area, as well as greater power consumption and silicon area requirements.^[55]^[56] For SRAM-based caches, which dominate on-die implementations, scaling size by a factor of two roughly doubles the area and cost while exacerbating these latency issues, prompting designers to balance capacity against these penalties.^[57] Another key compromise involves cache associativity, which determines how flexibly data blocks can be placed within the cache to avoid conflicts. Higher set-associativity, such as 8-way or 16-way configurations, enhances hit rates by reducing conflict misses compared to direct-mapped or lower-associativity designs, allowing better utilization of cache space for diverse access patterns.^[58] Yet, this comes at the expense of increased lookup latency, as parallel comparisons across more ways require additional hardware and time—typically adding 2-4 clock cycles to access times in modern processors.^[59] Designers often select moderate associativity levels (e.g., 4-8 ways for L1 caches) to optimize this trade-off between hit rate improvements and the overhead of complex tag matching.^[60] Power efficiency and manufacturing cost further constrain cache design, particularly in choices between on-die SRAM and alternatives like embedded DRAM (eDRAM). SRAM offers low latency and high speed but consumes significant static power and die area due to its six-transistor cell structure, making large caches expensive for high-performance computing.^[61] In contrast, eDRAM provides higher density (up to 3-4x that of SRAM) and lower static power—reducing dissipation by factors of 5 or more—while maintaining comparable access speeds for last-level caches, though it requires periodic refresh overhead.^[62]^[63] These trade-offs have driven adoption of eDRAM in select high-capacity implementations, such as in some IBM and Intel processors, to mitigate the power and cost burdens of scaling SRAM-based hierarchies.^[64] The evolution of cache hierarchies reflects ongoing adaptations to these trade-offs, progressing from single-level designs in the 1980s—where processors like the Intel 80486 relied solely on small on-chip L1 caches of 8 KB—to multi-level structures by the 2000s, incorporating L2 and L3 caches to bridge widening processor-memory speed gaps.^[65] By the 1990s, off-chip L2 caches became standard, evolving into on-die unified L3 caches in the early 2000s for better latency and bandwidth. In the 2020s, hierarchies have expanded to include massive last-level caches (e.g., 100+ MB shared L3) and emerging system-level caching mechanisms to support multi-core and AI workloads, prioritizing capacity over strict level counts.^[3] Historical shifts in inclusion policies have also addressed multi-core challenges. Early designs favored exclusive or inclusive policies to simplify coherence, but as core counts grew in the 2000s, non-inclusive (or victim-inclusive) approaches gained prominence, particularly in AMD processors, to avoid duplicating L1 data in shared L2/L3 caches and maximize effective capacity without excessive coherence traffic.^[21] This transition reduced redundancy in multi-core environments, improving scalability while maintaining coherence through directory-based protocols.^[66] Prefetcher integration has evolved to mitigate miss latencies, with hardware prefetchers becoming standard in CPU caches since the early 2000s to anticipate data accesses based on patterns observed in L1/L2 misses.^[67] Advanced implementations, such as stride or stream prefetchers, now reside in L2 and L3 levels, issuing proactive loads to hide latencies in irregular workloads, though they introduce bandwidth overhead if accuracy is low.^[68] Looking ahead, future cache hierarchies may incorporate disaggregated designs enabled by standards like Compute Express Link (CXL), allowing coherent sharing of remote memory pools across devices to scale capacity beyond on-chip limits while reducing per-core power.^[69] Emerging research into optical caches, leveraging photonic interconnects for ultra-low-latency data movement, promises to alleviate electrical signaling bottlenecks in dense hierarchies, though integration challenges remain.^[70] These trends aim to extend Moore's Law benefits amid slowing transistor scaling.^[71]

Gains and Limitations

Cache hierarchies deliver significant performance gains by mitigating the latency disparity between high-speed processors and slower main memory, often yielding speedups of 10 to 100 times for frequently accessed data.^[72] High hit rates, typically around 95% or better in well-designed systems, enable effective access times to drop from approximately 100 ns for main memory fetches to 1-5 ns when data resides in the cache.^[73] These improvements stem from the proximity and speed of on-chip storage, allowing processors to sustain high instruction throughput without stalling on memory operations. Despite these benefits, cache hierarchies face limitations in certain workloads, particularly irregular access patterns common in big data processing, where cache pollution and compulsory misses degrade hit rates.^[74] Cache pollution occurs when prefetched or irrelevant data evicts useful content, exacerbating misses in sparse or unpredictable datasets like graph analytics or machine learning training on large-scale data.^[75] In multi-core environments, cache coherence protocols introduce additional overhead, consuming 5-20% of available bandwidth due to snoop traffic and invalidations needed to maintain data consistency across caches.^[76] Key disadvantages include the inherent complexity of designing multi-level hierarchies, which requires balancing size, associativity, and policies to avoid performance pitfalls like thrashing or increased miss penalties.^[77] Caches are also vulnerable to side-channel attacks, such as Spectre, which exploit timing differences in cache access to leak sensitive information across security boundaries.^[78] Furthermore, caches account for 20-30% of a processor's total power budget in modern designs, driven by dynamic access energy and static leakage in large on-chip arrays.^[58] To address these limitations, software techniques like cache blocking—restructuring loops to maximize data reuse within cache blocks—can reduce misses by up to 50% in matrix computations.^[79] Hardware prefetchers complement this by anticipating data needs and loading it proactively, improving hit rates in streaming workloads without excessive pollution when tuned properly.^[80] Benchmark results underscore these gains; for instance, adding or enlarging L3 caches in SPEC CPU suites has delivered 20-50% performance uplifts in memory-intensive workloads like simulations and data processing, highlighting the hierarchy's role in overall system efficiency.

Modern Implementations

Intel Processors

Intel's cache hierarchy implementations in recent processors emphasize scalability, power efficiency, and performance optimizations tailored to client and server workloads. In the Arrow Lake family, released in 2024 as part of the Core Ultra 200S series, the redesigned hierarchy features Lion Cove performance cores (P-cores) with 3 MB of private L2 cache per core, marking a significant increase from prior generations to enhance single-threaded performance and reduce latency for core-local data access.^[81] The shared L3 cache totals 36 MB across the chip, adopting a non-inclusive policy that allows for more flexible data placement without duplicating lower-level cache contents in the L3, thereby improving overall bandwidth and hit rates in multi-core scenarios.^[82] This configuration supports up to 24 cores (8 P-cores and 16 efficiency cores) while prioritizing higher L2 capacity to mitigate bandwidth bottlenecks in the shared L3. For mobile and low-power applications, the Lunar Lake architecture, also launched in 2024 under the Core Ultra 200V series, optimizes the hierarchy for efficiency with a 12 MB shared L3 cache designed for reduced power consumption in thin-and-light devices.^[83] Each of the four Lion Cove P-cores includes 2.5 MB of private L2 cache, balancing capacity with energy efficiency to handle AI and productivity workloads.^[84] The design integrates on-package LPDDR5X memory, which minimizes latency by placing DRAM directly alongside the compute tiles, effectively extending the cache hierarchy with lower access times compared to traditional off-package configurations. In server environments, the Emerald Rapids Xeon processors, introduced in 2023 as the 5th Generation Xeon Scalable family, maintain an inclusive L3 policy to ensure coherence across high-core-count systems, with up to 320 MB of shared L3 cache per socket supporting up to 64 cores. Each core features 2 MB of private L2 cache, enabling efficient data sharing while the inclusive L3 duplicates lower-level contents for simplified snoop protocols in multi-socket setups.^[85] Key innovations in Intel's recent designs include the mesh interconnect, which facilitates low-latency access to the distributed L3 cache slices by routing requests across a 2D grid of nodes in multi-core processors, improving scalability over ring-based topologies.^[86] Since the Meteor Lake architecture in 2023, L2 caches incorporate victim cache elements to retain recently evicted lines, reducing miss rates and enhancing reuse in hybrid core configurations.^[87] Overall, Intel's shift toward tile-based designs enhances cache hierarchy scalability by modularizing components—such as compute, I/O, and memory tiles—connected via a high-bandwidth fabric, allowing easier customization and higher core densities without monolithic die constraints.^[88]

AMD Processors

AMD's cache hierarchy in its Zen-based processors emphasizes a chiplet-based design, where multiple Core Complex Dies (CCDs) are interconnected via Infinity Fabric to enable scalable core counts while maintaining coherence across shared L3 caches.^[89] This approach allows private L1 and L2 caches per core for low-latency access, with larger shared L3 caches per CCD to support high-performance workloads like AI and HPC.^[90] In the Zen 3 architecture, introduced in 2021 with Ryzen 5000 and EPYC 7003 series, each CCD features a unified 32 MB L3 cache serving up to eight cores, operating under an exclusive policy where data evicted from L2 is stored in L3 without duplication. Each core features a private 512 KB L2 cache (totaling 4 MB per CCD), with private 32 KB L1 instruction and 32 KB L1 data caches per core. This design improved intra-CCD latency and bandwidth compared to Zen 2, enhancing single-threaded performance by unifying the L3 slice per CCD.^[91] The Zen 4 architecture, launched in 2022-2023 with Ryzen 7000 and EPYC 9004 series, features private 32 KB L1 instruction and 32 KB L1 data caches, and 1 MB private L2 cache per core, with 32 MB L3 per CCD in a non-inclusive manner.^[11] The EPYC 9684X, a 96-core model with 3D V-Cache, provides 1152 MB total L3 cache across 12 CCDs (96 MB per CCD).^[92] Infinity Fabric ensures coherence across chiplets, supporting up to 128 PCIe 5.0 lanes and high-bandwidth inter-CCD communication.^[89] Zen 5, released in 2024 with Ryzen 9000 and EPYC 9005 series, refines this hierarchy with a 1 MB L2 cache per core featuring 16-way associativity and 64 bytes per cycle bandwidth, paired with private 48 KB L1 data and 32 KB L1 instruction caches.^[90] The L3 remains 32 MB per CCD in a non-inclusive policy, with latency reduced by 3.5 cycles relative to Zen 4 to improve hit rates in multi-core scenarios.^[90] Each CCD supports up to eight Zen 5 cores, interconnected via Infinity Fabric on the I/O die for server scalability up to 192 cores.^[93] A hallmark of AMD's design is the use of 3D stacking for increased cache density, particularly in V-Cache variants, where additional L3 layers are bonded directly to CCDs using through-silicon vias for sub-nanosecond access.^[94] This enables configurations like 96 MB L3 per CCD in gaming-focused models. Recent trends focus on expanding L3 capacity for AI workloads, with X3D variants reaching up to 144 MB total L3 in dual-CCD desktop processors to accommodate larger datasets and reduce memory accesses. As of 2025, previews of next-generation architectures continue to emphasize larger L3 capacities and efficiency improvements.

ARM-Based SoCs

ARM-based systems-on-chip (SoCs) integrate cache hierarchies optimized for power efficiency and heterogeneous computing, particularly in mobile and embedded applications. These designs often employ split L1 caches per core, private or cluster-shared L2 caches, and a shared system-level cache (SLC) or L3 to balance latency, bandwidth, and energy consumption across CPU, GPU, and other accelerators. The unified memory architecture (UMA) common in many ARM SoCs reduces the need for complex cache coherency by allowing direct access to a shared memory pool, streamlining data sharing in integrated environments.^[95] Apple's M4 SoC, introduced in 2024, exemplifies this approach with its 10-core CPU configuration of 4 performance (P) cores and 6 efficiency (E) cores. Each P core reportedly features a split L1 cache with 192 KB instruction and 128 KB data, while E cores have 128 KB instruction and 64 KB data caches; these are backed by 16 MB shared L2 cache for the P-core cluster and smaller L2 for E cores, with no dedicated L3—instead, the system leverages up to 32 MB SLC integrated into the UMA for low-latency access across the SoC, including GPU and neural engine.^[96] This non-inclusive design prioritizes SoC-wide coherency and efficiency, enabling seamless data movement in unified memory setups up to 128 GB. The Apple M1 Ultra, launched in 2022, scales this hierarchy for high-end desktops with dual-die integration via UltraFusion. It includes 128 KB L1 instruction and 64 KB data caches per E core, with P cores at 192 KB/128 KB split; L2 is 12 MB per cluster (48 MB total for P clusters across dies), complemented by a 96 MB SLC shared system-wide. This non-inclusive SLC serves as the final cache level before unified memory, optimized for SoC integration by caching data for CPU clusters, GPU (up to 64 cores), and media engines, reducing main memory accesses in bandwidth-intensive tasks.^[97] In contrast, Qualcomm's Snapdragon 8 Gen 4 (also known as Snapdragon 8 Elite), released in 2024, adopts a big.LITTLE-like structure with 2 Prime Oryon cores, 6 Performance cores, and private L2 caches of 2 MB per Prime and 1 MB per Performance core, totaling 12 MB L2 across the CPU. A shared 8 MB L3 SLC provides system-level caching, supporting the Adreno GPU and AI accelerators in a UMA framework, with emphasis on low-latency access for mobile workloads.^[98]^[99]^[100] Key innovations in ARM SoCs include UMA, which minimizes hierarchy depth by unifying CPU, GPU, and I/O memory access, thereby reducing coherency overhead and cache misses in graphics and AI processing. The SLC, often exclusive to lower-level caches, delivers low-latency sharing for heterogeneous elements, as seen in Apple's designs where it caches up to 96 MB for multi-die scaling.^[95] Trends in these SoCs emphasize power efficiency through techniques like per-core and cluster-level power gating for L1/L2 caches, which cuts leakage in idle states while retaining state for quick resumption, and dynamic cache sizing to adjust capacity based on workload demands, enabling up to 95% power reduction in dormant modes without performance loss upon activation.^[101]^[102]

References

[1]
[PDF] Lecture 14: Introduction to Caches - Computer Science
In computer architecture, almost everything is a cache! – Registers “a cache” on variables – software managed. – First-level cache a cache on second-level cache.
[2]
[PDF] CS429: Computer Organization and Architecture - Cache I
Apr 8, 2020 · The fundamental idea of a memory hierarchy: For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at ...
[3]
Caches - CS 3410 - Cornell: Computer Science
It is common for modern machines to have three levels of caching, called the L1, L2, and L3 caches. The L1 cache is closest to the processor, smallest, and ...
[4]
Caching - CSCI 0300/1310: Fundamentals of Computer Systems
The storage hierarchy shows the processor caches divided into multiple levels, with the L1 cache (sometimes pronounced "level-one cache") closer to the ...
[5]
[PDF] What Is Memory Hierarchy
In computer architecture, almost everything is a cache! ▫ Register file is the fastest place to cache variables. ▫ First-level cache a cache on second-level ...<|control11|><|separator|>
[6]
[PDF] Slave Memories and Dynamic Storage Allocation
If it does, then the word must be updated in the slave as well as in the main memory. Slave Memories and Dynamic Storage Allocation ... 1965. Short Notes. 271.
[7]
[PDF] Structural aspects of the System/360 Model 85 11 The cache
This paper discusses organization of the cache and the studies that led to its use in the Model 85 and to selecting of values for its parameters. the SYSTEM/360 ...
[8]
[PDF] THE 801 MINICOMPUTER
This paper provides an overview of an experimental system developed at the IBM T. J. Watson Research Center. It consists of a running hardware prototype, ...
[9]
[PDF] Next Generation Intel® Microarchitecture (Nehalem)
Next generation Intel microarchitecture (Nehalem) enhances the Intel. Smart Cache by adding an inclusive shared L3 (last-level) cache that can be up to 8MB in ...
[10]
Skylake Processors - HECC Knowledge Base
May 13, 2021 · L3 cache: shared non-inclusive 1.375 MB/core; total of 27.5 MB, shared by 20 cores in each socket; fully associative; fastest latency: 44 cycles.
[11]
AMD EPYC™ 4th Gen 9004 & 8004 Series Server Processors
Deliver breakthrough performance with up to 96 “Zen 4” cores and 1152 MB of L3 cache per socket · Accelerate product design and enable low CAPEX and OPEX · Boost ...AMD EPYC™ 9754 · EPYC 9534 · AMD EPYC™ 9754S · AMD EPYC™ 9124
[12]
The working set model for program behavior - ACM Digital Library
The working set model for program behavior. Author: Peter J. Denning. Peter J. Denning. Massachusetts Institute of Technology, Cambridge. View Profile.Cited By · Information & Contributors · Author Tags
[13]
The Elements of Cache Programming Style - USENIX
The carrot is the Principle of Locality and the stick is Amdahl's Law. The Principle of Locality says that programs tend to cluster their memory references.<|control11|><|separator|>
[14]
How L1 and L2 memory chips fit into modern processor chips
May 5, 2021 · In the real world, an L1 cache typically has a hit rate between 95 and 97 percent, but the performance impact of those two values in our simple ...
[15]
[PDF] Caches - Brown Computer Science
Reason for dramatic difference: – matrix multiplication has inherent temporal locality: » input data: 3n2, computation 2n3. » every array element ...
[16]
[PDF] Side-Channel Attacks and Mitigations on Mesh Interconnects
(L1 and L2), and a shared L3 cache, also called last-level cache or LLC. The L1 cache is small (e.g., 32-64 KB) and fast, typically responding within 5 cycles.
[17]
[PDF] Cache Design - UCSD CSE
Multi-level Caches, cont. ▫ L1 cache local miss rate 10%, L2 local miss rate 40%. What are the global miss rates?
[18]
[PDF] Impact of Thermal Constraints on Multi-Core Architectures
a few cycles, with sizes of 8–128KB. L2 cache: secondary cache, sometimes shared among multiple cores, access times of 10–20 cycles, with sizes of 256KB up.
[19]
[PDF] Multi-Core Cache Hierarchies - Electrical and Computer Engineering
and for each, they consider caches built with SRAM, MRAM, PCM, and eDRAM. Their first architecture (LHCA) models a traditional cache hierarchy with three levels ...
[20]
CACHE SIZE CHART - irc
The largest cache level (ranging from several MB to several tens of MB), L3 cache is shared among all CPU cores. It's slower than L1 and L2 but faster than ...
[21]
Memory Latency Components - University of Texas at Austin
Mar 10, 2011 · ... average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of ...
[22]
[PDF] CPU clock rate DRAM access latency Growing gap - Error: 400
• L1 cache: 1- ~ 3-cycle latency. • L2 cache: 8- ~ 13-cycle latency. • Main memory: 100- ~ 300-cycle latency. • Cache hit rates are critical to high performance.
[23]
[PDF] Cache Associativity - CS Illustrated
But no matter how large or small they are, caches fall into one of three categories: direct mapped, n-way set associative, and fully associative.
[24]
[PDF] Achieving Non-Inclusive Cache Performance with Inclusive Caches
Unfortunately, non-inclusive and exclusive caches increase the capacity of a hierarchy by sacrificing the natural snoop filter benefits of inclusive caches.
[25]
[PDF] The Memory Hierarchy
L2 cache. (SRAM). L1 cache holds cache lines retrieved from the L2 cache. CPU registers hold words retrieved from the L1 cache. L2 cache holds cache lines.
[26]
[PDF] Memory Hierarchy/Cache - UCSD CSE
Average memory access time. • Average Memory Access Time (AMAT). = Hit Time+ Miss rate* Miss penalty. • Miss penalty = AMAT of the lower memory hierarchy.
[27]
[PDF] Memory Hierarchy
AMAT = Average memory access time = Hit time + Miss ratio × Miss penalty. Page 40. Note that speculative and multithreaded processors may execute other ...Missing: formula | Show results with:formula
[28]
[PDF] Chapter 5 Memory Hierarchy - UCSD ECE
In the previous subsections, we described the effectiveness of a cache in terms of the average memory access time (AMAT). Eventually, what we are really ...
[29]
[PDF] CS232 Discussion 9: Caches
The average memory access time, or AMAT, can then be computed. AMAT = Hit time + (Miss rate × Miss penalty). This is just averaging the amount of time for cache ...
[30]
25. Cache Optimizations III - UMD Computer Science
The objectives of this module are to discuss the various factors that contribute to the average memory access time in a hierarchical memory system and discuss ...
[31]
Analyzing Cache Misses Using the perf Tool in Linux - Baeldung
Mar 18, 2024 · In this tutorial, we'll analyze cache misses using the perf tool. We'll focus on monitoring and analyzing these events to drive optimal program execution.
[32]
Linux perf Examples - Brendan Gregg
Questions that can be answered include: Why is the kernel on-CPU so much? What code-paths? Which code-paths are causing CPU level 2 cache misses?
[33]
The impact of cache inclusion policies on cache management ...
This paper addresses the question of how sensitive existing cache management techniques are to the inclusion policy, quantifies how effective they are for ...
[34]
[PDF] To Include or Not To Include: The CMP Cache Coherency Question
Multilevel inclusion has been the standard in cache inclusion policies in processors to date with the exception of the aforementioned Piranha. The problem of ...
[35]
[PDF] Balancing Cache Capacity and On-Chip Traffic via Flexible Exclusion
In a three-level cache hierarchy, lower-level caches (L2 and L3) may have different inclusion policies. For example,. L3 can be inclusive (i.e., cache blocks in ...
[36]
https://pages.cs.wisc.edu/~markhill/cs838-david/projects/danav.pdf
[37]
https://memlab.ece.gatech.edu/papers/ISCA_2012_1.pdf
[38]
Virtually split cache: An efficient mechanism to distribute instructions and data: ACM Transactions on Architecture and Code Optimization: Vol 10, No 4
### Summary of Key Points from Abstract and Introduction
[39]
Two-ported cache alternatives for superscalar processors
In addition to the obvious benefit of supplying twice as much peak bandwidth as a single-ported cache, the .split- cache design provides two opportunities for ...
[40]
https://developer.arm.com/documentation/ddi0500/j/ch06s02s05
[41]
[PDF] IDT MIPS Microprocessor Family Software Reference Manual
Chapter 5, “Cache Management,” discusses IDT's implementation of the on-chip caches for instructions (I-cache) and data (D-cache). Chapter 6, “Memory Management ...
[42]
[PDF] 4th Gen AMD EPYC Processor Architecture
... and a shared 16 MB L3 cache Two of these core complexes are combined onto a single die for 16 cores per die and a total of 32 MB of L3 cache per die (Figure 1) ...
[43]
[PDF] Intel® Technology Journal | Volume 14, Issue 3, 2010
the prior generation of microarchitecture, the first level of cache was private to a core while the second level of cache was shared between pairs of cores.<|control11|><|separator|>
[44]
On the performance benefits of sharing and privatizing second and ...
This paper investigates the performance impact of cache sharing on a homogeneous same-ISA 16-core processor with private first-level (L1) caches.
[45]
[PDF] Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Banked architectures are now the typical design direction for caches in both industry and academia. Such solutions though are still not efficient enough since ...Missing: seminal | Show results with:seminal
[46]
[PDF] Generating Last-Level Cache Eviction Sets in the Blink of an Eye
Apr 20, 2025 · In this section, we introduce the necessary background on Intel caches, their LLC slice function and past recovery efforts, eviction sets and ...
[47]
[PDF] I-Cache Multi-Banking and Vertical Interleaving
Mar 11, 2007 · ABSTRACT. This research investigates the impact of a microarchitectural technique called vertical interleaving in multi-banked caches.
[48]
A Low-latency On-chip Cache Hierarchy for Load-to-use Stall ...
The main architectural reasons are: (1) GPU L1D is a multi-bank cache, hence, gathering data from these banks requires longer latency to fetch a cache line ...
[49]
On high-bandwidth data cache design for multi-issue processors
Though each technique has significant costs and drawbacks, we find that multi-banking holds the key to a low cost cache memory design that can cope with increas ...
[50]
Designing high bandwidth on-chip caches - ACM Digital Library
One drawback of implementing a banked cache is an incrcnso In the number of wires required to interconnect the banks. This increases both the area consumed by ...
[51]
L14: The Memory Hierarchy - Computation Structures
The cache hardware is designed so that each memory location in the CPU's address space maps to a particular cache line, hence the name direct-mapped (DM) cache.Typical Memory Hierarchy · Cache Access · Cache Metrics · Basic Cache AlgorithmMissing: definition | Show results with:definition
[52]
[PDF] Access Memory (RAM) SRAM vs DRAM Summary
Feb 13, 2014 · Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). ▫ The gap between CPU and main memory speed is ...
[53]
SRAM Scaling Issues, And What Comes Next
Feb 15, 2024 · The inability of SRAM to scale has challenged power and performance goals forcing the design ecosystem to come up with strategies that range from hardware ...
[54]
[PDF] Cache Design Trade-offs for Power and Performance Optimization
A set associative cache usually has a better hit rate than a direct-mapped cache of the same size, although the access time for the set associative cache is.
[55]
[PDF] Predictive Sequential Associative Cache - UCSD CSE
Set-associative caches offer lower miss rates than direct- mapped caches, but usually have a longer access time. In a set-associative cache, cache blocks are ...
[56]
[PDF] Fast Software Cache Design for Network Appliances - USENIX
Jul 17, 2020 · Cache hit rate versus lookup latency Balancing the cache hit rate and lookup latency has been studied in the context of DRAM hardware caches.<|separator|>
[57]
DRAM vs. SRAM: Which One Should You Choose? - NinjaOne
Aug 29, 2025 · Power Consumption: 3–5W per GB when active; static power draw remains low. Voltage: Operates at around 1.8V, higher than DDR4/DDR5 DRAM.
[58]
The Power Of eDRAM - Semiconductor Engineering
Jun 16, 2014 · In terms of density comparisons with SRAM, Intel reports a cell bit area of 0.108um² for its SRAM and 0.029um² for eDRAM. IBM has 3 versions of ...
[59]
MCAIMem: a Mixed SRAM and eDRAM Cell for Area and Energy ...
Dec 6, 2023 · Specifically, the 2T eDRAM offers a 5.26 × reduction in static power disspitation compared to SRAM. However, the use of eDRAM gain cells ...
[60]
High performance Si-MoS 2 heterogeneous embedded DRAM
Nov 12, 2024 · Embedded Dynamic RAM (eDRAM) has become a key solution for large-capacity cache in high-performance processors.
[61]
[PDF] Evolution of Memory Architecture
Microarchitectural techniques have evolved to share this memory across ever- larger systems of processors with deep cache hierarchies and have managed to hide ...Missing: milestones | Show results with:milestones
[62]
Second-Level Cache - an overview | ScienceDirect Topics
6. Second-level caches can be inclusive, exclusive, or non-inclusive. Inclusive L2 caches duplicate data from L1, simplifying coherence checks but wasting die ...
[63]
[PDF] A Survey on Recent Hardware Data Prefetching Approaches with An ...
Sep 1, 2020 · A prefetcher in the first-level cache can observe all memory accesses, and hence, is able to issue highly-accurate prefetch requests, but at the ...
[64]
[PDF] 35 A Survey of Recent Prefetching Techniques for Processor Caches
Lookahead shows how well in advance a prefetch is issued, such that prefetched data arrive in cache on time and are not evicted by other data.
[65]
[PDF] DRack: A CXL-Disaggregated Rack Architecture to Boost Inter-Rack ...
Jul 9, 2025 · We introduce a cache device installed at each host's CXL port – DRAM cache, so that remote data can be cached and even prefetched to hide the ...
[66]
[PDF] Survey of Disaggregated Memory: Cross-layer Technique ... - arXiv
Mar 26, 2025 · Conventional cache-based hierarchical memory management methods can not exploit the high bandwidth and large capacity of emerging devices.
[67]
[PDF] Understanding and Profiling CXL.mem Using PathFinder - cs.wisc.edu
Aug 27, 2025 · In a typical memory subsystem, a memory request is first served from. L1, L2, and L3/LLC caches, and then queries the DIMM when missed from the ...
[68]
[PDF] 09-memory-hierarchy.pdf - Texas Computer Science
Maybe 100x Cache memories. DRAM 1. 10X. Yes. Yes. 1X. Main memories, frame buffers ... The gap between CPU and main memory speed is widening. ▫ Well-written ...
[69]
[PDF] CS161 – Design and Architecture of Computer Systems
Typical caches have a hit rate of 95% or higher, so in fact most memory accesses will be handled by the cache and will be dramatically faster. Page 15. A simple ...<|separator|>
[70]
[PDF] Prodigy: Improving the Memory Latency of Data-Indirect Irregular ...
Abstract—Irregular workloads are typically bottlenecked by the memory system. These workloads often use sparse data representations, e.g., compressed sparse ...
[71]
[PDF] Cache Showdown: The Good, Bad, and Ugly - cs.wisc.edu
When prefetching into a cache we must be weary of cache pollution. Cache pollution is when a prematurely prefetched block displaces useful data in the cache.
[72]
[PDF] Demystifying Cache Coherency in Modern Multiprocessor Systems
Jun 7, 2025 · Particularly notable is their finding that the cache coherence protocol causes bandwidth to scale sub-linearly, achieving only 68% of the ...
[73]
Caching guidance - Azure Architecture Center | Microsoft Learn
The requirement to implement a separate cache service might add complexity to the solution. Considerations for using caching. The following sections describe ...
[74]
[PDF] Exploiting Speculative Execution - Spectre Attacks
Spectre attacks exploit speculative execution to induce a victim to perform operations that leak confidential information via side channels, using transient ...
[75]
[PDF] The Efficacy of Software Prefetching and Locality Optimizations on ...
Software prefetching and locality optimizations are techniques for overcoming the speed gap between processor and memory. In this paper, we provide a ...
[76]
[PDF] Effective hardware-based data prefetching for high-performance ...
The goal of this paper is to show that a hardware-based prefetch mechanism can be a cost and performance efficient mechanism when placed in the context of a ( ...<|separator|>
[77]
[PDF] Notices & Disclaimers - Intel
with 3 MB L2 Cache. AI-based power management. VEC. INT. 16.67MHz. 8x. Wider predict. Across both allocation/rename & retire. Up to 36MB shared L3 cache on ...
[78]
Examining Intel's Arrow Lake, at the System Level - Chips and Cheese
Dec 4, 2024 · So ever since Intel moved to non-inclusive cache design, their strategy is to emphasize the private caches at the expense of the L3, while ...
[79]
Intel Lunar Lake Technical Deep Dive - The CPU Cores: Part 1
Rating 5.0 · Review by W1zzard (TPU)Jun 3, 2024 · On Arrow Lake, this core gets 3 MB (3,072 KB) of L2 dedicated L2 cache. The four P-cores on the Lunar Lake silicon share a 12 MB L3 cache.Missing: hierarchy details
[80]
Intel's Lunar Lake intricacies revealed in new high-resolution die shots
May 17, 2025 · The TSMC N3B fabbed Compute Tile hosts four Lion Cove-based Performance (P) cores, sharing 12MB of L3 cache, with 2.5MB of private L2 cache per ...
[81]
What Is the Difference in Cache Memory Between CPUs for Intel ...
This article contains information about L3 cache of an Intel® Xeon® Scalable Processor and why the value is higher than L1 cache.
[82]
[PDF] Reverse Engineering the Intel Cascade Lake Mesh Interconnect
Generally, processors provide two levels of private caches (L1 and L2) for each core and a shared, lower-level L3 cache, otherwise known as a last-level cache ( ...
[83]
Previewing Meteor Lake at CES - by Chester Lam - Chips and Cheese
Jan 11, 2024 · ... L2 region suggests Intel has changed up L2 cache's replacement policy. At L3, E-Cores on both Raptor Lake and Meteor Lake see 16.6 ns of latency ...Missing: victim | Show results with:victim
[84]
The 'Blank Sheet' that Delivered Intel's Most Significant SoC Design ...
Jan 17, 2024 · “The tiles can be easily swapped, adapting chip capabilities to different requirements. A new, scalable fabric means all blocks within the SoC ...
[85]
[PDF] 4th Gen AMD EPYC Processor Architecture
be augmented with 3D V-Cache technology to bring the L3 cache capacity to 96 ... Infinity Fabric interfaces, allowing for double the CPU-core-to-I/O.
[86]
Discussing AMD's Zen 5 at Hot Chips 2024 - by Chester Lam
Sep 15, 2024 · AMD is proud that the L1 data cache has a 50% capacity and associativity increase while maintaining a 4-cycle load-to-use latency. L2 ...
[87]
AMD "Zen" Core Architecture
With a core engine that supports simultaneous multi-threading for future-looking workloads; a leading-edge cache system and neural-net prediction, to help ...
[88]
AMD EPYC™ 9684X
Realize exceptional time-to-results and energy efficiency for your business-critical applications with AMD EPYC™ 9004 Series processors for modern data centers.
[89]
5th Generation AMD EPYC™ Processors
AMD EPYC 9005 Series processors include up to 192 “Zen 5” or “Zen 5c” cores with exceptional memory bandwidth and capacity. The innovative AMD chiplet ...AMD EPYC™ 9965 · AMD EPYC™ 9175F · AMD EPYC™ 9575F · Document 70353
[90]
AMD 3D V-Cache™ Technology
2nd Gen AMD 3D V-Cache™ Technology · Up to 8-core “Zen 5” CCD · 64MB L3 Cache Die · Through Silicon Vias (TSVs) for Silicon-to-silicon Communication · Direct Copper ...
[91]
Exploiting Exclusive System-Level Cache in Apple M-Series SoCs ...
Apr 18, 2025 · In this paper, we target the System-Level Cache (SLC) of Apple M-series SoCs, which is exclusive to higher-level CPU caches.Missing: innovations unified power gating
[92]
Apple M1 Pro and M1 Max: Specs, Performance, Everything We Know
Oct 25, 2021 · The M1 Pro and M1 Max, professional-grade processors debuting in the 14-inch and 16-inch MacBook Pros. Here's everything you need to know about the M1 Pro and ...
[93]
Apple MacBook Pro "M4 Max" 14 CPU/32 GPU 14" Specs
Oct 30, 2024 · Each performance core is believed to also have a 32 MB L2 cache and each efficiency core is believed to have a 4 MB L2 cache.Missing: hierarchy SLC
[94]
Analyzing the memory ordering models of the Apple M1
Each processor encompasses separate L1 instruction (L1i) and L1 data (L1d) caches, while an L2 cache is associated with each cluster. Information about a shared ...
[95]
Snapdragon 8 Elite Mobile Platform - Qualcomm
Oct 21, 2024 · Qualcomm Oryon CPU is custom-built with the fastest mobile CPU speeds up to 4.47 GHz · Largest shared cache in mobile industry.
[96]
Qualcomm Snapdragon 8 Elite (Gen 4): specs and benchmarks
CPU ; L1 cache, 192 KB ; L2 cache, 12 MB ; L3 cache, 8 MB ; Process, 3 nanometers ; TDP (Sustained Power Limit), 8 W.VS · Apple A18 Pro vs Qualcomm... · Vivo iQOO Neo 10 Pro Plus · OnePlus 13
[97]
Snapdragon 8 Elite: Everything You Need To Know - Forbes
Oct 24, 2024 · The Snapdragon 8 Elite cores feature 192KB L1 cache per Prime core and 128KB per Performance core for a total of 12MB L2 cache per user. As ...
[98]
https://www.qualcomm.com/smartphones/products/8-series/snapdragon-8-elite-mobile-platform
[99]
[PDF] 27 DPCS: Dynamic Power/Capacity Scaling for SRAM Caches in the ...
Our mechanism combines multilevel voltage scaling with optional architectural support for power gating of blocks as they become faulty at low voltages. A static ...