Fact-checked by Grok 2 weeks ago

CPU cache

A CPU cache is a small, high-speed semiconductor memory integrated into the central processing unit (CPU) that temporarily stores copies of data or instructions frequently accessed from the main memory, thereby reducing the average latency for memory operations and improving overall processor performance.^[1] The primary purpose of the CPU cache is to bridge the significant speed gap between the fast CPU core and the comparatively slower main memory (such as DRAM), by leveraging two key principles of program behavior: temporal locality, where recently accessed data is likely to be used again soon, and spatial locality, where data near a recently accessed location is also likely to be referenced.^[2] This approach allows the processor to fulfill most memory requests directly from the cache—a "hit"—in a fraction of the time required for a main memory access, often achieving effective access times close to the cache's own latency of just a few processor cycles.^[3] Modern CPU caches are structured as a multi-level hierarchy to balance speed, size, and cost, typically including Level 1 (L1), Level 2 (L2), and Level 3 (L3) caches, with each successive level being larger in capacity but slower in access time and farther from the CPU core.^[4] The L1 cache, the closest to the core, is usually the smallest (e.g., 32–64 KB per core) and divided into separate instruction (L1i) and data (L1d) caches to optimize for different access patterns; L2 caches are moderately larger (e.g., 256 KB–1 MB per core) and often private to each core; while L3 caches, which can span several megabytes to tens of megabytes, are typically shared across multiple cores in multi-core processors to facilitate data sharing and reduce inter-core communication overhead.^[3] Data is transferred between levels and main memory in fixed-size units called cache lines (usually 64 bytes), and cache management involves strategies for mapping addresses to cache locations—such as direct-mapped, set-associative, or fully associative organizations—to handle placement, identification via tags, and replacement policies (e.g., least recently used) when space is full.^[1] In multi-core systems, cache hierarchies also address coherence challenges to ensure consistent data views across cores, often using protocols like MESI (Modified, Exclusive, Shared, Invalid) to maintain synchronization without excessive overhead.^[5] Advances in cache design continue to focus on increasing hit rates through larger inclusive or non-inclusive structures, prefetching mechanisms, and optimizations for specific workloads, making the cache a critical determinant of CPU efficiency in everything from embedded devices to high-performance computing.^[6]

Fundamentals

Definition and Purpose

A CPU cache is a small, high-speed volatile memory integrated near the processor core, designed to store copies of frequently accessed data and instructions from the larger but slower main memory (DRAM).^[7] This proximity allows the cache to deliver data in a fraction of the time required for main memory access, exploiting the principle of locality—where programs tend to reuse recently accessed data or nearby addresses—to minimize latency.^[8] The fundamental purpose of the CPU cache emerged from the historical disparity in speeds between processors and main memory, a gap that has widened dramatically since the 1960s.^[9] Early motivations, as seen in the IBM System/360 Model 85—the first commercial computer with a cache in 1968—stemmed from the need to accelerate effective memory performance, where main memory cycles took several processor cycles (e.g., about 13 times longer), limiting overall throughput. Today, this gap exceeds 100-fold, with modern CPU clock cycles in the sub-nanosecond range contrasting against DRAM latencies of 50-100 nanoseconds, compelling multilevel cache hierarchies to sustain processor utilization. By reducing average memory access time through fast cache hits, the CPU cache enhances instruction execution throughput, enabling processors to operate closer to their peak speeds while also improving energy efficiency by avoiding costly main memory fetches.^[7] Cache performance is quantified by the hit rate—the fraction of memory requests satisfied directly from the cache—and the miss rate, defined as 1 minus the hit rate, which determines the frequency of slower main memory interventions.^[8] Typical hit rates in well-designed systems range from 90-99%, underscoring the cache's critical role in modern computing.^[7]

Basic Components

The fundamental building blocks of a CPU cache consist of tag storage, a data array, valid and dirty bits per cache entry, and control logic to manage these elements. These components work together to enable efficient temporary storage of data from main memory, leveraging the principle of locality to bridge the speed gap between the processor and slower memory systems.^[10] Tag storage captures the upper bits of the physical memory address for each cached block, allowing the cache to distinguish which specific memory region a given block represents within its set or slot. This tag field is essential for verifying whether a requested address matches a stored block during access attempts.^[11]^[12] The data array provides the primary storage for the actual contents of the cached blocks and is typically implemented using static random-access memory (SRAM) for its high speed and low latency, suitable for on-chip integration. Each entry in the data array holds a fixed-size block of data, commonly 64 bytes in modern designs, though sizes such as 32 or 128 bytes appear in various architectures depending on the processor generation and optimization goals.^[13] Associated with each cache entry are control bits, including a valid bit that signals whether the block contains usable data (set to 1 upon loading a block from memory) and a dirty bit that indicates if the block has been modified since loading (set to 1 on writes in write-back caches). The valid bit ensures that only initialized entries are considered during lookups, while the dirty bit tracks changes to facilitate efficient write-back to lower memory levels only when necessary.^[12] Control logic encompasses hardware elements such as comparators that match incoming address tags against stored tags and finite state machines that orchestrate updates to the cache state, including bit flips and data movements within the array. These circuits ensure reliable operation by handling edge cases like initial cache states where all valid bits are unset.^[14] Cache designers must balance block sizes, as larger blocks amortize tag storage overhead across more data bytes—reducing the relative cost of tags and control bits—but they also amplify miss penalties by requiring more data transfer time from main memory on misses. For instance, doubling the block size from 32 to 64 bytes can halve tag overhead per byte stored but roughly doubles the latency impact of compulsory or capacity misses.^[15]^[16]

Operation Principles

Cache Entries and Addressing

In CPU caches, data is stored and retrieved in fixed-size units known as cache lines or blocks, which typically range from 32 to 64 bytes in modern processors to balance transfer efficiency and overhead. Each cache line holds a contiguous block of memory data along with metadata, such as a tag for identification and a valid bit to indicate usability. The block offset, comprising the lowest \log_2 (block size) bits of the memory address, specifies the particular byte or word within that line.^[11]^[17] To map a memory address to a cache entry, the address is decomposed into three primary fields: the block offset (lowest bits), the index (middle bits), and the tag (highest bits). The block offset, as noted, selects the byte within the line. The index bits determine the specific cache line or set where the data resides, with the number of index bits equal to \log_2 (number of lines or sets). The tag bits, consisting of the remaining upper address bits, uniquely identify the originating memory block to ensure correct matching during access. This decomposition enables efficient hardware lookup by first using the index to locate candidate entries, then comparing the tag for validation.^[11]^[18] For example, in a direct-mapped cache of 4 KB capacity with 32-byte lines and a 32-bit address space, the offset requires 5 bits ($2^5 = 32). This results in 128 lines ($4096 / 32 = 128) and thus 7 index bits ($2^7 = 128). The tag then occupies the remaining 20 bits ($32 - 5 - 7 = 20).^[11]^[18] The tag is calculated by extracting the high-order bits of the full memory address, excluding those allocated to the index and offset fields, allowing the cache to distinguish between different memory blocks that might map to the same index. In set-associative caches, the index selects a set of lines, with tag comparisons performed across ways within that set.^[11]^[18]

Access Process

The read access process in a CPU cache initiates when the processor issues a memory read request accompanied by a physical address. The cache controller first decodes this address, partitioning it into tag bits, index bits, and block offset bits to locate the potential data. The index bits select a specific set of cache lines, and the tag bits are simultaneously compared against the stored tags in all ways (associative entries) within that set using content-addressable memory (CAM) or equivalent parallel comparators.^[12] If a tag match occurs and the associated valid bit is asserted—confirming the cache line holds current, usable data—a hit is detected, and the requested bytes are fetched from the data storage array using the offset bits before being delivered to the processor. This valid bit check serves as a basic coherency validation to prevent serving stale or uninitialized data. The entire hit resolution for an L1 cache typically requires 1-4 clock cycles, accounting for parallel tag lookup, valid bit inspection, and data array access.^[11]^[19] On a miss, where no matching valid tag is found, the request propagates to the next memory level (e.g., L2 cache or main memory), potentially triggering a block fetch and replacement on compulsory misses. Write access mirrors the read process in address decoding and tag comparison but diverges in data handling upon a hit. Following tag match and valid bit confirmation, the incoming data updates the targeted bytes in the cache line's data array via the offset. The specific write policy then governs further actions: in write-through mode, the update is simultaneously written to the next memory level, while in write-back mode, the line is marked dirty for deferred propagation.^[20] Misses during writes may allocate a new line (write-allocate) or bypass the cache (no-allocate), forwarding the operation directly to lower levels without altering the cache state. Hit latency remains comparable to reads at 1-4 cycles for L1, as the core lookup and update occur swiftly.^[19]

Design Policies

Replacement Policies

When a cache cannot accommodate a new cache line due to being full, a replacement policy determines which existing line to evict to make space. These policies aim to minimize cache misses by exploiting patterns in memory access, such as temporal locality, where recently accessed data is likely to be accessed again soon. Common policies balance prediction accuracy with implementation complexity in hardware. The Least Recently Used (LRU) policy selects for eviction the cache line that has gone the longest without being accessed. It maintains an ordering of lines based on recency, often using a stack or counters to track access times, making it well-suited for workloads exhibiting strong temporal locality. Implementing true LRU requires significant hardware overhead, such as O(n) updates per access for an n-way associative cache, which becomes prohibitive for large associativities. The First-In-First-Out (FIFO) policy treats the cache as a queue, evicting the oldest line inserted regardless of its usage history. This approach is simple and incurs minimal hardware cost, typically just a circular buffer or pointer, but it performs poorly on accesses with temporal locality since it ignores recency and can evict frequently used lines prematurely. Random replacement selects a line for eviction using a pseudo-random mechanism, such as a hardware random number generator. It requires negligible storage and update overhead compared to LRU or FIFO, avoiding complex state management, but its performance is suboptimal as it does not leverage access patterns, leading to higher miss rates in locality-heavy workloads. To address LRU's complexity, approximations like pseudo-LRU (PLRU) are commonly used in modern CPU caches. PLRU employs a tree-based structure or bit fields to approximate recency ordering with reduced hardware cost; for example, a tree-PLRU for an 8-way cache uses only 7 bits per set instead of full LRU counters. Evaluations show PLRU achieves miss rates close to LRU (often within 1-5% higher) while using far less area and power, making it preferable for high-associativity designs. A theoretical benchmark is Belady's optimal policy, which evicts the line whose next access is farthest in the future. This requires complete knowledge of future accesses, rendering it impractical for real hardware but invaluable for analyzing policy performance bounds. Replacement policies like LRU and PLRU can reduce miss rates by 20-50% compared to random or FIFO in typical workloads, though exact gains depend on access patterns.

Write Policies

Write policies in CPU caches dictate how write operations from the processor are managed, balancing factors such as data consistency, performance, and bandwidth usage between the cache and lower levels of the memory hierarchy. These policies primarily address whether and when updates are propagated to main memory, as well as how cache lines are handled on write misses. Common strategies include write-through, write-back, and write-around approaches, often combined with allocation decisions.^[21]^[22] The write-through policy updates both the cache and the backing main memory synchronously for every write operation. This ensures immediate consistency between the cache and main memory, simplifying recovery in case of failures and avoiding the need for complex tracking mechanisms. However, it incurs higher latency and increased bandwidth demands on the memory bus due to frequent writes, making it suitable for systems where consistency is paramount over speed.^[1]^[21] In contrast, the write-back policy (also known as copy-back or write-deferred) updates only the cache line immediately, deferring the write to main memory until the line is evicted or explicitly flushed. A dirty bit is set in the cache entry to flag modifications, indicating that the data differs from main memory. This approach reduces bus traffic by batching writes and avoiding unnecessary updates to unchanged portions of a block, thereby improving overall performance in write-intensive workloads. Drawbacks include the risk of data loss during power failures or crashes before eviction, as well as increased complexity in handling consistency across multiple cache levels. On eviction, the dirty bit triggers the write to main memory as part of the replacement process.^[1]^[22]^[21] Write-around, or write-through with no-write-allocation, bypasses the cache entirely for write misses by updating only main memory directly. This prevents cache pollution from temporary or streaming write data that is unlikely to be read soon, conserving cache space for read-heavy accesses. It is particularly effective for workloads with large sequential writes, though it offers no caching benefits for subsequent reads to the written data unless a separate read miss populates the cache.^[21] Allocation policies complement write strategies by determining whether to fetch a missing block into the cache on a write miss. Write-allocate brings the entire block from main memory into the cache before performing the update, commonly paired with write-back to support future accesses to the same line. No-write-allocate, often used with write-through or write-around, skips this fetch and writes directly to main memory, avoiding the overhead of loading irrelevant data and reducing initial latency for one-time writes.^[1]^[21] The dirty bit is a critical one-bit flag per cache line, primarily used in write-back policies to denote whether the line's contents have been modified since being loaded from main memory. When a write occurs, the bit is set to true, ensuring that the updated data is propagated to main memory only when necessary, such as during eviction or cache flushes. This mechanism optimizes bandwidth by distinguishing modified lines from clean ones but adds hardware overhead for bit management and coherence checks.^[22]^[21]^[1]

Associativity Mechanisms

Direct-Mapped Caches

In a direct-mapped cache, each memory block from the main memory is mapped to exactly one specific cache line using a portion of the memory address known as the index bits. This mapping is determined by dividing the physical address into three fields: the tag, the index, and the block offset. The index bits select the cache line, while the tag bits are compared against the stored tag in that line to verify if it matches the requested block; a full tag match is required for a hit. If the tags match, the data from the selected line is output via a simple multiplexer; otherwise, it is a miss.^[23] The hardware implementation of a direct-mapped cache is straightforward, requiring only one tag storage per cache line and a single comparator for tag matching during access. This design uses a multiplexer to route the data from the indexed line to the output, minimizing the complexity of the cache controller. As a result, direct-mapped caches occupy less silicon area compared to higher-associativity designs.^[24] Key advantages of direct-mapped caches include fast lookup times, as only one tag comparison is needed, leading to lower access latency and reduced power consumption due to fewer parallel comparisons and simpler wiring. The minimal hardware overhead also makes them suitable for high-speed, area-constrained environments like primary caches in microprocessors.^[24]^[25] However, direct-mapped caches suffer from high conflict miss rates, where multiple memory blocks map to the same cache line and thrash each other out, particularly in workloads with sequential accesses that stride across the cache size, such as matrix operations or linked lists. This can degrade overall performance more than in set-associative caches, which mitigate such conflicts by allowing multiple lines per index.^[25] For example, in a 64 KB direct-mapped L1 cache with 32-byte lines, the cache holds 2048 lines, requiring 11 index bits to select a line (since $2^{11} = 2048); the remaining upper address bits form the tag for comparison, assuming a typical 32-bit or larger address space where the low 5 bits are the block offset.^[23]

Set-Associative Caches

Set-associative caches organize the cache memory into multiple sets, where each set contains a fixed number of cache lines, known as ways. For an n-way set-associative cache, the lower bits of the memory address serve as an index to select one specific set from the total number of sets, while the upper tag bits are compared simultaneously against the tags stored in all n lines within that set to identify a potential hit.^[26] This parallel tag comparison allows for efficient lookup, as only the lines in the indexed set need to be examined, rather than the entire cache.^[12] The primary benefit of set-associative caches lies in their ability to mitigate conflict misses that plague direct-mapped caches, where multiple memory blocks map to the same cache line and thrash each other out. By permitting a block to reside in any of the n ways within its designated set, these caches provide greater flexibility in block placement, leading to lower overall miss rates for a given cache size.^[26] This design strikes a balance between the simplicity and speed of direct-mapped caches and the high hit rates of fully associative caches, offering improved performance without the prohibitive hardware costs of searching the entire cache on every access.^[27] In terms of hardware implementation, tag matching in set-associative caches typically employs parallel comparators for each way in the selected set, often combined with multiplexer trees to route the matching data to the processor.^[28] Alternatively, content-addressable memory (CAM) can be used for the tag array within each set to perform simultaneous comparisons, though this increases power and area overhead compared to standard SRAM-based tags with dedicated comparators.^[29] To further optimize access speed and reduce energy, way prediction techniques forecast the most likely way containing the requested data, allowing the cache to initially access only that way's data array while verifying the tag in parallel; a misprediction incurs a small penalty but avoids probing all ways upfront.^[30] Variations on the standard set-associative design include skewed-associative caches, which apply different hash functions or bit permutations to the index for each way, dispersing conflicting blocks across sets to further reduce misses without increasing associativity. Another approach is the pseudo-associative cache, which begins with a direct-mapped lookup and, upon a miss, sequentially probes one or more alternative locations using a secondary hash or fixed offset, approximating higher associativity with minimal additional hardware.^[31] However, increasing the degree of associativity in set-associative caches introduces scalability challenges, as higher n requires more parallel tag comparisons and wider multiplexers, which elevate access latency, silicon area, and dynamic power dissipation.^[32] For instance, transitioning from 2-way to 8-way associativity can double the hit latency in some designs due to the expanded critical path in the tag matching logic.^[33] Victim caches, small fully associative buffers that store recently evicted lines, serve as a hardware extension to capture many of these residual conflict misses with low overhead.^[34]

Performance Aspects

Cache Misses and Hits

A cache hit occurs when the requested data or instruction is found in the cache, allowing the processor to access it directly without needing to retrieve it from a lower level of the memory hierarchy. Hits are categorized based on the type of access: instruction hits, where fetched instructions are present in the instruction cache; data read hits, where data for a load operation is available in the data cache; and data write hits, where data for a store operation is located in the cache, enabling immediate update according to the cache's write policy. These hits typically take 1-4 clock cycles, depending on the cache level, significantly faster than accessing main memory.^[1] In contrast, a cache miss happens when the requested item is not present in the cache, requiring the processor to fetch it from a slower memory level, such as another cache or DRAM. Misses are classified into three main types: compulsory misses, which occur on the first access to a memory block that has never been referenced before; capacity misses, resulting from the cache being too small to hold all actively used blocks during program execution; and conflict misses, arising in set-associative or direct-mapped caches when multiple blocks compete for the same cache set, leading to evictions even if space is available elsewhere. Conflict misses can be influenced by replacement policies, such as least recently used (LRU), which determine which block is evicted during contention.^[35] The miss penalty represents the additional clock cycles incurred to resolve a miss, often ranging from 10 to 100 cycles for an L1 cache miss escalating to DRAM access, though this can exceed 200 cycles in modern systems due to increasing processor speeds relative to memory latency. Upon a miss, traditional blocking caches cause the processor to stall, halting instruction execution until the data is fetched. Non-blocking caches, however, permit the processor to continue executing independent instructions during the miss resolution, a technique known as hit-under-miss, which overlaps computation with memory operations.^[36]^[37] Modern processors mitigate miss impacts through techniques like hardware prefetching, which anticipates and loads likely future data into the cache ahead of time, and out-of-order execution, which rearranges instructions to proceed past dependent loads while misses are serviced. These approaches reduce effective stall times without altering the fundamental hit/miss classification.^[38]

Performance Metrics

The performance of a CPU cache is evaluated using several key metrics that quantify its efficiency in reducing memory access latency and improving overall system throughput. The hit time represents the latency incurred for a successful cache access, typically measured in processor clock cycles. For instance, the level-1 (L1) cache often achieves a hit time of 1 cycle in idealized models or simple processor designs, allowing rapid data retrieval without stalling the pipeline.^[39] In contrast, the miss penalty is the additional time required to service a cache miss, which involves fetching data from a lower level of the memory hierarchy, such as main memory, and can range from tens to hundreds of cycles depending on the system configuration.^[39] A fundamental metric integrating these factors is the Average Memory Access Time (AMAT), which provides a comprehensive measure of effective memory latency. The formula for AMAT is given by:

\text{AMAT} = \text{hit time} + (\text{miss rate} \times \text{miss penalty})

where the miss rate is the fraction of memory accesses that result in a cache miss. This equation captures how improvements in hit rate or reductions in miss penalty directly lower the average access time, thereby enhancing processor performance.^[39]^[40] Another important metric is the effective bandwidth, which assesses the cache's data throughput by accounting for both hits and misses. It is calculated as the total data transferred divided by the total time, incorporating the high-speed delivery on hits (limited by cache port bandwidth) and the slower fetches on misses (constrained by lower-level memory bandwidth). In bandwidth-bound workloads, effective bandwidth can be significantly lower than peak cache bandwidth due to miss-induced stalls, emphasizing the need to minimize miss rates for sustained performance.^[41] Cache performance metrics are typically measured using hardware performance monitoring counters (PMCs), which track events such as cache misses and hits at runtime. For example, Intel processors provide PMCs to count L1 and L2 cache miss events, enabling precise calculation of miss rates without simulation overhead. Simulation tools, such as those based on trace-driven models, complement hardware measurements by evaluating design variations under diverse workloads.^[42] A key trade-off in cache design involves balancing size against hit time: larger caches reduce capacity-related misses and improve overall hit rates, but they increase hit time due to longer access paths and higher power consumption. Studies indicate that cache sizes between 32 KB and 128 KB often optimize this trade-off, as further increases yield diminishing returns in miss rate reduction while proportionally raising cycle times.^[43]

Advanced Features

Cache Hierarchy

Modern CPUs typically employ a multi-level cache hierarchy to balance speed, size, and cost, consisting of at least three levels: L1, L2, and L3 caches. The L1 cache is the smallest and fastest, usually split into separate instruction (L1i) and data (L1d) caches, each around 32 KB per core, and is private to each core for minimal latency in accessing frequently used data and instructions.^[44] The L2 cache is larger, typically 256 KB to 1 MB per core, and can be either private to each core or shared among a few cores, serving as an intermediate buffer with slightly higher latency than L1.^[45] The L3 cache, often called the last-level cache (LLC), is the largest, ranging from several MB to tens of MB, and is shared among all cores on a chip or socket to provide a unified pool for less frequently accessed data.^[46] Cache hierarchies adopt inclusion policies to manage data duplication across levels, with inclusive and exclusive being the primary variants. In an inclusive hierarchy, all data in higher levels (L1 and L2) is also present in the lower level (L3), ensuring that the LLC contains a superset of upper-level contents, which simplifies coherence but may waste space due to redundancy.^[47] Conversely, an exclusive hierarchy prohibits overlap, where data evicted from L1 or L2 is not stored in L3, maximizing effective cache capacity but complicating management of cache states.^[48] Many processors, such as Intel Xeon, use non-inclusive policies for L3 to balance these trade-offs, allowing some but not all upper-level data in the LLC.^[45] At the L1 level, caches are often separate for instructions and data to optimize fetch and load/store operations, while L2 and L3 are typically unified, handling both types of accesses in a single structure to improve utilization in mixed workloads.^[44] In multi-core processors, L1 and L2 remain private to each core for low-latency access, whereas the shared L3 facilitates inter-core data sharing and reduces off-chip memory traffic, enhancing overall system coherency.^[46] To maintain consistency across the hierarchy and cores, snoop protocols are employed, where caches monitor (or "snoop") bus or interconnect transactions to detect modifications and invalidate or update local copies accordingly.^[49] In multi-level setups, snoop filters track line locations in private caches, filtering unnecessary probes to the shared L3 and minimizing coherence overhead in multi-socket systems.^[45]

Specialized Cache Types

Specialized caches address specific performance bottlenecks in processors by tailoring storage and access mechanisms to particular types of data or operations, often complementing standard cache hierarchies. The victim cache, introduced by Norman Jouppi, is a small, fully associative buffer that captures cache lines recently evicted from a direct-mapped primary cache, thereby reducing conflict misses by providing quick access to these "victims" without fetching from lower memory levels.^[34] Simulations in the original work showed that a victim cache of just 1 to 5 entries could eliminate up to 85% of conflict misses in direct-mapped caches for benchmark workloads.^[50] This design is particularly effective in environments where associativity is limited to minimize hardware cost and access latency. Trace caches store sequences of decoded instructions, known as traces, rather than raw instruction bytes, enabling faster fetch in out-of-order processors by delivering contiguous micro-operations (μops) across branch boundaries.^[51] Pioneered in research by Rotenberg et al., this approach improves instruction bandwidth by making non-contiguous code appear sequential, with evaluations indicating up to 30% higher fetch rates in wide-issue superscalar designs.^[51] Intel implemented trace caching in the Pentium 4 processor's NetBurst microarchitecture, where the execution trace cache served as the primary L1 instruction cache, holding up to 12K μops and achieving hit rates comparable to 8-16 KB conventional caches.^[52] Micro-op caches, or μop caches, hold pre-decoded sequences of micro-operations to bypass the instruction decoder in subsequent executions, reducing front-end latency in complex instruction set architectures like x86.^[53] First deployed in Intel's Sandy Bridge microarchitecture, this cache stores 1.5K μops per core, with subsequent architectures expanding capacity to 4K–6K μops, allowing hot code regions to avoid repeated decoding and improving overall throughput by 5-10% in integer workloads.^[53] The structure is typically set-associative and integrated near the decoder, prioritizing frequently executed paths to minimize power and delay in the pipeline front-end.^[54] The branch target buffer (BTB) is a specialized cache that stores the target addresses of branch instructions alongside prediction information, enabling rapid resolution of control flow by prefetching instructions from predicted targets without full address decoding.^[55] As described by Lee and Smith, BTBs function as a lookup table indexed by program counter, with early designs using 512-2048 entries to achieve misprediction penalties under 10 cycles in pipelined processors.^[55] Modern implementations often employ multi-level BTBs with varying associativities to balance accuracy and latency, significantly boosting instruction fetch efficiency in branch-heavy workloads. Write coalescing caches merge multiple small or non-contiguous writes into larger, contiguous blocks before propagating them to lower memory levels, thereby improving bandwidth utilization and reducing I/O overhead in write-intensive scenarios.^[56] This technique, explored in last-level cache designs, maximizes in-cache displacement to coalesce overwrites, with studies showing up to 40% reduction in writeback traffic for applications with fragmented stores.^[57] Commonly applied in embedded systems or storage accelerators, it avoids unnecessary line evictions by buffering writes until a full cache line is accumulated. Smart caches enable dynamic reconfiguration of cache resources, such as reallocating ways or sets between instruction and data storage based on workload demands, to optimize hit rates without fixed partitioning.^[58] Proposed in reconfigurable cache architectures, this adaptability allows processors to shift capacity—e.g., increasing instruction cache space for compute-bound tasks—yielding energy savings of 20-50% in media processing benchmarks through runtime partitioning.^[58] Intel's Smart Cache extends this concept to multi-core shared L2/L3 levels, dynamically assigning space to active cores, though core-specific I/D reconfiguration remains prominent in research for single-thread efficiency.^[59]

Implementation Techniques

Address Translation in Caches

In modern processors, CPU caches must interface with virtual memory systems, where programs operate using virtual addresses (VAs) that are translated to physical addresses (PAs) by the memory management unit (MMU). This translation introduces complexities in cache design, as caches need to efficiently handle both virtual and physical addressing to minimize latency while ensuring consistency. The primary approaches to address translation in caches are virtually indexed, physically tagged (VIPT) and physically indexed, physically tagged (PIPT) schemes, each balancing speed, size, and correctness differently.^[60] VIPT caches, commonly used for L1 instruction and data caches, generate the cache index from bits of the virtual address while storing the physical address in the tag field for comparison. This design allows the cache lookup to proceed in parallel with the translation lookaside buffer (TLB) lookup, as the virtual index bits are available immediately from the processor, potentially overlapping the address translation latency and reducing overall access time. However, VIPT requires that the index bits do not overlap with the virtual page offset in a way that exceeds the physical page size, limiting cache size to avoid aliasing issues.^[61]^[62] In contrast, PIPT caches use the full physical address for both indexing and tagging, necessitating a complete VA-to-PA translation before any cache access can occur. This approach eliminates aliasing problems inherent in virtual addressing but incurs higher latency, as the TLB and page table walk must complete serially before the cache probe, making it more suitable for larger, lower-level caches like L2 or L3 where consistency is prioritized over speed.^[60]^[61] A key challenge in VIPT caches is the synonym problem, where multiple virtual addresses map to the same physical memory location, potentially leading to multiple cache entries holding the same data and causing inconsistencies during updates or coherence operations. This arises because different processes or threads may alias the same physical page via distinct virtual mappings, and solutions typically involve flushing the cache on context switches or using hardware/software mechanisms for invalidation to ensure only one valid entry remains.^[63]^[64] Another issue, known as the homonym problem, occurs when different physical pages have virtual addresses sharing the same low-order bits used for indexing, resulting in false cache misses or conflicts within the same set. This is mitigated through page coloring techniques, where the operating system allocates physical pages such that their low-order address bits (colors) match the corresponding virtual page bits, preventing unrelated pages from colliding in the cache index.^[65]^[66] The TLB plays a crucial role in facilitating efficient address translation for caches by caching recent VA-to-PA mappings from the page table, allowing quick resolution of translations without frequent main memory accesses. In VIPT designs, the TLB provides the physical tag bits concurrently with the virtual index, enabling hit detection shortly after translation; a TLB miss, however, triggers a page table walk that can stall the cache access.^[67]^[68] Some advanced designs incorporate virtual address hints to further optimize translation paths in hybrid caching schemes.^[69]

Multi-Port and On-Chip Designs

Multi-ported caches enable simultaneous access to memory from multiple pipeline stages or execution units in modern processors, supporting higher instruction throughput in pipelined and out-of-order designs.^[70] Dual-ported caches, for instance, allow one port for instruction fetches and another for data operations, while triple-ported variants further accommodate additional reads from branch prediction or load/store units, though they increase silicon area by 20-50% compared to single-ported equivalents.^[71] These designs mitigate bandwidth bottlenecks in wide-issue processors, where issue widths exceed four instructions per cycle, but require careful banking to avoid port conflicts.^[72] On-chip caches are predominantly implemented using static random-access memory (SRAM) cells, with the conventional 6-transistor (6T) cell providing stable single-port access suitable for basic read/write operations in L1 and L2 caches.^[73] For multi-ported needs, 8-transistor (8T) cells are employed, incorporating separate read and write ports to enhance stability and speed under contention, albeit at the cost of about 20% higher area per bit than 6T cells.^[74] Compared to off-chip dynamic RAM (DRAM), on-chip SRAM offers 5-10x lower latency and no refresh overhead, but exhibits 10-100x lower density and higher static power consumption per bit due to always-on transistor leakage.^[75] To optimize access times and area, cache implementations often separate the tag RAM from the data array, using dedicated high-speed SRAM for tags that store address identifiers while the larger data array holds the actual cache lines. This decoupling allows parallel tag lookups without activating the full data array until a hit is confirmed, reducing energy by up to 25% in set-associative caches and enabling smaller, faster tag storage optimized for comparison logic. In contemporary multi-core processors, caches are tightly integrated on-chip within system-on-chip (SoC) designs, with shared L3 caches scaling to 256–1,152 MB in server chips like the AMD EPYC 9005 series (as of 2025) to support inter-core data sharing and reduce off-chip traffic.^[76] Advanced process nodes, such as 3 nm, enable denser SRAM integration, as seen in chips with up to 96 MB of stacked L3 cache per core complex die using technologies like AMD's 3D V-Cache (introduced in 2022), balancing capacity gains with fabrication feasibility.^[77] Fabricating large on-chip caches poses challenges in yield and thermal management, as increasing array sizes amplifies defect probabilities, potentially dropping yields below 80% for multi-megabyte structures without redundancy techniques. Heat dissipation intensifies with cache size, given SRAM's leakage-dominated power profile, leading to hotspots that can elevate die temperatures by 20-30°C and necessitate advanced cooling like microchannel liquid systems in high-performance servers.^[78]

Historical Development

Early Innovations

The concept of cache memory originated in the mid-1960s with Maurice Wilkes' proposal for "slave memories," a small, fast auxiliary storage acting as a buffer between the processor and a larger, slower main memory to exploit locality of reference and reduce access times.^[79] In his seminal 1965 paper, Wilkes described a slave memory of around 32,000 words operating as a high-speed cache for frequently accessed data, emphasizing dynamic allocation to minimize latency while keeping costs low compared to fully fast main storage.^[79] This idea laid the groundwork for modern caching by introducing the notion of a hierarchical memory system where the slave holds the most active portions of the program, fetched on demand from the master memory. The first commercial implementation of a CPU cache appeared in the IBM System/360 Model 85, announced in 1968 and shipped starting in 1969.^[80] This mainframe featured a high-speed buffer storage, or cache, of 16 KB standard capacity (expandable to 32 KB), implemented using fast monolithic integrated circuits to bridge the speed gap between the processor and core main memory.^[9] The cache employed a set-associative design with dynamic sector assignment, where 16 sectors of 1 KB each were mapped to main storage blocks, using an activity list for replacement to optimize hit rates and performance in scientific workloads.^[9] Studies leading to its adoption showed hit rates exceeding 90% for typical programs, significantly boosting effective memory bandwidth without requiring fully fast main storage.^[9] In the 1970s, IBM extended caching principles to support virtual addressing with the introduction of translation lookaside buffers (TLBs) in the System/370 architecture, announced in 1970.^[81] The TLB served as a dedicated cache for page table entries, storing recent virtual-to-real address mappings to accelerate dynamic address translation (DAT) and reduce overhead in virtual memory systems.^[81] Models like the 155, 158, 165, and 168 integrated the TLB hardware to handle paging without excessive table walks, enabling efficient multiprogramming on mainframes with up to 8 MB of memory.^[82] This innovation was crucial for the System/370's virtual storage capabilities, as it minimized translation latency to a few cycles per access.^[81] By the early 1980s, caches began appearing in microprocessors, with the Motorola 68020—announced in 1982 and released in 1984—introducing the first on-chip instruction cache in the 68k family. This 256-byte cache improved performance by prefetching instructions ahead of execution, achieving hit rates that reduced average fetch times in pipelined operations.^[83] The design addressed the growing processor-memory speed gap in 32-bit systems, using simple direct-mapped organization for cost-effective integration on a single chip. Early cache implementations faced significant challenges, primarily the high cost of fast memory technologies like magnetic core in the 1960s and early static RAM (SRAM) in the 1970s, which limited cache sizes to kilobytes and restricted adoption to high-end systems.^[84] Core memory, while reliable, was expensive at $1–$5 per bit and slow (around 1–2 μs access), making larger caches uneconomical; early SRAM offered sub-μs speeds but at 10–100 times the cost per bit of dynamic RAM.^[84] Additionally, the lack of standardized design methodologies complicated optimization, as architects balanced associativity, replacement policies, and coherence without established benchmarks, often relying on simulations for specific workloads.^[85]

Evolution in Processor Architectures

The evolution of CPU caches within major instruction set architectures (ISAs) since the late 1980s has been driven by the need to balance increasing core counts, power efficiency, and performance demands in processors. In the x86 architecture, Intel's 80486 microprocessor, introduced in 1989, marked the first integration of an on-chip cache, featuring an 8 KB unified instruction and data cache to reduce latency compared to off-chip designs.^[86] This unified approach stored both code and data in a single structure, leveraging a write-through policy for simplicity in early embedded systems. By 1993, the Intel Pentium processor advanced this design with a split level-1 (L1) cache, comprising separate 8 KB instruction and 8 KB data caches, which allowed simultaneous access to code and data, mitigating bottlenecks in superscalar execution.^[87] In modern x86 implementations, such as Intel's Xeon Scalable processors in the 2020s, multi-level hierarchies have scaled dramatically, with Sapphire Rapids (4th Gen, 2023) offering up to 112.5 MB of shared L3 cache per socket to support up to 60 cores (as in the Xeon Platinum 8490H), enhancing data sharing in high-performance computing workloads.^[88] Parallel developments in RISC architectures included the MIPS R4000 (1991) with separate 8 KB instruction and data caches, and the PowerPC 601 (1993) integrating on-chip caches, influencing subsequent designs. Parallel advancements in the ARM ISA have emphasized mobile and embedded efficiency, evolving from single-level to sophisticated multi-level caches. The StrongARM SA-110, released in 1996 by Digital Equipment Corporation (later acquired by Intel), introduced a split L1 cache with 16 KB for instructions and 16 KB for data, enabling better pipelining in low-power devices like PDAs.^[89] Subsequent ARM Cortex-A series processors, starting with the Cortex-A8 in 2005, adopted multi-level hierarchies, including private L1 caches per core and a shared L2 cache configurable up to 1 MB, with later models like Cortex-A76 incorporating L3 support for improved coherence in multi-core setups.^[90] In recent developments as of 2025, Apple's M-series chips, such as the M5 in the MacBook Pro and iPad Pro, integrate AI-optimized cache architectures within their unified memory system, featuring a large shared system-level cache and unified memory bandwidth exceeding 500 GB/s, tailored for neural network acceleration via the Neural Engine.^[91] The RISC-V ISA, as an open-standard alternative, has seen rapid cache integration in commercial multi-core designs. SiFive's Performance P870 series (2023 onward) employs coherent multi-core clusters with private L1 and L2 caches per core, alongside a shared L3 cache up to several MB, supporting up to 32 cores for scalable embedded and edge computing applications.^[92] Key milestones in cache evolution across ISAs include the adoption of private L2 caches per core in early multi-core processors around 2005, as seen in Intel's Pentium D and AMD's Athlon 64 X2, with later designs adopting shared caches around 2006 (e.g., Intel Core 2 Duo) to minimize data replication and improve inter-core communication.^[93] In 2011, Intel's Sandy Bridge microarchitecture introduced the micro-operation (uop) cache, a specialized L1 structure holding up to 4K decoded uops to bypass the decode stage for hot code paths, reducing power and latency in out-of-order execution.) Cache coherency protocols like MESI (Modified, Exclusive, Shared, Invalid), standard in x86 since the 1990s, have been essential for these multi-core designs, ensuring consistent data visibility across cores via snooping mechanisms.^[94] Current research explores innovative cache paradigms to address emerging workloads. Approximate caching techniques, such as those in Proximity (2025), leverage semantic similarity in machine learning queries to reuse cache entries with controlled error rates, reducing database hits by up to 50% in retrieval-augmented generation systems.^[95] Experimental optical caches, like Pho$ (2022), propose hybrid opto-electronic hierarchies using photonic interconnects for shared last-level caches, potentially slashing latency and energy in multi-core setups by integrating optical RAM cells.^[96]

References

[1]
22. Basics of Cache Memory - UMD Computer Science
The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are ...
[2]
Caches - CS 3410
The idea with a cache is to try to “intercept” most of a program's memory accesses. A cache wants to fulfill as many loads and stores as it can directly, using ...Missing: definition | Show results with:definition
[3]
[PDF] Caches and Memory Hierarchies - Duke People
Typical Processor Cache Hierarchy. • First level caches: optimized for t hit ... • “4KB cache” means cache holds 4KB of data. • Called capacity. • Tag ...
[4]
21. Memory Hierarchy Design - Basics - UMD Computer Science
A memory hierarchy uses smaller, faster memories closer to the processor, with larger, slower memories further away, to achieve speed, size, and cost balance.
[5]
[PDF] Multi-Core Cache Hierarchies - Electrical and Computer Engineering
The scope will largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,. MICRO, and ASPLOS. Multi-Core Cache Hierarchies.
[6]
23. Cache Optimizations I - UMD Computer Science
Multi-level Caches: The first techniques that we discuss and one of the most widely used techniques is using multi-level caches, instead of a single cache.
[7]
Cache Memories | ACM Computing Surveys
Cache memory systems for multiprocessor architecture, in Proc. AFIPS National Computer Conference (Dallas, Tex. June 13-16, 1977), vol. 46, AFIPS Press, ...
[8]
[PDF] Design of CPU Cache Memories - UC Berkeley EECS
In the remainder of this paper, we will first show how a typical cache memory would look and work, and then we will discuss in more detail the specification of ...
[9]
[PDF] Structural aspects of the System/360 Model 85 11 The cache
This paper discusses organization of the cache and the studies that led to its use in the Model 85 and to selecting of values for its parameters. the SYSTEM ...Missing: Conti | Show results with:Conti
[10]
L2 Cache - an overview | ScienceDirect Topics
Each entry in the array consists of a data block and its associated valid and tag bits. ... Computer Architecture: A Quantitative Approach, 5th ed., Morgan ...Missing: components | Show results with:components
[11]
[PDF] The Basics of Caches | UCSD CSE
Because different regions of memory may be mapped into a block, the tag is used to differentiate between them. valid bit - A bit of information that indicates ...Missing: array dirty trade- offs
[12]
[PDF] Cache Memory - Duke Computer Science
The Tag Array holds the Block Memory Address. • A valid bit associated with each cache block tells if the data is valid. Direct Mapped Cache. Cache ...
[13]
[PDF] OpenPiton Microarchitecture Specification - Princeton Parallel Group
Apr 2, 2016 · The CPU Cache-Crossbar (CCX) is the crossbar interface used in ... The L2 data array is a 4096x144 SRAM array. Each line contains. 128 ...
[14]
[PDF] Section 7. Memory System Cortex-A15 MPCore L1 and L2 Caches ...
Duplicated Tag RAMs keep track of what data is allocated in each CPU's cache ... Data array. Match? Multiplexer. To CPU. D. E. C. 16. Fully Associative Mapped ...
[15]
[PDF] Cache - CMSC 611: Advanced Computer Architecture - UMBC CSEE
If so, How to find it? The Basics of Cache. • Cache: level of hierarchy closest to processor ... • By definition: Conflict Miss = 0 for a fully associative cache.
[16]
[PDF] EE 660: Computer Architecture Advanced Caches - Amazon S3
Reduce Miss Rate: Large Block Size. • Less tag overhead. • Exploit fast burst ... Reduce Miss Rate: Large Cache Size. Empirical Rule of Thumb: If cache ...
[17]
5.2.3.2.2. Data Cache - Intel
The size of the line field depends only on the size of the cache memory. The size of the offset field depends on the line size. Line sizes of 4, 16, and 32 ...
[18]
[PDF] 250P: Computer Systems Architecture Lecture 10: Caches
• 32 KB 4-way set-associative data cache array with 32 byte line sizes. • How many sets? • How many index bits, offset bits, tag bits? • How large is the tag ...<|control11|><|separator|>
[19]
[PDF] CPU clock rate DRAM access latency Growing gap - Error: 400
▫ CPU-cache operations. • CPU: Read request. Cache: hit, provide data. Cache ... Data array. D. V. Address tag. Address tag is the “id” of the associated ...
[20]
[PDF] 4 Cache Organization
Sep 2, 1998 · • Bus traffic ratio depends on miss rate and block size. ◇ Concurrency. • Multiple blocks per sector permits sharing cost of tag in cache.
[21]
Cache Write Policy | Baeldung on Computer Science
Mar 18, 2024 · We refer to this policy as “write-through with no-write allocation” or write-around. ... For CPU caches, we use a dirty bit as a state indicator.
[22]
Cache Basics
A cache address can be specified simply by index and offset. The tag is kept to allow the cache to translate from a cache address (tag, index, and offset) to a ...
[23]
[PDF] Memory Hierarchy Review - People @EECS
Jan 27, 2010 · Review: Direct Mapped Cache. • Direct Mapped 2N byte cache: – The uppermost (32 - N) bits are always the Cache Tag. – The lowest M bits are ...
[24]
Dealing with Cache Conflicts
Many commercial RISC microprocessors have direct-mapped primary caches. This is because direct-mapped caches have faster access times than set-associative ...
[25]
[PDF] A case for direct-mapped caches - Computer
Dec 15, 1988 · The arguments against direct-mapped caches are that they (1) have worse miss ratios than set-associative caches of the same size, (2) have ...
[26]
Set-Associative Cache - an overview | ScienceDirect Topics
Set associative cache is defined as a type of cache that reduces conflicts by providing multiple blocks (N ways) in each set, allowing a memory address to ...Introduction · Architecture and Design · Cache Replacement Policies
[27]
Set associative caches - Arm Developer
The main caches of ARM cores are always implemented using a set associative cache. This significantly reduces the likelihood of the cache thrashing seen ...
[28]
[PDF] CS650 Computer Architecture Lecture 8 Memory Hierarchy - Cache ...
Cache address = (MM block address) % (No of blocks in cache). For MM block #9, cache address = 9 % 8 = 1. Block #9 can only be placed in block #1 of cache. 2.Missing: seminal | Show results with:seminal
[29]
[PDF] Caches - CSE, IIT Delhi
... cache and a set associative cache are made of 6-transistor SRAM cells, whereas the tag array in a fully associative cache uses 10-transistor CAM cells. (e) ...
[30]
https://ieeexplore.ieee.org/document/799456
[31]
[PDF] 4: Pseudo-Associative Cache
#4: Pseudo-Associative Cache. • Also called column associative. • Idea. – start with a direct mapped cache, then on a miss check another entry. • A typical next ...
[32]
[PDF] Memory Hierarchy Design – Part 2
The large penalty for eight-way set associative caches is due to the cost of reading out eight tags and the corresponding data in parallel. Energy Data from ...
[33]
[PDF] lec18-markup.pdf - Washington
Larger sets and higher associativity lead to fewer cache conflicts and lower miss rates, but they also increase the hardware cost. In practice, 2-way through 16 ...Missing: matching | Show results with:matching
[34]
Improving direct-mapped cache performance by the addition of a ...
Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim ...
[35]
[PDF] Lecture 7: Memory Hierarchy—3 Cs and 7 Ways to Reduce Misses
• 3 Cs: Compulsory, Capacity, Conflict Misses. • Reducing Miss Rate. 1. Reduce ... • “hit under miss” reduces the effective miss penalty by being helpful ...<|control11|><|separator|>
[36]
[PDF] Cache Memory and Performance - Computer Science (CS)
L1 cache hit time of 1 cycle. L1 miss penalty of 100 cycles (to DRAM). Average access time: 97% L1 hits: 1 cycle + 0.03 * 100 cycles = 4 cycles. 99% L1 hits: 1 ...
[37]
25. Cache Optimizations III - UMD Computer Science
Computer Architecture – A Quantitative Approach , John L. Hennessy and David A.Patterson, 5th Edition, Morgan Kaufmann, Elsevier, 2011. Computer ...
[38]
[PDF] Caches and Memory Systems Part 3: Miss penalty reduction
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss. Hit-under-miss implies loads may be serviced out-of-order... Need a memory “fence” or “barrier ...
[39]
[PDF] Memory Hierarchy - Overview of 15-740
AMAT = Average memory access time = Hit time + Miss ratio × Miss penalty ... Zcaches get miss ratio of highly-associative cache with hit time of low-associative ...
[40]
Performance Modeling and Evaluation of a Production ...
H is the hit time, the same as defined in AMAT. The Hit ... (5), which also means the overlapping between hit and miss accesses in this level i cache.
[41]
Memory bandwidth limitations of future microprocessors
The traffic inefficiency of a cache allows us to compute an upper bound on effective pin bandwidth, This upper bound is only valid if the processor model ...
[42]
Intel® Data Direct I/O Technology Performance Monitoring
Aug 2, 2024 · Core: Central Processing Unit (CPU) containing local level 1 (L1) and level 2 (L2) caches. Each core communicates with the uncore and LLC ...
[43]
Performance tradeoffs in cache design
Section 3 looks at the primary tradeoff between CPU cycle time and cache size. It strongly indicates that pushing cache size to extremes to benefit either ...
[44]
[PDF] Performance Analysis Guide for Intel® Core™ i7 Processor and Intel ...
On the Intel® Core™ i7 processor the mid level cache (L2 CACHE) misses and traffic with the uncore and beyond can be measured with a large variety of ...
[45]
A Case Study for Broadcast on Intel Xeon Scalable Processors
All cores have their own private L1 and L2 caches, while all cores in each socket share a large, non-inclusive, last-level cache (LLC). An important factor ...
[46]
https://dl.acm.org/doi/pdf/10.1145/1669112.1669165
[47]
Achieving Non-Inclusive Cache Performance ... - ACM Digital Library
Inclusive caches are commonly used by processors to simplify cache coherence. However, the trade-off has been lower performance compared to non-inclusive ...
[48]
Cache Exclusivity and Sharing: Theory and Optimization
Exclusive cache differs from inclusive cache in two major aspects. First, there is no data duplica- tion, which is a benefit because the cache space is ...
[49]
Why On-Chip Cache Coherence Is Here to Stay
Jul 1, 2012 · With many cores, the size of private caches is limited, and the miss latency from a private cache to the chipwide shared cache is likely large.Cache Coherence Today · Concern 1: Traffic · Concern 2: Storage
[50]
[PDF] Improving Direct-Mapped Cache Performance by the Addition of a ...
Victim caching is an improvement to miss caching that loads the small full. Ii. -associative cache with the vic- tim of a miss and not t e requested line. Small ...
[51]
[PDF] Trace Cache: a Low Latency Approach to High Bandwidth ...
A trace cache caches traces of the dynamic instruction stream, making noncontiguous instructions appear contiguous, improving instruction fetching.
[52]
[PDF] Intel® Technology Journal
Feb 18, 2004 · The Execution Trace Cache on the Pentium 4 processor can hold up to 12K uops and has a hit rate similar to an 8 to 16 kilobyte conventional.
[53]
[PDF] sandy bridge spans generations - People @EECS
Sep 1, 2010 · A new addition is the micro-op (L0) cache, which can hold 1.5K micro-ops. Assuming an average of 3.5 bytes per x86 instruction, this cache is ...
[54]
[PDF] Branch Prediction Strategies and Branch Target Buffer Design
The branch target buffer, like the CPU cache or the translation look- aside buffer, is a small, high-speed memory, and because of both cost and performance ...
[55]
[PDF] Reducing Writebacks Through In-Cache Displacement - COMPAS Lab
In this paper, we proposed a low-cost cache management policy that attempts to maximize write-coalescing for the purpose of reducing costly writebacks. We ...
[56]
Reducing Writebacks Through In-Cache Displacement
In this article, we propose a novel cache management policy that attempts to maximize write-coalescing in the on-chip SRAM last-level cache (LLC)
[57]
[PDF] Reconfigurable Caches and their Application to Media Processing
Section 5 studies one such technique, instruction reuse, with reconfigurable caches to address the computation bottleneck in media processing workloads.
[58]
[PDF] Smart Cache: A Self Adaptive Cache Architecture for Energy Efficiency
The Smart cache organizes ways at set boundaries, which avoids flushing data back to memory when increasing the associativity but keeping the cache size fixed.
[59]
[PDF] CS/ECE 552: Virtual Memory
• PI = Physically indexed. • VT = Virtually tagged. • Realistically never used. – Cache is physically indexed. – Virtual address used for tag. – Why? 26. Page ...
[60]
[PDF] LECTURE 12 Virtual Memory - FSU Computer Science
Virtually indexed, physically tagged (VIPT) caches use the virtual address for the index and the physical address in the tag. Index into the cache using bits ...
[61]
[PDF] SIPT: Speculatively Indexed, Physically Tagged Caches
This virtually-indexed physically-tagged (VIPT) design is very ef- fective, but constrains L1 design such that the total capacity of each cache way is the ...Missing: explanation | Show results with:explanation
[62]
https://lph.ece.utexas.edu/merez/uploads/MattanErez/sipt_hpca18.pdf
[63]
[PDF] Organization and Performance of a Two-Level Virtual-Real Cache ...
The two-level cache has a fast, virtually-addressed first-level cache, backed by a large, physically-addressed second-level cache, in a shared-bus ...Missing: explanation | Show results with:explanation
[64]
[PDF] Reducing Memory Reference Energy with Opportunistic Virtual ...
Third, virtual caches require extra mechanisms to disambiguate homonyms (a single virtual address mapped to different physical pages). Fourth, they pose ...
[65]
[PDF] 18-447 Virtual Memory, Protection and Paging! 2 Parts to Modern VM !
increase page size! page coloring! MIPS R10K! VPN. PO. TLB. PPN. IDX BO physical cache tag data. = cache hit? TLB hit? a. CMU 18-447! Spring ʻ10 36! © 2010!
[66]
[PDF] Paging: Faster Translations (TLBs) - cs.wisc.edu
To speed address translation, we are going to add what is called (for historical rea- sons [CP78]) a translation-lookaside buffer, or TLB [CG68, C95]. A TLB is ...
[67]
[PDF] Address Transla+on - Washington
Translation Lookaside Buffer (TLB). Virtual. Page. Page. Frame. Access. Matching ... – Lookup virtually addressed cache and TLB in parallel. – Check if ...
[68]
Virtual-Address Caches Part 1 - IEEE Micro - ACM Digital Library
Unfortunately, consistency problems add complexity to virtual-address caches. These problems are mostly caused by synonyms and address-mapping changes.
[69]
Increasing cache port efficiency for dynamic superscalar ...
Abstract. The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance.
[70]
https://dl.acm.org/doi/10.1145/232973.232989
[71]
A scalable multi-porting solution for future wide-issue processors
Current single- and dual-ported cache implementations are clearly inadequate. There is a need to explore scalable techniques for increasing the effective number ...Missing: triple CPUs
[72]
Implementation of High Performance 6T-SRAM Cell - ResearchGate
Aug 6, 2025 · This paper mainly focuses on reducing powerdissipation of Static Random Access Memory (SRAM). Power reduction and Delay reductions are the major challenge of ...
[73]
Difference between SRAM and DRAM - GeeksforGeeks
Jul 12, 2025 · SRAM stores data in voltage using transistors, is faster and used for cache. DRAM stores data in electric charges using capacitors, is slower ...Static Random Access Memory... · Dynamic Random Access Memory... · Difference Between Static...
[74]
Examining Intel's Arrow Lake, at the System Level - Chips and Cheese
Dec 4, 2024 · Recent Arm server chips also use large (1 or 2 MB) L2 caches to mitigate L3 latency. Thus other CPU makers are also taking advantage of process ...Missing: modern size
[75]
Full article: Challenges in Cooling Design of CPU Packages for High ...
Jul 14, 2010 · Cooling technologies that address high-density and asymmetric heat dissipation in CPU packages of high-performance servers are discussed.
[76]
Slave Memories and Dynamic Storage Allocation - Semantic Scholar
Slave Memories and Dynamic Storage Allocation · M. Wilkes · Published in IEEE Transactions on… 1 April 1965 · Computer Science, Engineering.
[77]
IBM's Single-Processor Supercomputer Efforts
Dec 1, 2010 · Cache memory was a new concept at the time; the IBM S/360 Model 85 in 1969 was IBM's first commercial computer system to use cache. DOI ...
[78]
[PDF] Chapter 51
We found ourselves well into the 1970s making changes in the architecture of System/360 to remove ambiguities and, in some cases, to adjust the function ...<|separator|>
[79]
[PDF] IBM System/370 - Your.Org
The System/370 got off to a surprisingly low-key start in. June 1970, when IBM introduced the large-scale Models. 155 and 165. Though they offered significant ...
[80]
Motorola 68020 - Wikipedia
The Motorola 68020 is a 32-bit microprocessor from Motorola, released in 1984. ... The 68020 replaced this with a proper instruction cache of 256 bytes, the ...Missing: 1982 | Show results with:1982
[81]
[PDF] MC68020 MC68EC020 - NXP Semiconductors
Sep 29, 1995 · The M68020 User's Manual describes the capabilities, operation, and programming of the. MC68020 32-bit, second-generation, enhanced ...Missing: 1982 | Show results with:1982
[82]
[PDF] Evolution of Memory Architecture
The 1960s and 1970s saw the prolific use of the mag- netic core memory which, unlike drum memories, had no moving parts and provided random access to any word ...
[83]
[PDF] System/360 and Beyond
on the Model 85 with a 16K-byte cache, typically. 97% of fetches were satisfied with data from the cache. With larger caches, in scientific applications "hit" ...
[84]
[PDF] i486™ MICROPROCESSOR
An 8 Kbyte unified code and data cache combined with a 106 Mbyte/Sec burst bus at 33.3 MHz ensure high system throughput even with inexpensive DRAMs.
[85]
The Pentium: An Architectural History of the World's Most Famous ...
Jul 11, 2004 · First among these improvements was the an on-die, split L1 cache that was doubled in size to 32K. This larger L1 helped boost performance ...
[86]
Intel 4th Gen Xeon CPUs Official: Sapphire Rapids With Up To 60 ...
Jan 10, 2023 · L3 Cache, 384 MB, 105 MB ; Memory Support, DDR5-5200, DDR5-4800 ; Memory Capacity, 12 TB, 8 TB ; Memory Channels, 12-Channel, 8-Channel.Missing: size | Show results with:size
[87]
DEC StrongARM SA-110 | Processor Specs - PhoneDB.net
Sep 3, 2007 · 1996, Application Processor, 32 bit, single-core, Memory Interface(s): Yes, 16 Kbyte I-Cache, 16 Kbyte D-Cache, 350 nm, Embedded GPU: N/A, ...
[88]
Cache architecture - Arm Developer
A guide for software developers programming Arm Cortex-A series processors based on the Armv7-R architecture.
[89]
Apple unleashes M5, the next big leap in AI performance for Apple ...
Oct 15, 2025 · M5 offers unified memory bandwidth of 153GB/s, providing a nearly 30 percent increase over M4 and more than 2x over M1. The unified memory ...
[90]
SiFive Performance™ P800 Series - P870-D
P870-D is fully compliant with the RVA23 RISC‑V Instruction Profile. It incorporates a shared cluster cache enabling up to -32 cores to be connected coherently.
[91]
Industry Trends: Chip Makers Turn to Multicore Processors
Chip makers are turning to multicore processors due to slowing performance increases, power and heat issues, and the need for more energy-efficient chips.<|separator|>
[92]
What cache coherence solution do modern x86 CPUs use?
May 31, 2020 · MESI states for each cache line can be tracked / updated with messages and a snoop filter (basically a directory) to avoid broadcasting those messages.Can I force cache coherency on a multicore x86 CPU?Which cache-coherence-protocol does Intel and AMD use?More results from stackoverflow.com
[93]
Leveraging Approximate Caching for Faster Retrieval-Augmented ...
Mar 7, 2025 · We introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries.
[94]
A Practical Shared Optical Cache With Hybrid MWSR/R-SWMR NoC ...
Oct 13, 2022 · The optical cache banks are fabricated on separate optical dies, while the processor cores remain on their original electronic die. The cores ...