Fact-checked by Grok 2 weeks ago

CPU cache

A CPU cache is a small, high-speed integrated into the (CPU) that temporarily stores copies of data or instructions frequently accessed from the main , thereby reducing the average latency for operations and improving overall performance. The primary purpose of the CPU cache is to bridge the significant speed gap between the fast CPU core and the comparatively slower main (such as ), by leveraging two key principles of program behavior: temporal locality, where recently accessed data is likely to be used again soon, and spatial locality, where data near a recently accessed location is also likely to be referenced. This approach allows the to fulfill most requests directly from the cache—a ""—in a fraction of the time required for a main access, often achieving effective access times close to the cache's own latency of just a few cycles. Modern CPU caches are structured as a multi-level hierarchy to balance speed, size, and cost, typically including Level 1 (L1), Level 2 (), and Level 3 (L3) caches, with each successive level being larger in capacity but slower in access time and farther from the CPU . The L1 cache, the closest to the , is usually the smallest (e.g., 32–64 per ) and divided into separate instruction (L1i) and data (L1d) caches to optimize for different access patterns; L2 caches are moderately larger (e.g., 256 –1 per ) and often private to each ; while L3 caches, which can span several megabytes to tens of megabytes, are typically shared across multiple cores in multi-core processors to facilitate and reduce inter-core communication overhead. Data is transferred between levels and main memory in fixed-size units called cache lines (usually 64 bytes), and cache management involves strategies for addresses to cache locations—such as direct-mapped, set-associative, or fully associative organizations—to handle placement, identification via tags, and policies (e.g., least recently used) when space is full. In multi-core systems, cache hierarchies also address coherence challenges to ensure consistent data views across cores, often using protocols like MESI (Modified, Exclusive, Shared, Invalid) to maintain synchronization without excessive overhead. Advances in cache design continue to focus on increasing hit rates through larger inclusive or non-inclusive structures, prefetching mechanisms, and optimizations for specific workloads, making the cache a critical of CPU efficiency in everything from embedded devices to .

Fundamentals

Definition and Purpose

A CPU cache is a small, high-speed integrated near the core, designed to store copies of frequently accessed data and instructions from the larger but slower main memory (). This proximity allows the cache to deliver data in a fraction of the time required for main memory access, exploiting the principle of locality—where programs tend to reuse recently accessed data or nearby addresses—to minimize . The fundamental purpose of the CPU cache emerged from the historical disparity in speeds between and main , a gap that has widened dramatically since the . Early motivations, as seen in the Model 85—the first commercial computer with a in —stemmed from the need to accelerate effective performance, where main cycles took several cycles (e.g., about 13 times longer), limiting overall throughput. Today, this gap exceeds 100-fold, with modern CPU clock cycles in the sub-nanosecond range contrasting against DRAM latencies of 50-100 nanoseconds, compelling multilevel hierarchies to sustain utilization. By reducing average memory access time through fast cache hits, the CPU cache enhances instruction execution throughput, enabling processors to operate closer to their peak speeds while also improving by avoiding costly main memory fetches. Cache performance is quantified by the hit rate—the fraction of memory requests satisfied directly from the cache—and the miss rate, defined as 1 minus the hit rate, which determines the frequency of slower main memory interventions. Typical hit rates in well-designed systems range from 90-99%, underscoring the cache's critical role in modern computing.

Basic Components

The fundamental building blocks of a CPU cache consist of storage, a array, valid and dirty bits per cache entry, and control logic to manage these elements. These components work together to enable efficient temporary of from main , leveraging of locality to bridge the speed gap between the and slower systems. storage captures the upper bits of the physical for each cached , allowing the cache to distinguish which specific region a given represents within its set or . This field is essential for verifying whether a requested matches a stored during access attempts. The data provides the primary storage for the actual contents of the cached s and is typically implemented using (SRAM) for its high speed and low latency, suitable for on-chip integration. Each entry in the data holds a fixed-size of data, commonly 64 bytes in modern designs, though sizes such as 32 or 128 bytes appear in various architectures depending on the generation and optimization goals. Associated with each cache entry are control bits, including a valid bit that signals whether the block contains usable data (set to 1 upon loading a block from ) and a dirty bit that indicates if the block has been modified since loading (set to 1 on writes in write-back caches). The valid bit ensures that only initialized entries are considered during lookups, while the dirty bit tracks changes to facilitate efficient write-back to lower levels only when necessary. Control logic encompasses hardware elements such as comparators that match incoming address tags against stored tags and finite state machines that orchestrate updates to the cache state, including bit flips and data movements within the array. These circuits ensure reliable operation by handling edge cases like initial cache states where all valid bits are unset. Cache designers must balance block sizes, as larger blocks amortize tag storage overhead across more data bytes—reducing the relative cost of tags and control bits—but they also amplify miss penalties by requiring more data transfer time from main memory on misses. For instance, doubling the block size from 32 to 64 bytes can halve tag overhead per byte stored but roughly doubles the latency impact of compulsory or capacity misses.

Operation Principles

Cache Entries and Addressing

In CPU caches, is stored and retrieved in fixed-size units known as cache lines or blocks, which typically range from 32 to 64 bytes in modern processors to balance transfer efficiency and overhead. Each cache line holds a contiguous block of along with , such as a for and a valid bit to indicate usability. The block offset, comprising the lowest \log_2 (block size) bits of the , specifies the particular byte or word within that line. To map a memory address to a cache entry, the address is decomposed into three primary fields: the block offset (lowest bits), the index (middle bits), and the tag (highest bits). The block offset, as noted, selects the byte within the line. The index bits determine the specific cache line or set where the data resides, with the number of index bits equal to \log_2 (number of lines or sets). The tag bits, consisting of the remaining upper address bits, uniquely identify the originating memory block to ensure correct matching during access. This decomposition enables efficient hardware lookup by first using the index to locate candidate entries, then comparing the tag for validation. For example, in a direct-mapped of 4 capacity with 32-byte lines and a 32-bit , the requires 5 bits ($2^5 = 32). This results in 128 lines ($4096 / 32 = 128) and thus 7 bits ($2^7 = 128). The then occupies the remaining 20 bits ($32 - 5 - 7 = 20). The is calculated by extracting the high-order bits of the full , excluding those allocated to the and fields, allowing the to distinguish between different blocks that might map to the same . In set-associative caches, the selects a set of lines, with comparisons performed across ways within that set.

Access Process

The read access process in a CPU cache initiates when the issues a memory read request accompanied by a . The cache controller first decodes this address, partitioning it into bits, bits, and block offset bits to locate the potential data. The bits select a specific set of cache lines, and the bits are simultaneously compared against the stored tags in all ways (associative entries) within that set using () or equivalent parallel comparators. If a tag match occurs and the associated valid bit is asserted—confirming the cache line holds current, usable data—a hit is detected, and the requested bytes are fetched from the data storage array using the offset bits before being delivered to the processor. This valid bit check serves as a basic coherency validation to prevent serving stale or uninitialized data. The entire hit resolution for an L1 cache typically requires 1-4 clock cycles, accounting for parallel tag lookup, valid bit inspection, and data array access. On a miss, where no matching valid tag is found, the request propagates to the next memory level (e.g., L2 cache or ), potentially triggering a block fetch and replacement on compulsory misses. Write access mirrors the read process in address decoding and tag comparison but diverges in data handling upon a . Following tag match and valid bit confirmation, the incoming data updates the targeted bytes in the cache line's data array via the . The specific write then governs further actions: in write-through , the is simultaneously written to the next level, while in write-back , the line is marked dirty for deferred propagation. Misses during writes may allocate a new line (write-allocate) or bypass the (no-allocate), forwarding the operation directly to lower levels without altering the state. latency remains comparable to reads at 1-4 cycles for L1, as the core lookup and occur swiftly.

Design Policies

Replacement Policies

When a cache cannot accommodate a new cache line due to being full, a replacement determines which existing line to evict to make space. These policies aim to minimize cache misses by exploiting patterns in memory access, such as temporal locality, where recently accessed is likely to be accessed again soon. Common policies balance prediction accuracy with implementation complexity in hardware. The Least Recently Used (LRU) selects for eviction the cache line that has gone the longest without being accessed. It maintains an ordering of lines based on recency, often using a or counters to track access times, making it well-suited for workloads exhibiting strong temporal locality. Implementing true LRU requires significant hardware overhead, such as O(n) updates per access for an n-way associative cache, which becomes prohibitive for large associativities. The First-In-First-Out (FIFO) policy treats the as a , evicting the oldest line inserted regardless of its usage history. This approach is simple and incurs minimal hardware cost, typically just a or pointer, but it performs poorly on accesses with temporal locality since it ignores recency and can evict frequently used lines prematurely. Random replacement selects a line for eviction using a pseudo-random , such as a . It requires negligible storage and update overhead compared to LRU or FIFO, avoiding complex , but its performance is suboptimal as it does not leverage access patterns, leading to higher miss rates in locality-heavy workloads. To address LRU's complexity, approximations like (PLRU) are commonly used in modern CPU caches. PLRU employs a tree-based structure or bit fields to approximate recency ordering with reduced cost; for example, a tree-PLRU for an 8-way uses only 7 bits per set instead of full LRU counters. Evaluations show PLRU achieves rates close to LRU (often within 1-5% higher) while using far less area and power, making it preferable for high-associativity designs. A theoretical is Belady's optimal , which evicts the line whose next access is farthest in the . This requires complete knowledge of accesses, rendering it impractical for real but invaluable for analyzing bounds. Replacement policies like LRU and PLRU can reduce rates by 20-50% compared to random or in typical workloads, though exact gains depend on access patterns.

Write Policies

Write policies in CPU caches dictate how write operations from the are managed, balancing factors such as data consistency, , and usage between the cache and lower levels of the . These policies primarily address whether and when updates are propagated to main memory, as well as how cache lines are handled on write misses. Common strategies include write-through, write-back, and write-around approaches, often combined with allocation decisions. The write-through policy updates both the cache and the backing main memory synchronously for every write operation. This ensures immediate between the cache and main memory, simplifying recovery in case of failures and avoiding the need for complex tracking mechanisms. However, it incurs higher and increased demands on the memory bus due to frequent writes, making it suitable for systems where is paramount over speed. In contrast, the write-back policy (also known as copy-back or write-deferred) updates only the cache line immediately, deferring the write to main memory until the line is or explicitly flushed. A is set in the cache entry to flag modifications, indicating that the data differs from main memory. This approach reduces bus traffic by batching writes and avoiding unnecessary updates to unchanged portions of a , thereby improving overall performance in write-intensive workloads. Drawbacks include the risk of during power failures or crashes before eviction, as well as increased complexity in handling across multiple cache levels. On eviction, the dirty bit triggers the write to main memory as part of the replacement process. Write-around, or write-through with no-write-allocation, bypasses the entirely for write misses by updating only main directly. This prevents cache pollution from temporary or streaming write that is unlikely to be read soon, conserving cache space for read-heavy accesses. It is particularly effective for workloads with large sequential writes, though it offers no caching benefits for subsequent reads to the written unless a separate read miss populates the . Allocation policies complement write strategies by determining whether to fetch a into the on a write miss. Write-allocate brings the entire from main memory into the before performing the update, commonly paired with write-back to support future accesses to the same line. No-write-allocate, often used with write-through or write-around, skips this fetch and writes directly to main memory, avoiding the overhead of loading irrelevant and reducing initial latency for one-time writes. The is a critical one-bit flag per cache line, primarily used in write-back policies to denote whether the line's contents have been modified since being loaded from main memory. When a write occurs, the bit is set to true, ensuring that the updated data is propagated to main memory only when necessary, such as during or cache flushes. This optimizes by distinguishing modified lines from clean ones but adds overhead for bit and coherence checks.

Associativity Mechanisms

Direct-Mapped Caches

In a direct-mapped cache, each memory block from the main is mapped to exactly one specific line using a portion of the known as the bits. This mapping is determined by dividing the physical into three fields: the , the , and the block . The bits select the line, while the bits are compared against the stored tag in that line to verify if it matches the requested block; a full match is required for a . If the tags match, the data from the selected line is output via a simple ; otherwise, it is a . The hardware implementation of a direct-mapped cache is straightforward, requiring only one storage per cache line and a single for tag matching during access. This design uses a to route the data from the indexed line to the output, minimizing the complexity of the cache controller. As a result, direct-mapped caches occupy less area compared to higher-associativity designs. Key advantages of direct-mapped caches include fast lookup times, as only one tag comparison is needed, leading to lower access latency and reduced power consumption due to fewer parallel comparisons and simpler wiring. The minimal overhead also makes them suitable for high-speed, area-constrained environments like primary caches in microprocessors. However, direct-mapped caches suffer from high conflict miss rates, where multiple memory blocks map to the same cache line and thrash each other out, particularly in workloads with sequential accesses that stride across the cache size, such as operations or linked lists. This can degrade overall more than in set-associative caches, which mitigate such conflicts by allowing multiple lines per index. For example, in a 64 KB direct-mapped L1 with 32-byte lines, the holds 2048 lines, requiring 11 bits to select a line (since $2^{11} = 2048); the remaining upper bits form the for comparison, assuming a typical 32-bit or larger where the low 5 bits are the block offset.

Set-Associative Caches

Set-associative organize the cache memory into multiple sets, where each set contains a fixed number of cache lines, known as ways. For an n-way set-associative , the lower bits of the memory serve as an to select one specific set from the total number of sets, while the upper bits are compared simultaneously against the tags stored in all n lines within that set to identify a potential . This parallel comparison allows for efficient lookup, as only the lines in the indexed set need to be examined, rather than the entire . The primary benefit of set-associative caches lies in their ability to mitigate conflict misses that plague direct-mapped caches, where multiple memory blocks map to the same line and thrash each other out. By permitting a block to reside in any of the n ways within its designated set, these caches provide greater flexibility in block placement, leading to lower overall miss rates for a given cache size. This design strikes a balance between the simplicity and speed of direct-mapped caches and the high hit rates of fully associative caches, offering improved performance without the prohibitive hardware costs of searching the entire cache on every access. In terms of hardware implementation, tag matching in set-associative caches typically employs comparators for each way in the selected set, often combined with trees to route the matching to the . Alternatively, (CAM) can be used for the tag array within each set to perform simultaneous comparisons, though this increases power and area overhead compared to standard SRAM-based tags with dedicated comparators. To further optimize access speed and reduce energy, way prediction techniques forecast the most likely way containing the requested , allowing the cache to initially access only that way's data array while verifying the tag in ; a misprediction incurs a small penalty but avoids probing all ways upfront. Variations on the standard set-associative design include skewed-associative caches, which apply different functions or bit permutations to the index for each way, dispersing conflicting blocks across sets to further reduce misses without increasing associativity. Another approach is the pseudo-associative cache, which begins with a direct-mapped lookup and, upon a miss, sequentially probes one or more alternative locations using a secondary or fixed , approximating higher associativity with minimal additional . However, increasing the degree of associativity in set-associative caches introduces scalability challenges, as higher n requires more parallel comparisons and wider multiplexers, which elevate access , silicon area, and dynamic dissipation. For instance, transitioning from 2-way to 8-way associativity can double the hit in some designs due to the expanded critical path in the matching logic. caches, small fully associative buffers that store recently evicted lines, serve as a extension to capture many of these residual conflict misses with low overhead.

Performance Aspects

Cache Misses and Hits

A cache hit occurs when the requested or is found in the , allowing the to it directly without needing to retrieve it from a lower level of the . are categorized based on the type of : , where fetched are present in the ; read , where for a load is available in the ; and write , where for a is located in the , enabling immediate update according to the cache's write policy. These typically take 1-4 clock cycles, depending on the cache level, significantly faster than accessing main memory. In contrast, a cache miss happens when the requested item is not present in the cache, requiring the processor to fetch it from a slower memory level, such as another cache or DRAM. Misses are classified into three main types: compulsory misses, which occur on the first access to a memory block that has never been referenced before; capacity misses, resulting from the cache being too small to hold all actively used blocks during program execution; and conflict misses, arising in set-associative or direct-mapped caches when multiple blocks compete for the same cache set, leading to evictions even if space is available elsewhere. Conflict misses can be influenced by replacement policies, such as least recently used (LRU), which determine which block is evicted during contention. The miss penalty represents the additional clock cycles incurred to resolve a miss, often ranging from 10 to 100 cycles for an L1 cache miss escalating to DRAM access, though this can exceed 200 cycles in modern systems due to increasing processor speeds relative to memory latency. Upon a miss, traditional blocking caches cause the processor to stall, halting instruction execution until the data is fetched. Non-blocking caches, however, permit the processor to continue executing independent instructions during the miss resolution, a technique known as hit-under-miss, which overlaps computation with memory operations. Modern processors mitigate miss impacts through techniques like hardware prefetching, which anticipates and loads likely future data into the ahead of time, and , which rearranges instructions to proceed past dependent loads while misses are serviced. These approaches reduce effective stall times without altering the fundamental hit/miss classification.

Performance Metrics

The performance of a CPU cache is evaluated using several key metrics that quantify its efficiency in reducing memory access and improving overall system throughput. The hit time represents the incurred for a successful cache access, typically measured in processor clock cycles. For instance, the level-1 (L1) cache often achieves a hit time of 1 cycle in idealized models or simple designs, allowing rapid data retrieval without stalling the . In contrast, the miss penalty is the additional time required to service a cache , which involves fetching data from a lower level of the , such as main memory, and can range from tens to hundreds of cycles depending on the system configuration. A fundamental metric integrating these factors is the Average Memory Access Time (AMAT), which provides a comprehensive measure of effective . The formula for AMAT is given by: \text{AMAT} = \text{hit time} + (\text{miss rate} \times \text{miss penalty}) where the miss rate is the fraction of memory accesses that result in a cache miss. This equation captures how improvements in hit rate or reductions in miss penalty directly lower the average access time, thereby enhancing processor performance. Another important metric is the effective bandwidth, which assesses the cache's data throughput by accounting for both hits and misses. It is calculated as the total transferred divided by the total time, incorporating the high-speed delivery on (limited by cache port ) and the slower fetches on misses (constrained by lower-level memory ). In bandwidth-bound workloads, effective can be significantly lower than peak cache due to miss-induced stalls, emphasizing the need to minimize miss rates for sustained performance. Cache performance metrics are typically measured using hardware performance monitoring counters (PMCs), which track events such as cache misses and hits at runtime. For example, processors provide PMCs to count L1 and cache miss events, enabling precise calculation of miss rates without simulation overhead. Simulation tools, such as those based on trace-driven models, complement hardware measurements by evaluating design variations under diverse workloads. A key trade-off in cache design involves balancing size against hit time: larger caches reduce capacity-related misses and improve overall hit rates, but they increase hit time due to longer access paths and higher power consumption. Studies indicate that cache sizes between 32 and 128 often optimize this trade-off, as further increases yield in miss rate reduction while proportionally raising cycle times.

Advanced Features

Cache Hierarchy

Modern CPUs typically employ a multi-level to balance speed, size, and cost, consisting of at least three levels: L1, , and L3 caches. The L1 cache is the smallest and fastest, usually split into separate (L1i) and (L1d) caches, each around 32 KB per , and is private to each for minimal in accessing frequently used and s. The cache is larger, typically 256 KB to 1 MB per , and can be either private to each or shared among a few cores, serving as an intermediate buffer with slightly higher than L1. The L3 cache, often called the last-level (LLC), is the largest, ranging from several MB to tens of MB, and is shared among all cores on a or to provide a unified pool for less frequently accessed . Cache hierarchies adopt inclusion policies to manage data duplication across levels, with inclusive and exclusive being the primary variants. In an inclusive hierarchy, all data in higher levels ( and ) is also present in the lower level (), ensuring that the LLC contains a superset of upper-level contents, which simplifies but may waste space due to . Conversely, an exclusive hierarchy prohibits overlap, where data evicted from or is not stored in , maximizing effective capacity but complicating of cache states. Many processors, such as , use non-inclusive policies for to balance these trade-offs, allowing some but not all upper-level data in the LLC. At the L1 level, caches are often separate for instructions and data to optimize fetch and load/store operations, while and L3 are typically unified, handling both types of accesses in a single structure to improve utilization in mixed workloads. In multi-core processors, L1 and remain private to each core for low-latency access, whereas the shared L3 facilitates inter-core and reduces off-chip traffic, enhancing overall system coherency. To maintain consistency across the hierarchy and cores, snoop protocols are employed, where caches monitor (or "snoop") bus or interconnect transactions to detect modifications and invalidate or update local copies accordingly. In multi-level setups, snoop filters track line locations in private caches, filtering unnecessary probes to the shared L3 and minimizing overhead in multi-socket systems.

Specialized Cache Types

Specialized caches address specific performance bottlenecks in processors by tailoring and mechanisms to particular types of data or operations, often complementing standard cache hierarchies. The victim cache, introduced by Norman Jouppi, is a small, fully associative that captures cache lines recently evicted from a direct-mapped primary , thereby reducing conflict misses by providing quick to these "victims" without fetching from lower memory levels. Simulations in the original work showed that a victim cache of just 1 to 5 entries could eliminate up to 85% of conflict misses in direct-mapped caches for benchmark workloads. This design is particularly effective in environments where associativity is limited to minimize hardware cost and latency. Trace caches store sequences of decoded , known as traces, rather than raw instruction bytes, enabling faster fetch in out-of-order processors by delivering contiguous micro-operations (μops) across branch boundaries. Pioneered in by Rotenberg et al., this approach improves bandwidth by making non-contiguous appear sequential, with evaluations indicating up to 30% higher fetch rates in wide-issue superscalar designs. implemented trace caching in the processor's , where the execution trace cache served as the primary L1 cache, holding up to 12K μops and achieving hit rates comparable to 8-16 conventional caches. Micro-op caches, or μop caches, hold pre-decoded sequences of micro-operations to bypass the instruction in subsequent executions, reducing front-end in complex instruction set architectures like x86. First deployed in Intel's , this cache stores 1.5K μops per core, with subsequent architectures expanding capacity to 4K–6K μops, allowing hot code regions to avoid repeated decoding and improving overall throughput by 5-10% in integer workloads. The structure is typically set-associative and integrated near the , prioritizing frequently executed paths to minimize power and delay in the front-end. The branch target buffer (BTB) is a specialized cache that stores the target addresses of branch instructions alongside prediction information, enabling rapid resolution of by prefetching instructions from predicted targets without full address decoding. As described by Lee and , BTBs function as a indexed by , with early designs using 512-2048 entries to achieve misprediction penalties under 10 cycles in pipelined processors. Modern implementations often employ multi-level BTBs with varying associativities to balance accuracy and latency, significantly boosting instruction fetch efficiency in branch-heavy workloads. Write coalescing caches merge multiple small or non-contiguous writes into larger, contiguous blocks before propagating them to lower memory levels, thereby improving bandwidth utilization and reducing I/O overhead in write-intensive scenarios. This technique, explored in last-level cache designs, maximizes in-cache displacement to coalesce overwrites, with studies showing up to 40% reduction in writeback traffic for applications with fragmented stores. Commonly applied in systems or accelerators, it avoids unnecessary line evictions by buffering writes until a full cache line is accumulated. Smart caches enable dynamic reconfiguration of cache resources, such as reallocating ways or sets between and based on demands, to optimize hit rates without fixed partitioning. Proposed in reconfigurable architectures, this adaptability allows processors to shift capacity—e.g., increasing space for compute-bound tasks—yielding savings of 20-50% in media processing benchmarks through runtime partitioning. Intel's Smart extends this concept to multi-core shared /L3 levels, dynamically assigning space to active cores, though core-specific I/ reconfiguration remains prominent in research for single-thread efficiency.

Implementation Techniques

Address Translation in Caches

In modern processors, CPU caches must interface with systems, where programs operate using addresses (VAs) that are translated to physical addresses (PAs) by the (MMU). This translation introduces complexities in cache design, as caches need to efficiently handle both and physical addressing to minimize while ensuring . The primary approaches to address translation in caches are virtually indexed, physically tagged (VIPT) and physically indexed, physically tagged (PIPT) schemes, each balancing speed, size, and correctness differently. VIPT caches, commonly used for L1 and caches, generate the cache from bits of the while storing the in the field for comparison. This design allows the cache lookup to proceed in parallel with the (TLB) lookup, as the virtual bits are available immediately from the , potentially overlapping the and reducing overall time. However, VIPT requires that the bits do not overlap with the virtual page offset in a way that exceeds the physical page size, limiting cache size to avoid issues. In contrast, PIPT caches use the full for both indexing and tagging, necessitating a complete VA-to-PA before any cache access can occur. This approach eliminates problems inherent in virtual addressing but incurs higher latency, as the TLB and walk must complete serially before the cache probe, making it more suitable for larger, lower-level caches like or L3 where consistency is prioritized over speed. A key challenge in VIPT caches is the synonym problem, where multiple virtual addresses map to the same physical location, potentially leading to multiple cache entries holding the same and causing inconsistencies during updates or operations. This arises because different processes or threads may alias the same physical via distinct virtual mappings, and solutions typically involve flushing the on switches or using /software mechanisms for invalidation to ensure only one valid entry remains. Another issue, known as the problem, occurs when different physical s have es sharing the same low-order bits used for indexing, resulting in false misses or conflicts within the same set. This is mitigated through page coloring techniques, where the operating allocates physical pages such that their low-order bits (colors) match the corresponding virtual page bits, preventing unrelated pages from colliding in the . The TLB plays a crucial role in facilitating efficient for caches by caching recent VA-to-PA mappings from the , allowing quick resolution of translations without frequent main memory accesses. In VIPT designs, the TLB provides the physical tag bits concurrently with the virtual index, enabling detection shortly after translation; a TLB miss, however, triggers a page table walk that can stall the cache access. Some advanced designs incorporate virtual address hints to further optimize translation paths in hybrid caching schemes.

Multi-Port and On-Chip Designs

Multi-ported caches enable simultaneous access to from multiple stages or execution units in modern processors, supporting higher throughput in pipelined and out-of-order designs. Dual-ported caches, for instance, allow one for fetches and another for operations, while triple-ported variants further accommodate additional reads from or load/store units, though they increase area by 20-50% compared to single-ported equivalents. These designs mitigate bottlenecks in wide-issue processors, where issue widths exceed four , but require careful banking to avoid port conflicts. On-chip caches are predominantly implemented using (SRAM) cells, with the conventional 6-transistor (6T) cell providing stable single-port access suitable for basic read/write operations in L1 and caches. For multi-ported needs, 8-transistor (8T) cells are employed, incorporating separate read and write ports to enhance stability and speed under contention, albeit at the cost of about 20% higher area per bit than 6T cells. Compared to off-chip dynamic RAM (), on-chip SRAM offers 5-10x lower and no refresh overhead, but exhibits 10-100x lower density and higher static power consumption per bit due to always-on leakage. To optimize access times and area, cache implementations often separate the tag RAM from the data array, using dedicated high-speed for tags that store address identifiers while the larger data array holds the actual cache lines. This allows parallel tag lookups without activating the full data array until a is confirmed, reducing by up to 25% in set-associative caches and enabling smaller, faster tag storage optimized for comparison logic. In contemporary multi-core processors, caches are tightly integrated on-chip within system-on-chip () designs, with shared L3 caches scaling to 256–1,152 MB in server chips like the 9005 series (as of 2025) to support inter-core data sharing and reduce off-chip traffic. Advanced process nodes, such as 3 nm, enable denser integration, as seen in chips with up to 96 MB of stacked L3 cache per core complex die using technologies like 's 3D V-Cache (introduced in 2022), balancing capacity gains with fabrication feasibility. Fabricating large on-chip caches poses challenges in yield and thermal management, as increasing array sizes amplifies defect probabilities, potentially dropping yields below 80% for multi-megabyte structures without redundancy techniques. Heat dissipation intensifies with cache size, given SRAM's leakage-dominated profile, leading to hotspots that can elevate die temperatures by 20-30°C and necessitate advanced cooling like microchannel liquid systems in high-performance servers.

Historical Development

Early Innovations

The concept of cache memory originated in the mid-1960s with ' proposal for "slave memories," a small, fast auxiliary storage acting as a between the and a larger, slower main to exploit and reduce access times. In his seminal paper, Wilkes described a slave memory of around 32,000 words operating as a high-speed cache for frequently accessed , emphasizing dynamic allocation to minimize while keeping costs low compared to fully fast main storage. This idea laid the groundwork for modern caching by introducing the notion of a hierarchical system where the slave holds the most active portions of the program, fetched on demand from the master . The first commercial implementation of a CPU cache appeared in the Model 85, announced in 1968 and shipped starting in 1969. This mainframe featured a high-speed buffer storage, or , of 16 KB standard capacity (expandable to 32 KB), implemented using fast monolithic integrated circuits to bridge the speed gap between the processor and core main memory. The cache employed a set-associative design with dynamic sector assignment, where 16 sectors of 1 KB each were mapped to main storage blocks, using an activity list for replacement to optimize hit rates and performance in scientific workloads. Studies leading to its adoption showed hit rates exceeding 90% for typical programs, significantly boosting effective without requiring fully fast main storage. In the 1970s, IBM extended caching principles to support virtual addressing with the introduction of translation lookaside buffers (TLBs) in the System/370 architecture, announced in 1970. The TLB served as a dedicated cache for page table entries, storing recent virtual-to-real address mappings to accelerate dynamic address translation (DAT) and reduce overhead in virtual memory systems. Models like the 155, 158, 165, and 168 integrated the TLB hardware to handle paging without excessive table walks, enabling efficient multiprogramming on mainframes with up to 8 MB of memory. This innovation was crucial for the System/370's virtual storage capabilities, as it minimized translation latency to a few cycles per access. By the early 1980s, caches began appearing in microprocessors, with the —announced in 1982 and released in 1984—introducing the first on-chip instruction cache in the 68k family. This 256-byte cache improved performance by prefetching instructions ahead of execution, achieving hit rates that reduced average fetch times in pipelined operations. The design addressed the growing processor-memory speed gap in 32-bit systems, using simple direct-mapped organization for cost-effective integration on a single chip. Early cache implementations faced significant challenges, primarily the high cost of fast memory technologies like magnetic core in the 1960s and early static RAM () in the 1970s, which limited cache sizes to kilobytes and restricted adoption to high-end systems. , while reliable, was expensive at $1–$5 per bit and slow (around 1–2 μs access), making larger caches uneconomical; early offered sub-μs speeds but at 10–100 times the cost per bit of dynamic RAM. Additionally, the lack of standardized design methodologies complicated optimization, as architects balanced associativity, replacement policies, and without established benchmarks, often relying on simulations for specific workloads.

Evolution in Processor Architectures

The evolution of CPU caches within major instruction set architectures (ISAs) since the late has been driven by the need to balance increasing core counts, power efficiency, and performance demands in processors. In the x86 architecture, Intel's 80486 microprocessor, introduced in 1989, marked the first integration of an on-chip cache, featuring an 8 KB unified instruction and data cache to reduce compared to off-chip designs. This unified approach stored both and data in a single structure, leveraging a write-through policy for simplicity in early systems. By 1993, the Pentium processor advanced this design with a split level-1 (L1) cache, comprising separate 8 KB instruction and 8 KB data caches, which allowed simultaneous access to and data, mitigating bottlenecks in superscalar execution. In modern x86 implementations, such as 's Scalable processors in the 2020s, multi-level hierarchies have scaled dramatically, with (4th Gen, 2023) offering up to 112.5 MB of shared L3 cache per socket to support up to 60 cores (as in the Platinum 8490H), enhancing data sharing in workloads. Parallel developments in RISC architectures included the (1991) with separate 8 instruction and data caches, and the PowerPC 601 (1993) integrating on-chip caches, influencing subsequent designs. Parallel advancements in the ISA have emphasized mobile and embedded efficiency, evolving from single-level to sophisticated multi-level caches. The StrongARM SA-110, released in 1996 by (later acquired by ), introduced a split cache with 16 for instructions and 16 for data, enabling better pipelining in low-power devices like PDAs. Subsequent series processors, starting with the Cortex-A8 in 2005, adopted multi-level hierarchies, including private caches per core and a shared cache configurable up to 1 MB, with later models like Cortex-A76 incorporating support for improved coherence in multi-core setups. In recent developments as of 2025, Apple's M-series chips, such as the M5 in the and , integrate AI-optimized architectures within their unified memory system, featuring a large shared system-level and unified exceeding 500 GB/s, tailored for acceleration via the Neural Engine. The , as an open-standard alternative, has seen rapid integration in commercial multi-core designs. SiFive's Performance P870 series (2023 onward) employs coherent multi-core clusters with private L1 and caches per core, alongside a shared L3 cache up to several MB, supporting up to 32 cores for scalable embedded and applications. Key milestones in cache evolution across ISAs include the adoption of private caches per core in early multi-core processors around 2005, as seen in Intel's and AMD's , with later designs adopting shared caches around 2006 (e.g., Duo) to minimize data replication and improve inter-core communication. In 2011, Intel's introduced the () , a specialized L1 structure holding up to 4K decoded uops to bypass the decode stage for hot code paths, reducing power and latency in .) Cache coherency protocols like MESI (Modified, Exclusive, Shared, Invalid), standard in x86 since the , have been essential for these multi-core designs, ensuring consistent data visibility across cores via snooping mechanisms. Current research explores innovative cache paradigms to address emerging workloads. Approximate caching techniques, such as those in Proximity (2025), leverage in queries to reuse cache entries with controlled error rates, reducing database hits by up to 50% in retrieval-augmented generation systems. Experimental optical caches, like Pho$ (2022), propose hybrid opto-electronic hierarchies using photonic interconnects for shared last-level caches, potentially slashing latency and energy in multi-core setups by integrating optical cells.

References

  1. [1]
    22. Basics of Cache Memory - UMD Computer Science
    The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are ...
  2. [2]
    Caches - CS 3410
    The idea with a cache is to try to “intercept” most of a program's memory accesses. A cache wants to fulfill as many loads and stores as it can directly, using ...Missing: definition | Show results with:definition
  3. [3]
    [PDF] Caches and Memory Hierarchies - Duke People
    Typical Processor Cache Hierarchy. • First level caches: optimized for t hit ... • “4KB cache” means cache holds 4KB of data. • Called capacity. • Tag ...
  4. [4]
    21. Memory Hierarchy Design - Basics - UMD Computer Science
    A memory hierarchy uses smaller, faster memories closer to the processor, with larger, slower memories further away, to achieve speed, size, and cost balance.
  5. [5]
    [PDF] Multi-Core Cache Hierarchies - Electrical and Computer Engineering
    The scope will largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,. MICRO, and ASPLOS. Multi-Core Cache Hierarchies.
  6. [6]
    23. Cache Optimizations I - UMD Computer Science
    Multi-level Caches: The first techniques that we discuss and one of the most widely used techniques is using multi-level caches, instead of a single cache.
  7. [7]
    Cache Memories | ACM Computing Surveys
    Cache memory systems for multiprocessor architecture, in Proc. AFIPS National Computer Conference (Dallas, Tex. June 13-16, 1977), vol. 46, AFIPS Press, ...
  8. [8]
    [PDF] Design of CPU Cache Memories - UC Berkeley EECS
    In the remainder of this paper, we will first show how a typical cache memory would look and work, and then we will discuss in more detail the specification of ...
  9. [9]
    [PDF] Structural aspects of the System/360 Model 85 11 The cache
    This paper discusses organization of the cache and the studies that led to its use in the Model 85 and to selecting of values for its parameters. the SYSTEM ...Missing: Conti | Show results with:Conti
  10. [10]
    L2 Cache - an overview | ScienceDirect Topics
    Each entry in the array consists of a data block and its associated valid and tag bits. ... Computer Architecture: A Quantitative Approach, 5th ed., Morgan ...Missing: components | Show results with:components
  11. [11]
    [PDF] The Basics of Caches | UCSD CSE
    Because different regions of memory may be mapped into a block, the tag is used to differentiate between them. valid bit - A bit of information that indicates ...Missing: array dirty trade- offs
  12. [12]
    [PDF] Cache Memory - Duke Computer Science
    The Tag Array holds the Block Memory Address. • A valid bit associated with each cache block tells if the data is valid. Direct Mapped Cache. Cache ...
  13. [13]
    [PDF] OpenPiton Microarchitecture Specification - Princeton Parallel Group
    Apr 2, 2016 · The CPU Cache-Crossbar (CCX) is the crossbar interface used in ... The L2 data array is a 4096x144 SRAM array. Each line contains. 128 ...
  14. [14]
    [PDF] Section 7. Memory System Cortex-A15 MPCore L1 and L2 Caches ...
    Duplicated Tag RAMs keep track of what data is allocated in each CPU's cache ... Data array. Match? Multiplexer. To CPU. D. E. C. 16. Fully Associative Mapped ...
  15. [15]
    [PDF] Cache - CMSC 611: Advanced Computer Architecture - UMBC CSEE
    If so, How to find it? The Basics of Cache. • Cache: level of hierarchy closest to processor ... • By definition: Conflict Miss = 0 for a fully associative cache.
  16. [16]
    [PDF] EE 660: Computer Architecture Advanced Caches - Amazon S3
    Reduce Miss Rate: Large Block Size. • Less tag overhead. • Exploit fast burst ... Reduce Miss Rate: Large Cache Size. Empirical Rule of Thumb: If cache ...
  17. [17]
    5.2.3.2.2. Data Cache - Intel
    The size of the line field depends only on the size of the cache memory. The size of the offset field depends on the line size. Line sizes of 4, 16, and 32 ...
  18. [18]
    [PDF] 250P: Computer Systems Architecture Lecture 10: Caches
    • 32 KB 4-way set-associative data cache array with 32 byte line sizes. • How many sets? • How many index bits, offset bits, tag bits? • How large is the tag ...<|control11|><|separator|>
  19. [19]
    [PDF] CPU clock rate DRAM access latency Growing gap - Error: 400
    ▫ CPU-cache operations. • CPU: Read request. Cache: hit, provide data. Cache ... Data array. D. V. Address tag. Address tag is the “id” of the associated ...
  20. [20]
    [PDF] 4 Cache Organization
    Sep 2, 1998 · • Bus traffic ratio depends on miss rate and block size. ◇ Concurrency. • Multiple blocks per sector permits sharing cost of tag in cache.
  21. [21]
    Cache Write Policy | Baeldung on Computer Science
    Mar 18, 2024 · We refer to this policy as “write-through with no-write allocation” or write-around. ... For CPU caches, we use a dirty bit as a state indicator.
  22. [22]
    Cache Basics
    A cache address can be specified simply by index and offset. The tag is kept to allow the cache to translate from a cache address (tag, index, and offset) to a ...
  23. [23]
    [PDF] Memory Hierarchy Review - People @EECS
    Jan 27, 2010 · Review: Direct Mapped Cache. • Direct Mapped 2N byte cache: – The uppermost (32 - N) bits are always the Cache Tag. – The lowest M bits are ...
  24. [24]
    Dealing with Cache Conflicts
    Many commercial RISC microprocessors have direct-mapped primary caches. This is because direct-mapped caches have faster access times than set-associative ...
  25. [25]
    [PDF] A case for direct-mapped caches - Computer
    Dec 15, 1988 · The arguments against direct-mapped caches are that they (1) have worse miss ratios than set-associative caches of the same size, (2) have ...
  26. [26]
    Set-Associative Cache - an overview | ScienceDirect Topics
    Set associative cache is defined as a type of cache that reduces conflicts by providing multiple blocks (N ways) in each set, allowing a memory address to ...Introduction · Architecture and Design · Cache Replacement Policies
  27. [27]
    Set associative caches - Arm Developer
    The main caches of ARM cores are always implemented using a set associative cache. This significantly reduces the likelihood of the cache thrashing seen ...
  28. [28]
    [PDF] CS650 Computer Architecture Lecture 8 Memory Hierarchy - Cache ...
    Cache address = (MM block address) % (No of blocks in cache). For MM block #9, cache address = 9 % 8 = 1. Block #9 can only be placed in block #1 of cache. 2.Missing: seminal | Show results with:seminal
  29. [29]
    [PDF] Caches - CSE, IIT Delhi
    ... cache and a set associative cache are made of 6-transistor SRAM cells, whereas the tag array in a fully associative cache uses 10-transistor CAM cells. (e) ...
  30. [30]
  31. [31]
    [PDF] 4: Pseudo-Associative Cache
    #4: Pseudo-Associative Cache. • Also called column associative. • Idea. – start with a direct mapped cache, then on a miss check another entry. • A typical next ...
  32. [32]
    [PDF] Memory Hierarchy Design – Part 2
    The large penalty for eight-way set associative caches is due to the cost of reading out eight tags and the corresponding data in parallel. Energy Data from ...
  33. [33]
    [PDF] lec18-markup.pdf - Washington
    Larger sets and higher associativity lead to fewer cache conflicts and lower miss rates, but they also increase the hardware cost. In practice, 2-way through 16 ...Missing: matching | Show results with:matching
  34. [34]
    Improving direct-mapped cache performance by the addition of a ...
    Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim ...
  35. [35]
    [PDF] Lecture 7: Memory Hierarchy—3 Cs and 7 Ways to Reduce Misses
    • 3 Cs: Compulsory, Capacity, Conflict Misses. • Reducing Miss Rate. 1. Reduce ... • “hit under miss” reduces the effective miss penalty by being helpful ...<|control11|><|separator|>
  36. [36]
    [PDF] Cache Memory and Performance - Computer Science (CS)
    L1 cache hit time of 1 cycle. L1 miss penalty of 100 cycles (to DRAM). Average access time: 97% L1 hits: 1 cycle + 0.03 * 100 cycles = 4 cycles. 99% L1 hits: 1 ...
  37. [37]
    25. Cache Optimizations III - UMD Computer Science
    Computer Architecture – A Quantitative Approach , John L. Hennessy and David A.Patterson, 5th Edition, Morgan Kaufmann, Elsevier, 2011. Computer ...
  38. [38]
    [PDF] Caches and Memory Systems Part 3: Miss penalty reduction
    8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss. Hit-under-miss implies loads may be serviced out-of-order... Need a memory “fence” or “barrier ...
  39. [39]
    [PDF] Memory Hierarchy - Overview of 15-740
    AMAT = Average memory access time = Hit time + Miss ratio × Miss penalty ... Zcaches get miss ratio of highly-associative cache with hit time of low-associative ...
  40. [40]
    Performance Modeling and Evaluation of a Production ...
    H is the hit time, the same as defined in AMAT. The Hit ... (5), which also means the overlapping between hit and miss accesses in this level i cache.
  41. [41]
    Memory bandwidth limitations of future microprocessors
    The traffic inefficiency of a cache allows us to compute an upper bound on effective pin bandwidth, This upper bound is only valid if the processor model ...
  42. [42]
    Intel® Data Direct I/O Technology Performance Monitoring
    Aug 2, 2024 · Core: Central Processing Unit (CPU) containing local level 1 (L1) and level 2 (L2) caches. Each core communicates with the uncore and LLC ...
  43. [43]
    Performance tradeoffs in cache design
    Section 3 looks at the primary tradeoff between CPU cycle time and cache size. It strongly indicates that pushing cache size to extremes to benefit either ...
  44. [44]
    [PDF] Performance Analysis Guide for Intel® Core™ i7 Processor and Intel ...
    On the Intel® Core™ i7 processor the mid level cache (L2 CACHE) misses and traffic with the uncore and beyond can be measured with a large variety of ...
  45. [45]
    A Case Study for Broadcast on Intel Xeon Scalable Processors
    All cores have their own private L1 and L2 caches, while all cores in each socket share a large, non-inclusive, last-level cache (LLC). An important factor ...
  46. [46]
  47. [47]
    Achieving Non-Inclusive Cache Performance ... - ACM Digital Library
    Inclusive caches are commonly used by processors to simplify cache coherence. However, the trade-off has been lower performance compared to non-inclusive ...
  48. [48]
    Cache Exclusivity and Sharing: Theory and Optimization
    Exclusive cache differs from inclusive cache in two major aspects. First, there is no data duplica- tion, which is a benefit because the cache space is ...
  49. [49]
    Why On-Chip Cache Coherence Is Here to Stay
    Jul 1, 2012 · With many cores, the size of private caches is limited, and the miss latency from a private cache to the chipwide shared cache is likely large.Cache Coherence Today · Concern 1: Traffic · Concern 2: Storage
  50. [50]
    [PDF] Improving Direct-Mapped Cache Performance by the Addition of a ...
    Victim caching is an improvement to miss caching that loads the small full. Ii. -associative cache with the vic- tim of a miss and not t e requested line. Small ...
  51. [51]
    [PDF] Trace Cache: a Low Latency Approach to High Bandwidth ...
    A trace cache caches traces of the dynamic instruction stream, making noncontiguous instructions appear contiguous, improving instruction fetching.
  52. [52]
    [PDF] Intel® Technology Journal
    Feb 18, 2004 · The Execution Trace Cache on the Pentium 4 processor can hold up to 12K uops and has a hit rate similar to an 8 to 16 kilobyte conventional.
  53. [53]
    [PDF] sandy bridge spans generations - People @EECS
    Sep 1, 2010 · A new addition is the micro-op (L0) cache, which can hold 1.5K micro-ops. Assuming an average of 3.5 bytes per x86 instruction, this cache is ...
  54. [54]
    [PDF] Branch Prediction Strategies and Branch Target Buffer Design
    The branch target buffer, like the CPU cache or the translation look- aside buffer, is a small, high-speed memory, and because of both cost and performance ...
  55. [55]
    [PDF] Reducing Writebacks Through In-Cache Displacement - COMPAS Lab
    In this paper, we proposed a low-cost cache management policy that attempts to maximize write-coalescing for the purpose of reducing costly writebacks. We ...
  56. [56]
    Reducing Writebacks Through In-Cache Displacement
    In this article, we propose a novel cache management policy that attempts to maximize write-coalescing in the on-chip SRAM last-level cache (LLC)
  57. [57]
    [PDF] Reconfigurable Caches and their Application to Media Processing
    Section 5 studies one such technique, instruction reuse, with reconfigurable caches to address the computation bottleneck in media processing workloads.
  58. [58]
    [PDF] Smart Cache: A Self Adaptive Cache Architecture for Energy Efficiency
    The Smart cache organizes ways at set boundaries, which avoids flushing data back to memory when increasing the associativity but keeping the cache size fixed.
  59. [59]
    [PDF] CS/ECE 552: Virtual Memory
    • PI = Physically indexed. • VT = Virtually tagged. • Realistically never used. – Cache is physically indexed. – Virtual address used for tag. – Why? 26. Page ...
  60. [60]
    [PDF] LECTURE 12 Virtual Memory - FSU Computer Science
    Virtually indexed, physically tagged (VIPT) caches use the virtual address for the index and the physical address in the tag. Index into the cache using bits ...
  61. [61]
    [PDF] SIPT: Speculatively Indexed, Physically Tagged Caches
    This virtually-indexed physically-tagged (VIPT) design is very ef- fective, but constrains L1 design such that the total capacity of each cache way is the ...Missing: explanation | Show results with:explanation
  62. [62]
  63. [63]
    [PDF] Organization and Performance of a Two-Level Virtual-Real Cache ...
    The two-level cache has a fast, virtually-addressed first-level cache, backed by a large, physically-addressed second-level cache, in a shared-bus ...Missing: explanation | Show results with:explanation
  64. [64]
    [PDF] Reducing Memory Reference Energy with Opportunistic Virtual ...
    Third, virtual caches require extra mechanisms to disambiguate homonyms (a single virtual address mapped to different physical pages). Fourth, they pose ...
  65. [65]
    [PDF] 18-447 Virtual Memory, Protection and Paging! 2 Parts to Modern VM !
    increase page size! page coloring! MIPS R10K! VPN. PO. TLB. PPN. IDX BO physical cache tag data. = cache hit? TLB hit? a. CMU 18-447! Spring ʻ10 36! © 2010!
  66. [66]
    [PDF] Paging: Faster Translations (TLBs) - cs.wisc.edu
    To speed address translation, we are going to add what is called (for historical rea- sons [CP78]) a translation-lookaside buffer, or TLB [CG68, C95]. A TLB is ...
  67. [67]
    [PDF] Address Transla+on - Washington
    Translation Lookaside Buffer (TLB). Virtual. Page. Page. Frame. Access. Matching ... – Lookup virtually addressed cache and TLB in parallel. – Check if ...
  68. [68]
    Virtual-Address Caches Part 1 - IEEE Micro - ACM Digital Library
    Unfortunately, consistency problems add complexity to virtual-address caches. These problems are mostly caused by synonyms and address-mapping changes.
  69. [69]
    Increasing cache port efficiency for dynamic superscalar ...
    Abstract. The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance.
  70. [70]
  71. [71]
    A scalable multi-porting solution for future wide-issue processors
    Current single- and dual-ported cache implementations are clearly inadequate. There is a need to explore scalable techniques for increasing the effective number ...Missing: triple CPUs
  72. [72]
    Implementation of High Performance 6T-SRAM Cell - ResearchGate
    Aug 6, 2025 · This paper mainly focuses on reducing powerdissipation of Static Random Access Memory (SRAM). Power reduction and Delay reductions are the major challenge of ...
  73. [73]
    Difference between SRAM and DRAM - GeeksforGeeks
    Jul 12, 2025 · SRAM stores data in voltage using transistors, is faster and used for cache. DRAM stores data in electric charges using capacitors, is slower ...Static Random Access Memory... · Dynamic Random Access Memory... · Difference Between Static...
  74. [74]
    Examining Intel's Arrow Lake, at the System Level - Chips and Cheese
    Dec 4, 2024 · Recent Arm server chips also use large (1 or 2 MB) L2 caches to mitigate L3 latency. Thus other CPU makers are also taking advantage of process ...Missing: modern size
  75. [75]
    Full article: Challenges in Cooling Design of CPU Packages for High ...
    Jul 14, 2010 · Cooling technologies that address high-density and asymmetric heat dissipation in CPU packages of high-performance servers are discussed.
  76. [76]
    Slave Memories and Dynamic Storage Allocation - Semantic Scholar
    Slave Memories and Dynamic Storage Allocation · M. Wilkes · Published in IEEE Transactions on… 1 April 1965 · Computer Science, Engineering.
  77. [77]
    IBM's Single-Processor Supercomputer Efforts
    Dec 1, 2010 · Cache memory was a new concept at the time; the IBM S/360 Model 85 in 1969 was IBM's first commercial computer system to use cache. DOI ...
  78. [78]
    [PDF] Chapter 51
    We found ourselves well into the 1970s making changes in the architecture of System/360 to remove ambiguities and, in some cases, to adjust the function ...<|separator|>
  79. [79]
    [PDF] IBM System/370 - Your.Org
    The System/370 got off to a surprisingly low-key start in. June 1970, when IBM introduced the large-scale Models. 155 and 165. Though they offered significant ...
  80. [80]
    Motorola 68020 - Wikipedia
    The Motorola 68020 is a 32-bit microprocessor from Motorola, released in 1984. ... The 68020 replaced this with a proper instruction cache of 256 bytes, the ...Missing: 1982 | Show results with:1982
  81. [81]
    [PDF] MC68020 MC68EC020 - NXP Semiconductors
    Sep 29, 1995 · The M68020 User's Manual describes the capabilities, operation, and programming of the. MC68020 32-bit, second-generation, enhanced ...Missing: 1982 | Show results with:1982
  82. [82]
    [PDF] Evolution of Memory Architecture
    The 1960s and 1970s saw the prolific use of the mag- netic core memory which, unlike drum memories, had no moving parts and provided random access to any word ...
  83. [83]
    [PDF] System/360 and Beyond
    on the Model 85 with a 16K-byte cache, typically. 97% of fetches were satisfied with data from the cache. With larger caches, in scientific applications "hit" ...
  84. [84]
    [PDF] i486™ MICROPROCESSOR
    An 8 Kbyte unified code and data cache combined with a 106 Mbyte/Sec burst bus at 33.3 MHz ensure high system throughput even with inexpensive DRAMs.
  85. [85]
    The Pentium: An Architectural History of the World's Most Famous ...
    Jul 11, 2004 · First among these improvements was the an on-die, split L1 cache that was doubled in size to 32K. This larger L1 helped boost performance ...
  86. [86]
    Intel 4th Gen Xeon CPUs Official: Sapphire Rapids With Up To 60 ...
    Jan 10, 2023 · L3 Cache, 384 MB, 105 MB ; Memory Support, DDR5-5200, DDR5-4800 ; Memory Capacity, 12 TB, 8 TB ; Memory Channels, 12-Channel, 8-Channel.Missing: size | Show results with:size
  87. [87]
    DEC StrongARM SA-110 | Processor Specs - PhoneDB.net
    Sep 3, 2007 · 1996, Application Processor, 32 bit, single-core, Memory Interface(s): Yes, 16 Kbyte I-Cache, 16 Kbyte D-Cache, 350 nm, Embedded GPU: N/A, ...
  88. [88]
    Cache architecture - Arm Developer
    A guide for software developers programming Arm Cortex-A series processors based on the Armv7-R architecture.
  89. [89]
    Apple unleashes M5, the next big leap in AI performance for Apple ...
    Oct 15, 2025 · M5 offers unified memory bandwidth of 153GB/s, providing a nearly 30 percent increase over M4 and more than 2x over M1. The unified memory ...
  90. [90]
    SiFive Performance™ P800 Series - P870-D
    P870-D is fully compliant with the RVA23 RISC‑V Instruction Profile. It incorporates a shared cluster cache enabling up to -32 cores to be connected coherently.
  91. [91]
    Industry Trends: Chip Makers Turn to Multicore Processors
    Chip makers are turning to multicore processors due to slowing performance increases, power and heat issues, and the need for more energy-efficient chips.<|separator|>
  92. [92]
    What cache coherence solution do modern x86 CPUs use?
    May 31, 2020 · MESI states for each cache line can be tracked / updated with messages and a snoop filter (basically a directory) to avoid broadcasting those messages.Can I force cache coherency on a multicore x86 CPU?Which cache-coherence-protocol does Intel and AMD use?More results from stackoverflow.com
  93. [93]
    Leveraging Approximate Caching for Faster Retrieval-Augmented ...
    Mar 7, 2025 · We introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries.
  94. [94]
    A Practical Shared Optical Cache With Hybrid MWSR/R-SWMR NoC ...
    Oct 13, 2022 · The optical cache banks are fabricated on separate optical dies, while the processor cores remain on their original electronic die. The cores ...