Fact-checked by Grok 2 weeks ago

Locality of reference

Locality of reference, also known as the principle of locality, is the observed tendency of computer programs to cluster their memory references to small subsets of their address space over extended intervals of time. This behavior manifests in two primary forms: temporal locality, where recently accessed data is likely to be referenced again soon, and spatial locality, where data located near a recently accessed address is likely to be accessed next. These patterns arise from common programming structures, such as loops for temporal clustering and contiguous data arrays or modules for spatial clustering. The concept emerged in the late 1950s during the development of early systems, such as the Atlas computer at the , to address inefficiencies in paging and reduce thrashing in multiprogrammed environments. It was formally articulated by Peter J. Denning in his 1968 paper introducing the working set model, which defines the of a process as the set of pages referenced within a recent time window, providing a predictive framework for memory allocation based on locality. Denning's model demonstrated that programs exhibit stable localities, allowing systems to allocate just enough memory to avoid excessive page faults while maintaining performance. Locality of reference underpins the design of modern memory hierarchies, including caches, , and prefetching mechanisms, enabling dramatic improvements in system throughput by minimizing data movement between slow and fast storage layers. In contemporary architectures, it influences optimizations, database query , and even web caching algorithms, where exploiting locality can reduce by orders of magnitude. Violations of locality, such as in pointer-chasing workloads, highlight its critical role, often necessitating algorithmic redesigns to enhance spatial and temporal reuse.

Core Concepts

Definition

Locality of reference is a fundamental principle in describing the tendency of computer programs to access a relatively small portion of their total or to repeatedly reference the same data items over relatively short periods of time. This behavior arises because programs, influenced by their structure and the algorithms they implement, cluster references into localized subsets rather than distributing them uniformly or randomly across the entire . The concept was developed by Peter J. Denning during his doctoral research at and formally introduced in his 1968 paper on the model, emerging from efforts to model program behavior in systems and to resolve performance issues such as thrashing in multiprogrammed environments. In the , where instructions and data share a unified linear accessed sequentially by the processor, memory access patterns reflect the linear flow of program execution and the modular organization of code and data structures. These patterns deviate from purely , exhibiting predictability that stems from algorithmic locality, such as loops reusing variables or sequential scans of arrays, thereby enabling system designers to optimize for clustered rather than scattered references. The principle primarily manifests as spatial locality (accesses to nearby addresses) and temporal locality (reuse of recently accessed data), though these are explored in greater detail elsewhere.

Types of Locality

Spatial locality refers to the tendency of a to or instructions located in close physical proximity to recently accessed items in . This pattern arises because many algorithms process in a linear or clustered manner, allowing multiple related elements to be loaded into a line or page simultaneously. A classic example occurs in sequential traversals of arrays or linked structures within loops, where consecutive elements are stored adjacently, enabling efficient prefetching and reducing the number of distinct fetches. Temporal locality, in contrast, captures the propensity for the same location—whether or instructions—to be referenced again shortly after its initial access. This reuse stems from repetitive flows in programs, such as iterations in or frequent evaluations of conditional statements, where variables or blocks are accessed multiple times in quick succession. For instance, loop counters and accumulated results in iterative computations exhibit strong temporal locality, as they are read and updated repeatedly during execution. In practice, spatial and temporal localities frequently interplay within execution, as seen in nested loops that both reuse variables (temporal) and scan adjacent array elements (spatial), amplifying overall predictability in reference streams. Despite this synergy, the two can be for analysis; a might demonstrate robust temporal locality in branch-heavy but limited spatial locality if accesses skip intervening locations. This separation aids in targeted optimizations, such as adjusting layouts for spatial gains without altering reuse patterns. These locality types differ markedly from non-local patterns like strided access, which involves regular intervals between references (e.g., sampling every fifth element), offering partial spatial benefits only if strides align with boundaries but generally weaker than contiguous access. Random access patterns, characterized by unpredictable and dispersed references, exhibit negligible locality in both dimensions, leading to frequent misses without the clustering that defines spatial or temporal behaviors.

Theoretical and Empirical Basis

Empirical Evidence

Early empirical studies of program behavior provided foundational evidence for locality of reference through analysis of address traces. In 1968, Peter J. Denning examined traces from various computational workloads and found that references were highly concentrated within small sets of recently accessed pages, with the size stabilizing to capture the majority of accesses during execution phases. This observation demonstrated that programs tend to reuse a limited subset of locations over extended periods, validating the existence of both temporal and spatial locality patterns. Subsequent work by Denning and Stuart C. Schwartz in 1972 reinforced these findings by deriving properties of the working set model from trace data, showing consistent clustering of references, with the probability of reuse dropping sharply outside the working set. Validations using standardized workloads, such as the SPEC CPU2017 benchmark suite, confirm the prevalence of locality. Trace-driven simulations reveal average L1 data cache miss rates around 3.4% for typical configurations, corresponding to hit rates exceeding 95% and highlighting strong temporal reuse in loop-dominated code sections. Spatial locality is evident in reduced miss rates with larger cache lines, as sequential accesses within data structures account for a significant portion of references. In operating systems, locality manifests in low page fault frequencies during normal operation, implying near-complete of resident pages and minimal disk I/O due to temporal patterns in application behavior. The extent of locality observed in these studies is influenced by program characteristics, including repetitive code structures that foster temporal and compact arrangements that promote spatial proximity of accessed elements.

Mathematical Modeling

The distance metric provides a formal measure of temporal locality in access patterns. For a sequence of references r_1, r_2, \dots, r_n, the distance d for a reference r_i to item x is defined as the number of distinct items accessed between the previous reference to x at j < i and r_i, i.e., d = |\{ r_k \mid j < k < i, r_k \neq x \} |. This indicates the likelihood of a cache hit: if d is smaller than the cache size, the item remains in cache under optimal replacement policies like LRU. Lower average distances across a trace signify stronger temporal locality, enabling predictions of cache performance independent of specific hardware parameters. This metric originates from early evaluations of storage hierarchies, where it was used to simulate stack-based replacement algorithms. The working set model offers another foundational approach to modeling locality, particularly for paging systems. Introduced by Denning, it defines the working set W(t, \theta) at time t with window size \theta as the set of unique pages p referenced in the recent past: W(t, \theta) = \{ p \mid \exists i : t - \theta < a_i < t \land \text{page}(a_i) = p \}, where a_i denotes the timestamp of the i-th memory access. The size |W(t, \theta)| represents the active memory footprint, and maintaining \theta such that |W(t, \theta)| fits in physical memory prevents thrashing by ensuring locality is preserved. This model assumes references within the window capture the program's current locality phase, guiding resource allocation in multiprogrammed environments. Spatial locality metrics focus on the tendency to access nearby data items, often quantified through miss ratio curves or access stride patterns. Miss ratio curves depict the miss rate as a function of block size or cache capacity, showing how larger blocks exploit spatial reuse to reduce misses exponentially in many workloads. For stride-based accesses, such as in linear array traversals, the probability of hitting within a spatial distance k can be modeled exponentially as P(k) = 1 - e^{-\lambda k}, where \lambda > 0 reflects the decay rate of access likelihood with distance; this captures decaying spatial correlation in reference streams. Such models help predict the benefits of prefetching or larger cache lines by fitting empirical traces to \lambda. Despite their utility, these mathematical models rely on key assumptions that limit their applicability. A primary limitation is the assumption of stationarity in access patterns, where locality metrics like distances or sizes are presumed invariant over time or inputs; in practice, dynamic workloads with phase changes or varying data sets violate this, leading to inaccurate predictions without adaptive adjustments.

Performance Implications

Relevance to Efficiency

Locality of reference significantly enhances computational efficiency by reducing the average memory access time, as programs tend to repeatedly reference a limited subset of data and instructions, allowing these to be kept in faster storage levels while minimizing fetches from slower ones. This principle underpins memory hierarchies, where exploiting temporal and spatial locality avoids the high latency of distant storage, such as main memory or disk, which can be orders of magnitude slower than on-chip caches—disk seeks, for instance, are approximately 10 million times slower than CPU cycles (e.g., 10 ms vs. 1 ns). By limiting page faults and cache misses, locality ensures that systems operate closer to their peak performance, directly tying into Amdahl's law, which emphasizes balancing components to maximize overall throughput; without locality-aware management, thrashing can degrade efficiency to near zero, but proper exploitation maintains high utilization across the hierarchy. Quantitative impacts of locality are evident in cache hit rate improvements, which can yield substantial speedups in execution time. For example, optimizing algorithms to better exploit locality in hierarchies, such as through separate caches for arrays and scalars, has demonstrated miss rate reductions of up to 43%, leading to performance improvements by increasing reuse and reducing misses. In blocked , exploiting locality via blocking reduces miss rates from around 2 per iteration to as low as 0.1, resulting in 3x to 4x overall speedups on typical processors of the era. These gains highlight how even modest hit rate enhancements (e.g., from 90% to 95%) compound across billions of accesses to deliver transformative efficiency. Beyond raw , locality contributes to broader through savings and improved in large systems. On-chip accesses consume far less than off-chip ones—typically 10-100 times lower due to reduced data movement across buses—making locality crucial for power-constrained environments like devices and centers. In scalable architectures, such as clustered databases or , locality enables efficient resource partitioning, allowing systems to handle growing workloads without proportional increases in or , as locality sets remain manageable even as total scales. However, enhancing locality introduces trade-offs, particularly the overhead of techniques like prefetching or data reorganization. Prefetching data based on predicted locality can anticipate needs but incurs costs when predictions fail at phase boundaries, leading to unnecessary pollution or waste; similarly, restructuring for better locality may add computational overhead that offsets gains in small datasets. These costs must be weighed against benefits, often requiring runtime monitoring to adapt dynamically.

Impact on System Design

The principle of locality of reference serves as a foundational guideline for designing memory hierarchies in computing systems, enabling efficient data access by aligning hardware structures with observed access patterns. Multi-level caches exemplify this by being dimensioned according to the characteristic windows of temporal and spatial locality; L1 caches, typically small (e.g., 32-64 ), prioritize rapid exploitation of short-term temporal locality for frequently reused data, while progressively larger (256 -1 ) and L3 (several ) caches extend to spatial locality across broader address ranges and longer reuse intervals. This tiered sizing reduces average by capturing locality at scales that match program behavior, as established in early analyses of reference patterns. Virtual memory systems integrate locality principles through page replacement algorithms that approximate temporal locality to manage limited physical . The least recently used (LRU) , for example, evicts pages based on the recency of access, assuming that recently referenced pages are likely to be reused soon, which aligns with empirical observations of program phasing and minimizes thrashing in multiprogrammed environments. Compilers and operating systems further embed locality-aware mechanisms into system design: compilers apply transformations like loop tiling to reorder code iterations, thereby enhancing data reuse and fitting access patterns to sizes during . Meanwhile, OS schedulers preserve locality by prioritizing assignments to processors that retain shared data in local caches, particularly in cache-coherent non-uniform memory access (CC-NUMA) architectures, to avoid costly data migrations. System performance benchmarking leverages locality through metrics such as miss rates, which quantify how effectively the design exploits reference patterns—lower miss rates indicate better alignment between capacities and locality windows, providing a direct measure of overall efficiency.

Applications in Computing

Memory Hierarchies

Memory hierarchies in systems are structured as a of levels, each with increasing and access latency but decreasing per bit, designed to exploit the locality of reference principle. At the apex are CPU registers, which offer sub-nanosecond access times and hold a few dozen entries for immediate , capitalizing on temporal locality by retaining recently computed values. Below them lie on-chip caches: the L1 (typically 32-64 per core) provides latencies around 1-4 cycles and targets both spatial and temporal locality for instructions and data; the (256 KB to 1 MB per core) extends this with 10-20 cycle latencies; and the shared L3 cache (several MB to tens of MB) serves multiple cores at 20-50 cycles, further buffering main memory. main memory follows with hundreds of cycles and gigabyte-scale , while secondary like SSDs or disks at the base offers milliseconds for archival data. This tiered design ensures that programs, which exhibit locality by reusing nearby or recent data, experience average times closer to the fastest levels rather than the slowest, as misses in higher levels fetch blocks into all intervening caches to prime future es. Cache management policies within these hierarchies directly leverage locality to maximize hit rates, which measure the fraction of accesses served from the without resorting to slower levels. For temporal locality, the Least Recently Used (LRU) policy approximates optimality by evicting the block unused for the longest time, assuming recent accesses predict future ones; this performs well in workloads with looping structures but incurs overhead from tracking recency. Spatial locality is exploited via prefetching mechanisms, where hardware anticipates and loads adjacent blocks upon a miss, such as stride prefetchers detecting regular access patterns in arrays. Cache associativity influences hit rates by allowing multiple blocks per set: direct-mapped caches (associativity of 1) are fast but prone to conflict misses that disrupt locality, while higher associativity (e.g., 8- or 16-way set-associative) reduces such conflicts by 20-50% in typical benchmarks, though at the cost of slightly longer hit times due to parallel tag comparisons. These policies collectively tune the hierarchy to the observed 90%+ hit rates in L1 for locality-rich programs. Belady's anomaly illustrates a counterintuitive failure in replacement policies under locality assumptions, where increasing cache size does not always reduce misses and can even increase them for certain access patterns. Named after László Bélády, the anomaly arises in non-optimal policies like (First-In-First-Out), which evict the oldest block regardless of future use; for a reference string like 1-2-3-4-1-2-5-1-2-3-4-5, a 3-frame incurs 9 faults, but a 4-frame version incurs 10, violating the expectation that larger caches better capture temporal locality. Even the optimal Bélády algorithm—evicting the block referenced farthest in the future—relies on perfect locality knowledge, which real systems approximate imperfectly, leading to scenarios where scaling up exposes pathological patterns not aligned with typical reuse. This highlights the need for locality-aware policies beyond naive sizing in hierarchy design. In modern multi-core CPUs, cache inclusion policies extend hierarchies to handle shared access while preserving locality benefits. Inclusive caches, common in Intel processors, mandate that the L3 contains all L1 and L2 contents, simplifying by allowing snoop broadcasts to check a single point but potentially duplicating data and underutilizing L3 space for temporal locality in private cores. Non-inclusive (or exclusive) designs, as in some architectures, permit L3 to hold data absent from private caches, maximizing effective capacity and hit rates by up to 10-15% in multi-threaded workloads with strong locality, though they complicate protocols with additional invalidations. These extensions balance locality exploitation against inter-core sharing, influencing overall system performance as hierarchies scale to dozens of cores.

Algorithm Optimization

Algorithm optimization techniques leverage locality of reference to minimize memory access latencies by restructuring and access patterns. These methods target both spatial and temporal locality, ensuring that likely to be reused is kept in faster levels such as caches or registers. By analyzing access patterns and applying transformations, algorithms can achieve significant gains without altering their functional correctness. Seminal works in this area emphasize the importance of modeling reuse to guide optimizations, enabling developers and compilers to tune for modern hardware hierarchies. Loop transformations, such as or blocking, enhance spatial locality by partitioning loops into smaller blocks that fit within lines, reducing misses during nested iterations. For instance, in , dividing the computation into blocks sized to the capacity localizes accesses, allowing repeated use of the same within the before . This approach, introduced in early blocked optimizations, can improve performance by factors of 2 to 10 depending on the problem size and configuration. also preserves temporal locality by reordering iterations to reuse soon after loading. Data structure selections play a crucial role in exploiting locality, with contiguous structures like arrays preferred over scattered ones like linked lists to capitalize on spatial locality. Arrays enable prefetching of adjacent elements into the during sequential access, whereas linked lists often scatter nodes across , leading to frequent cache misses and degraded in traversal-heavy operations. Cache-oblivious algorithms and data structures further generalize this by designing layouts that perform well across unknown cache hierarchies without explicit tuning, such as recursive layouts that maintain locality at multiple levels. These choices can yield up to 5-10 times faster execution for memory-bound computations compared to non-optimized alternatives. Compiler optimizations, including and , target temporal locality by keeping frequently accessed data in registers and reordering instructions to maximize reuse proximity. assigns variables with high reuse frequency to the limited number of registers, minimizing spills to slower and exploiting short-term temporal locality; techniques ensure conflict-free assignments, reducing execution time by up to 50% in register-poor architectures. rearranges code to overlap latencies and group accesses with similar reuse patterns, further enhancing temporal reuse without hardware knowledge. These passes, often performed late in , integrate locality models to prioritize allocations based on live ranges and access distances. Analysis tools like reuse distance aid in tuning algorithms by quantifying the distance between successive accesses to the same , providing to evaluate and improve locality. Reuse distance measures the number of unique references between reuses, allowing prediction of rates under various configurations; for example, distances shorter than cache associativity indicate good temporal locality. Tools employing this , such as dynamic analyzers, sample execution traces to recommend transformations like reordering, achieving 20-40% reductions in misses for loop-dominated applications. By focusing on whole-program behavior, these tools enable iterative optimization without exhaustive simulations.

Specialized Examples

One prominent example of exploiting locality in computing is matrix multiplication, where the standard O(n³) algorithm suffers from poor spatial locality due to strided memory accesses that evict useful data from the cache before reuse. By dividing the matrices into blocks sized to fit within the cache, the blocked algorithm reuses submatrices multiple times while they remain resident, thereby improving both spatial and temporal locality. This approach achieves the asymptotically optimal number of cache misses of O(n³ / √M), where M is the cache size in elements, by better exploiting spatial and temporal locality to minimize capacity and conflict misses beyond those in the naive algorithm, which suffers from strided accesses leading to poorer reuse. In database query processing, indexing leverages spatial locality by organizing keys in sorted, contiguous leaf nodes, enabling queries to perform sequential disk or memory block accesses rather than scattered random reads. This structure minimizes misses during scans, as adjacent keys are likely to be fetched together into the , supporting efficient prefetching and reducing I/O in systems like relational databases. For graph traversals, (BFS) exploits temporal locality by processing vertices level by level, reusing adjacency lists of neighboring vertices that remain in during the frontier expansion phase. In compressed sparse row representations, this can amortize misses across multiple traversals, improving reuse for connected components. Performance comparisons highlight the impact of these optimizations; for instance, the general multiply () routine in BLAS libraries, which employs blocked algorithms, achieves reuse efficiencies approaching 90% on modern processors for moderately sized matrices, leading to near-peak floating-point . In contrast, unoptimized implementations may incur 10-50 times more misses, slowing execution by factors of 5-20 depending on matrix dimensions and hardware. A notable case arises in branchy code, where frequent conditional branches disrupt access pattern predictability, leading to branch mispredictions that stall the and invalidate assumptions about future references. This scatters memory accesses, reducing effective temporal locality as recently data is evicted before reuse, with misprediction rates exceeding 20% in irregular workloads potentially doubling effective cache miss penalties.

Advanced and Emerging Uses

Parallel and Distributed Systems

In shared-memory parallel systems, (NUMA) architectures introduce varying access latencies, where local memory fetches are significantly faster than remote ones, impacting spatial locality by increasing the cost of accessing data across nodes. For instance, in NUMA machines like the ACE, local memory access times are about 0.65 μs for fetches compared to 1.5 μs for global memory, leading to performance degradation if data placement does not align with , as unrelated objects on the same page can force frequent remote accesses. Operating system techniques, such as page migration and replication in cache-coherent NUMA (CC-NUMA) systems, mitigate this by moving or duplicating pages to processors with high access rates, reducing remote cache misses by up to 52% in workloads and improving execution time by 29%. protocols, such as the MESI (Modified, Exclusive, Shared, Invalid) protocol, preserve temporal locality by managing line states to allow repeated local accesses without unnecessary invalidations, enabling private caching for high-locality data blocks in multi-core environments. Locality-aware extensions to MESI further adapt coherence based on , classifying cache lines with a Private Caching Threshold (e.g., 4 accesses) to exploit spatio-temporal , yielding up to 15% faster completion times on 64-core systems. In distributed systems, data locality optimizes computation placement to minimize network overhead, as exemplified by frameworks like Hadoop, where tasks are scheduled on nodes holding input data replicas to avoid transferring large datasets. The model, built on distributed file systems like GFS, stores files in 64 MB blocks with three replicas and prioritizes local reads during the map phase, ensuring most input data is processed without network involvement in large-scale operations. This approach reduces usage by aligning computation with data storage, directly leveraging locality of reference to scale processing across clusters while handling failures through redundant replicas. From a network perspective, joint scheduling algorithms in Hadoop-like systems prefetch data and balance task assignment with , stabilizing throughput at higher arrival rates (e.g., 48 tasks per time slot in 200-node clusters) compared to traditional fair schedulers. A key challenge in parallel systems is , where multiple threads access distinct variables mapped to the same line, violating spatial locality through unnecessary traffic and invalidations. This "ping-ponging" effect can degrade performance by up to 160 times in microbenchmarks and 10 times in applications like those in the suite, as updates to one variable evict the line from other caches despite no logical . Optimizations address this via scheduling, which binds threads or iterations to specific processors to maintain data proximity, such as in locality-based dynamic scheduling (LDS) for NUMA systems that partitions loops into small subtasks and prioritizes local execution, outperforming static methods in varied workloads on machines like . Data partitioning further enhances locality by dividing datasets into balanced chunks aligned with node topology, as in spatial-aware strategies like Spatial-Aware Bucket Balanced Split (SABBS), which groups related data while limiting load per node, achieving up to 1.64 times performance gains in distributed multimedia retrieval over 60 nodes with billions of descriptors.

Modern Hardware Adaptations

In modern graphics processing units (GPUs) and other accelerators, the (SIMT) execution model inherently promotes thread-level locality by executing groups of threads in , allowing shared data accesses to benefit from coalesced operations that reduce pressure and improve spatial . Within this model, CUDA's serves as a programmer-controlled on-chip resource that captures temporal locality by enabling threads within a to data loaded from , minimizing repeated off-chip accesses and enhancing overall throughput for data-parallel workloads. Emerging interconnect standards like (CXL) extend locality of reference across disaggregated devices by providing cache-coherent memory semantics over PCIe, allowing remote memory to be treated as part of a unified and enabling efficient without explicit copying, which mitigates the latency penalties of traditional NUMA hierarchies. Complementing this, processing-in-memory (PIM) architectures reduce latency gaps in the by embedding compute units directly within or near , thereby localizing data movement and exploiting both spatial and temporal locality for bandwidth-bound applications, as demonstrated in low-overhead PIM instruction sets that adaptively monitor access patterns to avoid unnecessary data transfers. In AI workloads, particularly transformer models, attention mechanisms can leverage spatial locality in token embeddings through IO-aware algorithms like FlashAttention, which employ blockwise tiling to compute attention scores while keeping intermediate results in fast GPU SRAM, avoiding full materialization of large attention matrices in slower HBM and achieving up to 3x speedups on sequence lengths common in language models. Looking to future trends, quantum computing algorithms for tasks like quantum chemistry exploit inherent locality in molecular Hamiltonians by mapping sparse, local interactions to qubit operators with reduced circuit depth, as in locality-optimized variants of the Bravyi-Kitaev transformation, potentially lowering the qubit and gate requirements for practical simulations. Similarly, neuromorphic systems, inspired by neural architectures, enforce locality constraints in synaptic connections and learning rules, enabling on-chip, distributed processing that mirrors the brain's sparse, local activations and supports energy-efficient computation without relying on von Neumann bottlenecks.

References

  1. [1]
    The Locality Principle - the denning institute
    Locality of reference is one of the cornerstones of com- puter science. It was born from efforts to make virtual memory systems work well. Vir-.Missing: original | Show results with:original
  2. [2]
    The working set model for program behavior - ACM Digital Library
    The working set model for program behavior. Author: Peter J. Denning. Peter J ... Published: 01 May 1968 Publication History. 762citation6,999Downloads.Cited By · Information & Contributors · Author Tags
  3. [3]
    The Working Set Model for Program Behavior - cs.wisc.edu
    The Working Set Model for Program Behavior. P. J. Denning @ MIT. Communications of the ACM, May 1968, pages 323-333. Goal. Develop a unified approach to tackle ...
  4. [4]
    [PDF] Locality of Reference - CS@Cornell
    If a process access a memory location, then it is likely that the same memory location is going to be accessed again in the near future (temporal locality).
  5. [5]
    The locality principle
    Locality of reference is one of the cornerstones of com- puter science. It was born from efforts to make virtual memory systems work well. Vir-.
  6. [6]
    [PDF] Quantifying Locality In The Memory Access Patterns of HPC ...
    This paper proposes a method to quantify spatial and temporal locality in memory access patterns of HPC applications using architecture-independent metrics.
  7. [7]
    None
    ### Summary of Temporal and Spatial Locality from the PDF
  8. [8]
    The working set model for program behavior
    The working set model is a new model where the working set of pages, the most recently used pages, provides vital knowledge for dynamic memory management.
  9. [9]
    Properties of the working-set model
    A program's working set W(t, T) at time t is the set of distinct pages among the T most recently referenced pages. Relations between the average.
  10. [10]
    Cache performance of SPEC CPU2000 - Computer Sciences Dept.
    On this webpage, we present functional cache miss ratios and related statistics for the SPEC CPU2000 suite.
  11. [11]
    Evaluation techniques for storage hierarchies - IEEE Xplore
    IBM Systems Journal >Volume: 9 Issue: 2. Evaluation techniques for storage hierarchies. Publisher: IBM. Cite This. PDF. R.L. Mattson ; J. Gecsei ; D. R. Slutz
  12. [12]
    The Stack Growth Function: Cache Line Reference Models
    SGFs measured for some 40 real programs show that a simple exponential model fits the SGFs reasonably well over a range of execution intervals. Parameters of ...
  13. [13]
    [PDF] Program locality analysis using reuse distance - Research
    Mattson et al. [1970] gave the first algorithm, which used a stack. Bennett and Kruskal [1975] observed that a stack was too slow to measure ...
  14. [14]
    [PDF] Locality Principle. - the denning institute
    Jun 22, 2008 · Locality is a universal behavior of all computational processes: They tend to refer repeatedly to subsets of their resources over extended ...Missing: paper | Show results with:paper
  15. [15]
    A Study of Separate Array and Scalar Caches. - ResearchGate
    ... locality of reference. [4], which is a property exhibited by most programs. A ... We achieve almost a 10X speedup in execution time compared to non-optimized C ...
  16. [16]
    [PDF] The Cache Performance and Optimization of Blocked Algorithms
    By reusing data in the faster level of the hierarchy, it cuts down the average access latency. It also reduces the number of references made to slower levels ...
  17. [17]
    The Locality Principle - Communications of the ACM
    Jul 1, 2005 · Locality of reference is one of the cornerstones of computer science. It was born from efforts to make virtual memory systems work well.Missing: original paper
  18. [18]
    [PDF] A Data Locality Optimizing Algorithm - SUIF Compiler
    In general, tiling transforms an n-deep loop nest into a 2n-deep loop nest where the inner n loops execute a compiler-determined number of iterations. Figure 4 ...
  19. [19]
    [PDF] Measuring Cache and TLB Performance and Their Effect on ...
    Our measurement technique involves the timing of operations executed repeatedly within small loops; in such cases, few cache and TLB misses are encountered.<|control11|><|separator|>
  20. [20]
    [PDF] The Memory Hierarchy
    Sep 23, 2025 · ⬛ References array elements in succession (spatial locality). ⬛ References sum and i each iteration (temporal locality; put in registers).
  21. [21]
    [PDF] Memory Management - cs.Princeton
    • Registers as cache of L1/L2 cache and main memory. • Main memory as a cache for the disk. • Disk as a cache of files from remote storage. • Locality of ...
  22. [22]
    [PDF] 03-04-memory-hierarchy.pdf - Overview of 15-740
    Temporal Locality: If a location is referenced it is likely to be referenced again in the near future. Spatial Locality: If a location is referenced it is ...
  23. [23]
    [PDF] Chapter 2: Memory Hierarchy Design – Part 2
    Cache size is the total capacity of the cache. Bigger caches exploit temporal locality better than smaller caches. But are not always better.
  24. [24]
    [PDF] Beyond Physical Memory: Policies - cs.wisc.edu
    The optimal replacement policy leads to the fewest number of misses overall. Belady showed that a sim- ple (but, unfortunately, difficult to implement!)
  25. [25]
    [PDF] Cache Replacement Policy - Washington
    Belady's Anomaly. FIFO (3 ,slots). Reference A. B C D. A B E. A B C. D E. A. D. E. +. 2 ... • MIN is optimal. - replace the page or cache entry that will be used.
  26. [26]
    [PDF] Achieving Non-Inclusive Cache Performance with Inclusive Caches
    We propose Temporal Locality Aware (TLA) cache management policies to allow an inclusive LLC to be aware of the temporal locality of lines in the core caches.
  27. [27]
    [PDF] Non-Inclusion Property in Multi-level Caches Revisited
    The inclusion property, which dictates that the contents of a lower level cache be a subset of those of a higher level cache, is highly desired in a ...
  28. [28]
    [PDF] Compiler Optimizations for Improving Data Locality
    the best loop permutation. Wolf and Lam use unimodular transformations and tiling with estimates of temporal and spatial reuse to improve data local- ity ...
  29. [29]
    [PDF] Improving Data Locality with Loop Transformations
    In this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and ...
  30. [30]
    Improving the cache locality of memory allocation - ACM Digital Library
    Our measurements show that poor locality in sequential-fit allocation algorithms reduces program performance, both by increasing paging and cache miss rates.
  31. [31]
    [PDF] Cache-Oblivious Algorithms and Data Structures - Erik Demaine
    ousness, introduced by Frigo, Leiserson, Prokop, and Ramachandran in. 1999. Cache-oblivious algorithms perform well on a multilevel memory hierarchy without ...
  32. [32]
    [PDF] Predicting Whole-Program Locality through Reuse Distance Analysis
    First, reuse distance is at most a linear function of program data size. The search space is therefore much smaller for pattern recognition and prediction.
  33. [33]
    [PDF] Cache-Oblivious Algorithms
    Many of the results in this thesis are based on a joint paper [21] coauthored by. Matteo Frigo, Charles E. Leiserson, and Sridhar Ramachandran. 12. Page 9 ...
  34. [34]
    [PDF] Fractal Prefetching B -Trees: Optimizing Both Cache and Disk ...
    This paper, however, is the first to propose a B+-Tree index structure that effec- tively optimizes both CPU cache and disk performance on modern processors, ...
  35. [35]
    [PDF] The More the Merrier: Efficient Multi-Source Graph Traversal
    Multi-Source BFS (MS-BFS) runs multiple concurrent BFSs on a single CPU core, sharing computation, reducing memory accesses, and avoiding synchronization costs.
  36. [36]
    CGAcc: A Compressed Sparse Row Representation-Based BFS ...
    Nov 7, 2018 · Graph traversal suffers from memory stalls and cache misses due to limited spatial locality. Spatial locality for traversal only exists in ...<|separator|>
  37. [37]
    [PDF] High-Performance Implementation of the Level-3 BLAS
    Right: The curves labeled “Pack A” and “Pack B” indicate the percent of total time spent in each of these operations.
  38. [38]
    [PDF] The Effects of Predicated Execution on Branch Prediction
    Predication removes branch instructions, affecting branch prediction accuracy, branch penalty, and basic block size. Branch prediction eliminates branch delay.
  39. [39]
    [PDF] Branch prediction, instruction-window size, and cache size
    Using an RUU eliminates artifacts arising from interactions between active-list size and issue- queue size and reduces the already large number of variables we ...
  40. [40]
    Locality-Driven Dynamic GPU Cache Bypassing - ACM Digital Library
    This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical ...
  41. [41]
    A locality-aware memory hierarchy for energy-efficient GPU ...
    We investigate the locality of reference at different cache levels in the memory hierarchy. At the L1 cache level, we look into the locality behavior at the ...
  42. [42]
    An Introduction to the Compute Express Link (CXL) Interconnect
    Jul 8, 2024 · CXL enables coherency and memory semantics and builds on top of PCIe' s physical subsystem. Non-coherent accesses work well for streaming I/O ...
  43. [43]
    a low-overhead, locality-aware processing-in-memory architecture
    In addition, we introduce a simple hardware structure that monitors the locality of data accessed by a PIM-enabled instruction at runtime to adaptively execute ...
  44. [44]
    Exploiting Locality in Quantum Computation for Quantum Chemistry
    We show that the utilization of spatial locality combined with the Bravyi–Kitaev transformation offers an improvement in the scaling of known quantum algorithms ...
  45. [45]
    [PDF] Loihi: A Neuromorphic Manycore Processor with On-Chip Learning
    Once a learning rule satisfies the locality constraint, the inherent parallelism offered by SNNs will allow the adaptive network to be scaled up to large ...