Fact-checked by Grok 2 weeks ago

Prefetcher

A prefetcher is a hardware component integrated into modern central processing units (CPUs) that predicts upcoming memory accesses for data or instructions and proactively fetches them into the on-chip cache hierarchy before they are explicitly requested, thereby hiding memory latency and boosting execution performance. This technique addresses the growing disparity between processor speed and memory access times, often referred to as the "memory wall," by overlapping data retrieval with computation. Prefetchers are a standard feature in high-performance processors from major vendors, including Intel's Xeon and Core series as well as AMD's Opteron and Ryzen lines, where they operate transparently to analyze access patterns such as sequential streams or constant strides. Prefetching mechanisms originated in the early as a response to cache miss penalties in direct-mapped caches, with Norman Jouppi's 1990 proposal of stream buffers marking a foundational advancement by prefetching sequential cache lines on misses to reduce conflict misses. Subsequent innovations in the mid-, such as stride-based prefetchers and Markov models for irregular patterns, evolved into more sophisticated hardware implementations by the 2000s, appearing in commercial chips like IBM's and Intel's microarchitectures. Today, prefetchers are classified primarily into hardware, software, and hybrid variants, with hardware prefetchers dominating due to their low overhead and automatic operation. Hardware prefetchers, the most common type, monitor runtime memory access histories to detect patterns and issue prefetch requests; for instance, Intel's Data Cache Unit (DCU) prefetcher targets for sequential or strided loads, while the Streamer identifies and prefetches 128-byte blocks for streaming workloads. Key subtypes include: In contrast, software prefetching involves compiler-inserted instructions (e.g., Intel's PREFETCHT0 for L1 loading) to explicitly hint at future accesses, offering fine-grained control for specialized applications such as matrix multiplications but requiring careful tuning to avoid overheads. Hybrid approaches combine both, leveraging hardware for general cases and software for domain-specific optimizations, as in environments handling scale-out workloads. The effectiveness of prefetchers is measured by metrics like coverage (fraction of misses avoided), accuracy (ratio of useful to total prefetches), and timeliness (ensuring arrives just in time), with modern designs incorporating adaptive throttling to mitigate issues such as cache pollution or bandwidth contention in multicore systems. Despite these benefits, prefetchers can introduce challenges, including from unnecessary fetches and vulnerabilities like exploits, prompting configurable controls via model-specific registers (MSRs) in processors. Ongoing research emphasizes context-aware prefetching for emerging workloads, such as , to sustain gains in architectures.

Fundamentals

Definition and Purpose

A prefetcher is a hardware or software mechanism in central processing units (CPUs) that anticipates future memory accesses and proactively loads data or instructions from main memory into faster cache levels before the processor explicitly requests them. This anticipatory fetching targets patterns in program behavior to overlap memory operations with computation, minimizing idle time for the processor core. The primary purpose of a prefetcher is to mitigate the widening performance disparity between rapidly advancing CPU speeds and comparatively slower main access times, thereby reducing the penalties associated with misses and enhancing overall system throughput and execution efficiency. By prefetching likely-needed blocks into the , it effectively hides without requiring changes to the core instruction execution flow. Prefetching techniques emerged prominently in the as the processor-memory performance gap began to dominate system bottlenecks, with early hardware implementations appearing in late-1990s processors such as the . Seminal work, including Jouppi's 1990 proposal for prefetch buffers to improve cache hit rates, laid the groundwork for these developments by demonstrating how speculative fetching could address compulsory and capacity misses in direct-mapped caches. This historical evolution underscores prefetching's role as a foundational optimization in modern CPU designs to sustain performance scaling amid persistent memory latency challenges.

Basic Mechanisms

Prefetchers operate by monitoring access patterns, such as addresses and strides, generated by the CPU's load/ unit to predict and fetch future data accesses based on historical patterns. This monitoring enables the identification of regularities in data references without requiring explicit or intervention in hardware-based designs. Common prediction strategies include one-block lookahead (OBL) for sequential accesses, where a prefetch is initiated for the next block (b+1) upon accessing the current block b, often triggered on a miss to exploit spatial locality. Stride detection identifies regular patterns, such as fixed increments in array traversals, by computing the difference between consecutive addresses; for addresses A_1 and A_2, the stride S = A_2 - A_1, and the next prefetch address is A_n + S. Stream detection, meanwhile, recognizes directional data flows, such as sequential streams, by allocating buffers to prefetch multiple successive blocks from a starting address. The fetch involves loading predicted into the in fixed-size blocks, typically 64 bytes, to align with line and minimize transfer overhead. To avoid with fetches for actively requested , prefetchers employ dedicated queues or buffers that prioritize and manage speculative requests separately. Timing is critical for effectiveness: prefetching must initiate early to mask but not so prematurely as to pollute the with unused , leading to evictions. This balance is controlled by the degree of prefetching, which specifies the number of blocks fetched ahead (e.g., K > 1 for aggressive prefetching), and the prefetch distance, often calculated as \delta = \lceil l / s \rceil, where l is the in and s is the cycle time per .

Types of Prefetchers

Hardware Prefetchers

Hardware prefetchers are autonomous circuits integrated into the (CPU) that monitor memory access patterns and proactively issue prefetch requests to bring data or instructions into the without requiring software intervention. These mechanisms address the memory wall by predicting future accesses based on observed behavior, thereby reducing cache miss latencies and improving overall system performance. Common subtypes of hardware prefetchers include next-line prefetchers, stride prefetchers, and stream prefetchers. Next-line prefetchers detect patterns and fetch the adjacent cache line immediately following a demand access, which is particularly effective for execution or contiguous data traversal. Stride prefetchers identify regular arithmetic progressions in memory addresses, such as array traversals with constant increments, by maintaining a stride table that records addresses, recent addresses, and computed stride values to anticipate and prefetch future elements. Stream prefetchers extend this capability by tracking directional streams of misses—either forward or backward—across larger regions, often using dedicated stream buffers to hold prefetched blocks and correlate sequences of accesses for higher coverage in irregular but predictable patterns. These prefetchers are typically situated near the L1 or cache levels to minimize in issuing requests, with predictions generated using structures like pattern history tables (PHTs) for correlating access histories or Markov models that map observed triggers to likely successor addresses. Notable implementations include Intel's stream prefetcher, which monitors miss streams at the cache and generates prefetch requests for up to 16 concurrent streams with adjustable degrees and distances. Compared to software approaches, hardware prefetchers impose lower runtime overhead since they operate in parallel with the execution and automatically adapt to dynamic access patterns without or explicit instructions. This parallelism enables them to hide latencies effectively, often achieving coverage rates exceeding 50% for common workloads while consuming minimal additional resources. The evolution of hardware prefetchers has progressed from basic adjacent-line fetching in early 2000s processors, such as those in the series, to sophisticated multi-level designs in contemporary CPUs that combine stride, stream, and temporal prediction for broader applicability. Modern implementations, like those in and processors, incorporate feedback mechanisms and spatio-temporal correlations to balance accuracy, timeliness, and bandwidth usage, reflecting ongoing advancements to handle diverse application behaviors.

Software Prefetchers

Software prefetching refers to techniques where programmers or compilers explicitly direct the to load into the in advance of its anticipated use, thereby overlapping with ongoing computations to mitigate cache misses. This approach employs non-blocking prefetch instructions that do not stall execution, allowing the to continue processing while the prefetched is fetched asynchronously. Unlike hardware prefetchers, which operate autonomously, software prefetching provides , making it suitable for complex or irregular patterns that automated mechanisms may fail to predict. Key mechanisms include temporal and non-temporal prefetching. Temporal prefetching, such as the x86 PREFETCHT0, PREFETCHT1, and PREFETCHT2 instructions, loads into specific levels based on expected reuse: PREFETCHT0 targets all levels (L1, , L3) for high temporal locality, PREFETCHT1 bypasses L1 to load into and higher for moderate reuse, and PREFETCHT2 directs to L3 or beyond for low locality. Non-temporal prefetching, exemplified by Intel's PREFETCHNTA, fetches into a non-temporal structure close to the to minimize , ideal for streaming workloads where is used once without reuse, such as write streams. In architectures, instructions like PLD (Preload ) and PRFM (Prefetch ) initiate cache linefills on misses for cacheable addresses, retiring immediately to enable background fetching without blocking; PLD targets L1, while PRFM can direct to . These mechanisms are particularly effective in loops with stride or stream access patterns, where prefetches are inserted one ahead to hide . However, they introduce overheads, including instruction fetch costs and risks of from mispredicted prefetches, which can increase execution time by up to 28% if not optimized. Compilers play a central role by automatically analyzing code to insert prefetch hints, often through prefetch distance analysis that estimates latency and access strides. In , the -fprefetch-loop-arrays option enables generation of prefetch instructions for loops accessing large arrays, inserting them based on target-specific support to prefetch data ahead of use. This is beneficial for streaming workloads like multimedia processing, where sequential or strided accesses dominate, potentially eliminating nearly all misses in suitable benchmarks. Software prefetching excels in scenarios involving large datasets or non-uniform accesses, such as scientific simulations (e.g., or PDE solvers) and database queries with irregular patterns like pointer chasing or indexed arrays. In these cases, it can improve performance by up to 45% in high-bandwidth environments by reducing memory stalls, outperforming hardware methods for patterns like sparse matrices or graph traversals. For database applications, techniques like prefetching exploit predictable invocation patterns to preload query data, enhancing throughput in shared-memory systems.

Hybrid Prefetchers

Hybrid prefetchers combine hardware and software techniques to leverage the strengths of both approaches, using hardware for general-purpose pattern detection and software hints for domain-specific or irregular accesses. This integration allows for adaptive prefetching in mixed workloads, such as server applications, where hardware handles common streams and software optimizes critical paths, reducing overhead while improving coverage and accuracy.

Implementation in Modern Processors

Intel Processors

introduced hardware prefetchers with the processor in November 2000, utilizing adjacent-line prefetching to automatically fetch the neighboring line when a miss occurs, thereby reducing for sequential accesses. This mechanism, combined with stream detection for both and , marked an early evolution toward proactive memory optimization in the microarchitecture. Over subsequent generations, 's prefetching evolved into multi-tier systems integrating L1, , and last-level (LLC) components, adapting to increasingly complex workloads while balancing demands. Core to 's design is the L1 data prefetcher, which employs an instruction pointer ()-based stride mechanism to monitor load , detect regular stride patterns (typically small multiples of lines), and prefetch into the L1 in forward or backward directions. Complementing this, L2 prefetchers include adjacent-line fetching to capture the next 64-byte line on a miss and stream detection that tracks multiple concurrent forward or backward streams, allocating resources dynamically based on access history. Beginning with the architecture in 2022, added the Data Dependent Prefetcher (DDP), a content-aware system that parses memory —limited to cacheable, non-privileged —to predict and prefetch pointer-based accesses, enhancing accuracy for irregular patterns. In recent architectures such as Meteor Lake (2023) and Arrow Lake (2024), prefetchers feature expanded configurability through Model-Specific Registers (MSRs), including over 35 parameters for E-core tuning like stream distance, demand density, and throttling thresholds to optimize for diverse access behaviors. In Arrow Lake (2024), prefetchers build on Meteor Lake with further optimizations for hybrid core efficiency, including adaptive throttling for P-cores. The IP-based stride prefetcher continues to track per-instruction patterns across page boundaries, restarting detection every 4KB to adapt to code layouts. For non-inclusive L3 caches in Xeon Scalable and client processors, an LLC prefetcher enables direct allocation into the shared cache, bypassing private L1/L2 to minimize redundancy and pollution. These prefetchers typically reduce and LLC miss rates by 20-50% in memory-bound SPEC CPU benchmarks, such as those exhibiting stream or stride patterns, leading to performance uplifts of 5-15% overall. However, in low-power modes on Atom-based E-cores, they are often disabled or throttled to limit bandwidth usage and power draw.

and Processors

In AMD's architecture family, introduced with the processors in 2017, prefetching is implemented across multiple levels to enhance utilization while prioritizing efficiency. The L1 data includes a stride prefetcher that detects regular access patterns, such as constant increments in memory addresses, to proactively fetch subsequent data lines. Complementing this, the L2 employs next-line and stream prefetchers, which identify sequential accesses and track directional streams to prefetch blocks ahead of demand requests. These mechanisms are designed with aggressive throttling to balance performance gains against consumption, dynamically adjusting prefetch aggressiveness based on characteristics and constraints. AMD's prefetchers collectively track multiple concurrent streams, allowing for efficient handling of multiple access patterns without excessive cache pollution. Integration with the Infinity Fabric interconnect ensures coherent prefetch operations across multi-chiplet configurations, where prefetch requests propagate through the fabric to remote controllers while maintaining consistency. ARM architectures emphasize low-power designs, particularly in and systems, with prefetching tailored to rather than maximum throughput. In the Cortex-A series, such as the A510 core, the L1 data implements a simple hardware prefetcher that detects adjacent-line accesses and repetitive patterns, prefetching lines into L1 or higher levels using addresses to cross page boundaries when permitted. This approach minimizes energy overhead in battery-constrained environments. More advanced heterogeneous prefetching appears in big.LITTLE configurations, where prefetch policies adapt to non-volatile (NVRAM) in mixed- systems, selectively prefetching from high-bandwidth or low-latency tiers based on core type (performance vs. ). Server-oriented Neoverse platforms from the , like Neoverse , incorporate predictor-based prefetchers that use machine learning-inspired models to forecast irregular accesses, including sampling indirect prefetches for pointer chasing and table-walk predictions for page tables. These enhance server workloads by improving hit rates in large caches. designs support optional disabling of prefetchers via control registers, aiding low-power modes or security hardening by preventing speculative fetches that could leak information. Both and architectures accommodate software hints; for instance, 's PRFM instruction allows explicit prefetching to specific cache levels with policies like keep or stream. Key differences between and prefetching lie in their optimization foci: 's Zen series stresses power-efficient throttling in high-performance x86 environments, scaling prefetch depth dynamically to avoid bandwidth waste, while prioritizes minimal overhead in RISC-based, low-power scenarios, often with configurable disables for security or energy savings. Prefetch integration ties into coherence protocols, such as 's Infinity Fabric for chiplet-scale data movement and 's Coherent Hub Interface () for hint propagation in multi-socket systems. As of 2025, recent updates include AMD Zen 5's enhancements to prefetch algorithms, improving stride accuracy and coverage for better irregular access handling. In ARMv9, prefetchers incorporate side-channel mitigations, such as restricted speculative prefetching across security domains, to address vulnerabilities without fully disabling functionality.

Configuration and Optimization

Hardware Configuration

Hardware prefetchers in modern processors can be configured through firmware interfaces such as BIOS/UEFI settings, which typically allow system-wide enabling or disabling of prefetch mechanisms. For Intel processors, the UEFI often includes toggles like "Hardware Prefetcher" and "Adjacent Cache Line Prefetch," enabling users to disable these features globally to mitigate potential cache pollution in specific workloads. These settings apply across all cores and take effect upon system reboot, influencing boot time by potentially reducing unnecessary memory accesses during initialization and affecting power consumption by limiting speculative fetches. Low-level adjustments are possible using Model-Specific Registers (MSRs), providing finer granularity such as per-core control. On architectures, MSR 0x1A4 (MSR_MISC_FEATURE_CONTROL) bits 0-3 allow independent enabling or disabling of prefetchers: bit 0 for L2 hardware prefetcher, bit 1 for L2 adjacent cache line prefetcher, bit 2 for L1 (DCU) hardware prefetcher, bit 3 for L1 IP/stride prefetcher (1 = disable). Access to MSRs requires privileged tools: on , msr-tools (rdmsr/wrmsr commands) or sysfs interfaces under /sys/devices/system/cpu/cpu*/msr enable per-core modifications after loading the msr . On Windows, third-party utilities like RWEverything or custom drivers facilitate MSR writes, though registry keys do not directly control prefetchers. Vendor implementations differ in configurability. supports independent L1 and prefetch control via the aforementioned MSRs, allowing targeted adjustments for hybrid architectures up to as of 2025. For Zen-based processors (up to ; similar for ), menus under AMD CBS (Common BIOS Settings) provide options like "Hardware Prefetcher" enable/disable and prefetch throttling controls, often system-wide, with MSRs such as C001_1022 for L1 prefetch disabling offering per-core . Best practices recommend disabling prefetchers for latency-sensitive applications, such as , where aggressive prefetching can cause pollution by evicting useful data, leading to increased micro-stutters; enabling them suits bandwidth-bound workloads like scientific computing.

Software Tuning

Software tuning for prefetching involves techniques at the compiler, operating system, and application levels to insert or optimize prefetch instructions, aiming to reduce without relying solely on mechanisms. Compilers can automatically generate prefetch hints based on code analysis, such as loop structures and access patterns, to anticipate data needs. In compiler optimizations, tools like the Classic (part of oneAPI) use the -qopt-prefetch option to control the aggressiveness of automatic prefetch insertion, with levels from 0 (disabled) to 5 (most aggressive), enabling prefetches for loops and pointer-based accesses to hide cache miss latencies. Similarly, supports data prefetch through built-in functions like __builtin_prefetch, which programmers or the compiler can invoke to load data into caches ahead of time, often integrated during optimization passes at -O2 or higher levels. Compilers also analyze factors like loop trip counts to determine prefetch distance, typically calculated as the ratio of to the time per iteration (e.g., distance ≈ / (stride size / )), ensuring prefetches complete just before data use without excessive bandwidth consumption. At the OS level, tuning often involves accessing model-specific registers (MSRs) to adjust prefetch behavior dynamically. In , tools like msr-tools allow writing to MSRs such as 0x1A4 (MSR_MISC_FEATURE_CONTROL) to enable or disable specific prefetchers per core; for example, bit 2 disables the L1 () data prefetcher. Windows supports similar adjustments via drivers or utilities that interface with MSRs, often used in performance-critical drivers to fine-tune prefetch for I/O operations. These OS interventions complement efforts by aligning system-wide prefetch with workload demands. Application-level techniques focus on manual insertion of prefetch instructions in performance hotspots, such as array traversals in numerical code. In C/C++, the _mm_prefetch intrinsic from <mmintrin.h> loads data into specified levels (e.g., _MM_HINT_T0 for L1), allowing developers to hint at future accesses; for instance, prefetching array elements several iterations ahead in a can reduce stalls by 20-50% in memory-bound kernels. Profiling tools like VTune Profiler identify prefetch opportunities by analyzing miss events and access streams, recommending insertion points based on runtime traces to maximize hit rates. Advanced software tuning employs hybrid approaches, blending compiler-generated hints with manual ones and hardware configurations to optimize for specific architectures; for example, using compiler flags alongside MSR tweaks to scale prefetch distance dynamically. In multi-threaded applications, tuning must consider contention, such as limiting prefetches per thread to avoid cache pollution or bandwidth saturation, often guided by workload profiling to balance parallelism and prefetch efficacy. To evaluate tuning effectiveness, developers monitor metrics like miss rates using performance counters; in , the perf tool captures events such as L1-dcache-load-misses, where reductions of 10-30% post-tuning indicate successful hiding without increased misses elsewhere. These metrics ensure adjustments yield net gains, prioritizing conceptual improvements over exhaustive benchmarks.

ARM Processors

For ARM-based processors, such as those in (M-series) or , prefetchers are typically configured via or OS interfaces rather than user-accessible MSRs. In on (e.g., Cortex-A series), interfaces under /sys/devices/system/cpu/cpufreq/ or device tree overlays allow toggling prefetch modes, while Apple provides limited control through macOS settings. BIOS/UEFI equivalents in embedded systems enable/disable via or DTB parameters, focusing on stream and stride detection for mobile and server workloads.

Performance and Security Considerations

Benefits and Drawbacks

Prefetchers offer significant performance advantages in workloads with predictable access patterns by anticipating and fetching data into the ahead of time, thereby reducing average access latency. In predictable workloads, such as those involving sequential or accesses, prefetchers can decrease access time by 20-90%, depending on the hardware implementation and benchmark characteristics. For instance, evaluations on SPEC CPU2017 benchmarks using Intel Xeon processors demonstrate that hardware prefetchers can reduce last-level misses by over 90% in memory-intensive applications like 510.parest_r. Additionally, prefetchers effectively hide in streaming applications, leading to improvements in (IPC) by 5-20% across various workloads, as observed in studies of stream-based prefetching on SPEC CPU2000 benchmarks, where average performance gains reached 6.5%. Despite these gains, prefetchers introduce notable drawbacks, particularly in terms of resource utilization and potential performance degradation. A primary issue is , where inaccurately prefetched data evicts useful lines, increasing cache miss rates by up to 50% in irregular or patterns; for example, in SPEC CPU benchmarks like , prefetcher-induced pollution has been shown to degrade performance by up to 24% before mitigation techniques are applied. This pollution also contributes to waste, as unnecessary prefetches generate extra traffic, imposing an overhead of 10-30% in bandwidth consumption for affected workloads, though this is minimal (under 5% increase in BPKI) for most SPEC CPU2006 benchmarks on modern systems. Furthermore, prefetchers can elevate power consumption by 5-15% in mobile CPUs due to the additional speculative memory accesses and cache management overhead, challenging in battery-constrained environments. The effectiveness of prefetchers thus involves key trade-offs, performing well for sequential and regular access patterns—such as in matrix traversals or database scans—while potentially harming irregular ones like graph traversals or pointer-chasing, where mispredictions amplify . Incorrectly enabling or configuring prefetchers can degrade overall performance by 10-20% in database workloads with sporadic accesses, underscoring the need for workload-specific tuning. To evaluate these impacts, tools like perf for Linux-based miss rate analysis or LIKWID for monitoring are commonly used to measure metrics such as miss rates and usage. Prefetcher accuracy, a critical indicator of usefulness, is quantified via the equation: \text{Accuracy} = \left( \frac{\text{Useful prefetches}}{\text{Total prefetches}} \right) \times 100\% where useful prefetches are those consumed by the processor before eviction. High accuracy (above 70%) correlates with net benefits, while lower values highlight drawbacks like pollution.

Security Implications

Prefetchers, designed to anticipate and preload data into caches, can inadvertently introduce security vulnerabilities by creating observable side effects that leak sensitive information through timing, power, or cache state variations. These mechanisms enable side-channel attacks where adversaries exploit prefetch-induced behaviors to infer memory accesses or contents without direct access. For instance, prefetch instructions on AMD processors exhibit timing and power differences that allow unprivileged user-space code to leak kernel data, as demonstrated in a 2022 analysis showing kernel memory leakage at up to 59 B/s. Similarly, speculative execution attacks like Meltdown can be extended by triggering hardware prefetchers during transient instruction execution, pulling privileged data into shared caches for subsequent measurement via timing channels, with the original vulnerability disclosure highlighting prefetchers as a vector for amplifying leaks beyond direct speculative reads. Specific vulnerabilities have been identified across major architectures. Data memory-dependent prefetchers (DMP), which inspect contents for pointer-like values to initiate fetches, enable leaks of data at rest—information never architecturally loaded into the core—through transient execution that brings inaccessible memory into the hierarchy; such vulnerabilities have been exploited in via attacks like (2022). AMD Zen architectures aid prime+probe attacks by prefetching patterns that alter occupancy, allowing adversaries to measure eviction times for cross-core inference of victim activity, with variations in prefetch aggressiveness exacerbating the in shared last-level caches, as shown in the 2025 ZenLeak analysis. In series, such as the A72, prefetchers create side channels by prefetching based on observed access patterns, leaking address information across security domains even after mitigations like FEAT_CSV2, as characterized in a 2023 study updated with 2025 disclosures confirming impacts on Armv8 implementations. Mitigation strategies focus on restricting prefetcher behavior or isolating sensitive operations. Operating systems like can disable hardware prefetchers via model-specific registers (MSRs) during security-critical code execution, such as in kernel modules handling , to prevent leakage while preserving performance in non-sensitive contexts. Software techniques include non-temporal prefetch hints (e.g., PREFETCHNTA on x86), which direct to non-cacheable streams to avoid polluting shared structures and limit side effects. Hardware-level protections encompass speculation barriers in ARMv9, such as enhanced speculation features that fence prefetch operations dependent on secrets, reducing transient fetches. Vendors have issued updates, including Intel's 2022 microcode revisions that constrain DMP to write-back types and cap prefetch depth, effectively mitigating content-dependent leaks without full disablement. Despite these advances, research gaps persist, particularly in post-2020 coverage of prefetcher security, with traditional encyclopedic sources overlooking recent exploits like , AMD power-based attacks, GoFetch (2024 DMP attack on constant-time cryptography in ), and (2025 LLC attacks on AMD ). Emerging neural network-based prefetchers, which use models to predict access patterns from historical data, introduce novel risks such as model inversion attacks that could reconstruct training data (including sensitive addresses) from observable prefetch decisions, though defenses like in model parameters remain underexplored as of 2025. These vulnerabilities have significant impacts, notably enabling cross-virtual-machine (VM) attacks in environments where prefetchers shared across tenants erode guarantees, allowing one VM to infer another’s memory layout or keys via co-located contention, as evidenced in multi-tenant scenarios.

References

  1. [1]
    [PDF] A Survey on Recent Hardware Data Prefetching Approaches with An ...
    Sep 1, 2020 · In this survey, we discuss the fundamental concepts in data prefetching and study state-of-the-art hardware data prefetching approaches.
  2. [2]
    A Survey of Recent Prefetching Techniques for Processor Caches
    Lookahead shows how well in advance a prefetch is issued, such that prefetched data arrive in cache on time and are not evicted by other data. A prefetch is ...
  3. [3]
    [PDF] Intel® 64 and IA-32 Architectures Optimization Reference Manual
    ... prefetching mechanisms to accelerate the movement of data or code and improve performance: • Hardware instruction prefetcher. • Software prefetch for data.
  4. [4]
    [PDF] Improving Direct-Mapped Cache Performance by the Addition of a ...
    Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches. Victim caching is an ...Missing: field | Show results with:field
  5. [5]
    [PDF] Data Prefetch Mechanisms
    Prefetch efficiency is defined to be the ratio of useful prefetches to total prefetches where a useful prefetch occurs whenever a prefetched block results in a ...
  6. [6]
    Data Dependent Prefetcher - Intel
    Nov 10, 2022 · A data-dependent prefetch is a prefetch operation based on memory contents. Memory contents dereferenced by the DDP are limited to those with a memory type ...
  7. [7]
    The gap between processor and memory speeds - ResearchGate
    In the beginning of the 1990s, with the development of powerful ... 1.3 -Evolution of the processor-memory performance gap starting in the 1980 [6] ...
  8. [8]
    [PDF] Intel® Architecture Optimization
    Pentium Pro Processor Family Developer's Manual, Volumes 1, 2, and 3, order ... The instruction prefetcher performs aggressive prefetch of straight line.
  9. [9]
    Improving direct-mapped cache performance by the addition of a ...
    Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Author: Norman P. Jouppi.
  10. [10]
    [PDF] Data Prefetching for High-Performance Processors - Dada
    The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache.
  11. [11]
    [PDF] Effective hardware-based data prefetching for high-performance ...
    The goal of this paper is to show that a hardware-based prefetch mechanism can be a cost and performance efficient mechanism when placed in the context of a ( ...
  12. [12]
    [PDF] A Primer on Hardware Prefetching A Primer Prefetch A Prime Prefetch
    When the stride detection mechanism observes a new stream, an entire stream buffer is cleared and re-allocated (discarding any unreferenced blocks from a ...
  13. [13]
    Stride directed prefetching in scalar processors
    This paper shows how a simple hardware mechanism, called a stride prediction table (WI), can be used to calculate the stride distance on a scalar processor and ...Missing: seminal | Show results with:seminal
  14. [14]
    [PDF] Reverse Engineering the Stream Prefetcher for Profit
    One such micro-architecture unit is a hardware prefetcher, which is a popular off-chip memory latency hiding technique employed in all the commercial machines.
  15. [15]
    [PDF] AMD Prefetch Attacks through Power and Time - USENIX
    The CPU resolves virtual addresses to physical addresses by walking the page tables, which are stored in physical memory. To speed up the translation, the ...
  16. [16]
    [PDF] Evaluation of Hardware Data Prefetchers on Server Processors
    In this survey, we evaluate the effectiveness of data prefetching in the context of server applications and shed light on its design trade-offs. To do so, we ...
  17. [17]
    [PDF] Software Prefetching * David Callahant
    Our simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing.
  18. [18]
    None
    Below is a merged summary of the PREFETCH instructions from the Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A, consolidating all information from the provided segments into a dense and comprehensive response. To maximize clarity and retain all details, I will use a table in CSV format for the core information (variants, purposes, and cache levels), followed by additional notes and URLs in a structured text format. This ensures all data is preserved while maintaining readability.
  19. [19]
    Preload instructions - Cortex-A53 - Arm Developer
    The Cortex-A53 processor supports the PLD and PRFM prefetch hint instructions. PLD and PRFM instructions lookup in the cache, and start a linefill if they miss ...<|separator|>
  20. [20]
    Optimize Options (Using the GNU Compiler Collection (GCC))
    If supported by the target machine, generate instructions to prefetch memory to improve the performance of loops that access large arrays. ... This number sets ...
  21. [21]
    [PDF] The Efficacy of Software Prefetching and Locality Optimizations on ...
    Software prefetching has been shown to be effective in reducing memory stalls for both sequential and parallel applications, particularly for scientific ...
  22. [22]
    [PDF] Call Graph Prefetching for Database Applications - cs.wisc.edu
    By contrast, C GP is a simple hardware scheme that discovers and exploits predictable call behavior as found, for example, in database applications due to their ...
  23. [23]
    How can I turn off hardware prefetcher? - Intel Community
    Apr 14, 2006 · Disabling one or both hardware prefetch modes is most likely to be useful for applications like database running on 4 or more socket servers, ...
  24. [24]
    [PDF] Intel® Technology Journal
    Feb 18, 2004 · A third mechanism used to reduce the time waiting for. DRAM is through a hardware prefetching scheme. ... in Oregon to work on the Pentium Pro ...
  25. [25]
    [PDF] Hardware Prefetch Control for Intel Atom Cores
    Hardware prefetchers are an excellent way of improving performance by fetching informa on ahead of me. The most basic prefetchers that only fetched the next ...<|control11|><|separator|>
  26. [26]
    What's the trigger condition of the L1-stride prefetcher
    Feb 15, 2021 · This prefetcher is called the "IP prefetcher", which suggests that it operates based on the sequence of addresses accessed by a single load ...Hardware prefetch and shared multi-core resources on XeonWhen L1 Adjacent line prefetchers starts prefetching and how many ...More results from community.intel.com
  27. [27]
    How to control the four hardware prefetchers in L1 and L2 more ...
    Feb 24, 2017 · There is not much formal documentation of Intel's prefetchers. Most of the information that is available is implicit in various documents, the ...When L1 Adjacent line prefetchers starts prefetching and how many ...understanding the behavior logic of the L2 stream prefetcherMore results from community.intel.com
  28. [28]
  29. [29]
    Memory hierarchy characterization of SPEC CPU2006 and ... - PMC
    Aug 1, 2019 · In this paper we present a detailed evaluation of the memory hierarchy performance for both the CPU2006 and single-threaded CPU2017 benchmarks.
  30. [30]
    Arm Cortex‑A510 Core Technical Reference Manual
    Hardware data prefetcher. The Cortex®‑A510 core has a data prefetch mechanism that looks for cache line fetches with regular or repetitive patterns of data.
  31. [31]
    [PDF] Arm Neoverse V2 platform - HotChips 2023
    Aug 28, 2023 · Active licensees, growing by 50+ every year. 650+. Arm-based chips reported ... Predictor acts as ICache prefetcher. 64kB, 4-way set-associative ...
  32. [32]
    PRFM (immediate) - Arm A-profile A64 Instruction Set Architecture
    Prefetch Memory (immediate) signals the memory system that data memory accesses from a specified address are likely to occur in the near future.
  33. [33]
    [PDF] AMD Prefetch Attacks through Power and Time - USENIX
    Aug 12, 2022 · Both the prefetch side channel and Meltdown have been mitigated with the same software patch on Intel. As. AMD is believed to be not vulnerable ...
  34. [34]
    Arm CPU Security Update: Prefetcher Side Channels - Arm Developer
    Mar 14, 2025 · Arm's general policy regarding side-channel attacks is to mitigate them in software by discouraging the use of secret-dependent memory accesses or branches in ...Missing: ARMv9 | Show results with:ARMv9
  35. [35]
  36. [36]
  37. [37]
    Correctly disable Hardware Prefetching with MSR in Skylake
    Feb 18, 2019 · How to disable L3 cache prefetcher on Intel Xeon Scalable Processor? 2 · Setting Package-Wide MSRs for Uncore Frequency and Cache Allocation ...How can I determine if my Intel CPU supports disabling prefetching ...Unable to disable Hardware prefetcher in Core i7 - Stack OverflowMore results from stackoverflow.com
  38. [38]
    [PDF] uProf User Guide | AMD
    walker. The table walk requests are made for L1-ITLB miss and L2-ITLB misses. This metric is in PTI. L2 ITLB Miss. The number of ITLB reloads from page table ...
  39. [39]
    Machine Learning for Fine-Grained Hardware Prefetcher Control
    Aug 8, 2019 · Prefetcher configuration is controlled by the first four bits (bits 0ś3) of Model-Specific Register (MSR) 0x1A4 on each core. Each bit ...<|control11|><|separator|>
  40. [40]
    Data Prefetch Support - GNU Project
    Jan 31, 2025 · Data prefetch, or cache management, instructions allow a compiler or an assembly language programmer to minimize cache-miss latency by moving data into a cache ...Introduction · Elements of Data Prefetch... · Data Prefetch Support on GCC...Missing: details | Show results with:details
  41. [41]
    Compiler Options - Intel
    Oct 31, 2024 · ... prefetch data prefetch ... For example, the following are some default setting differences between the Intel compiler and the open source Clang ...
  42. [42]
    [PDF] Compiler-Based Prefetching for Recursive Data Structures
    Software-controlled data prefetching offers the potential for bridging the ever-increasing speed gap between the memory.
  43. [43]
    print and manipulate cpu features like hardware prefetchers (Intel only)
    The MSR are set individually for every hardware thread. Will be deprecated in 5.4. The following hardware prefetchers can be toggled: • HW_PREFETCHER: Hardware ...
  44. [44]
    [PDF] Compiler Prefetching on KNL - IXPUG
    Mar 8, 2018 · Intel Compiler Lab. 1. Page 2. Copyright© 2012, Intel Corporation. All ... n=0 is the default if you omit -qopt-prefetch option. – No ...
  45. [45]
  46. [46]
    [PDF] Feedback Directed Prefetching: Improving the Performance and ...
    14In the GHB-based prefetching mechanism, Prefetch Distance and Prefetch. Degree are the same. cause the effectiveness of the prefetcher becomes more impor-.Missing: CPU | Show results with:CPU
  47. [47]
    [PDF] 51 Mitigating Prefetcher-Caused Pollution Using Informed Caching ...
    At a high level, each entry in the prefetcher tracks a potential stream. It learns the direction of the stream based on the first few accesses (2 in our ...<|separator|>
  48. [48]
    Practical models for energy-efficient prefetching in mobile ...
    However, it has been shown that prefetching will significantly increase the total energy consumption resulted from its speculative nature [1], [2], and that ...
  49. [49]
    [PDF] Prefetching Lecture ETH 22-11-2019 copy
    Nov 22, 2019 · - Accuracy = useful prefetches / total prefetches. - It also does not cover all accesses. At startup there are no prefetches. - Coverage ...