Fact-checked by Grok 2 weeks ago

Pseudo-LRU

Pseudo-LRU (PLRU) is a replacement policy that approximates the Least Recently Used (LRU) algorithm in set-associative s, providing a low-overhead to select eviction victims based on approximate recency ordering of lines. By employing a structure, PLRU maintains a partial order of usage recency with minimal hardware resources, making it a practical choice for processors where full LRU implementation would incur excessive storage and update costs. The core of tree-based PLRU involves associating a complete with each set, where the leaves represent the individual cache ways and each internal node holds a single bit indicating the direction to the older child subtree. On a or insertion, the PLRU bits along the path from the accessed to the root are flipped to promote the accessed line's recency, ensuring that recently used blocks are less likely to be evicted. For replacement on a miss, the algorithm traverses the tree by following the bits pointing to the older subtrees until reaching a , which identifies the approximate least recently used victim. This design requires only n-1 bits per set for an n-way associative and performs updates in O( n) time, starkly contrasting with true LRU's O(n) update complexity and O(n n) storage needs. PLRU offers significant advantages in hardware efficiency, enabling its use in L1 and caches of state-of-the-art processors, where it achieves miss rates very close to those of LRU while reducing power consumption through fewer bit transitions. In last-level caches, PLRU serves as a foundational for advanced variants, such as those incorporating dynamic insertion and promotion techniques, which further mitigate issues like cache and improve performance on memory-intensive workloads by up to 15% over baselines. Its balance of simplicity and effectiveness has made PLRU a staple in commercial and research designs, outperforming simpler policies like random or by 20% or more in miss reduction.

Background

Cache Replacement Fundamentals

In cache systems, replacement refers to the mechanism for selecting a cache line to evict when a set is full and a new block must be brought in following a miss. This process is essential in set-associative , where multiple lines share a set, to maintain efficient without excessive conflicts. Common replacement policies include , which evicts the line that has been in the set the longest; Random replacement, which selects a victim without maintaining access history; and Most Recently Used (MRU), which prioritizes evicting the line most recently accessed, often beneficial for workloads with sequential or stack-like access patterns. These policies vary in complexity, with and Random requiring minimal tracking compared to more sophisticated approaches. Cache associativity plays a key role in replacement needs: direct-mapped caches have no , as each maps to one line, while set-associative designs group lines into sets (e.g., 4-way means four lines per set), reducing conflict misses but demanding a decision when sets fill. Higher associativity improves rates by allowing more flexible but escalates the computational and demands of selecting victims efficiently. Replacement policies are evaluated using metrics such as miss rate (the proportion of memory accesses not found in the ) and hit rate (one minus the miss rate), which directly impact overall system . overhead, typically quantified in bits per set for tracking state (e.g., counters or flags), balances effectiveness against area and power costs in implementation. The Least Recently Used (LRU) policy serves as a reference for capturing temporal locality by evicting the least recently accessed line.

Limitations of Exact LRU

Exact Least Recently Used (LRU) replacement policy maintains a precise ordering of cache lines within each set based on their recency of access, typically implemented by tracking the exact age of every line using a sorted or by assigning and incrementing counters to each line upon access. This approach ensures that the least recently accessed line is evicted first, providing optimal performance under the for many workloads. However, implementing true LRU in incurs significant overhead due to the need for extensive comparison logic; for an N-way associative , a full LRU requires approximately N*(N-1)/2 to maintain the ordering on each access, along with additional storage for pointers or timestamps. This results in substantial increases in area and power consumption, particularly as associativity grows, making it resource-intensive even for moderate sizes. For instance, the alone can dominate the cache's tag array area in designs with higher ways. Scalability becomes a critical barrier for exact LRU in modern processors, where L1 and caches often employ 8-way or higher associativity to improve hit rates without excessive capacity. Beyond 4-8 ways, the quadratic growth in complexity renders full LRU impractical, as the added area and dynamic power from frequent comparisons outweigh the marginal gains. Studies indicate significant area increases for high-associativity exact LRU compared to simpler policies while offering only modest miss rate reductions. Exact LRU was used in some early low-associativity caches from the . By the late and into the , as cache sizes and associativities increased to meet rising demands, architects shifted toward to mitigate these costs, paving the way for more efficient alternatives like Pseudo-LRU.

Core Concepts

Approximation Strategies

Pseudo-LRU (PLRU) serves as an efficient to the exact Least Recently Used (LRU) cache by employing simple heuristics to estimate the relative recency of cache lines without maintaining a complete among them. This approach avoids the need for tracking every access precisely, instead relying on lightweight data structures that update in O(\log n) time in tree-based variants and require logarithmic storage overhead in tree-based variants per cache set. By focusing on likely eviction candidates based on recent access patterns, PLRU effectively captures temporal locality in workloads while drastically reducing implementation complexity in hardware. The fundamental in PLRU involves imposing a partial ordering on the lines within a set, which guides the selection of the line during . This partial order is maintained through bit fields or tree structures that indicate which subsets of lines are more or less recently used, allowing the to approximate the LRU under common patterns exhibiting temporal locality. Unlike exact LRU, which requires updating positions for all lines on each , PLRU limits updates to a small number of bits, ensuring low and power consumption. Such enable PLRU to mimic LRU's decisions in most cases, prioritizing lines that have not been accessed recently while tolerating minor inaccuracies for the sake of . In terms of storage efficiency, an N-way set-associative using true LRU demands approximately N \log_2 N bits per set to encode the full of line recencies, poorly with associativity. In contrast, PLRU achieves comparable functionality with O(\log N) bits per set in tree-based variants, such as N-1 bits, representing a significant reduction in hardware area and power. This bit efficiency stems directly from the partial ordering, which encodes only sufficient information to distinguish probable victims without full precision. The primary advantages of PLRU approximation strategies lie in their balance of performance and resource demands, yielding miss rates that closely track those of LRU—typically within 1-5% in evaluations—while incurring far lower complexity. For instance, evaluations on SPEC CPU2000 workloads demonstrate that PLRU variants maintain near-LRU accuracy across diverse organizations, sometimes even outperforming LRU due to reduced overhead in highly associative designs. These benefits make PLRU a practical choice for real-world systems where exact LRU's costs are prohibitive.

Update and Eviction Processes

In Pseudo-LRU policies, the update and eviction processes are designed to approximate the recency order of cache lines efficiently, using lightweight state structures to track usage without the full overhead of exact LRU maintenance. These processes handle cache hits by promoting accessed lines in the approximation and manage misses by selecting and replacing a victim line before inserting new data, ensuring the policy remains hardware-friendly for high-associativity caches. On a cache hit, the updates the to mark the accessed line as recently used, adjusting the structure—such as bits or pointers—to reflect its promoted position in the recency order. This update avoids scanning or reordering all lines, instead targeting only the relevant portions of the structure to indicate recency, thereby simulating the movement of the line toward the most recently used end of the stack. For example, in tree-based PLRU, this involves flipping pointers along the path from the accessed to the to direct future traversals away from it. On a cache miss, if the cache set is not full, the new line is simply inserted, and its state is updated to mark it as recently used, similar to a hit. When the set is full, however, the process first selects a victim line using the current approximation of the least recently used position, evicts it to free space, inserts the new line in the vacated slot, and then updates the state to promote the new line to a recent position. This sequence ensures that insertions prioritize recency for the incoming data while approximating eviction of the oldest line. Eviction selection in Pseudo-LRU relies on querying the state structure to identify a likely least recently used line without a full of the set, leveraging the to guide the choice efficiently. The structure is traversed or evaluated based on its recency indicators—such as following "least recent" pointers in a or scanning for specific bit patterns in a bit-based scheme—to reach a candidate victim, which balances accuracy with low and . For instance, in bit-PLRU, the victim is chosen as a line matching the least recent bit configuration. A high-level pseudocode outline for the combined update and eviction processes in Pseudo-LRU policies is as follows:
if access is [hit](/page/Hit):
    update_structure(accessed_line)  // Promote accessed line in [approximation](/page/Approximation)

else:  // [Miss](/page/Miss)
    if cache_set is full:
        victim = select_victim()  // Query structure for approximated LRU line
        evict(victim)
    insert(new_line)
    update_structure(new_line)  // Promote new line in [approximation](/page/Approximation)
This procedural workflow applies across Pseudo-LRU variants, enabling fast decisions in implementations.

Algorithms

Tree-PLRU

Tree-based Pseudo-LRU (Tree-PLRU) is a hierarchical of the least recently used (LRU) , utilizing a structure to track the relative recency of lines with minimal overhead. The tree has N leaves corresponding to the N lines in a set, with each internal node storing a single bit that indicates which subtree contains the least recently used line (0 for the left subtree, 1 for the right). This results in exactly N-1 bits of state per set, one per internal node, significantly less than the O(N log N) bits required for exact LRU. Upon a cache hit, the update process traverses the tree from the root to the accessed leaf, setting the bit in each internal node along the path so that it points away from the accessed subtree, effectively promoting the hit line toward a pseudo-most-recently-used (PMRU) position. This operation approximates moving the line to the MRU position without maintaining full ordering. For eviction on a miss, the replacement policy starts at the root and follows the bits downward—traversing left if the bit is 0 and right if 1—until reaching a leaf, which identifies the pseudo-LRU victim for replacement. Both operations require O(log N) time, implementable via a tree of multiplexers for efficient hardware realization. To illustrate for a 4-way associative cache (N=4), the tree consists of three internal nodes: a root and two children, with leaves labeled A, B, C, D from left to right. Each internal node bit points to the subtree presumed to hold the LRU line. Suppose initial bits are root=0 (LRU in left half), left child=1 (LRU in right of left half, i.e., B), right child=0 (LRU in left of right half, i.e., C). Eviction would follow root=0 → left child=1 → B as victim. Consider an access to A: traverse root-left-left, setting root to 1 (now points to right half as LRU) and left child to 1 (remains 1, points to right of left half as LRU). Now bits: root=1, left child=1, right child=0; eviction would go root=1 → right child=0 → C. Subsequent access to D: traverse root-right-right, setting root to 0 (now points to left half as LRU) and right child to 0 (remains 0, points to left of right half as LRU). The structure approximates recency but may fail to distinguish fine-grained orders, potentially evicting recently accessed lines under sequential or thrashing access patterns. Tree-PLRU's advantages include its simplicity in —requiring only N-1 bits and a tree for updates and searches—making it suitable for small to medium associativities (4-16 ways) where exact LRU is costly. However, its approximation can lead to suboptimal performance compared to exact LRU in workloads with complex recency patterns, though it often achieves 80-95% of LRU's hit rate with far lower complexity. Unlike flat bit-vector approaches such as Bit-PLRU, the provides a more nuanced partial ordering at the expense of slightly higher update latency.

Bit-PLRU

Bit-PLRU utilizes an array of Most Recently Used (MRU) bits, one per cache way, requiring N bits for an N-way set-associative cache to represent a pseudo-ordering of access recency. These bits collectively approximate the relative ages of lines by tracking recent accesses in a linear, index-ordered manner. The update mechanism operates on hits or after inserting a new line on a miss: the MRU bit for the accessed way is set to 1, marking it as recently used and effectively shifting it toward a pseudo-MRU position in the ordering. If all MRU bits are already 1 prior to the update, the entire array is reset to 0 to ensure ongoing operation, simulating a circular reset that prevents selection deadlocks. This bit-setting approach requires only simple logic, such as an OR gate per bit, for efficient hardware implementation. For on a miss, the is chosen as the lowest-indexed way with an MRU bit of 0, prioritizing earlier indices among non-recently accessed lines to approximate LRU behavior without full ordering. If no such bit exists (all 1s), index 0 is selected as the , with the occurring as part of the update after . This selection can be performed via a or sequential scan, enabling low-latency decisions in hardware. Consider an 8-way with initial MRU bits all at 0 (00000000). On sequential es:
  • First : Evict way 0, insert new line, set bit 0 to 1 (10000000).
  • Second : Evict way 1, insert, set bit 1 to 1 (11000000).
  • Continuing this pattern through the eighth : All bits reach 1 (11111111).
  • Ninth : All bits 1, evict way 0, reset to 00000000, insert, set bit 0 to 1 (10000000).
This sequence demonstrates the rotational history, where the bit pattern cycles through progressive sets and periodic resets, mimicking a shifting window of recency for ordered accesses. Bit-PLRU offers advantages in minimal wiring complexity and rapid VLSI integration due to its flat structure and O(1) update/eviction operations via bitwise logic. However, it yields poorer under patterns, as the fixed index bias can lead to suboptimal evictions compared to hierarchical alternatives like tree-PLRU. Historically, it saw early adoption in systems for its low overhead, enabling efficient management without the area costs of exact LRU.

Applications

Hardware Implementations

Pseudo-LRU replacement policies have evolved from theoretical proposals in 1990s research on efficient management to practical implementations in commercial by the early , becoming a standard choice for multi-core processors in the due to their balance of performance and overhead. Early work on approximations to full LRU, including tree-based variants, addressed the scalability issues of exact LRU in higher-associativity caches, paving the way for adoption in production systems. In CPU designs, Pseudo-LRU has seen notable adoption, such as in the PowerPC 7450 from the early 2000s, where a PLRU was implemented for its 8-way set-associative L1 caches to manage replacements efficiently. ARM series processors commonly employ variants of Pseudo-LRU in their caches to reduce complexity in tracking access order. Intel processors, such as those in the Skylake family, utilize variants of tree-PLRU in caches to approximate LRU behavior while minimizing state storage. These implementations typically target set-associative caches with 8 to 32 ways, where full LRU would require excessive resources, and Pseudo-LRU bits are integrated directly into cache tags for on-the-fly updates. In some designs, Pseudo-LRU is combined with other policies in hybrid approaches to further optimize for specific workloads, such as blending recency with frequency tracking. Beyond CPUs, Pseudo-LRU variants appear in graphics processing units (GPUs), including architectures, where they manage cache replacements to handle high-throughput memory accesses with low overhead. In network systems, such as Named Data Networking caches, Pseudo-LRU policies improve hit rates and reduce latency by approximating least-recently-used eviction in content routers. For (SSD) controllers, modified Pseudo-LRU algorithms like PUD-LRU are used in write buffers to minimize writes, indirectly supporting wear-leveling by distributing erasures more evenly across cells.

Performance Analysis

Pseudo-LRU replacement policies generally exhibit miss rates that are 1-10% higher than true least recently used (LRU) in evaluations on SPEC CPU benchmarks, depending on associativity and workload characteristics. For instance, in SPEC CPU2000 simulations for L1 data , tree-based and mask-based variants of Pseudo-LRU (PLRUm and PLRUt) showed miss rates within 0.1-2% of LRU for 2- to 4-way associativities, while outperforming first-in first-out () and random policies by 8-16% in miss rate reduction. In SPEC CPU2006 last-level cache evaluations, an enhanced tree-based Pseudo-LRU variant achieved 91% of LRU's misses (equivalently, about 9% fewer misses) on a 16-way 4MB cache. Compared to simpler policies, Pseudo-LRU provides 20-50% lower miss rates than or random in memory-intensive workloads, as these lack locality awareness. Hardware overhead for Pseudo-LRU is substantially lower than for true LRU, with area savings of 20-30% in typical implementations due to reduced state storage. For a 16-way , tree-based Pseudo-LRU requires 15 bits per set versus 64 bits for LRU, yielding a 77% reduction in storage overhead, which translates to lower area. Power consumption is also reduced owing to fewer comparators and simpler update logic in the approximation or bit structures. latency benefits from O(log N) complexity in tree-based variants, compared to for full LRU scans in high-associativity caches. Pseudo-LRU performs well on workloads with strong sequential or temporal locality, where its tree approximations closely mimic LRU behavior, but it can underperform in thrashing scenarios with weak locality due to imprecise ordering. Studies on tree insertions highlight sensitivity to access patterns, with set-dueling mechanisms helping adapt to sequential accesses while multiple insertion policy variants (IPVs) mitigate thrashing by balancing promotion and eviction. Recent improvements integrate Pseudo-LRU with hybrid policies like dynamic insertion and promotion (DIP) or learning-based approaches such as Hawkeye, achieving 5-9% speedups over baseline LRU in multi-core SPEC CPU2006 setups by refining insertion points. Formal verification techniques have identified bugs in Pseudo-LRU implementations, such as erroneous evictions of recently used lines or inaccessibility of cache ways, ensuring higher reliability without performance loss. As a quantitative example, in an 8-way cache, tree-based Pseudo-LRU uses 7 bits per set (versus 16 for LRU), delivering approximately 95% of LRU's accuracy while halving bit overhead.

References

  1. [1]
    [PDF] Cache Replacement Policies - ECE/CS 752 Fall 2019
    ○ Least-recently used (LRU). ○ Practical pseudo-LRU (tree LRU). ○ Protected LRU. ○ LIP/DIP variant. ○ Set dueling to dynamically select policy. ○ Not ...
  2. [2]
    [PDF] Insertion and Promotion for Tree-Based PseudoLRU Last-Level ...
    Dec 11, 2013 · A good cache replacement policy enables high performance for memory intensive programs. To be useful to industry, a cache replacement policy ...
  3. [3]
    [PDF] Performance Evaluation of Cache Replacement Policies for the ...
    The state-of-the-art processors employ various policies such as. Random, Least Recently Used (LRU), Round-Robin, and PLRU. (Pseudo LRU), indicating that there ...
  4. [4]
    [PDF] CIS 371 Computer Organization and Design This Unit: Caches ...
    On cache miss, which block in set to replace (kick out)?. • Some options. • Random. • FIFO (first-in first-out). • LRU ( ...Missing: fundamentals | Show results with:fundamentals
  5. [5]
    [PDF] CS 211: Computer Architecture Cache Memory Design
    As a general rule of thumb, “long and skinny” caches help to reduce conflict misses, “short and fat” caches help to reduce compulsory misses, but a cross ...
  6. [6]
    [PDF] Leveraging Belady's Algorithm for Improved Cache Replacement
    Belady's algorithm is optimal but infeasible. This paper uses past cache accesses to learn from it, and introduces Hawkeye, a policy that reconstructs Belady's ...
  7. [7]
    [PDF] Cache Replacement Policies and Enhancements Credit
    RR: log(set size) bits per set to track next to evict, no action on access. Bit-PLRU: 1 MRU bit per block + log(set size) bits per set (or equivalent logic) to.
  8. [8]
    [PDF] Caches and Memory Hierarchies - Duke People
    Cache Replacement Policies. • Set-associative caches present a new design choice. • On cache miss, which block in set to replace (kick out)?. • Some options. • ...<|control11|><|separator|>
  9. [9]
    [PDF] High Performance Cache Replacement Using Re-Reference ...
    The commonly used LRU replacement policy always predicts a near- immediate re-reference interval on cache hits and misses.
  10. [10]
    Insertion and promotion for tree-based PseudoLRU last-level caches
    We propose a novel last-level cache replacement algorithm with approximately the same complexity and storage requirements as tree-based PseudoLRU, but with ...Missing: origin | Show results with:origin
  11. [11]
    [PDF] Machine Learning-Based Cache Replacement Policies: A Survey
    Aug 30, 2021 · In this paper, we review some of the machine-learning based cache replacement policies that outperformed the static heuristics. Keywords: ...<|control11|><|separator|>
  12. [12]
    Methods and apparatus for implementing a pseudo-LRU cache ...
    A cache memory replacement scheme with a locking feature is provided. Locking bits associated with each line in the cache are supplied in the tag table.Missing: modern | Show results with:modern<|control11|><|separator|>
  13. [13]
    [PDF] Leaking Information Through Cache LRU States - arXiv
    Jan 3, 2020 · We implemented an in-house simulator to simulate the Tree-PLRU [23] and Bit-PLRU [24] replacement policies in an 8-way set. First, in the ...<|control11|><|separator|>
  14. [14]
  15. [15]
    [PDF] A Closer Look at Cache Replacement Policies on ARM - mediaTUM
    The concept of tree-based. PLRU algorithms leaves some ambiguities, which lead to slightly different behavior of hardware implementations. Grund and Reineke use ...<|control11|><|separator|>
  16. [16]
    [PDF] MPC7450 RISC Microprocessor Family Product Brief
    The caches implement a pseudo least-recently-used (PLRU) replacement ...
  17. [17]
    [PDF] Leaking Information Through Cache LRU States in Commercial ...
    Pseudo Least-Recently Used (PLRU): The “true” LRU algorithm is expensive in terms of latency (to update LRU states) and area (to store the age of all cache ...
  18. [18]
    [PDF] Volume 7: Memory Cache - Intel
    The pLRU algorithm uses a binary decision tree with (N-1) nodes. Each node is controlled by a single bit. When a way is filled, all the nodes in the tree to ...Missing: variants | Show results with:variants
  19. [19]
    [PDF] Adaptive GPU Cache Bypassing - Engineering People Site
    This data shows that an average of 46% (and a maximum of 84%) of cache blocks are evicted by the pseudo-LRU replacement algorithm with- out being touched again.
  20. [20]
    PUD-LRU: An Erase-Efficient Write Buffer Management Algorithm for ...
    Aug 17, 2010 · PUD-LRU: An Erase-Efficient Write Buffer Management Algorithm for Flash Memory SSD ... wear-leveling and garbage collection mechanisms, FTL ...
  21. [21]
    [PDF] Pseudo-LRU Not Efficient in Real World? Use Formal Verification to ...
    In this paper we will discuss an approach to formally verify PLRU mechanism. ... V. REFERENCES. [1] H. Ghasemzadeh, S. Mazrouee, and M.R. Kakoee “Modified pseudo ...