Pseudo-LRU
Pseudo-LRU (PLRU) is a cache replacement policy that approximates the Least Recently Used (LRU) algorithm in set-associative caches, providing a low-overhead mechanism to select eviction victims based on approximate recency ordering of cache lines.[1] By employing a binary tree structure, PLRU maintains a partial order of usage recency with minimal hardware resources, making it a practical choice for processors where full LRU implementation would incur excessive storage and update costs.[2]
The core of tree-based PLRU involves associating a complete binary tree with each cache set, where the leaves represent the individual cache ways and each internal node holds a single bit indicating the direction to the older child subtree.[2] On a cache hit or insertion, the PLRU bits along the path from the accessed leaf to the root are flipped to promote the accessed line's recency, ensuring that recently used blocks are less likely to be evicted.[1] For replacement on a miss, the algorithm traverses the tree by following the bits pointing to the older subtrees until reaching a leaf, which identifies the approximate least recently used victim.[2] This design requires only n-1 bits per set for an n-way associative cache and performs updates in O(log n) time, starkly contrasting with true LRU's O(n) update complexity and O(n log n) storage needs.[1]
PLRU offers significant advantages in hardware efficiency, enabling its use in L1 and L2 caches of state-of-the-art processors, where it achieves miss rates very close to those of LRU while reducing power consumption through fewer bit transitions.[3] In last-level caches, PLRU serves as a foundational policy for advanced variants, such as those incorporating dynamic insertion and promotion techniques, which further mitigate issues like cache pollution and improve performance on memory-intensive workloads by up to 15% over baselines.[2] Its balance of simplicity and effectiveness has made PLRU a staple in commercial and research cache designs, outperforming simpler policies like random or FIFO by 20% or more in miss reduction.[3]
Background
Cache Replacement Fundamentals
In cache systems, replacement refers to the mechanism for selecting a victim cache line to evict when a set is full and a new block must be brought in following a miss.[4] This process is essential in set-associative caches, where multiple lines share a set, to maintain efficient data access without excessive conflicts.[5]
Common replacement policies include First-In-First-Out (FIFO), which evicts the line that has been in the set the longest; Random replacement, which selects a victim without maintaining access history; and Most Recently Used (MRU), which prioritizes evicting the line most recently accessed, often beneficial for workloads with sequential or stack-like access patterns.[4][6] These policies vary in complexity, with FIFO and Random requiring minimal tracking compared to more sophisticated approaches.[7]
Cache associativity plays a key role in replacement needs: direct-mapped caches have no choice, as each block maps to one line, while set-associative designs group lines into sets (e.g., 4-way means four lines per set), reducing conflict misses but demanding a replacement decision when sets fill.[5] Higher associativity improves hit rates by allowing more flexible mapping but escalates the computational and hardware demands of selecting victims efficiently.[8]
Replacement policies are evaluated using metrics such as miss rate (the proportion of memory accesses not found in the cache) and hit rate (one minus the miss rate), which directly impact overall system performance.[5] Hardware overhead, typically quantified in bits per cache set for tracking state (e.g., counters or flags), balances effectiveness against area and power costs in implementation.[7] The Least Recently Used (LRU) policy serves as a reference for capturing temporal locality by evicting the least recently accessed line.[9]
Limitations of Exact LRU
Exact Least Recently Used (LRU) replacement policy maintains a precise ordering of cache lines within each set based on their recency of access, typically implemented by tracking the exact age of every line using a sorted linked list or by assigning and incrementing counters to each line upon access. This approach ensures that the least recently accessed line is evicted first, providing optimal performance under the independent reference model for many workloads.
However, implementing true LRU in hardware incurs significant overhead due to the need for extensive comparison logic; for an N-way associative cache, a full LRU mechanism requires approximately N*(N-1)/2 comparators to maintain the ordering on each access, along with additional storage for pointers or timestamps. This results in substantial increases in area and power consumption, particularly as associativity grows, making it resource-intensive even for moderate cache sizes. For instance, the comparator network alone can dominate the cache's tag array area in designs with higher ways.
Scalability becomes a critical barrier for exact LRU in modern processors, where L1 and L2 caches often employ 8-way or higher associativity to improve hit rates without excessive capacity. Beyond 4-8 ways, the quadratic growth in hardware complexity renders full LRU impractical, as the added silicon area and dynamic power from frequent comparisons outweigh the marginal performance gains. Studies indicate significant area increases for high-associativity exact LRU compared to simpler policies while offering only modest miss rate reductions.[1]
Exact LRU was used in some early low-associativity caches from the 1980s. By the late 1980s and into the 1990s, as cache sizes and associativities increased to meet rising memory demands, processor architects shifted toward approximations to mitigate these costs, paving the way for more efficient alternatives like Pseudo-LRU.
Core Concepts
Approximation Strategies
Pseudo-LRU (PLRU) serves as an efficient approximation to the exact Least Recently Used (LRU) cache replacement policy by employing simple heuristics to estimate the relative recency of cache lines without maintaining a complete total order among them. This approach avoids the need for tracking every access precisely, instead relying on lightweight data structures that update in O(\log n) time in tree-based variants and require logarithmic storage overhead in tree-based variants per cache set. By focusing on likely eviction candidates based on recent access patterns, PLRU effectively captures temporal locality in workloads while drastically reducing implementation complexity in hardware.
The fundamental approximation strategy in PLRU involves imposing a partial ordering on the cache lines within a set, which guides the selection of the victim line during eviction. This partial order is maintained through bit fields or tree structures that indicate which subsets of lines are more or less recently used, allowing the policy to approximate the ideal LRU behavior under common access patterns exhibiting temporal locality. Unlike exact LRU, which requires updating positions for all lines on each access, PLRU limits updates to a small number of bits, ensuring low latency and power consumption. Such strategies enable PLRU to mimic LRU's eviction decisions in most cases, prioritizing lines that have not been accessed recently while tolerating minor inaccuracies for the sake of efficiency.
In terms of storage efficiency, an N-way set-associative cache using true LRU demands approximately N \log_2 N bits per set to encode the full permutation of line recencies, scaling poorly with associativity. In contrast, PLRU achieves comparable functionality with O(\log N) bits per set in tree-based variants, such as N-1 bits, representing a significant reduction in hardware area and power. This bit efficiency stems directly from the partial ordering, which encodes only sufficient information to distinguish probable victims without full precision.
The primary advantages of PLRU approximation strategies lie in their balance of performance and resource demands, yielding miss rates that closely track those of LRU—typically within 1-5% in benchmark evaluations—while incurring far lower hardware complexity. For instance, evaluations on SPEC CPU2000 workloads demonstrate that PLRU variants maintain near-LRU accuracy across diverse cache organizations, sometimes even outperforming LRU due to reduced overhead in highly associative designs. These benefits make PLRU a practical choice for real-world systems where exact LRU's costs are prohibitive.
Update and Eviction Processes
In Pseudo-LRU policies, the update and eviction processes are designed to approximate the recency order of cache lines efficiently, using lightweight state structures to track usage without the full overhead of exact LRU maintenance. These processes handle cache hits by promoting accessed lines in the approximation and manage misses by selecting and replacing a victim line before inserting new data, ensuring the policy remains hardware-friendly for high-associativity caches.[10]
On a cache hit, the policy updates the state to mark the accessed line as recently used, adjusting the approximation structure—such as bits or pointers—to reflect its promoted position in the recency order. This update avoids scanning or reordering all lines, instead targeting only the relevant portions of the structure to indicate recency, thereby simulating the movement of the line toward the most recently used end of the stack. For example, in tree-based PLRU, this involves flipping pointers along the path from the accessed leaf to the root to direct future traversals away from it.[2]
On a cache miss, if the cache set is not full, the new line is simply inserted, and its state is updated to mark it as recently used, similar to a hit. When the set is full, however, the process first selects a victim line using the current approximation of the least recently used position, evicts it to free space, inserts the new line in the vacated slot, and then updates the state to promote the new line to a recent position. This sequence ensures that insertions prioritize recency for the incoming data while approximating eviction of the oldest line.[10]
Eviction selection in Pseudo-LRU relies on querying the state structure to identify a likely least recently used line without a full scan of the set, leveraging the approximation to guide the choice efficiently. The structure is traversed or evaluated based on its recency indicators—such as following "least recent" pointers in a tree or scanning for specific bit patterns in a bit-based scheme—to reach a candidate victim, which balances accuracy with low latency and power. For instance, in bit-PLRU, the victim is chosen as a line matching the least recent bit configuration.[11]
A high-level pseudocode outline for the combined update and eviction processes in Pseudo-LRU policies is as follows:
if access is [hit](/page/Hit):
update_structure(accessed_line) // Promote accessed line in [approximation](/page/Approximation)
else: // [Miss](/page/Miss)
if cache_set is full:
victim = select_victim() // Query structure for approximated LRU line
evict(victim)
insert(new_line)
update_structure(new_line) // Promote new line in [approximation](/page/Approximation)
if access is [hit](/page/Hit):
update_structure(accessed_line) // Promote accessed line in [approximation](/page/Approximation)
else: // [Miss](/page/Miss)
if cache_set is full:
victim = select_victim() // Query structure for approximated LRU line
evict(victim)
insert(new_line)
update_structure(new_line) // Promote new line in [approximation](/page/Approximation)
This procedural workflow applies across Pseudo-LRU variants, enabling fast decisions in hardware implementations.[10]
Algorithms
Tree-PLRU
Tree-based Pseudo-LRU (Tree-PLRU) is a hierarchical approximation of the least recently used (LRU) cache replacement policy, utilizing a binary tree structure to track the relative recency of cache lines with minimal hardware overhead. The tree has N leaves corresponding to the N cache lines in a set, with each internal node storing a single bit that indicates which subtree contains the least recently used line (0 for the left subtree, 1 for the right). This results in exactly N-1 bits of state per set, one per internal node, significantly less than the O(N log N) bits required for exact LRU.[10]
Upon a cache hit, the update process traverses the tree from the root to the accessed leaf, setting the bit in each internal node along the path so that it points away from the accessed subtree, effectively promoting the hit line toward a pseudo-most-recently-used (PMRU) position. This operation approximates moving the line to the MRU position without maintaining full ordering. For eviction on a miss, the replacement policy starts at the root and follows the bits downward—traversing left if the bit is 0 and right if 1—until reaching a leaf, which identifies the pseudo-LRU victim for replacement. Both operations require O(log N) time, implementable via a tree of multiplexers for efficient hardware realization.[10]
To illustrate for a 4-way associative cache (N=4), the tree consists of three internal nodes: a root and two children, with leaves labeled A, B, C, D from left to right. Each internal node bit points to the subtree presumed to hold the LRU line. Suppose initial bits are root=0 (LRU in left half), left child=1 (LRU in right of left half, i.e., B), right child=0 (LRU in left of right half, i.e., C). Eviction would follow root=0 → left child=1 → B as victim.
Consider an access to A: traverse root-left-left, setting root to 1 (now points to right half as LRU) and left child to 1 (remains 1, points to right of left half as LRU). Now bits: root=1, left child=1, right child=0; eviction would go root=1 → right child=0 → C. Subsequent access to D: traverse root-right-right, setting root to 0 (now points to left half as LRU) and right child to 0 (remains 0, points to left of right half as LRU). The structure approximates recency but may fail to distinguish fine-grained orders, potentially evicting recently accessed lines under sequential or thrashing access patterns.[10]
Tree-PLRU's advantages include its simplicity in hardware—requiring only N-1 bits and a multiplexer tree for updates and searches—making it suitable for small to medium associativities (4-16 ways) where exact LRU is costly. However, its approximation can lead to suboptimal performance compared to exact LRU in workloads with complex recency patterns, though it often achieves 80-95% of LRU's hit rate with far lower complexity. Unlike flat bit-vector approaches such as Bit-PLRU, the tree structure provides a more nuanced partial ordering at the expense of slightly higher update latency.[10]
Bit-PLRU
Bit-PLRU utilizes an array of Most Recently Used (MRU) bits, one per cache way, requiring N bits for an N-way set-associative cache to represent a pseudo-ordering of access recency. These bits collectively approximate the relative ages of lines by tracking recent accesses in a linear, index-ordered manner.[12]
The update mechanism operates on hits or after inserting a new line on a miss: the MRU bit for the accessed way is set to 1, marking it as recently used and effectively shifting it toward a pseudo-MRU position in the ordering. If all MRU bits are already 1 prior to the update, the entire array is reset to 0 to ensure ongoing operation, simulating a circular reset that prevents selection deadlocks. This bit-setting approach requires only simple logic, such as an OR gate per bit, for efficient hardware implementation.[12][7]
For eviction on a cache miss, the victim is chosen as the lowest-indexed way with an MRU bit of 0, prioritizing earlier indices among non-recently accessed lines to approximate LRU behavior without full ordering. If no such bit exists (all 1s), index 0 is selected as the victim, with the reset occurring as part of the update after eviction. This selection can be performed via a priority encoder or sequential scan, enabling low-latency decisions in hardware.[13][7]
Consider an 8-way cache with initial MRU bits all at 0 (00000000). On sequential misses:
- First miss: Evict way 0, insert new line, set bit 0 to 1 (
10000000).
- Second miss: Evict way 1, insert, set bit 1 to 1 (
11000000).
- Continuing this pattern through the eighth miss: All bits reach 1 (
11111111).
- Ninth miss: All bits 1, evict way 0, reset to
00000000, insert, set bit 0 to 1 (10000000).
This sequence demonstrates the rotational history, where the bit pattern cycles through progressive sets and periodic resets, mimicking a shifting window of recency for ordered accesses.[7]
Bit-PLRU offers advantages in minimal wiring complexity and rapid VLSI integration due to its flat structure and O(1) update/eviction operations via bitwise logic. However, it yields poorer LRU approximation under random access patterns, as the fixed index bias can lead to suboptimal evictions compared to hierarchical alternatives like tree-PLRU. Historically, it saw early adoption in embedded systems for its low hardware overhead, enabling efficient cache management without the area costs of exact LRU.[12][13]
Applications
Hardware Implementations
Pseudo-LRU replacement policies have evolved from theoretical proposals in 1990s research on efficient cache management to practical implementations in commercial hardware by the early 2000s, becoming a standard choice for multi-core processors in the 2010s due to their balance of performance and hardware overhead.[14] Early work on approximations to full LRU, including tree-based variants, addressed the scalability issues of exact LRU in higher-associativity caches, paving the way for adoption in production systems.[15]
In CPU designs, Pseudo-LRU has seen notable adoption, such as in the PowerPC 7450 microprocessor from the early 2000s, where a binary tree PLRU algorithm was implemented for its 8-way set-associative L1 caches to manage replacements efficiently.[16] ARM Cortex-A series processors commonly employ variants of Pseudo-LRU in their L2 caches to reduce complexity in tracking access order. Intel processors, such as those in the Skylake family, utilize variants of tree-PLRU in L2 caches to approximate LRU behavior while minimizing state storage.[17]
These implementations typically target set-associative caches with 8 to 32 ways, where full LRU would require excessive hardware resources, and Pseudo-LRU bits are integrated directly into cache tags for on-the-fly updates.[18] In some designs, Pseudo-LRU is combined with other policies in hybrid approaches to further optimize for specific workloads, such as blending recency with frequency tracking.[11]
Beyond CPUs, Pseudo-LRU variants appear in graphics processing units (GPUs), including NVIDIA architectures, where they manage shader cache replacements to handle high-throughput memory accesses with low overhead.[19] In network systems, such as Named Data Networking caches, Pseudo-LRU policies improve hit rates and reduce latency by approximating least-recently-used eviction in content routers.[14] For solid-state drive (SSD) controllers, modified Pseudo-LRU algorithms like PUD-LRU are used in write buffers to minimize flash memory writes, indirectly supporting wear-leveling by distributing erasures more evenly across cells.[20]
Pseudo-LRU replacement policies generally exhibit miss rates that are 1-10% higher than true least recently used (LRU) in evaluations on SPEC CPU benchmarks, depending on associativity and workload characteristics.[3][2] For instance, in SPEC CPU2000 simulations for L1 data caches, tree-based and mask-based variants of Pseudo-LRU (PLRUm and PLRUt) showed miss rates within 0.1-2% of LRU for 2- to 4-way associativities, while outperforming first-in first-out (FIFO) and random policies by 8-16% in miss rate reduction.[3] In SPEC CPU2006 last-level cache evaluations, an enhanced tree-based Pseudo-LRU variant achieved 91% of LRU's misses (equivalently, about 9% fewer misses) on a 16-way 4MB cache.[2] Compared to simpler policies, Pseudo-LRU provides 20-50% lower miss rates than FIFO or random in memory-intensive workloads, as these lack locality awareness.[3]
Hardware overhead for Pseudo-LRU is substantially lower than for true LRU, with area savings of 20-30% in typical implementations due to reduced state storage.[21] For a 16-way cache, tree-based Pseudo-LRU requires 15 bits per set versus 64 bits for LRU, yielding a 77% reduction in storage overhead, which translates to lower silicon area.[2] Power consumption is also reduced owing to fewer comparators and simpler update logic in the approximation tree or bit structures.[21] Eviction latency benefits from O(log N) complexity in tree-based variants, compared to O(N for full LRU scans in high-associativity caches.[3]
Pseudo-LRU performs well on workloads with strong sequential or temporal locality, where its tree approximations closely mimic LRU behavior, but it can underperform in thrashing scenarios with weak locality due to imprecise ordering.[2] Studies on tree insertions highlight sensitivity to access patterns, with set-dueling mechanisms helping adapt to sequential accesses while multiple insertion policy variants (IPVs) mitigate thrashing by balancing promotion and eviction.[2]
Recent improvements integrate Pseudo-LRU with hybrid policies like dynamic insertion and promotion (DIP) or learning-based approaches such as Hawkeye, achieving 5-9% speedups over baseline LRU in multi-core SPEC CPU2006 setups by refining insertion points.[2][6] Formal verification techniques have identified bugs in Pseudo-LRU implementations, such as erroneous evictions of recently used lines or inaccessibility of cache ways, ensuring higher reliability without performance loss.[21] As a quantitative example, in an 8-way cache, tree-based Pseudo-LRU uses 7 bits per set (versus 16 for LRU), delivering approximately 95% of LRU's accuracy while halving bit overhead.[21][3]