Fact-checked by Grok 2 weeks ago

Memory latency

Memory latency is the time delay between a processor's initiation of a request to read or write data in memory and the completion of that operation, when the data is delivered or the acknowledgment is received.^[1] This delay, often measured in nanoseconds (ns) or clock cycles, arises from the inherent slowness of memory access relative to processor speeds and is a fundamental bottleneck in computer system performance.^[2] In modern computing architectures, memory latency encompasses the entire memory hierarchy, including on-chip caches (L1, L2, L3), main memory such as dynamic random-access memory (DRAM), and secondary storage like solid-state drives (SSDs).^[1] For instance, L1 cache access might take as little as 0.5 ns, while DRAM access can exceed 100 ns, and SSD reads average around 150 µs for small blocks.^[1] A key component in DRAM latency is the column address strobe (CAS) latency, which specifies the number of clock cycles required after the column address is issued until data is available on the output pins.^[3] The effective latency in nanoseconds is computed as CAS latency multiplied by the inverse of the memory clock frequency (e.g., for DDR4-3200 at 1600 MHz, a CL16 yields 10 ns).^[4] Over the past decades, while memory bandwidth has scaled by orders of magnitude, latency has improved only modestly—by about 1.3× since the 1990s—widening the gap between processor and memory capabilities.^[2] The impact of memory latency is profound, as it causes processors to stall while awaiting data, directly reducing instruction throughput and application efficiency.^[5] To mitigate this, systems employ strategies like multilevel caching to keep hot data close to the processor, prefetching to anticipate accesses, out-of-order execution to overlap computation with memory operations, and latency-tolerant designs such as simultaneous multithreading.^[1] In domains like high-performance computing and graphics processing, where memory-bound workloads dominate, ongoing research focuses on novel DRAM architectures and voltage scaling to further reduce latency without compromising reliability.^[2]

Fundamentals

Definition

Memory latency refers to the time elapsed from the issuance of a memory request, such as a read or write operation, until the requested data is available for use by the processor. This delay encompasses the initiation, processing, and return of the memory access result, making it a critical factor in overall system performance.^[6] Unlike throughput or bandwidth, which quantify the volume of data transferred over time, memory latency specifically measures the response time for individual access requests. It focuses on the delay experienced by a single operation rather than the aggregate data movement capacity of the memory system.^[7] In conceptual terms, memory latency can be expressed as the sum of access time and transfer time:

\text{Latency} = t_{\text{access}} + t_{\text{transfer}}

where t_{\text{access}} represents the initial time to locate and prepare the data, and t_{\text{transfer}} accounts for the duration to move the data to the processor. This breakdown highlights how latency arises from both preparation and delivery phases in the access process.^[8] For instance, in synchronous dynamic random-access memory (SDRAM), key components of latency include the row address to column address delay (t_{\text{RCD}}), which is the time to activate a row and access a column, and the column address strobe latency (t_{\text{CL}}), which measures the delay from column selection to data output. These timings contribute to the overall access delay in DRAM-based systems.^[9]

Historical Development

The concept of memory latency emerged in the early days of electronic computing during the 1940s, when systems like the ENIAC and EDVAC relied on vacuum tube-based memory technologies such as acoustic delay lines, which introduced access delays on the order of tens to hundreds of microseconds due to signal propagation through mercury-filled tubes or quartz crystals.^[10] These delay lines, functioning as serial storage mechanisms, required mechanical or acoustic components to recirculate data pulses, resulting in minimum access times around 48 microseconds for EDVAC's design and up to 384 microseconds for full cycles in similar implementations. This era highlighted latency as a fundamental bottleneck, as memory access dominated computation cycles in these pioneering machines. The 1960s and 1970s marked a pivotal shift with the advent of semiconductor memory, dramatically reducing latency from the microsecond range of magnetic core or delay line systems to nanoseconds. The Intel 1103, introduced in 1970 as the first commercially successful dynamic random-access memory (DRAM) chip, achieved access times of approximately 150 nanoseconds, enabling faster random access and paving the way for denser, more efficient storage that supplanted core memory's 1-microsecond delays. This transition aligned with Moore's Law, which predicted exponential growth in transistor density and contributed to latency scaling by allowing smaller, quicker memory cells, though the law's benefits were more pronounced in capacity and bandwidth than in raw access speed reductions.^[2] By the 1980s and 1990s, the standardization of synchronous DRAM (SDRAM) by JEDEC in 1993 introduced clock-synchronized operations and key latency metrics like column address strobe (CAS) latency, which measured the delay from address issuance to data output in clock cycles, typically 2-3 cycles at initial frequencies around 100 MHz.^[11] This synchronization improved predictability and pipelining, reducing effective latency to around 20-30 nanoseconds in early implementations. Entering the 2000s, evolutions in double data rate (DDR) SDRAM, such as DDR4 standardized by JEDEC in 2014, focused on higher bandwidth but encountered a latency plateau, with access times stabilizing at 10-20 nanoseconds despite clock speeds exceeding 2000 MHz, as CAS latencies rose to 15-18 cycles to accommodate denser dies and power constraints.^[12] Similarly, DDR5, released by JEDEC in 2020, maintained this range around 15-17 nanoseconds at launch speeds of 4800 MT/s, prioritizing density and efficiency over further latency cuts.^[13] This stagnation reflects the "memory wall" concept, articulated by William A. Wulf and Sally A. McKee in 1995, which described diminishing returns in memory access speed relative to processor advancements, projecting increasing numbers of processor operations per memory access due to unyielding latency growth.^[14] As of 2025, DDR5 variants have achieved slight latency reductions through higher speeds and optimized timings, with some modules reaching effective latencies below 12 ns, though the memory wall persists.^[15]

Components

Access Latency

Access latency constitutes the core delay incurred during the internal data retrieval process within memory modules, encompassing the time from address decoding to data availability at the output. In dynamic random-access memory (DRAM), this latency arises primarily from the sequential operations needed to access data stored in a two-dimensional array of cells organized into rows and columns. The process begins with row activation, followed by column selection, and concludes with signal amplification and data sensing, all of which contribute to the overall delay.^[16] The breakdown of access latency in DRAM typically sums the row-to-column delay (t_RCD), the column address strobe latency (t_CL), and the row precharge time (t_RP). Here, t_RCD represents the time to activate a specific row by driving the wordline and allowing charge sharing between the cell capacitor and the bitline; t_CL is the delay from column address assertion to data output; and t_RP is the duration required to precharge the bitlines back to an equilibrium voltage after the access. This total access latency can be expressed as:

\text{Total access latency} = t_{\text{RCD}} + t_{\text{CL}} + t_{\text{RP}}

These timings are standardized parameters defined by the Joint Electron Device Engineering Council (JEDEC) for synchronous DRAM generations, such as DDR4, where typical values might range from 13.75 ns to 18 ns for t_RCD and t_CL at common clock rates, with t_RP similarly in the 13-15 ns range, leading to an aggregate of approximately 40-50 ns for a full random access cycle.^[17]^[16] Mechanistically, the process involves bitline precharging, where complementary bitlines (BL and BL-bar) are equalized to V_DD/2 to maximize voltage differential sensitivity during reads. Upon row activation, the selected wordline connects the DRAM cell's storage capacitor to the bitline, causing a small charge redistribution that develops a differential voltage (typically 100-200 mV). Sense amplifiers—cross-coupled latch circuits—are then activated to detect and amplify this differential into full-swing logic levels (0 or V_DD), enabling reliable data transfer to the output while restoring the cell charge. This amplification step is critical for overcoming noise and ensuring data integrity, but it introduces additional delay due to the need for precise timing control.^[18] Access latency varies significantly between memory types due to their underlying architectures. Static random-access memory (SRAM), which uses bistable flip-flop cells without capacitors, achieves access times around 1 ns through direct transistor-based storage and simpler decoding, making it suitable for high-speed caches. In contrast, DRAM's reliance on capacitor refresh and the multi-step row-column access results in latencies of 10-50 ns, influenced by factors like cell density and refresh overhead, though optimizations in sense amplifier design can mitigate variations.^[19]

Propagation and Queueing Delays

Propagation delay refers to the time required for electrical signals to traverse the physical paths between components in a memory system, such as buses or interconnects, fundamentally limited by the speed of light in the transmission medium. This delay arises after the core memory access but before the data reaches its destination, contributing to overall latency in distributed or multi-component architectures. It is calculated as the distance divided by the signal velocity, where velocity is approximately 0.67 times the speed of light (c) in typical printed circuit board (PCB) traces due to the dielectric properties of materials like FR-4.^[20]^[21] For instance, in copper traces on PCBs, this equates to roughly 1.5 ns per foot of trace length, emphasizing the need for compact layouts to minimize such delays in high-performance systems.^[22] Queueing delays occur when memory requests accumulate in buffers within memory controllers or interconnect queues, awaiting processing due to contention from multiple sources. These delays are modeled using queueing theory, particularly the M/M/1 model for single-server systems with Poisson arrivals and exponential service times, where the average waiting time in the queue is given by \frac{\lambda}{\mu(\mu - \lambda)}, with \lambda as the arrival rate and \mu as the service rate.^[23] In memory controllers, this model helps predict buffering impacts under varying workloads, as controllers prioritize and schedule requests to avoid excessive buildup, though high contention can lead to significant waits.^[24] Adaptive scheduling techniques, informed by such models, dynamically adjust to traffic patterns to bound these delays.^[25] In practice, propagation delays manifest in interconnects like PCIe buses, where round-trip latencies typically range from 300 to 1000 ns depending on generation and configuration, adding overhead to remote memory accesses.^[26]^[27] For multi-socket systems, fabric delays—arising from inter-socket communication over links like Intel's UPI or AMD's Infinity Fabric—can introduce additional 50-200 ns of latency for cross-socket memory requests, exacerbated by routing and contention in the interconnect topology.^[28]^[29] These delays highlight the importance of locality-aware data placement to reduce reliance on remote paths.

Measurement and Metrics

Key Performance Indicators

Memory latency is quantified through several key performance indicators that capture different aspects of access times in computer systems, enabling comparisons across hardware generations and configurations. The primary metrics focus on observable response times, emphasizing both typical and extreme behaviors to assess system reliability and efficiency. Average latency represents the mean response time across a series of memory access requests, providing a baseline measure of expected performance in steady-state operations.^[30] This metric is particularly useful for evaluating overall system throughput in bandwidth-constrained environments, where sustained access patterns dominate. For instance, in multi-core processors, average latency accounts for aggregated delays from cache hierarchies to main memory, often derived from microbenchmarks that simulate representative workloads. Tail latency, typically the 99th percentile response time, highlights worst-case delays that can disproportionately impact user-perceived performance in interactive or real-time applications.^[31] In memory systems, tail latency arises from factors like queueing in shared resources or intermittent contention, making it critical for distributed architectures where even rare high-latency accesses can degrade service level objectives.^[32] Cache hit latency measures the time required to retrieve data when it is successfully found in a cache level, serving as a direct indicator of the efficiency of the memory hierarchy's fastest tiers.^[33] This metric is essential for understanding intra-component performance, as it reflects the inherent speed of cache designs without the overhead of misses propagating to slower storage. These indicators are commonly expressed in nanoseconds (ns) for absolute time or clock cycles for relative processor speed, allowing normalization across varying frequencies.^[1] For example, a latency of 14 cycles on a 3 GHz processor equates to approximately 4.67 ns, calculated as cycles divided by frequency in GHz.^[34] Standard benchmarks facilitate the measurement and comparison of these KPIs. The STREAM benchmark evaluates effective latency in bandwidth-bound scenarios by simulating large-scale data movement, revealing how latency interacts with throughput in memory-intensive tasks.^[35] LMbench, through tools like lat_mem_rd, provides micro-benchmarking of raw access times by varying memory sizes and strides, yielding precise latency profiles for caches and main memory.^[36] In 2025-era CPUs, typical values illustrate the scale of these metrics across the memory hierarchy:

Component	Typical Latency (Cycles)	Approximate Time (ns at 4 GHz)
L1 Cache Hit	1–5	0.25–1.25
Main Memory	200–400	50–100

These ranges reflect advancements in DDR5 and cache designs, with L1 hits remaining sub-nanosecond for minimal disruption to instruction pipelines, while main memory accesses still dominate overall latency budgets.^[37]^[34]

Calculation and Simulation Methods

Analytical methods for calculating memory latency typically rely on models that decompose the access process into key components, such as address decoding and data retrieval times, expressed in terms of clock cycles or time units. A foundational approach is the Average Memory Access Time (AMAT) model, which computes the effective latency as the sum of the hit time in the cache and the miss penalty weighted by the miss rate:

\text{AMAT} = \text{Hit time} + \text{Miss rate} \times \text{Miss penalty}

This formula allows architects to estimate overall latency by incorporating cache hit probabilities and the additional cycles required for lower-level memory fetches, often derived from cycle-accurate breakdowns like address decoding time plus data fetch cycles, divided by the system clock frequency.^[38] For more detailed predictions, analytical models extend to hierarchical memory systems by recursively calculating latencies across levels, such as in two-level caches where average latency \lambda_{\text{avg}} is given by:

\lambda_{\text{avg}} = P_{L1}(h) \times \lambda_{L1} + (1 - P_{L1}(h)) \left[ P_{L2}(h) \times \lambda_{L2} + (1 - P_{L2}(h)) \times \lambda_{\text{RAM}} \right]

Here, P_{L1}(h) and P_{L2}(h) represent hit probabilities for L1 and L2 caches, while \lambda_{L1}, \lambda_{L2}, and \lambda_{\text{RAM}} denote the respective access latencies; this model integrates reuse distance distributions from memory traces to predict performance without full simulation.^[39] Simulation tools provide cycle-accurate modeling of memory latency in full-system environments. The gem5 simulator unifies timing and functional memory accesses through modular MemObjects and ports, enabling detailed prediction of latency in CPU-memory interactions across various architectures, including support for classic and Ruby memory models that capture queueing and contention effects.^[40] Similarly, DRAMSim2 offers a publicly available, cycle-accurate simulator for DDR2/3 memory subsystems, allowing trace-based or full-system integration to forecast latency by modeling DRAM timing parameters and bank conflicts with high fidelity.^[41] For modern DDR5 systems, tools like Ramulator 2.0 provide extensible, cycle-accurate simulation supporting contemporary DRAM standards.^[42] Empirical measurement techniques capture real-world memory latency through hardware and software instrumentation. Oscilloscope tracing measures signal delays in memory interfaces by quantifying propagation times between address signals and data return, providing precise nanosecond-level insights into physical layer latencies during hardware validation. In software environments, profilers like Intel VTune enable end-to-end latency profiling by analyzing memory access stalls, cache misses, and bandwidth utilization from application traces, offering breakdowns of average and tail latencies without requiring hardware modifications.^[43] Advanced statistical modeling addresses queueing latency under high loads by treating memory requests as Poisson arrivals in queueing systems, such as M/M/1 models for memory controllers. In these frameworks, queueing delay is derived from arrival rate \lambda and service rate \mu, yielding average waiting time W_q = \frac{\lambda}{\mu(\mu - \lambda)} for single-server scenarios, which predicts exponential latency increases as utilization approaches saturation in multi-bank DRAM systems.^[44] This approach, often combined with fixed-point iterations to resolve traffic-latency dependencies, facilitates rapid evaluation of contention-induced delays in multiprocessor environments.^[45]

Influencing Factors

Hardware Design Elements

Transistor scaling has been a cornerstone of reducing memory latency through advancements in semiconductor process nodes. As feature sizes decrease—for instance, TSMC's 5 nm node enables finer transistor geometries—gate delays in critical memory components like sense amplifiers and decoders diminish, allowing faster signal propagation and overall access times. However, the breakdown of Dennard scaling, observed since around the 90 nm node (circa 2006), has introduced diminishing returns: while transistor density continues to increase, power density rises without proportional voltage reductions, limiting frequency scaling and constraining latency improvements in advanced nodes.^[46] This effect is particularly evident in memory circuits, where subthreshold leakage and thermal constraints hinder the expected performance gains from scaling below 7 nm.^[47] Memory types fundamentally dictate baseline latency profiles due to their physical structures and access mechanisms. NAND flash memory, commonly used in storage applications, exhibits read latencies on the order of 25 μs for page accesses, stemming from sequential charge-based sensing that requires time for threshold voltage stabilization.^[48] In contrast, high-bandwidth memory (HBM) integrated into GPUs achieves random access latencies around 100 ns, benefiting from wide interface buses and stacked dies that minimize data movement overhead.^[49] 3D-stacked DRAM further optimizes this by vertically integrating layers via through-silicon vias (TSVs), which shorten interconnect lengths and reduce RC delays, yielding latency reductions of up to 50% in access times compared to planar DRAM.^[50] Interconnect design plays a pivotal role in propagation delays within memory hierarchies. On-chip buses, fabricated on the same die as the processor, incur propagation delays in the picosecond range due to low capacitance and short wire lengths, whereas off-chip buses introduce delays an order of magnitude higher from package inductance and board-level signaling.^[38] Innovations like Intel's Embedded Multi-Die Interconnect Bridge (EMIB) address this by embedding high-density silicon bridges between dies, enabling localized, high-bandwidth links that cut propagation times relative to traditional off-package routing without full 3D stacking overhead.^[51] Power constraints impose trade-offs in voltage scaling that directly impact memory latency. Reducing supply voltage (Vdd) lowers dynamic power consumption quadratically but slows transistor switching speeds, particularly in sub-1 V regimes where near-threshold operation amplifies delays. For DRAM, operating at reduced voltages can increase access latencies by 20-30%, as bitline sensing and precharge times extend due to diminished drive currents.^[52] This balance is critical in energy-constrained systems, where aggressive scaling below 0.8 V exacerbates variability and necessitates compensatory circuit techniques.^[53]

System-Level Interactions

Operating system scheduling mechanisms profoundly influence memory latency by introducing overheads during thread management and memory allocation. Context switches, essential for multitasking, incur costs of 10 to 100 microseconds primarily from saving and restoring CPU state, including translation lookaside buffer (TLB) flushes that disrupt memory access patterns.^[54] Page faults exacerbate this further; when a required memory page resides in secondary storage, resolution times extend to milliseconds due to disk I/O operations, dwarfing typical DRAM access latencies of tens of nanoseconds. Workload characteristics, especially access patterns, interact dynamically with virtual memory subsystems to modulate effective latency. Sequential accesses benefit from prefetching and locality, maintaining low latencies, whereas random accesses strain page replacement algorithms, leading to higher miss rates. In virtual memory thrashing—occurring when the aggregate working set exceeds physical memory capacity—excessive paging activity dominates, increasing effective memory latency by up to 10 times as computational progress halts for frequent disk swaps.^[55] Concurrency in multi-core systems amplifies latency through resource sharing and architectural asymmetries. In Non-Uniform Memory Access (NUMA) configurations, remote node accesses incur 2 to 3 times the latency of local memory due to cross-node interconnect delays, compelling software to optimize thread-to-node affinity. Thread contention on shared caches and memory controllers in multi-core environments compounds this, with high-contention scenarios elevating average memory latency by factors up to 7 times via queuing and coherence overheads.^[56]^[57] Virtualization layers in cloud infrastructures, such as hypervisors managing AWS EC2 instances, impose additional latency on memory operations through nested address translations and interception. These mechanisms typically add 5 to 20 percent overhead to memory access times, stemming from extended page table walks and VM exits, particularly under memory-intensive workloads.^[58] Queueing delays from concurrent virtual machines can further interact with these effects, though primarily as a hardware modulation covered elsewhere.^[59]

Optimization Approaches

Caching and Prefetching

Caching and prefetching are established techniques in computer architecture designed to mitigate memory latency by exploiting temporal and spatial locality in data accesses. Cache hierarchies typically consist of multiple levels, such as L1, L2, and L3 caches, each with increasing capacity but higher access latencies, organized to store frequently used data closer to the processor. The L1 cache, often split into instruction and data caches, provides the fastest access (around 1-4 cycles) but smallest size (e.g., 32 KB per core), while L2 caches (256 KB to 1 MB, 10-20 cycles latency) serve as a backup, and shared L3 caches (several MB to tens of MB, 30-50 cycles latency) further buffer main memory accesses across cores.^[60]^[61] Set associativity in these caches, such as 8-way set-associative designs common in modern processors, enhances reuse by allowing multiple blocks per set, thereby reducing conflict misses and effective miss latency through better data retention.^[38] The effectiveness of caching is quantified by hit and miss ratios, where a cache hit delivers data in minimal time, but a miss incurs significant penalty from fetching from lower levels or main memory. The average memory access time (AMAT) incorporates this via the equation:

\text{AMAT} = \text{Hit time} + \text{Miss rate} \times \text{Miss penalty}

For instance, with a 1-cycle hit time and a main memory miss penalty of approximately 100 cycles, even a low miss rate of 1% can double the effective access time compared to perfect hits.^[60]^[38] Higher associativity, like 8-way, typically lowers the miss rate by 10-20% in workloads with moderate locality, further amortizing the penalty.^[61] Prefetching complements caching by proactively loading anticipated data into caches to overlap latency, divided into hardware and software mechanisms. Hardware prefetchers, such as stride-based units in Intel CPUs (e.g., those detecting regular access patterns like array traversals with fixed offsets), monitor load addresses and issue fetches for predicted future lines, often reducing L3-to-memory miss latency by 20-50% in sequential workloads by hiding up to 200 cycles of DRAM access time.^[62]^[63] Software prefetching, implemented via compiler intrinsics like Intel's _mm_prefetch, allows programmers or compilers to insert explicit prefetch instructions, enabling fine-tuned control for irregular patterns where hardware may underperform, such as in pointer-chasing, potentially cutting effective latency by inserting prefetches 100-200 cycles ahead.^[64] Despite these benefits, prefetching introduces trade-offs, particularly cache pollution from inaccurate predictions, where useless data evicts useful content, potentially increasing overall miss rates and latency. Inaccurate hardware prefetches can elevate cache pollution by filling sets with non-reused lines, leading to performance degradation of 5-15% in bandwidth-sensitive or low-locality workloads, necessitating throttling mechanisms like confidence counters to balance aggression.^[65]^[66] Software prefetches risk similar issues if mistimed, amplifying instruction overhead without latency gains.

Architectural Innovations

Architectural innovations in memory systems aim to address the persistent latency challenges posed by the von Neumann bottleneck, where data movement between processing units and memory dominates performance overheads. By integrating computation closer to storage or leveraging advanced interconnects, these approaches fundamentally redesign hardware to minimize propagation delays and access times, enabling sub-100 ns latencies in pooled or distributed environments. Key developments include processing-in-memory (PIM) paradigms, 3D-stacked high-bandwidth memory (HBM), coherent fabric links like Compute Express Link (CXL), and photonic interfaces that exploit light-speed propagation for ultra-low delays. Processing-in-memory (PIM) architectures embed lightweight processing elements directly within DRAM modules, allowing computations to occur at the data's location and thereby slashing the latency of data shuttling across buses, which can account for 10-100x overheads in conventional systems. UPMEM's PIM accelerator, launched in 2017, exemplifies this by integrating 16 to 32 DRAM Processing Units (DPUs) per DDR4 DIMM, each capable of 8-bit integer operations at 500 MHz while sharing access to local banks. This setup reduces effective latency for data-intensive tasks like database scans by eliminating off-chip transfers, achieving up to 20x speedup in bandwidth-limited applications compared to CPU-only execution.^[67]^[68] Three-dimensional integration techniques further lower intra-chip latencies through vertical stacking, which shortens interconnect lengths and enables higher densities. High Bandwidth Memory 3 (HBM3), standardized in 2022, stacks up to 16 DRAM dies with a 1024-bit wide interface per stack, delivering access latencies around 20-30 ns—roughly half that of traditional GDDR6—while supporting coherent memory pools across multi-die configurations. Complementing this, Compute Express Link (CXL), introduced in 2019, extends PCIe-based fabrics with cache-coherent protocols, adding approximately 130-200 ns end-to-end latency for remote memory accesses in disaggregated systems as of 2023. This enables scalable sharing of HBM3 pools across devices, with typical round-trip times around 130-200 ns in fabric topologies.^[69]^[70]^[71] Emerging optical interconnects push propagation delays to sub-nanosecond levels by replacing electrical signaling with photonic links, drastically cutting energy and time for inter-chip communication. Ayar Labs' TeraPHY chiplet, prototyped in 2023, integrates silicon photonics for 4-8 Tbps bidirectional bandwidth with end-to-end latencies below 10 ns, including sub-ns optical propagation over short distances, offering 10x lower delay than copper-based alternatives for AI accelerators. Similarly, near-memory computing evolves PIM concepts by fusing logic directly into DRAM dies; Samsung's Aquabolt-XL HBM2-PIM, detailed in 2021 with ongoing refinements, incorporates accelerator cores in the base die stack, reducing wire lengths and achieving over 2x system performance uplift with 70% energy savings by localizing operations and minimizing off-die accesses. These innovations collectively target latencies under 50 ns in future heterogeneous systems, prioritizing scalability for exascale computing.^[72]^[73]^[74]

References

[1]
About memory latency - Arm Learning Paths
Memory latency refers to the communication from the CPU to the memory devices. A request is either a load (read) or a store (write) and the response is the ...
[2]
[PDF] Understanding and Improving the Latency of DRAM-Based Memory ...
Our extensive experimental data on the relationship between DRAM supply voltage, latency, and reliability can further enable developments of other new ...<|control11|><|separator|>
[3]
The Difference Between RAM Speed and CAS Latency
### Summary of Memory Latency, CAS Latency, Speed, and Performance
[4]
[PDF] Understanding Latency Hiding on GPUs - UC Berkeley EECS
Aug 12, 2016 · mean memory latency depends on memory throughput, how latencies of individual memory accesses are distributed around the mean, and how ...
[5]
[PDF] Analysis of Multithreaded Architectures for Parallel Computing
Memory latency, the time required to initiate, process, and return the result of a memory request, has always been the bane of high-performance computer ...
[6]
[PDF] The Memory System
▫ Memory latency is the time it takes to transfer a word of data to or from memory. ▫ Memory bandwidth is the number of bits or bytes that can be ...<|control11|><|separator|>
[7]
[PDF] Introduction to Caches - Computer Architecture - Wei Wang
▷ Miss penalty = Access Time + Transfer time. ▷ Access time is a function ... represent the average memory latency for a series of memory accesses ...
[8]
[PDF] A Low Latency and Low Cost DRAM Architecture
Therefore, the la- tency of the first access is 37.5ns (tRCD + tCL + tBL). On the other hand, the second access is delayed by the timing con- straint that is in ...<|control11|><|separator|>
[9]
[PDF] HISTORY OF NSA GENERAL-PURPOSE ELECTRONIC DIGITAL ...
Feb 9, 2004 · Its 1024-word memory was to be composed of mercury delay lines, with minimum access time being 48 microseconds and actual access varying between ...
[10]
What is SDRAM Memory: DDR, DDR2, DDR3, DDR4, DDR5
In order to ensure that SDRAM technology is interchangeable, JEDEC, the industry body for semiconductor standards, adopted its first SDRAM standard in 1993.Missing: CAS | Show results with:CAS
[11]
JEDEC Announces Publication of DDR4 Standard
JEDEC DDR4 (JESD79-4) has been defined to provide higher performance, with improved reliability and reduced power, thereby representing a significant ...Missing: 2014 latency
[12]
JEDEC Publishes New DDR5 Standard for Advancing Next ...
Jul 14, 2020 · Fine grain refresh feature: as compared to DDR4 all bank refresh improves 16 Gbps device latency. · On-die ECC and other scaling features enable ...
[13]
[PDF] Understanding Latency Variation in Modern DRAM Chips
Jun 18, 2016 · DRAM access latency is defined by three fundamental operations that take place within the DRAM cell array: (i) activation of a memory row, which ...
[14]
[PDF] ddr4 sdram jesd79-4 - JEDEC STANDARD
DDR4 SDRAM JESD79-4 is a JEDEC standard, designed to facilitate interchangeability and improvement of products. JEDEC Standard No. 79-4.Missing: 1993 | Show results with:1993
[15]
[PDF] Enabling High-fidelity DRAM Research by Uncovering Sense ...
After a row activation, the classic circuit events (c) are charge sharing (1), and latching & restore (2). A precharge makes bitlines connect to Vpre (PRE) and ...
[16]
[PDF] Access Memory (RAM) SRAM vs DRAM Summary
Feb 13, 2014 · ▫ SRAM access 3me is about 4 ns/doubleword, DRAM about 60 ns. ▫ Disk is about 40,000 3mes slower than SRAM,. ▫ 2,500 3mes slower then DRAM.
[17]
What is propagation delay? - TechTarget
May 27, 2021 · Propagation delay is the amount of time required for a signal to be received after it has been sent; it is caused by the time it takes for the signal to travel ...<|separator|>
[18]
[PDF] Problems in Electromagnetics, Vol. 1 Version 1.2 - VTechWorks
Jan 7, 2019 · 3.16-2 A PCB trace having characteristic impedance 75 Ω is short-circuited at one end. ... characteristic impedance 50 Ω and velocity factor 0.67c ...
[19]
Estimating PCB Trace Delays - Electrical Engineering Stack Exchange
Jun 4, 2022 · The speed of light in air is 1 ft / ns. On PCB, it's about half to 2/3rds of that, so lining up with your 150 ps/in. Grace Hopper used to take a ...rise time - How to determine if a signal path needs to be treated as a ...Understanding the mathematics in the following article about EMCMore results from electronics.stackexchange.comMissing: foot | Show results with:foot
[20]
[PDF] Tutorial: M/M/1 Queue
In this tutorial, you will build an M/M/1 queue model and make sure the queue reaches steady state for a specific arrival rate, packet size, and service ...Missing: controllers | Show results with:controllers
[21]
[PDF] The Application Slowdown Model: Quantifying and Controlling the ...
Dec 5, 2015 · Accounting for Memory Queueing Delay. During each epoch, when there are no requests from the highest priority application, the memory ...<|separator|>
[22]
[PDF] Self-Optimizing Memory Controllers: A Reinforcement Learning ...
The RL-based memory controller uses reinforcement learning to optimize scheduling, anticipating long-term consequences and adapting to dynamic workloads, ...
[23]
[PDF] Understanding Routable PCIe Performance for Composable ...
Apr 18, 2024 · Its one-way PCIe latency between two endpoints is 868.6 ns, while a local one takes 379.0 ns. The forwarding rate of an external PCIe switch is ...
[24]
[PDF] Understanding PCIe performance for end host networking
Aug 20, 2018 · Measurement of NIC PCIe latency Figure 2 illustrates that the round trip latency for a 128B payload is around 1000 ns with PCIe contributing ...
[25]
[PDF] Handling the Problems and Opportunities Posed by Multiple On ...
Proper allocation of workload data to the appropriate MC will be important in reducing the latency of memory service requests. The allocation strategy will need ...
[26]
[PDF] Scalable memory fabric for silicon interposer-based multi-core systems
Our experimental results show that compared to baseline designs, MemNiSI topology reduces the average packet latency by up to 15.3% and Choose Fastest. Path ( ...<|separator|>
[27]
Memory Latency - an overview | ScienceDirect Topics
Memory latency refers to the time it takes to access a specific piece of data in the memory hierarchy, with GPUs having longer latency for global memory ...
[28]
[PDF] Hardware, OS, and Application-level Sources of Tail Latency
Our investigation revealed that the cause of this in- crease is increased memory access latency. This problem is caused in part by Linux's default NUMA memory ...
[29]
Amdahl's Law for Tail Latency - Communications of the ACM
Aug 1, 2018 · This article investigates how the focus on tail latency affects hardware designs, including what types of processor cores to build, how much chip area to ...
[30]
The Mechanism behind Measuring Cache Access Latency
Sep 30, 2022 · This article describes how to measure the access latency at different levels of memory hierarchy and introduces the mechanism behind it.
[31]
Memory hierarchy – Clayton Cafiero - University of Vermont
Nov 3, 2025 · DRAM access latency is typically 50–100 ns, which at 3 GHz corresponds to 150–300 cycles. Latency arises from signal propagation, memory ...
[32]
Interpreting the results of the STREAM benchmark
Feb 27, 2020 · The STREAM benchmark is considered an important benchmark for understanding the memory bandwidth and access latency of a particular computer.
[33]
LAT_MEM_RD(8) manual page - LMbench
lat_mem_rd measures memory read latency for varying memory sizes and strides. The results are reported in nanoseconds per load and have been verified accurate.Description · Interpreting the Output
[34]
Cache Memory Explained: A Deep Dive Into L1, L2, L3 cache - Orlyset
Apr 6, 2025 · Typically cache consist of L1, L2, L3 caches. Among these three caches L1 is faster with 1 to 5 cycles latency ( Latency is the time delay ...
[35]
[PDF] Memory Hierarchy
AMAT = Average memory access time = Hit time + Miss ratio × Miss penalty ... • E.g., associativity—hit rate vs hit latency. Summary: Memory hierarchy. Page 88 ...
[36]
[PDF] An Analytical Memory Hierarchy Model for Performance Prediction
Furthermore, the cache hit-rates at different cache hierarchies are useful for measuring the effective latency and throughput per one memory access of an ...
[37]
https://orlyset.com/cache-memory-l1-l2-l3-detail-explanation/
[38]
[PDF] DRAMSim2: A Cycle Accurate Memory System Simulator
Abstract—In this paper we present DRAMSim2, a cycle accurate memory system simulator. The goal of DRAMSim2 is to be an accurate.Missing: prediction | Show results with:prediction
[39]
Optimize Memory and Cache Use with Intel® VTune™ Profiler
Aug 11, 2023 · The platform diagram shows DRAM bus utilization. · The bandwidth utilization histogram reports memory accesses of varying bandwidth. · The latency histogram shows ...Missing: empirical oscilloscope
[40]
[PDF] An Analytical Performance Model for Partitioning Off-Chip Memory ...
Liu, et al., [14] propose an analytical queueing model to derive an optimal ... The model proposed in [14] assumes poisson arrival of memory requests, which ...
[41]
[PDF] Secure DIMM: Moving ORAM Primitives Closer to Memory
side memory controller can control the memory channel. ... If we consider each transfer queue as m/m/1/K queue model, i.e., poisson arrival/service probability ...
[42]
Understanding Dennard scaling - Rambus
Aug 4, 2016 · Essentially, Dennard and his engineering colleagues observed that as transistors are reduced in size, their power density stays constant.
[43]
Dennard Scaling - an overview | ScienceDirect Topics
Dennard scaling is defined as the principle that allowed for the continuous decrease of transistor feature size and voltage while maintaining constant power ...
[44]
[PDF] Approaching DRAM performance by using microsecond-latency ...
For the read- intensive case we focus on in this paper, low latency flash memory with microsecond read latency is a promising solution. However, when they are ...
[45]
What is High Bandwidth Memory 3 (HBM3) - Wevolver
Jun 29, 2025 · In latency-sensitive applications, HBM3 achieves an average random-access latency of 120 nanoseconds, outperforming DDR5, which averages 160 ...
[46]
[PDF] Re-architecting DRAM memory systems with 3D Integration and ...
The main advantage of TSV-based stacking is that it reduces the wire-length between modules located on different tiers, which in turn reduces delay and energy ...
[47]
Revolutionizing Chip Packaging: The Impact of Intel's Embedded...
Aug 27, 2025 · Intel Foundry's Embedded Multi-Die Interconnect Bridge (EMIB) is a groundbreaking 2.5D interconnect technology that redefines chip packaging.Missing: reduction percentage
[48]
[PDF] Understanding Reduced-Voltage Operation in Modern DRAM Devices
The purpose of this experiment is to understand the trade-off between access latencies ... less sensitive to the increased memory latency. Second, DRAM power ...Missing: regimes | Show results with:regimes
[49]
Voltage Scaling - an overview | ScienceDirect Topics
Voltage scaling refers to the process of adjusting the supply voltage of a circuit in conjunction with its clock frequency to optimize power consumption, ...
[50]
The Effect of Context Switches on Cache Performance
Depending on cache parameters the net cost of a context switch appears to be in the thousands of cycies, or tens to hundreds of microseconds. 1. Introduction.
[51]
[PDF] Arachne: Core-Aware Thread Management - USENIX
Oct 8, 2018 · background application, 99th-percentile latency was 10x higher for memcached than for memcached-A. ... system will collapse under page thrashing ...
[52]
Flexible use of memory for replication/migration in cache-coherent ...
The ratio of remote to local access times can be as small as 2 (or 3) to 1 for the SC1 Origin or as high as 8 to 1 for the Sequent SliNG.
[53]
A Pressure-Aware Policy for Contention Minimization on Multicore ...
May 25, 2022 · The authors of [31] demonstrate that the average memory latency of an application can increase up to 7 × ( 60 % IPC loss) when running ...<|separator|>
[54]
Optimizing Nested Virtualization Performance Using Direct Virtual ...
Mar 16, 2020 · Exits from a VM to the hypervisor are the main reason for the overhead in virtualization due to the time spent in the hypervisor [24] and cache ...
[55]
[PDF] Firecracker: Lightweight Virtualization for Serverless Applications
Feb 25, 2020 · Cloud Hypervisor with the latter having significantly lower write latency. ... overhead as low a 3% on memory and negligible overhead on CPU.
[56]
[PDF] Cache Memories
Feb 26, 2019 · Would you believe 99% hits is twice as good as 97%?. ▫ Consider this simplified example: cache hit time of 1 cycle miss penalty of 100 cycles.
[57]
Memory hierarchy reconfiguration for energy and performance in ...
For the conventional hierarchy, the L2 cache is 512KB two-way set associative with a 21 cycle latency and the L3 cache is 2MB 16-way set associa- tive with a 60 ...
[58]
[PDF] Hardware LLC prefetch feature on 4th Gen Intel® Xeon® Scalable ...
All memory prefetch features that include hardware LLC prefetch can help reduce the long latency typically associated with reading data from memory, and thus ...
[59]
https://www.usenix.org/system/files/nsdi20-paper-agache.pdf
[60]
prefetch/noprefetch - Intel
These hints affect the heuristics used in the compiler. Prefetching data can minimize the effects of memory latency. If you specify the prefetch pragma with ...
[61]
https://dl.acm.org/doi/pdf/10.1145/360128.360153
[62]
When Prefetching Works, When It Doesnâ - ACM Digital Library
Lower-level (L2 or L3) prefetching block insertion greatly reduces the higher level (L1) cache pollution, but L2 to L1 latency can degrade perfor- mance ...
[63]
Upmem Moves Processing Into Memory for Analyzing Finances and ...
Sep 25, 2017 · To get around the memory wall, the semiconductor start-up is looking to process information inside memory, which could vastly improve bandwidth ...<|separator|>
[64]
A survey on processing-in-memory techniques: Advances and ...
In this survey, we analyze recent studies that explored PIM techniques, summarize the advances made, compare recent PIM architectures, and identify target ...
[65]
High Bandwidth Memory (HBM) Explained - Lam Research Newsroom
Sep 11, 2025 · HBM solves this by: providing massive bandwidth increases (HBM3 delivers speeds up to 6.4 Gb/s), reducing latency through stacked architecture,
[66]
An Introduction to the Compute Express Link (CXL) Interconnect
This gives an end-to-end latency adder of 57 nanoseconds for a memory access across a CXL link. ... mem access targeting DRAM/ HBM memory or a 50 nanoseconds ...
[67]
Ayar Labs Demonstrates Industry's First 4-Tbps Optical Solution for ...
Mar 3, 2023 · ... 2023. Ayar Labs is now shipping its in-package optical I/O solution with sub-10 ns ... Silicon Photonics and Heterogeneous Integration ...
[68]
TeraPHY Optical I/O Chiplet | Silicon Photonics - Ayar Labs
The TeraPHY chiplet is a small, low-power, high-bandwidth optical I/O alternative, with 8 Tbps bandwidth, 10ns latency, and 5-10x higher bandwidth than ...Missing: 2023 sub- propagation
[69]
[PDF] Aquabolt-XL: Samsung HBM2-PIM with in-memory processing for ...
Aquabolt-XL is able to deliver over 2X system performance while reducing energy consumption by more than 70%. Page 7. Aquabolt HBM2 (High Bandwidth memory). • ...