Fact-checked by Grok 2 weeks ago

Memory architecture

Memory architecture in computer systems refers to the organizational design and structure of components that enable efficient data access, management, and utilization by the , typically structured as a hierarchy to balance speed, , cost, and power consumption. This design addresses the fundamental trade-offs in memory technologies, where faster is smaller and more expensive, while slower offers greater at lower cost. At the core of memory architecture is the , which organizes into multiple levels progressing from the fastest, smallest units closest to the to larger, slower ones further away. The primary levels include: The effectiveness of this hierarchy relies on principles of , where programs exhibit temporal locality (reusing recently accessed data) and spatial locality (accessing nearby data soon after). Cache organizations, such as direct-mapped, fully associative, or set-associative mappings, further optimize hit rates by determining how data blocks are placed and searched. Virtual memory extends this architecture by abstracting physical limitations through paging and segmentation, allowing processes to use more memory than physically available via disk swapping. Modern memory architectures also incorporate advanced features like error-correcting codes () in to enhance reliability, multi-port designs for concurrent access in multiprocessor systems, and emerging non-volatile technologies such as to reduce power usage and gaps. These evolutions are driven by the "memory wall" challenge, where processor speeds outpace , necessitating innovations like interfaces and near-data processing to sustain overall system performance.

Fundamentals

Definition and Scope

Memory architecture refers to the structural arrangement and design of memory systems within , encompassing the methods for storing, retrieving, and managing to support efficient . This organization ensures that is accessible at varying levels of speed and capacity, tailored to the needs of the and overall system performance. The scope of memory architecture focuses primarily on hardware-level implementations, spanning from the smallest, fastest storage units integrated into the processor to larger, slower external devices, while interfacing with software mechanisms such as and file systems. It addresses the integration of these components into a cohesive system that interfaces with the CPU, devices, and buses, prioritizing physical design principles that support programmatic control. At its core, memory architecture plays a critical role in balancing key trade-offs: access speed for rapid , for handling large datasets, for economic feasibility, and to determine data persistence without power. Fundamental components include registers, which provide the quickest access for immediate operands within the CPU; memory, a small intermediary buffer for frequently used data; main , typically implemented as for holding active programs and data; and secondary devices for long-term retention. These elements collectively form a that optimizes performance by exploiting .

Historical Evolution

The development of memory architecture began in the 1940s with rudimentary technologies constrained by the limitations of early electronic computing. The ENIAC, completed in 1945, relied on vacuum tube-based flip-flop registers for its primary memory, providing a capacity of just 20 words of 10 decimal digits each, which was sufficient for its calculator-like operations but required frequent reconfiguration for different tasks. To address the need for larger, more reliable storage in subsequent machines, J. Presper Eckert proposed mercury delay line memory in the mid-1940s for the EDVAC design, using sound waves propagating through liquid mercury-filled tubes to store bits acoustically; this technology was first implemented in computers like the EDSAC in 1949 and the UNIVAC I in 1951, offering capacities up to several thousand bits with access times around 1 millisecond. By the early 1950s, emerged as a transformative advancement, supplanting delay lines due to its non-volatility, faster access (around 1 ), and greater reliability. Independently invented by in 1951 through his patent for the coincident-current selection method and by Jay Forrester at , which enabled efficient addressing of tiny ferrite rings magnetized to represent bits, core memory was first deployed in MIT's computer in 1953 and became the dominant technology through the 1970s, powering systems like later models (e.g., the UNIVAC 1105 with core planes storing 4096 words). Key architectural innovations during this era included the introduction of in the Manchester Atlas computer in 1962, which used paging to create the illusion of a larger by pages between core memory and a drum backing store, significantly improving multiprogramming efficiency. Similarly, memory debuted in the Model 85 in 1968, employing a small, high-speed buffer to bridge the growing speed gap between processors and main memory, marking the onset of hierarchical designs. The shift to in the revolutionized density and cost, driven by advances in s. introduced the 1103 chip in October 1970, the first commercially successful with 1 kilobit capacity, which required periodic refreshing but enabled much higher densities than at lower cost, rapidly displacing magnetic technologies. Static RAM (SRAM), invented in 1963 at as a memory using flip-flop circuits for each bit without refresh needs, complemented in applications requiring speed, such as registers, with early commercial versions appearing in the mid-1960s. This semiconductor era was propelled by Gordon Moore's 1965 observation—later termed —that the number of components on an would double approximately every year (revised to every two years in 1975), fostering exponential miniaturization and integration that underpinned denser memory hierarchies.

Memory Hierarchy

Levels and Components

The in modern computer systems is organized as a multi-tiered , designed to balance speed, capacity, and cost by exploiting the principle of . At the apex are CPU registers, which are the fastest and smallest units, typically numbering 32 to 128 per core and holding individual words of 32 to 64 bits each for immediate manipulation during instruction execution. Below registers lie multilevel caches, implemented in static (): L1 caches (split into instruction and subsets) provide the first line of rapid access outside registers, followed by and shared L3 caches that stage larger blocks of closer to the . Main memory, usually dynamic (), serves as the primary working for active programs and . Secondary , such as hard disk drives (HDDs) or solid-state drives (SSDs), holds persistent at much larger scales, while tertiary like magnetic tapes handles archival needs. This structure is justified by the principle of , which observes that programs tend to reuse recently accessed data (temporal locality) and access data located near recently referenced items (spatial locality), allowing most operations to hit faster upper levels rather than slower lower ones. Temporal locality arises because computational patterns, such as loops, repeatedly reference the same variables, while spatial locality stems from sequential data access in arrays or code instructions, enabling block transfers that capture nearby items. These properties ensure that the effective access time approximates that of the fastest level for a significant portion of references, providing the of a large, uniform memory system. The components of the hierarchy vary in scale and performance, as summarized in the following table of representative specifications for a typical modern (e.g., architecture as of 2025 in desktops/servers):
LevelTypical CapacityAccess LatencyPurpose and Technology
Registers256 bytes to 1 KB (32-128 × 64-bit words)<1 ns (0.3-1 cycle at 3-5 GHz)CPU-integrated for operands; SRAM-like speed.
L1 32-128 KB per core1-4 ns (3-12 cycles)On-chip, per-core; holds active instructions/data blocks.
L2 256 KB-2 MB per core3-10 ns (7-20 cycles)On-chip, per-core or shared; extends L1 for larger working sets.
L3 8-128 MB shared10-25 ns (20-50 cycles)On-chip, multi-core shared; buffers main memory accesses.
Main (DRAM)16-256 system-wide50-100 nsOff-chip modules; volatile bulk storage for running processes.
Secondary (HDD/SSD)500 -10 TBHDD: 5-10 ms; SSD: 0.05-0.1 msPersistent, non-volatile; for files and OS.
Tertiary (Tape)10 TB- PB archivalSeconds to minutesOffline, ; for backups.
These metrics highlight the : upper levels offer sub-nanosecond speeds but limited , while lower levels provide terabyte-scale at latencies. Levels interact seamlessly through dedicated : registers exchange data with L1 s via the CPU's internal , while cache controllers automatically manage block transfers between caches and main over high-speed on-chip interconnects. A bridges main memory to the , facilitating burst transfers to/from secondary via I/O controllers, ensuring coherent data flow without software intervention for upper levels. This integration relies on buses like the (or modern equivalents such as QPI/UPI) for inter-level communication, with protocols handling misses by fetching from the next lower level.

Design Principles and Trade-offs

The design of memory hierarchies relies on fundamental principles that exploit program behavior to balance performance and resource constraints. Central to this is the principle of locality, which posits that programs exhibit temporal locality—recently accessed data is likely to be accessed again soon—and spatial locality—data near recently accessed locations is likely to be referenced next. This behavior allows smaller, faster memory levels to effectively store frequently used data, reducing the need to access slower, larger storage. Seminal work formalized these observations, enabling hierarchies that approximate the speed of the fastest components while approaching the cost of the cheapest. In multi-level cache designs, the principle of further guides organization by ensuring that data in higher-speed levels (closer to the , such as L1 caches) is a of data in lower-speed levels (such as caches), facilitating simpler management and snoop filtering. To handle evictions when capacity is exceeded, replacement policies select victims for removal; the least recently used (LRU) policy, a widely adopted of optimal , prioritizes evicting the item unused for the longest time, balancing implementation simplicity with effectiveness in exploiting temporal locality. (Note: This is a lecture citing standard LRU from Belady's work, but using CMU as authoritative .edu) Key trade-offs in memory hierarchy design revolve around speed, cost, and . Faster technologies, such as static RAM () used in caches, provide low-latency access but at a high cost per bit due to greater density requirements, limiting their size to kilobytes or megabytes. In contrast, slower dynamic RAM (DRAM) for main memory offers higher at lower cost per bit but with increased access times, necessitating careful sizing to optimize overall system . Additionally, and present a : volatile memories enable rapid read/write operations suitable for active computation but require power to retain data, whereas non-volatile options provide large-scale persistence for long-term storage at the expense of slower access speeds and higher write latencies. These choices are quantified through empirical modeling, showing that hierarchies can achieve effective speeds within 10-20% of ideal while keeping costs close to bulk storage levels. Performance metrics focus on hit rate—the proportion of memory requests satisfied by a given level—and miss rate (1 minus hit rate), which determine effective access efficiency. The average access time T_{avg} for a two-level is calculated as T_{avg} = h \cdot T_{cache} + (1 - h) \cdot T_{main}, where h is the hit rate, T_{cache} is the access time, and T_{main} is the main access time (including any penalty for fetching missed ). Typical hit rates in well-designed caches range from 90-99%, dramatically reducing compared to main memory alone. Optimization goals emphasize minimizing this T_{avg} to lower overall while maximizing throughput, particularly to alleviate the von Neumann bottleneck—the limitation imposed by a shared bus constraining movement between and , which can cap system performance despite advances in computation speed.

Primary Memory Technologies

Volatile Memory Types

Volatile memory types encompass technologies that lose stored data upon power removal, primarily (SRAM) and (DRAM), which serve as foundational elements for high-speed data access in computing systems. SRAM employs static cells based on bi-stable flip-flops to retain each bit without periodic maintenance, while DRAM uses dynamic cells relying on capacitors that necessitate regular refreshing to counteract charge leakage. SRAM cells typically adopt a 6-transistor (6T) , consisting of two cross-coupled inverters forming the flip-flop (with four transistors: two NMOS pull-downs and two PMOS loads) and two NMOS access transistors for read/write operations. This design ensures data stability as long as power is supplied, eliminating the need for refresh cycles and enabling faster access times compared to , though at the cost of lower density due to the higher transistor count per bit. In contrast, DRAM stores each bit in a single transistor-capacitor pair, where the capacitor's charge level (charged for logic 1, discharged for logic 0) represents the data, allowing for greater density but requiring periodic refresh operations every 64 ms to restore leaked charge across all 8192 rows in a typical module. DRAM chips are organized into multiple independent banks, each containing arrays of rows and columns; access begins with activating a row via the Row Address Strobe (RAS) to transfer its data to a row buffer, followed by column selection using the Column Address Strobe (CAS) for reading or writing specific bits. Common DRAM variants include Synchronous DRAM (SDRAM), which synchronizes operations with the system clock for improved performance, and Double Data Rate (DDR) SDRAM evolutions, such as DDR5 standardized in July 2020 by JEDEC with initial speeds up to 6.4 Gbps, updated to support up to 8.8 Gbps as of 2024, featuring on-die error correction and reduced voltage operation. These technologies find applications in main system memory, where DRAM dominates for its cost-effective capacity in general-purpose , and in systems, where SRAM provides rapid, reliable storage for critical operations.

Non-Volatile Memory Types

Non-volatile memory retains data without power supply, distinguishing it from volatile types by providing persistent essential for and boot processes. Primary non-volatile memories, such as (ROM) variants and , utilize structures to store charge-based information, enabling reliable over extended periods. These technologies are foundational in systems where upon power-off is critical. ROM variants form the basis of non-reprogrammable or limited-reprogrammable storage. Mask ROM is programmed during through fixed patterning of the , such as via metal contacts or channel implants, rendering it non-erasable and ideal for high-volume, unchanging data like software or fonts. (programmable ROM) allows one-time user programming post-manufacture using fuses or anti-fuses to blow connections, offering flexibility for low-volume customization without the cost of mask changes. (erasable PROM), a legacy technology largely obsolete since the early , employs a floating-gate structure, where data is programmed by hot-electron injection under to trap charge on the isolated gate, shifting the 's ; erasure occurs via (UV) light exposure through a , typically taking 20 minutes, with densities historically reaching up to 32 Mbit. (electrically erasable PROM) advances this with electrical erasure and byte-level addressing, using Fowler-Nordheim tunneling to add or remove charge from the floating gate, supporting up to 10^4 write cycles and densities of 1-2 Mbit for applications in adaptive controllers. The core mechanism in these ROM variants and is the floating-gate transistor, where a conductive polysilicon layer isolated by dielectrics stores electrons, altering the transistor's to represent binary states; this charge persists for over 10 years due to the high barrier of . , an evolution of , enables block-level erasure for efficiency. NOR flash features a parallel array allowing byte-addressable and direct code execution, with read times under 80 ns but slower sector erasure (up to 1 second), suited for densities typically up to 2 Gbit as of 2025. In contrast, NAND flash arranges in series for higher density (typically 512 Gbit to 1 Tbit or more as of 2025), supporting page-based with faster block erases (1 ms) and writes (400 μs/page), though is slower. Flash vary by bits stored: SLC (single-level ) holds 1 bit for high endurance (up to 100,000 cycles), (multi-level ) stores 2 bits via four voltage levels for balanced density, TLC (triple-level ) encodes 3 bits with eight levels, maximizing capacity at the cost of reduced reliability, and QLC (quad-level ) stores 4 bits with 16 levels, further increasing density but with lower endurance. These non-volatile types find primary use in firmware storage, such as and in systems, where NOR flash enables execute-in-place for boot code, and embedded devices rely on / for persistent configuration data. While also applied in secondary storage like SSDs, their role in primary emphasizes low-latency persistence for system initialization.

Secondary and Auxiliary Storage

Magnetic and Optical Media

Magnetic storage devices, such as hard disk drives (HDDs), represent a cornerstone of secondary storage in memory architectures, utilizing ferromagnetic materials to encode data through the alignment of . In HDDs, data is stored on rotating platters coated with magnetic material, where read/write heads positioned on actuator arms magnetize small regions to represent states—either north-south or south-north domain orientations for 0s and 1s, respectively. This process relies on principles established in the 1950s with early but evolved significantly with the introduction of rigid disk drives in the 1970s by , enabling reliable non-volatile storage for large datasets. Magnetic tape drives serve as another important auxiliary storage medium, particularly for archival and backup purposes, where data is stored linearly on reels using similar alignment. Modern (LTO) formats, such as LTO-9, offer uncompressed capacities up to 18 terabytes per cartridge, with LTO-10 reaching 36 terabytes as of 2025, providing cost-effective, high-density storage for data centers despite slower times in the seconds range. The areal density of HDDs, which measures storage capacity per unit area on the platters, has grown exponentially due to advancements in recording technologies, reaching over 1 terabit per by the mid-2020s through techniques like (HAMR). HAMR involves heating a tiny spot on the platter with a to temporarily lower the of the magnetic material, allowing the write head to align domains more precisely and densely without interference from adjacent bits. This innovation, commercialized by companies like Seagate in products such as the Mozaic 3+ platform, supports drive capacities exceeding 36 terabytes in enterprise settings as of 2025. maintaining HDDs' role in cost-effective, high-volume data persistence despite competition from faster alternatives. Optical media, including compact discs (CDs), digital versatile discs (DVDs), and Blu-ray discs, store data by creating physical variations in a reflective layer that alter patterns detectable by . Data is encoded as pits (depressions) and lands (flat areas) on a , where a beam of specific reads the transitions: reflected from lands indicates one state, while scattered from pit edges signals the other. Introduced in the 1980s by and , CDs achieve a standard capacity of 700 megabytes using an 780 nm , while DVDs double that to 4.7 gigabytes per layer with a 650 nm for finer pit , and Blu-ray reaches up to 50 gigabytes on dual-layer discs via a 405 nm blue-violet that enables smaller pits around 0.16 micrometers. The fundamental mechanisms of these media highlight their reliance on physical phenomena for data retention: in magnetic systems, the stability of aligned domains persists without power due to material , allowing sequential or via head movement over spinning platters at 5,400 to 15,000 RPM. Optical mechanisms exploit the and of light off microscopic topography, ensuring read-only or writable formats (e.g., with dye layers that become opaque when heated) maintain through environmental isolation. Both technologies offer high capacities—HDDs scaling to petabytes in arrays and optical s providing archival portability—but suffer from mechanical vulnerabilities like head crashes in HDDs or scratching in optical media, contributing to their gradual decline in favor of solid-state drives for primary secondary storage roles.

Solid-State and Emerging Storage

Solid-state drives (SSDs) represent a major advancement in secondary storage, utilizing NAND flash memory to provide non-volatile, high-speed data persistence without mechanical components. Unlike traditional hard disk drives, SSDs enable times in the microsecond range, making them suitable for applications requiring frequent read and write operations. At the core of SSDs is , organized into blocks and pages, where data is written electronically but requires erasure at the block level before rewriting. SSD controllers manage these operations, implementing to distribute write and erase cycles evenly across , thereby preventing premature failure of heavily used blocks. This is critical because NAND flash has limited ; for instance, triple-level (TLC) NAND typically supports around 1,000 program/erase cycles per cell before reliability degrades. Emerging quad-level cell (QLC) NAND extends capacities further, storing 4 bits per cell to enable SSDs up to 16 terabytes or more as of 2025, though with reduced endurance of approximately 100–1,000 cycles, balanced by advanced error correction. To extend lifespan, SSDs employ over-provisioning, reserving a portion of capacity (often 7-25% beyond user-visible space) for internal use in garbage collection and replacement of worn cells. SSDs connect via standardized interfaces that influence performance. The Serial ATA (SATA) interface, common in early consumer models, limits sequential throughput to about 600 MB/s. In contrast, the Non-Volatile Memory Express (NVMe) protocol, built on the PCIe bus, supports much higher speeds; by the 2020s, PCIe 4.0-based NVMe SSDs achieve sequential read/write rates exceeding 7 GB/s, with PCIe 5.0 variants surpassing 14 GB/s in configurations. Hybrid drives, known as solid-state hybrid drives (SSHDs), integrate a small cache (typically 8-32 ) with a larger HDD platter to balance capacity and speed. The portion caches frequently accessed data, accelerating boot times and application launches while leveraging the HDD for bulk storage, though adoption has waned with declining SSD costs. In applications, SSDs dominate consumer markets for laptops and desktops, offering capacities from 256 GB to 8 TB with power efficiency for use. In environments, they form the backbone of centers, handling high-IOPS workloads in servers and arrays, where reliability features like error correction and integration ensure over petabyte-scale deployments.

Addressing and Access Mechanisms

Physical and Logical Addressing

Physical addressing refers to the direct identification of locations in using absolute byte . These are transmitted over the memory bus via dedicated address pins on the CPU and memory modules, enabling the selection of specific byte positions within the physical array. The physical address space encompasses all actual storage locations available in the system's , typically organized as a contiguous range starting from address 0. In many modern architectures, the physical address space employs a flat model, treating as a single, linear array of bytes where each location is uniquely identified by its from the . This contrasts with older segmented models, where physical addresses are computed by combining a (specifying the starting point of a memory ) with an (the displacement within that ), allowing for variable-sized blocks of and facilitating and . The flat model simplifies addressing and is predominant in contemporary systems like , while segmented approaches, as seen in early systems like the , provided flexibility for modular code but added complexity to address calculations. Logical addressing, in contrast, provides an managed by the operating system, where programs operate on virtual addresses that do not directly correspond to physical locations. These logical addresses are generated by the CPU during instruction execution and represent positions within a process's . The (MMU), a hardware component integrated into the CPU, translates these logical addresses to physical addresses using translation tables, such as page tables, to access the actual memory hardware. This mechanism isolates processes and enables efficient memory sharing without direct hardware addressing. Addressing schemes also encompass conventions for multi-byte and to optimize . In 64-bit architectures, the theoretically spans $2^{64} bytes (approximately exabytes), though practical implementations often limit physical addressing to 48 or 52 bits due to pin constraints and cost considerations. For multi-byte values like integers, byte order determines the arrangement: big-endian stores the most significant byte at the lowest address (e.g., the 32-bit value 0x01234567 appears as 01 23 45 67 in memory), common in network protocols and architectures like PowerPC, while little-endian stores the least significant byte first (67 45 23 01), as used in x86 processors. This ordering affects data portability and interoperability between systems. To enhance access speed and avoid hardware penalties, data structures are aligned to natural boundaries matching the processor's word size—typically 32 bits (4 bytes) or double-word (64 bits, 8 bytes). Alignment requires that the starting of such data be a multiple of the boundary size (e.g., addresses ending in 0, 4, 8, or C in for 32-bit word ), ensuring fetches occur in single bus cycles. Misaligned data may trigger exceptions or require multiple accesses, increasing ; for instance, a 32-bit at an unaligned might necessitate two 16-bit reads. Compilers automatically insert bytes in structures to enforce , balancing with memory overhead.

Memory Access Patterns and Buses

Memory access patterns refer to the ways in which processors request from , influencing due to optimizations for locality. Random access involves non-sequential retrievals from scattered locations, often incurring higher as each request requires full decoding and no exploitation of nearby . In contrast, fetches in linear order, leveraging spatial locality to reduce overhead by prefetching adjacent blocks. Burst mode enhances sequential patterns by transferring multiple words (e.g., 4 to 8) from a single row activation in without reissuing addresses, minimizing row access time and boosting throughput for block operations. This mode is particularly effective in fast page mode , where subsequent column accesses to the row buffer occur in as little as one clock cycle each after the initial row hit. Memory buses facilitate data transfer between the processor and memory modules, typically comprising three main components. The address bus is unidirectional, carrying memory location signals from the processor to memory, with widths like 24 bits in early standards to support up to 16 addressing. The data bus is bidirectional, enabling read and write operations between processor and memory, often 8 or 16 bits wide in legacy systems but scaled to 64 bits in modern configurations for higher bandwidth. The control bus, also unidirectional from processor to memory, conveys signals for timing, read/write commands, and synchronization. Memory access protocols govern the timing and coordination of these bus transfers. Asynchronous protocols, common in older DRAM, operate without a system clock, relying on strobe signals like and for event-driven responses, which suits variable-speed systems but limits scalability. Synchronous protocols, prevalent in and , align operations to clock edges for predictable timing, with data transfers on rising (SDR) or both (DDR) edges to double effective rates (e.g., 200 MT/s at 100 MHz clock). Pipelining in modern synchronous systems overlaps command issuance and data handling, supporting burst lengths of 4 or more to hide latencies and interleave bank accesses for concurrent operations. A key bottleneck in memory access arises from the , where instructions and data share the same bus, creating contention that limits throughput as processor speeds outpace memory. This shared pathway, known as the Von Neumann bottleneck, constrains bandwidth since fetches and loads compete for the bus. mitigates this by separating instruction and data buses, allowing simultaneous transfers and reducing contention in specialized systems like embedded processors.

Cache Systems

Cache Organization Strategies

Cache organization strategies determine how data blocks from main memory are mapped to cache locations, balancing factors such as access speed, complexity, and miss rates to optimize overall system performance. These strategies primarily involve techniques, address decomposition into fields for lookup, algorithms for evicting blocks, and policies for coordinating multiple cache levels. Seminal evaluations have shown that practical designs often favor compromises between simplicity and flexibility to achieve high hit rates without excessive costs.

Mapping Strategies

Cache mapping strategies define how memory blocks are assigned to slots, with direct-mapped, set-associative, and fully associative being the primary approaches. Direct-mapped s assign each block to exactly one slot, determined by a portion of the block's , offering simplicity and fast access via a single tag comparison but suffering from conflict es when multiple blocks compete for the same slot. For instance, in a direct-mapped , simulations on SPEC benchmarks indicate miss rates around 5-10% higher than more flexible designs due to these conflicts. Fully associative caches allow any block to map to any cache slot, eliminating conflict misses by comparing the tag against all slots in parallel, which maximizes flexibility and hit rates but requires complex hardware with many comparators, making it feasible only for small caches due to increased power and area costs. Set-associative caches, a hybrid, divide the cache into sets of n slots (e.g., 2-way or 4-way), where a block maps to one specific set but can occupy any slot within it, reducing conflict misses compared to direct-mapped while limiting comparisons to n tags for manageable hardware overhead. Evaluations demonstrate that 4-way set-associative caches achieve miss rates within 1-2% of fully associative equivalents for typical workloads, with diminishing returns beyond 8-way due to capacity and compulsory misses dominating.

Block Structure

The in systems is decomposed into three s—, , and —to facilitate efficient lookup and access. The bits select a specific byte within a (or line), typically 4-64 bytes in size, ensuring the operates on fixed-size units for and prefetching benefits. The bits identify the set or slot, with the number of bits determined by the size divided by size (e.g., 10 bits for 1KB with 4-byte s yield 256 slots). The remaining high-order bits store the unique identifier of the , compared against stored tags to confirm a hit; in direct-mapped caches, the comprises the upper bits excluding and , while in fully associative designs, there is no , enlarging the . This structure enables parallel tag matching within a set, with hardware like () in associative caches accelerating comparisons. For a 32-bit and 32-byte in a 2-way set-associative cache with 512 entries, the breakdown allocates 5 bits to (for 32 bytes), 8 bits to index (for 256 sets), and 19 bits to , minimizing to a few cycles.

Replacement Policies

When a cache miss occurs in a full set, a replacement policy selects the victim block to evict, with least recently used (LRU) and first-in first-out () being widely implemented for their balance of performance and hardware feasibility. LRU evicts the block unused for the longest time since last access, exploiting temporal locality by assuming recent accesses predict future ones, and is typically realized with per-set counters or shift registers that update on each hit to rank blocks by recency. Hardware approximations like use tree-based structures to track order with logarithmic overhead, avoiding full LRU's exponential state space in high-associativity caches. FIFO, in contrast, evicts the oldest based solely on insertion , implemented via a simple circular queue or incrementing counters per , requiring minimal but ignoring usage patterns and thus performing worse under locality-heavy workloads, with miss rates 10-20% higher than LRU in benchmarks. Both policies integrate with write-back or write-through strategies, but LRU's adaptability makes it prevalent in modern processors like Intel's, where it reduces compulsory misses effectively.

Multi-Level Caches

In multi-level cache hierarchies (e.g., L1, , L3), organization extends single-level strategies with inclusion or exclusive policies to manage duplication and consistency across levels. Inclusive policies require all L1 blocks to also reside in , simplifying by allowing L2 to serve as a backing store but potentially wasting space on L1 , leading to higher misses. Exclusive policies, conversely, ensure no block is present in both L1 and simultaneously, maximizing effective by avoiding redundancy and reducing total misses by up to 20% in some traces, though they complicate as L1 misses may require probing L2 for potential relocation. Seminal analysis shows inclusive hierarchies preserve strict inclusion for easier verification but underutilize lower-level space, while exclusive variants enhance bandwidth efficiency in pipelined designs, with non-inclusive non-exclusive (NINE) hybrids blending benefits for multi-core systems. These policies integrate with shared mechanisms, often propagating LRU decisions upward to maintain order.

Cache Performance and Coherence

Cache performance is primarily evaluated through metrics such as hit rate, miss rate, and miss penalty, which quantify the efficiency of data retrieval from the relative to main . The hit rate represents the fraction of memory accesses that are satisfied by the , while the miss rate is its complement (1 - hit rate); a high miss rate indicates frequent accesses to slower main . Miss penalty denotes the additional time required to handle a cache miss, typically involving fetching data from lower levels of the . These metrics are foundational in assessing effectiveness, as they directly influence overall system . A key is the Average Memory Access Time (AMAT), which integrates these factors into a single measure of expected access . The AMAT is calculated as: \text{AMAT} = \text{Hit Time} + \text{Miss Rate} \times \text{Miss Penalty} Here, Hit Time is the for a successful access, often 1-2 cycles in modern designs. This formula, introduced in seminal literature, allows architects to quantify trade-offs in design by balancing reductions in miss rate against potential increases in hit time or penalty. For instance, in a with a 95% hit , 1-cycle hit time, and 100-cycle miss penalty, the AMAT would be 6 cycles, highlighting the outsized impact of misses. To improve cache performance, techniques such as prefetching and address misses proactively or through auxiliary storage. Prefetching anticipates future needs by loading blocks into the before explicit requests, reducing compulsory and capacity misses; hardware prefetchers, for example, detect patterns and fetch subsequent lines, improving hit rates by up to 20-30% in streaming workloads. , a small fully associative holding recently evicted lines from the main , mitigate conflict misses in direct-mapped caches by allowing quick recovery of useful without full main ; evaluations show they can reduce rates by 10-20% with minimal hardware overhead. These methods, proposed in early optimization studies, enhance AMAT without significantly enlarging the primary . In multi-core systems, cache coherence ensures that all processors observe a consistent view of shared data, preventing stale or inconsistent reads/writes across private caches. Coherence protocols manage this through state transitions for cache lines, with the MESI protocol being a widely adopted invalidate-based approach using four states: Modified (dirty data unique to one cache), Exclusive (clean data unique to one cache), Shared (clean data possibly in multiple caches), and Invalid (data not present or invalid). Transitions, such as invalidating shared copies on a write, maintain consistency via bus snooping, where caches monitor transactions to update states. For scalability in larger systems, directory-based protocols replace snooping by maintaining a centralized directory tracking line locations and states, avoiding broadcast overhead; this reduces traffic in systems with over 16 cores, though it introduces directory storage costs. The MESI protocol, originating from early multiprocessor designs, underpins coherence in many commercial processors like Intel's. Multi-processor environments introduce challenges like thrashing and , which degrade performance despite mechanisms. Thrashing occurs when frequent evictions due to high contention or poor locality cause repeated misses, often in shared under multiprogrammed workloads; this can inflate AMAT by factors of 2-5 in contended scenarios. pollution arises when non-reusable data displaces useful lines, exacerbating misses in last-level ; in multi-core setups, inter-thread amplifies this, with studies showing up to 50% performance loss from polluted shared resources. Mitigations, such as selective insertion policies, help but require careful tuning to avoid overheads.

Virtual Memory Systems

Core Concepts and Advantages

Virtual memory serves as an in operating systems that allows to operate as if they have access to a large, contiguous block of , independent of the actual physical available. This is achieved through a mapping mechanism that translates virtual addresses—generated by a —into physical addresses in the system's main or secondary storage. The core concept involves an address map function that dynamically associates virtual names with physical locations or indicates absence (null mapping), enabling the system to manage resources efficiently without requiring programmers to handle physical constraints directly. This abstraction decouples addressing from hardware limitations, providing the illusion of a vast, uniform space that can exceed the size of physical . A primary advantage of virtual memory is process isolation, which enhances multitasking by assigning each process its own private , preventing direct access to physical memory locations used by other processes. This isolation is enforced through protection mechanisms, such as associating read, write, and execute permissions with memory segments or pages, thereby safeguarding against unauthorized access and errors that could corrupt other programs or the system . In the system, for instance, each process maintains a private descriptor segment that maps its virtual addresses while restricting access based on user-defined rights, allowing secure sharing of segments without duplication. Such protection not only improves system reliability but also supports concurrent execution of multiple programs, a foundational benefit for modern operating systems. Another key advantage lies in demand paging and loading, where memory pages are brought into physical RAM only upon reference, rather than preloading entire programs. This technique, known as demand fetch, reduces the initial and minimizes unnecessary data transfers, as only actively used portions—often captured by the model—are retained in main memory. By supporting overcommitment, where the total allocated across processes can exceed physical capacity, virtual memory optimizes resource utilization; unused pages can be swapped to secondary storage, allowing more programs to run simultaneously without immediate resource exhaustion. However, careful management is required to avoid thrashing, a state of excessive paging that degrades performance when overcommitment leads to frequent page faults. Overall, these features enable efficient memory and , making virtual memory indispensable for handling large applications and diverse workloads.

Implementation Techniques

Virtual memory is implemented through techniques that map virtual addresses to physical memory, primarily paging and segmentation, which enable efficient use of limited physical resources while providing isolation and flexibility. Paging divides the virtual address space into fixed-size blocks called pages, typically 4 KiB in size on many systems, allowing the operating system to allocate and manage memory in uniform units. Each page corresponds to a physical frame of the same size, and the mapping is maintained in data structures known as page tables. In paging, page tables consist of page table entries (PTEs), where each entry includes a valid bit to indicate whether the page is present in physical memory and a frame number specifying the physical address of the corresponding frame. If the valid bit is unset, a page fault occurs, triggering the operating system to load the page from secondary storage. To handle large address spaces efficiently, multi-level page tables are employed, such as the four-level hierarchy in architectures, which uses a page map level 4 (PML4), page directory pointer table, page directory, and page table to index into the virtual address. This hierarchical structure reduces memory overhead by allocating only the necessary levels for sparsely populated address spaces, with each level fitting into a single page. Segmentation provides an alternative or complementary approach by dividing the virtual address space into variable-sized segments tailored to logical units, such as code, data, or stack sections, which simplifies sharing and protection. Each segment is defined by a base address, length, and access permissions, allowing processes to reference memory relative to segment boundaries. In systems like Multics, segments are named symbolically and mapped dynamically, supporting modular program design without fixed sizes. Modern architectures often combine segmentation with paging; for instance, in x86, segments define coarse-grained regions that are then subdivided into pages for fine-grained allocation and swapping. This hybrid model leverages segmentation for logical organization while using paging to handle fragmentation and enable demand loading. To accelerate address translation, the (TLB) serves as a hardware cache that stores recent virtual-to-physical mappings from the s. The TLB performs associative searches on virtual page numbers, providing a hit in constant time for frequently accessed translations and avoiding full page table walks, which can span multiple memory accesses in multi-level schemes. On a TLB miss, the hardware or software walks the page tables to populate the entry, with typical TLB sizes ranging from 32 to 2048 entries depending on the . Swapping complements these mapping techniques by moving entire or pages between physical memory and secondary storage to manage . Thrashing, where excessive paging leads to performance degradation due to frequent faults overwhelming the system, is mitigated using the model, which tracks the set of pages actively referenced by a over a recent time window. By ensuring that a 's remains resident in memory—typically estimated via reference bits in PTEs—the operating system admits or suspends to prevent overcommitment, maintaining efficient multiprogramming levels. This approach, formalized in the late , balances memory allocation to sustain useful computation without collapse.

Advanced and Future Developments

Memory Management Units

The (MMU) is a dedicated hardware component that performs real-time translation of virtual addresses to physical addresses, enabling efficient memory access in virtualized environments. It also enforces by checking access permissions and generating interrupts, known as page faults, when invalid accesses occur, such as referencing unmapped pages or violating protection rules. These interrupts allow the operating system to handle faults, such as loading missing pages from secondary storage. In contemporary processor architectures, the MMU is integrated directly into the CPU core, as seen in ARM-based systems where it forms part of the processor pipeline for seamless address translation. Conversely, in older computer systems from the and , such as those based on the PDP-11 or early VAX designs, the MMU was implemented as a separate to offload translation tasks from the main processor. This evolution reflects advancements in silicon integration, reducing latency and cost while supporting more complex memory hierarchies. Key features of the MMU include support for context switching in multi-process systems, achieved by reloading the active page table registers to switch between different virtual address spaces without altering physical memory mappings. It also accommodates large virtual address spaces; for instance, x86-64 processors utilize 48-bit virtual addresses, providing up to 256 terabytes of addressable virtual memory per process. These capabilities ensure isolation and efficient resource sharing among multiple processes running concurrently. A primary performance overhead in MMU operations stems from misses in the (TLB), a small on-chip that stores recent address translations to avoid full traversals. On a TLB miss, the MMU initiates a multi-level page walk through the hierarchy, which can involve several memory accesses and introduce significant latency, potentially stalling the processor pipeline for hundreds of cycles. Techniques like larger TLBs or hardware walkers mitigate this, but TLB misses remain a critical in memory-intensive workloads.

Novel Architectures and Technologies

(HBM) represents a pivotal advancement in 3D stacking technologies, enabling of multiple dies to achieve unprecedented for graphics processing units (GPUs) and applications. By employing through-silicon vias (TSVs) and wide interface buses, HBM stacks up to 12 layers of memory, delivering data rates exceeding 9.6 Gb/s per pin in HBM3E configurations, which translates to over 1.2 TB/s per stack. As of 2025, 16-high HBM3E stacks with 48 GB capacity are entering sampling, while is standardizing HBM4 with speeds up to 6.4 Gbps per pin. This architecture, standardized in the 2020s by , significantly reduces and power consumption compared to traditional planar by minimizing interconnect lengths and enhancing parallel access. Adopted widely in AI accelerators and GPUs from and , HBM3E supports up to 36 GB per stack in 12-high configurations, addressing the memory bottlenecks in workloads. Non-volatile random-access memory (RAM) technologies have emerged as promising alternatives to conventional DRAM and flash, offering persistence without power while approaching CMOS compatibility. Magnetoresistive RAM (MRAM), particularly spin-transfer torque (STT) variants, utilizes magnetic tunnel junctions where data is stored via spin-polarized current-induced magnetization switching, enabling write speeds under 10 ns and endurance exceeding 10^12 cycles. Recent advancements in perpendicular STT-MRAM have improved thermal stability and scaled cell sizes to 6F^2, positioning it for applications in microcontrollers and last-level caches. Similarly, resistive RAM (ReRAM) relies on resistive switching in metal-oxide films, such as HfO_2, where formation modulates between high- and low-resistance states, achieving sub-1 ns access times and densities up to 10 Gb/mm^2 in crossbar arrays. Phase-change memory (PCM) exploits the amorphous-to-crystalline phase transitions in chalcogenide materials like Ge_2Sb_2Te_5, induced by , to store data with multi-bit capability per cell and write latencies around 50 ns, as demonstrated in commercial storage-class memory products. These technologies collectively bridge the gap between volatile speed and non-volatile retention, with MRAM entering production for automotive and devices by the mid-2020s. In-memory computing architectures, such as processing-in-memory (PIM), integrate computational logic directly within or near memory arrays to mitigate the bottleneck by reducing data movement overhead, which can account for up to 60% of in data-intensive tasks. PIM implementations, often leveraging 3D-stacked DRAM like HBM, embed simple accelerators for operations like matrix multiplications, achieving up to 10x bandwidth efficiency gains in and inference. For instance, near-data processing units in HBM2E prototypes perform bulk bitwise operations , slashing latency for workloads by over 50% compared to CPU-GPU pipelines. These designs prioritize savings, with prototypes reporting 2-5x lower power for training through localized computation, paving the way for scalable exascale systems. Emerging quantum and optical memory paradigms push beyond electronic limits, targeting ultra-high-speed and secure storage for future quantum networks and photonic computing. Quantum memories, based on atomic ensembles or solid-state defects like nitrogen-vacancy centers in , aim to store photonic qubits with fidelity above 90% for milliseconds, as shown in post-2020 enabling entanglement distribution over 10 km. These systems, still in research phases, support quantum repeaters by mapping photons to long-lived spin states, though scalability remains challenged by decoherence. Optical memories, conversely, utilize photonic or magneto-optical materials for volatile storage, with a 2025 programmable photonic prototype demonstrating switching speeds 100 times faster than state-of-the-art photonic integrated and consuming one-tenth the power of current photonic memory units, using integrated for reconfigurable states. Plasmonic approaches have yielded hybrid nanoelectronic interfacing light and electrons for edge computing. Neither is commercialized as of 2025, but they hold potential for terahertz-bandwidth interconnects in neuromorphic and quantum-hybrid systems.

References

  1. [1]
    [PDF] Chapter 7- Memory System Design
    Mar 3, 2025 · Memory system design includes RAM structure, memory hierarchy, cache, virtual memory, memory components, and memory as a subsystem.
  2. [2]
    Memory Hierarchy Design – Basics – Computer Architecture
    A memory hierarchy uses smaller, faster memories closer to the processor, with larger, slower memories further away, to achieve speed, size, and cost balance.
  3. [3]
    Memory Architecture - an overview | ScienceDirect Topics
    Memory architecture refers to the organizational structure and design of a computer system's memory. Typically, memory is arranged as a regular structure and ...Introduction to Memory... · Memory Architecture Models...
  4. [4]
    [PDF] What Every Programmer Should Know About Memory - FreeBSD
    Nov 21, 2007 · This paper explains the structure of memory subsys- tems in use on modern commodity hardware, illustrating why CPU caches were developed, how.
  5. [5]
    5.5 Memory Hierarchy - Introduction to Computer Science | OpenStax
    Nov 13, 2024 · The obvious way to do this is to use an arrangement of storage available on a computer system in the form of a triangle as shown in Figure 5.20.The Memory Hierarchy · Memory: Dram · Memory: Sram<|control11|><|separator|>
  6. [6]
    Magnetic Core Memory - CHM Revolution - Computer History Museum
    Wang's patent was one of the most important for core memory and IBM paid him $500,000 in 1955 for rights to it.Missing: UNIVAC | Show results with:UNIVAC
  7. [7]
    Mainframe Computer Component, Univac 1105 Core Plane
    In addition to magnetic core planes like this one, its memory included one or more magnetic drums. This core plane stores 4096 words. A mark on the object ...
  8. [8]
    Milestones:Atlas Computer and the Invention of Virtual Memory ...
    Originally called 'one-level storage' [Kilburn 1962], Atlas gave each user the illusion of having a very large fast memory by automating the transfer of code ...
  9. [9]
    IBM System/360 Model 85 - Wikipedia
    CPU cache, which IBM termed "high-speed buffer storage ... The System/360 Model 85 is the first commercially available computer with cache memory.<|separator|>
  10. [10]
    The Intel 1103 DRAM - Explore Intel's history
    With the 1103, Intel introduced dynamic random-access memory (DRAM), which would establish semiconductor memory as the new standard technology for computer ...
  11. [11]
  12. [12]
    [PDF] Cramming More Components Onto Integrated Circuits
    GORDON E. MOORE, LIFE FELLOW, IEEE. With unit cost falling as the number of components per circuit rises, by 1975 economics may dictate squeezing as many as ...
  13. [13]
    [PDF] Memory Management - cs.Princeton
    L1 cache holds cache lines retrieved from the L2 cache memory. CPU registers hold words retrieved from L1 cache. L2 cache holds cache lines retrieved from main ...
  14. [14]
    [PDF] The Memory Hierarchy
    In 2014, the memory hierarchy consists of registers at the very top, followed by up to three levels of cache, then primary memory (built out of DRAM), then ...
  15. [15]
    [PDF] Cache Memories
    Registers. L1 cache. (SRAM). Main memory. (DRAM). Local secondary storage. (local disks). Larger, slower, cheaper per byte. Remote secondary storage. (tapes ...<|control11|><|separator|>
  16. [16]
    [PDF] The Memory Hierarchy
    Fundamental idea of a memory hierarchy: – For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. • Why ...Missing: components | Show results with:components
  17. [17]
    [PDF] Locality and Caching Systems I
    Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy:.Missing: components | Show results with:components
  18. [18]
    [PDF] LEC15: Caches (Vol. I) - Cornell: Computer Science
    Oct 20, 2025 · Typical use. Main memory. CPU caches (L1/L2/L3). Access time. ~10–100 ns (nanoseconds). ~1–2 ns (nanoseconds). Cost per bit. Low. High. Page 23 ...
  19. [19]
    [PDF] 09-memory-hierarchy.pdf - UT Computer Science
    Locality Example. □ Data references. ▫ Reference array elements in succession ... Caches. □ Cache: A smaller, faster storage device that acts as a staging.
  20. [20]
    [PDF] Chapter 2 Memory Hierarchy Design
    The L1 and L2 caches are separate for each core, whereas the L3 cache is shared among the cores on a chip and is a total of 2. MiB per core. All three caches ...
  21. [21]
    Memory hierarchy – Clayton Cafiero - University of Vermont
    Nov 3, 2025 · disk storage (SSD/HDD): millions of clock cycles. DRAM access latency is typically 50–100 ns, which at 3 GHz corresponds to 150–300 cycles.
  22. [22]
    [PDF] The Memory Hierarchy Today Byte-Oriented Memory Organization ...
    ▫ A DRAM chip has access times of 30-50ns. ▫ and, transferring from main ... typical size (MB) 1. 10. 160. 1,000. 20,000 160,000 1,500,0001,500,000. Disk.
  23. [23]
    22C:60 Notes, Chapter 14
    The job of the cache controller is to clock new data into the cache's internal memory whenever there is valid data on the data bus along with a valid address ...
  24. [24]
    [PDF] CS 104 Computer Organization and Design
    CS104: Memory Hierarchy [Adapted from A. Roth]. 89. Memory Bus. • Memory bus: connects CPU package with main memory. • Has its own clock. • Typically slower ...
  25. [25]
  26. [26]
    The impact of cache inclusion policies on cache management ...
    NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies. In Proceedings of the 7th ACM international ...
  27. [27]
    Cost, performance and size tradeoffs for different levels in a memory ...
    This paper evaluates the effect of cost and performance tradeoffs on memory system hierarchies achieved by varying the total amount of memory at any two ...
  28. [28]
    [PDF] The Memory Hierarchy
    The trade-off is that. SRAM cells use more transistors than DRAM cells, and thus have lower densities, are more expensive, and consume more power ...
  29. [29]
    [PDF] 8 sram technology - Electrical Engineering and Computer Science
    The SRAM cell consists of a bi-stable flip-flop connected to the internal circuitry by two access transistors (Figure 8-3). When the cell is not addressed, the ...
  30. [30]
    [PDF] Dynamic Random Access Memory: - UT Computer Science
    Since real capacitors leak charge, the information eventually fades unless the capacitor charge is refreshed periodically. Because of this refresh requirement, ...
  31. [31]
    Executing Commands in Memory: DRAM Commands
    Aug 9, 2019 · Most DRAMs will perform 8192 refresh cycles every 64 ms. That's every 7.813 μs. This has remained constant despite growing device densities.
  32. [32]
    [PDF] Memory
    – Send Row address (RAS) – Open the row buffer (read it). – Send Col address (CAS). – Deliver data. – Prepare for next <row, column> command (PRECHARGE).
  33. [33]
    JEDEC Publishes New DDR5 Standard for Advancing Next ...
    The standard is architected to enable scaling memory performance without degrading channel efficiency at higher speeds, which has been achieved ...
  34. [34]
    [PDF] Basics of Nonvolatile Semiconductor Memory Devices
    A 1 Kbit UV-erasable programmable read-only memory. (PROM) (EPROM) part, based on the floating gate concept, became readily avail- able in 1971, shortly after 1 ...
  35. [35]
    [PDF] 9 rom, eprom, and eeprom technology
    The cell array architecture is NOR. The different types of ROM architectures (NOR, NAND, etc.) are detailed in the flash memory section (Section 10) as.Missing: IEEE | Show results with:IEEE<|control11|><|separator|>
  36. [36]
    Overview of emerging nonvolatile memory technologies - PMC - NIH
    NOR and NAND Flash, two major Flash types, are dominant in the memory market. NOR Flash has lower density but a random-access interface, while NAND Flash ...Missing: PROM | Show results with:PROM
  37. [37]
    [PDF] Exploiting Memory Device Wear-Out Dynamics to Improve NAND ...
    Abstract. This paper advocates a device-aware design strategy to improve various NAND flash memory system perfor- mance metrics. It is well known that NAND ...
  38. [38]
    [PDF] Operational Characteristics of SSDs in Enterprise Storage Systems
    Feb 24, 2022 · The study examines SSD operational characteristics like write amplification, wear leveling, write rates, read/write ratios, and background work ...
  39. [39]
    Design and implementation of an efficient wear-leveling algorithm ...
    This study presents a successful implementation of the proposed wear-leveling algorithm using about 200 bytes of RAM in an SSD controller rated at 33 MHz.
  40. [40]
    A Guide to NAND Flash Memory - SLC, MLC, TLC, and QLC - SSSTC
    This article provides a comprehensive overview of the different NAND Flash types, including SLC, MLC, TLC, and QLC.Missing: explanation | Show results with:explanation
  41. [41]
    PCIe SSD Generations: Performance and Why It Matters - Kingston ...
    PCIe Gen 3: Offers a maximum bandwidth of 16GB/s across 16 lanes, with up to 3,500MB/s sequential read speeds on NVMe SSDs. · PCIe Gen 4: Doubles the bandwidth ...Missing: 2020s | Show results with:2020s
  42. [42]
    [PDF] Part 1. Enterprise SSHD Basics - Seagate Technology
    An SSHD is a hybrid storage device that combines a traditional magnetic disk drive and complementary solid state storage to achieve a blend of high capacity and ...
  43. [43]
    Solid State Drives Transform Data Centers
    Jul 25, 2013 · SSDs are a convenient way to implement flash in a disk package so that it can be implemented in consumer as well as enterprise storage systems.
  44. [44]
    Physical Address Space - an overview | ScienceDirect Topics
    Physical Address Space refers to the actual memory locations where data is stored in a computer system, as opposed to virtual address spaces.Conclusion And Future... · 21.2 Memory Model Evolution · 5.4 Virtualizing Memory
  45. [45]
    [PDF] MEMORY MODELS
    Most hardware implementations of memory are still quite close to the flat model, despite the example of the Burroughs ( now Unisys ) segmented memory ( and a ...<|control11|><|separator|>
  46. [46]
    Virtual and physical addresses - Arm Developer
    This guide introduces the MMU, which is used to control virtual to physical address translation.
  47. [47]
    What is a 64-Bit Processor (64-Bit Computing)? - TechTarget
    Mar 9, 2022 · For an address bus, that is about 18 exabytes of potential addressable memory space. For a data bus, that is 18 quintillion different ...<|control11|><|separator|>
  48. [48]
    Big Endian vs. Little Endian | Baeldung on Computer Science
    May 15, 2024 · Big-endian keeps the most significant byte of a word at the smallest memory location and the least significant byte at the largest.
  49. [49]
    [PDF] Word alignment
    If the ints are aligned on word boundaries, there must be 3 bytes between the chars and the ints. This means that the size of the struct is 16 bytes, if ...
  50. [50]
    [PDF] Memory Hierarchy
    Principle of Locality: ◦ Programs tend to reuse data and instructions near those they have used recently. ◦ Temporal locality: recently referenced items are ...
  51. [51]
    [PDF] CS650 Computer Architecture Lecture 9 Memory Hierarchy - NJIT
    DRAM fast page mode (burst) allows repeated accesses to the row buffer without another row access time, called a page hit.
  52. [52]
    796-1983 - IEEE Standard Microcomputer System Bus
    The bus supports two independent address spaces: memory and I/O. During memory cycles, the bus allows direct addressability of up to 16 megabytes using 24-bit ...Missing: architecture | Show results with:architecture
  53. [53]
    sdram.html
    DDR allows data transfers on both the up and down tick. ... Memory 'word' is determined by the width of the data bus - current design 8 byte (64 bit).
  54. [54]
    [PDF] Synchronous DRAM Architectures, Organizations, and Alternative ...
    Dec 10, 2002 · In a synchronous. DRAM, operative steps internal to the DRAM happen in time with one or more edges of this clock. In an asynchronous DRAM, opera ...
  55. [55]
    Von Neumann Architecture - an overview | ScienceDirect Topics
    The separation of CPU and memory is often called the von Neumann bottleneck , since it limits the rate at which instructions can be executed.
  56. [56]
    Cache - Cornell: Computer Science
    I discuss the implementation and comparative advantages of direct mapped cache, N-way set associative cache, and fully-associative cache. Also included are ...
  57. [57]
    Basics of Cache Memory – Computer Architecture
    The commonly used algorithms are random, FIFO and LRU. Random replacement does a random choice of the block to be removed. FIFO removes the oldest block ...
  58. [58]
    On the inclusion properties for multi-level cache hierarchies
    Baer. An economical solution to the cache coherence problem. In Proc ... On the inclusion properties for multi-level cache hierarchies. ISCA '98: 25 ...
  59. [59]
    Cache Optimizations III – Computer Architecture
    AMAT can be written as hit time + (miss rate x miss penalty). Reducing any of these factors reduces AMAT. The previous modules discussed some optimizations that ...
  60. [60]
    [PDF] Average memory access time: Reducing Misses
    L2 Equations. AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1. Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2. AMAT = Hit TimeL1 +. Miss RateL1 x ...Missing: formula | Show results with:formula
  61. [61]
    [PDF] In More Depth: Average Memory Access Time
    Average memory access time is the average time to access memory considering both hits and misses and the fre- quency of different accesses; it is equal to the ...
  62. [62]
    Improving direct-mapped cache performance by the addition of a ...
    Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim ...Missing: tuning | Show results with:tuning
  63. [63]
    [PDF] Lecture 18: Snooping vs. Directory Based Coherency
    Berkeley Protocol. – Clean exclusive state. (no miss for private data on write). MESI Protocol. – Cache supplies data when shared state. (no memory access).
  64. [64]
    [PDF] A Unified Mechanism to Address Both Cache Pollution and Thrashing
    We compare our EAF-based mechanism to five state-of-the-art mechanisms that address cache pollution or thrashing, and show that it provides significant ...
  65. [65]
    [PDF] Cooperative Caching for Chip Multiprocessors by Jichuan Chang
    For multiprogrammed workloads, threads also com- pete for cache capacity and associativity which can cause lowered performance (e.g., due to thrashing [35]),.
  66. [66]
    [PDF] One-Level Storage System
    1. T. Kilburn / D. B. G. Edwards / M. J. Lanigan / F. H. Sumner. Summary After a brief survey of the basic Atlas machine, the paper describes an automatic ...
  67. [67]
    [PDF] The Multics virtual memory: concepts and design
    This paper discusses the properties of an "idealized". Multics memory comprised entirely of segments referenced by symbolic name, and describes the simula ...
  68. [68]
    [PDF] THRASHING - the denning institute
    1. P. J. Denning, Thrashing: Its causes and prevention, Proc. AFIPS Fall Joint. Comput. Conf. 1968; 32: 915-922.
  69. [69]
    The Memory Management Unit - Arm Developer
    The MMU manages tasks in private virtual memory, translates virtual addresses to physical addresses, and controls memory access permissions.Missing: faults | Show results with:faults
  70. [70]
    Page Faults - Intel
    A page fault occurs when a program accesses a memory page not mapped to its virtual address space. The MMU handles mapping.
  71. [71]
    [PDF] Chapter 3 - Memory Management
    However, logically it could be a separate chip and was years ago. (Source ... • Part of the chip's memory-management unit (MMU);. • Stores the recent ...
  72. [72]
    L17: Virtualizing the Processor - Computation Structures
    When switching between programs, we'd perform a “context switch” to move to the appropriate MMU context. The ability to share the CPU between many programs ...Missing: multi- | Show results with:multi-
  73. [73]
    What Does the Number of Bits in Physical Address Extensions (PAE ...
    For example, if a processor supports 48 virtual address bits, it can address up to 2^48 bytes of virtual memory. Memory Address Calculation: For a processor ...Missing: x86- 64
  74. [74]
    Translation caching: skip, don't walk (the page table)
    This paper shows that the most effective MMU caches are translation caches, which store partial translations and allow the page walk hardware to skip one or ...
  75. [75]
    Performance analysis of the memory management unit under scale ...
    ... TLB miss rates and the interference between page walks and application data in the cache hierarchy. We find that decreasing the MMU overhead - with large ...
  76. [76]
    High Bandwidth Memory (HBM): Everything You Need to Know
    Oct 30, 2025 · For example, HBM3E runs at 9.6 Gb/s, enabling a 1229 GB/s of bandwidth per stack. That's impressive, but HBM4 takes things to an entirely new ...
  77. [77]
    HBM3E: All About Bandwidth - Semiconductor Engineering
    Aug 8, 2024 · The production version of our latest HBM3E PHY supports DRAM speeds of up to 10.4Gbps or 1.33TB/s per DRAM device. This speed represents a >1.6X ...
  78. [78]
  79. [79]
    Recent progress in spin-orbit torque magnetic random-access memory
    Oct 1, 2024 · We review recent advancements in perpendicular SOT-MRAM devices, emphasizing on material developments to enhance charge-spin conversion efficiency.
  80. [80]
    Spin-transfer torque magnetic random access memory (STT-MRAM)
    STT-MRAM features fast read and write times, small cell sizes of 6F2 and potentially even smaller, and compatibility with existing DRAM and SRAM architecture ...
  81. [81]
    Resistive Switching Random-Access Memory (RRAM): Applications ...
    This work addresses the RRAM concept from materials, device, circuit, and application viewpoints, focusing on the physical device properties and the ...
  82. [82]
    An overview of phase-change memory device physics - IOPscience
    Mar 26, 2020 · Phase-change memory (PCM) is a key enabling technology for non-volatile electrical data storage at the nanometer scale. A PCM device consists of ...
  83. [83]
    Spin-Transfer Torque Magnetoresistive Random Access Memory ...
    Nov 6, 2024 · STT-MRAM is a non-volatile memory with speed, endurance, density, and ease of fabrication, replacing embedded Flash in advanced applications.
  84. [84]
    A survey on processing-in-memory techniques: Advances and ...
    TUPIM selects PIM-friendly instructions that (1) reduce intermediate data movements between on-chip caches and main memories, (2) are not well served by the ...
  85. [85]
    GraphP: Reducing Communication for PIM-Based Graph Processing ...
    Processing-In-Memory (PIM) is an effective technique that reduces data movements by integrating processing units within memory. The recent advance of “big ...
  86. [86]
    DL-PIM: Improving Data Locality in Processing-in-Memory Systems
    Oct 9, 2025 · PIM architectures aim to reduce data transfer costs between processors and memory by integrating processing units within memory layers. Prior ...
  87. [87]
    Metropolitan-scale heralded entanglement of solid-state qubits
    Oct 30, 2024 · Here, we report on heralded entanglement between two independently operated quantum network nodes separated by 10 kilometers.
  88. [88]
    Recent progress in hybrid diamond photonics for quantum ... - Nature
    May 8, 2025 · This review discusses recent progress and challenges in the hybrid integration of diamond color centers on cutting-edge photonic platforms.
  89. [89]
    New optical memory unit poised to improve processing speed and ...
    Jan 23, 2025 · Researchers have developed a new type of optical memory called a programmable photonic latch that is fast and scalable.Missing: 2020 | Show results with:2020
  90. [90]
    A plasmon-electron addressable and CMOS compatible random ...
    May 9, 2025 · Our plasmon-addressable memory platform offers versatile functionality in both nanoelectronic and nanoplasmonic systems, demonstrating a robust hybrid ...
  91. [91]
    A New Optical Memory Platform for Super-fast Calculations
    Dec 4, 2024 · The new memory uses magneto-optical material, has 100x faster switching, 1/10th power consumption, and can be reprogrammed multiple times, with ...Missing: post- | Show results with:post-<|control11|><|separator|>