Fact-checked by Grok 2 weeks ago

Modified Harvard architecture

Modified Harvard architecture is a variation of the classic Harvard computer architecture in which instructions and data are stored in physically separate memory spaces with dedicated buses, enabling simultaneous access, but with relaxed restrictions that permit the instruction memory to be accessed as data or both memories to share a common underlying address space.^[1]^[2] This design addresses limitations of the pure Harvard model, such as the inability to load programs dynamically, while mitigating the von Neumann bottleneck of shared memory access.^[3] The architecture typically features distinct instruction and data caches at the processor level, both mapping to a unified main memory, which allows for efficient pipelined execution and higher bandwidth compared to von Neumann systems.^[2]^[1] Key advantages include reduced contention for memory resources during instruction fetch and data operations, leading to improved performance in real-time applications, though the modifications can introduce complexity in cache coherence management.^[3] It is widely implemented in modern microcontrollers and digital signal processors (DSPs), such as the Atmel AVR, PIC series, ARM processors, and x86 architectures, where the separation enhances speed for embedded systems while the flexibility supports general-purpose computing tasks like self-modifying code.^[2] Historically, the modified Harvard approach evolved from early Harvard designs like the Harvard Mark I in the 1940s, which used separate punched tape for instructions and electromechanical counters for data, but gained prominence in the 1980s with DSPs requiring high-throughput data processing.^[2]^[3] Unlike pure Harvard architectures, which prohibit any overlap and are common in specialized DSPs like the Texas Instruments TMS32010, the modified variant balances isolation for performance with interoperability for software versatility.^[2]^[4] This hybrid nature makes it a foundational element in contemporary computing, from mobile devices to high-performance embedded systems.

Foundational Architectures

Harvard Architecture

The Harvard architecture is a computer architecture that employs a strict separation of instruction memory and data memory into distinct physical address spaces, preventing any overlap between the two. This design utilizes two independent sets of buses—one dedicated to fetching instructions from the instruction memory and another for accessing data from the data memory—allowing the processor to perform simultaneous read operations on both types of memory during a single clock cycle. As a result, this separation enables parallel fetch and execution, which can increase throughput by avoiding the bottlenecks associated with shared memory pathways.^[3]^[5] The architecture originated with the Harvard Mark I, a relay-based computer developed by Howard Aiken and completed in 1944, which featured separate storage mechanisms: punched paper tapes for program instructions and dedicated registers and switches for data. In this electromechanical machine, the isolation of program and data storage ensured reliable operation in an era of limited electronic components, prioritizing mechanical precision over flexibility. The Harvard Mark I's design exemplified early efforts to build general-purpose calculators for complex computations, such as those needed for wartime ballistics tables.^[6]^[7] This architecture found typical application in early specialized machines, such as digital signal processors and microcontrollers, where the emphasis on predictable execution timing and high-speed parallel access outweighed the need for dynamic program modification. In contrast to shared memory models like the Von Neumann architecture, the Harvard design's dual-bus structure provides inherent parallelism but at the cost of reduced address space efficiency.^[3] Conceptually, the architecture can be illustrated as follows:

+----------------+     Instruction Bus     +----------------+
| [Instruction](/page/Instruction)    | <---------------------- |                |
| [Memory](/page/Memory)         |                         |   Processor    |
+----------------+     Data Bus            |                |
                       ------------------> |                |
+----------------+                         +----------------+
| [Data Memory](/page/Data)    |                         +----------------+
+----------------+     Dual Memory Buses
+----------------+     Instruction Bus     +----------------+
| [Instruction](/page/Instruction)    | <---------------------- |                |
| [Memory](/page/Memory)         |                         |   Processor    |
+----------------+     Data Bus            |                |
                       ------------------> |                |
+----------------+                         +----------------+
| [Data Memory](/page/Data)    |                         +----------------+
+----------------+     Dual Memory Buses

This diagram highlights the independent pathways, with the processor connected to separate memory banks via dedicated buses for instructions and data.^[5]

Von Neumann Architecture

The Von Neumann architecture, proposed in John von Neumann's 1945 "First Draft of a Report on the EDVAC," describes a stored-program computer design where a single memory unit holds both program instructions and data, enabling flexible computation through a unified address space.^[8] This model revolutionized computing by allowing programs to be loaded into the same memory as data, facilitating the creation of general-purpose machines that could execute arbitrary instructions without hardware reconfiguration.^[9] The architecture's core components include a central processing unit (CPU) that fetches instructions and data from memory via a shared address bus and data bus, processes them in an arithmetic logic unit (ALU), and stores results back into the same memory space.^[8] A key feature of this design is its support for self-modifying code, where programs can alter their own instructions during execution since they reside in the same modifiable memory as data, enabling dynamic adaptation but also introducing complexity in debugging and reliability.^[9] However, the unified memory access creates the Von Neumann bottleneck, a limitation where instructions and data must compete for the same memory bandwidth through sequential fetches over the shared bus, restricting parallelism and overall system throughput as computational speeds outpace memory access rates.^[10] This bottleneck, first articulated by John Backus in his 1978 Turing Award lecture, arises because the architecture's single bus cannot simultaneously handle instruction fetches and data operations without contention, leading to inefficiencies in high-performance scenarios.^[10] Early implementations of the Von Neumann architecture included the Manchester Baby (Small-Scale Experimental Machine), which ran its first stored program in June 1948, demonstrating the feasibility of electronic memory for both instructions and data.^[11] The EDSAC, completed in 1949 at the University of Cambridge, further exemplified the design by successfully executing complex programs using mercury delay-line memory for unified storage, influencing subsequent stored-program computers like the IAS machine at Princeton.^[9] These systems, derived from concepts initially explored in ENIAC modifications for stored programming, established the Von Neumann model as the dominant paradigm for general-purpose computing, powering most digital computers from the mid-20th century onward.^[11]

Overview of Modified Harvard Architecture

Definition and Key Principles

The modified Harvard architecture represents a hybrid compromise between the pure Harvard architecture's strict separation of instruction and data memories and the von Neumann architecture's unified memory model. It retains dedicated pathways for instructions and data at the cache level to enable concurrent access, while permitting a shared main memory backing store for greater flexibility.^[1] A core principle is the use of separate instruction cache (I-cache) for program code and data cache (D-cache) for operands, which mitigates the single-bus contention inherent in von Neumann designs and supports parallel fetches during execution. This separation at the cache level preserves the performance advantages of Harvard-style access, but the relaxation of isolation allows both caches to draw from a common main memory address space on misses, enabling unified memory allocation and simplifying programming.^[2]^[1] Conceptually, this architecture emerged in the 1980s and 1990s as processors evolved to meet demands for both high-speed parallel operations and cost-effective shared memory in microcontrollers and RISC designs. In basic operation, the processor simultaneously retrieves instructions from the I-cache and data from the D-cache; upon a miss, the required content is loaded from the unified main memory into the relevant cache, maintaining efficiency without full memory duplication.^[12]^[2]

Advantages and Trade-offs

The modified Harvard architecture offers several performance advantages stemming from its hybrid design, which incorporates separate caches for instructions and data while sharing a unified main memory. This separation reduces memory contention by enabling simultaneous fetches of instructions and data from distinct cache banks, alleviating the Von Neumann bottleneck where a single bus handles both operations. In practice, this parallelism yields modest speedups, such as up to 1.25 times the performance in ARM9 processors compared to unified cache designs, primarily through optimized cache access patterns that approach one instruction per clock cycle.^[13] Additionally, the architecture provides greater flexibility than pure Harvard designs by allowing dynamic loading and modification of code in shared memory, facilitating applications like just-in-time compilation without the rigid isolation of separate memory spaces.^[14] From a security perspective, the partial separation of instruction and data caches enhances protection against certain exploits, such as buffer overflows that attempt to inject and execute malicious code in data memory, as the program counter is typically restricted to instruction memory. This mitigates transient code injection attacks common in Von Neumann systems, where data and instructions share the same address space. However, the architecture is not fully immune, as modified Harvard implementations often permit program memory updates (e.g., via special instructions like SPM in AVR), enabling permanent code injection through techniques like return-oriented programming to copy payloads into instruction space.^[15] Key trade-offs arise from the increased hardware complexity of maintaining dual caches, which demands more silicon area and specialized logic for cache management, potentially raising manufacturing costs and design effort compared to simpler unified cache systems. Cache coherence issues further complicate this, particularly in scenarios involving self-modifying code or dynamic code generation, where modifications written through the data cache may not immediately propagate to the instruction cache without explicit software interventions like cache flushes and barriers, introducing overheads that can vary performance by up to 12% across platforms.^[14] While the separate caches allow tailored optimizations—such as immutable instructions reducing coherence traffic—the need to balance cache sizes for optimal hit rates remains a critical design consideration to avoid excessive misses in unified memory access, without which the benefits of parallelism diminish.^[13]

Variations of Modified Harvard Architecture

Split-Cache Architecture

The split-cache architecture represents the predominant implementation of modified Harvard architecture, characterized by independent level-1 instruction cache (I-cache) and data cache (D-cache) with distinct tag storage, associativity, and management policies, while both caches address the identical unified main memory space. This separation at the cache level enables parallel fetching of instructions and data, mitigating the structural hazards inherent in unified caches.^[16]^[17] In operation, the processor loads executable instructions exclusively into the I-cache upon fetch requests, while operand data populates the D-cache during load or store operations; both caches employ dedicated control logic to handle misses by querying the shared higher-level memory hierarchy. Cache coherence is preserved through protocols such as MESI, which track line states (Modified, Exclusive, Shared, Invalid) to synchronize updates across caches—ensuring, for instance, that a write to main memory via the D-cache invalidates or updates corresponding lines in the I-cache if affected. This mechanism is critical in environments with self-modifying code or dynamic instruction generation, preventing stale data propagation.^[18] Often described as "almost Von Neumann," the split-cache design emulates Von Neumann's unified addressing at the main memory interface for programming simplicity, yet adopts Harvard-style partitioning at the cache to accelerate access latencies and bandwidth utilization. Hardware implementations typically incorporate separate address and data buses or ports to each cache, minimizing contention during concurrent instruction execution and data manipulation; for example, early configurations in processors like the Intel Pentium featured 8 KB I-cache and 8 KB D-cache, balancing size constraints with performance gains.^[17] This variation evolved during the 1980s as processor designers sought to alleviate Von Neumann bottlenecks—such as single-bus limitations on throughput—without the complexity and cost of fully segregated Harvard memory systems. Processors like the Motorola 68030, introduced in 1987, pioneered on-chip split caches with 256-byte I-cache and 256-byte D-cache to enable pipelined execution and burst modes, influencing subsequent general-purpose and embedded designs.

Instruction Memory Accessed as Data

In the modified Harvard architecture variant known as instruction memory accessed as data, the contents of the program memory—typically used for storing instructions—are made available for reading as data through dedicated hardware mechanisms. This flexibility is achieved via special instructions or operational modes that bridge the separate address spaces of instruction and data memories, allowing the processor to treat program memory locations as a readable data space without fully unifying the memories. For instance, in AVR microcontrollers, functions such as pgm_read_byte() enable the retrieval of byte values from Flash-based program memory into data registers, while compiler directives like the PROGMEM attribute direct constants to be stored in program memory rather than limited RAM.^[19] Similarly, in digital signal processors (DSPs), enhanced Harvard designs incorporate auxiliary access paths, such as special load instructions, to fetch data from program memory, often leveraging multiported memory structures for concurrent instruction fetches and data reads.^[20]^[21] This access mechanism supports key use cases in resource-constrained environments, particularly where data memory is scarce compared to program memory capacity. A primary application is storing large constant datasets, such as lookup tables or filter coefficients, directly in program memory to conserve RAM for dynamic variables; in AVR systems with only 2 KB of RAM, this approach prevents unnecessary copying of constants at runtime, improving memory efficiency.^[19] In DSPs, it facilitates efficient signal processing by allowing coefficients to reside in program memory, enabling parallel access during multiply-accumulate operations without bottlenecking the data bus.^[20] Additionally, it enables scenarios like self-modifying code, where programs can read their own instructions as data for analysis or alteration, though this requires careful handling to maintain coherence between memory views. Implementation often involves a unified address decoder that maps program memory addresses into the data space under specific conditions, such as privileged modes or explicit instruction opcodes, while preserving separate buses for performance. In the dsPIC DSC family from Microchip, for example, the modified Harvard bus structure uses dedicated program and data buses but provides special auxiliary instructions for reading from program memory into data space, integrated with the CPU's pipeline to minimize latency.^[22] Historical examples trace back to early DSPs like the Analog Devices ADSP-21xx series in the 1980s, which employed this variant to store coefficients in program memory for real-time filtering tasks, balancing the architecture's parallelism with practical data needs.^[23] When combined with split-cache designs, this access can introduce risks like instruction cache pollution, where data fetches inadvertently load into the I-cache, potentially evicting useful code and degrading fetch efficiency.^[19] Despite these benefits, the approach carries limitations, including added programming complexity from the need for custom functions—such as AVR's strcpy_PF() for string handling in program memory—and potential performance overhead from mode switches or special instruction decoding.^[19] Security concerns arise as well, since exposing program memory contents as data could enable unintended leakage of proprietary code in multi-tenant or networked systems, though this is mitigated in isolated embedded contexts.^[15] Overall, these trade-offs make the variant suitable for embedded and DSP applications where memory asymmetry is pronounced, but less ideal for general-purpose computing requiring seamless address unification.

Data Memory Accessed as Instructions

In modified Harvard architectures supporting data memory accessed as instructions, the mechanism involves hardware provisions for indirect instruction fetches from the data address space, such as jump instructions targeting data addresses or configurable memory partitions that designate RAM regions as executable. This allows content stored in data memory, typically RAM, to be treated as machine code and fed into the instruction pipeline. For example, in the Microchip PIC32MX microcontroller family, which employs a MIPS M4K core, data memory can be partitioned into kernel and user program spaces using Bus Matrix (BMX) control registers like BMXDKPBA and BMXDUDBA; once configured, these regions become executable, enabling jumps to data addresses via standard MIPS instructions like JR (jump register).^[24] This capability finds applications in environments requiring dynamic code generation, including just-in-time (JIT) compilers, interpreters that generate bytecode at runtime, and dynamic loaders that relocate code segments into available memory. It supports scenarios with variable instruction placement, such as swapping code overlays to manage limited program memory by temporarily storing executable segments in data RAM. In the PIC32MX series, this facilitates runtime code modifications for embedded systems handling adaptive algorithms or script execution.^[24]^[19] Hardware support generally includes a multiplexed or shared bus configuration that routes data memory outputs to the instruction fetch unit, often with mode bits to switch access types. In the PIC32MX, the Bus Matrix module coordinates this by allowing the CPU's instruction side (IS) to access partitioned data memory, while the data side (DS) handles normal operations; however, coherence issues arise when updates to data memory affect subsequent instruction fetches, necessitating pipeline flushes or invalidations to avoid executing outdated code.^[24] A practical example is found in microcontrollers like the PIC32MX, where enabling executable data RAM supports code overlays—dynamically loading subroutine segments into RAM to augment fixed flash-based program memory without hardware reconfiguration. This is configured post-reset by setting BMX registers to allocate portions of the 32-bit addressable RAM (e.g., 5 KB for kernel program space in a 32 KB device). Drawbacks include heightened pipeline design complexity to manage overlapping memory uses and risks of runtime errors, such as bus exceptions from fetching invalid, misaligned, or unprotected code regions, which can lead to system instability if partitions are improperly set.^[24]^[19] This approach extends Harvard architecture flexibility by permitting controlled breaches in the instruction-data separation for runtime adaptability.

Comparisons with Other Architectures

With Pure Harvard Architecture

The pure Harvard architecture maintains complete isolation between instruction and data memory spaces, utilizing separate address spaces, buses, and storage units to prevent any overlap or shared access. In contrast, the modified Harvard architecture introduces partial overlap, such as through shared main memory or mechanisms allowing instruction memory to be accessed as data, thereby enabling greater flexibility in memory utilization. Both architectures derive from the core principle of separated memory pathways to support concurrent instruction fetch and data access.^[3] In terms of performance, the pure Harvard design achieves true simultaneity in memory operations, allowing the CPU to read instructions and access data in parallel without contention, which is particularly beneficial for applications requiring predictable timing. However, this isolation precludes self-modification of code, as instructions cannot be treated or altered as data. The modified variant trades this uncompromised parallelism for versatility, potentially introducing minor contention during shared access scenarios, though it supports code modification and dynamic loading.^[25]^[26] Regarding complexity, the pure Harvard architecture offers simpler memory management due to its rigid separation, eliminating the need for additional hardware or logic to handle cross-access between instruction and data domains. The modified approach, while more adaptable, necessitates coherence mechanisms—such as cache flushing or synchronization protocols—to maintain consistency when memory spaces overlap, increasing design and implementation complexity.^[3] Use cases for the pure Harvard architecture are typically limited to fixed-function devices where program code remains static, such as early electromechanical calculators like the Harvard Mark I or certain digital signal processors with unchanging firmware. Modified Harvard architectures, conversely, suit more adaptable systems requiring runtime code updates or efficient handling of variable workloads, like modern embedded controllers that balance performance with programmability.^[25]^[26]

With Von Neumann Architecture

The Von Neumann architecture utilizes a single unified memory and bus system for both instructions and data, resulting in sequential access patterns where the processor must alternate between fetching instructions and loading data, thereby creating the well-known Von Neumann bottleneck that limits overall performance.^[27] In comparison, the modified Harvard architecture addresses this limitation by incorporating separate instruction and data caches connected to the processor, allowing simultaneous access to instructions and data at the cache level even though the underlying main memory remains unified.^[28] This dual-cache design enables parallel fetch operations, mitigating the sequential constraints of the Von Neumann model while preserving a shared address space for compatibility.^[29] One key benefit of the modified Harvard approach is its enhancement of memory bandwidth through independent cache pathways, which can provide greater aggregate throughput and more predictable access times compared to the shared bus in Von Neumann systems.^[28] For instance, in scenarios with high cache hit rates, this separation reduces contention and effectively increases available bandwidth for instruction and data operations.^[27] Both architectures support self-modifying code, as the modified Harvard design allows instructions to access and alter data memory (and vice versa) via the unified main memory, but the added cache parallelism in modified Harvard delivers performance gains without necessitating a complete architectural overhaul from the Von Neumann baseline.^[28] The modified Harvard architecture has evolved as a practical enhancement to the Von Neumann model, which has long dominated personal computers and general-purpose processors due to its simplicity and flexibility; by integrating split caches, it offers bottleneck relief in modern implementations without sacrificing the unified memory model's advantages.^[29]

Aspect	Von Neumann Architecture	Modified Harvard Architecture
Memory Access Patterns	Sequential fetches over a single shared bus, leading to contention between instructions and data	Parallel access via separate instruction and data caches, despite shared main memory
Latency	Higher average latency due to bus arbitration and sequential queuing	Reduced latency on cache hits through independent cache ports and pipelined fetches
Scalability	Limited by unified bus bandwidth as core counts increase	Improved scalability with cache hierarchies and multi-core support via isolated paths
Bandwidth	Constrained by shared pathway, exacerbating the Von Neumann bottleneck	Higher effective bandwidth from concurrent cache operations, enabling better throughput

Implementations and Applications

In Microcontrollers and Embedded Systems

Modified Harvard architecture is widely adopted in 8-bit and 16-bit microcontrollers, particularly in families like AVR and PIC, where program instructions reside in dedicated flash memory and data operates from separate RAM, with minimal overlap provisions to preserve distinct address spaces. This separation allows simultaneous fetching of instructions and data, enhancing efficiency in resource-constrained environments typical of embedded systems. For instance, AVR microcontrollers utilize this structure to enable parallel processing, while PIC devices employ a 24-bit instruction bus alongside a 16-bit data bus for optimized operation.^[30]^[31] In these implementations, direct memory separation predominates, often without extensive caching, though small instruction or data caches may appear in advanced variants to further isolate frequent data accesses. This design reduces power consumption by limiting bus contention and allowing quicker entry into low-power modes after task completion; for example, AVR-based devices like the ATmega328 achieve active-mode currents as low as 0.2 mA at 1 MHz, dropping to 0.1 μA in sleep, outperforming comparable von Neumann architectures in energy efficiency.^[32] The two-stage pipeline in PIC microcontrollers further ensures deterministic execution, minimizing timing variations in interrupt-driven operations. Such features make modified Harvard ideal for battery-powered applications, where isolating data paths prevents unnecessary energy expenditure on instruction fetches.^[33]^[34]^[31] The Atmel AVR series, developed from the late 1990s, exemplifies this architecture's utility through its support for fast interrupt handling, with latencies of 5–8 clock cycles enabled by dedicated vectors and configurable priority schemes like round-robin or static ordering. In embedded control systems, the predictable timing from parallel memory access proves critical; automotive electronic control units (ECUs), for instance, rely on this determinism to meet real-time requirements without bottlenecks, ensuring reliable sensor processing and actuator control.^[35]^[31] By 2025, modified Harvard principles have integrated into IoT devices featuring RISC-V cores for edge computing, where separate instruction flash and data RAM support low-power, real-time tasks in sensors and remote monitors. These implementations leverage Harvard's parallel paths to extend battery life in distributed networks, with RISC-V variants like those in synthesizable cores for low-cost IoT emphasizing energy-efficient memory isolation.^[34]^[36]^[37]

In Digital Signal Processors

Digital signal processors (DSPs) often employ a modified Harvard architecture to meet the high-bandwidth demands of real-time audio, video, and signal processing tasks, where simultaneous access to coefficients stored as data and algorithms stored as instructions is essential. This design typically incorporates dual-ported memories or dedicated caches, enabling parallel fetches from program and data spaces without contention, which is critical for operations like filtering and modulation in multimedia applications.^[3] A prominent example is the Texas Instruments TMS320 series, introduced in the 1980s and continuing through 2025 models, which utilizes a modified Harvard architecture with separate program and data buses alongside multiply-accumulate (MAC) units optimized for signal processing. The architecture allows limited transfers between program and data spaces via special instructions, enhancing flexibility while maintaining the performance benefits of segregated memory access. Later variants, such as the TMS320C54x family, extend this with an advanced modified Harvard structure featuring one program bus and up to three data buses, supporting advanced peripherals and on-chip memory for efficient execution.^[38]^[39] Key features in these DSPs include dual data memories that facilitate parallelism, allowing independent access for operands in arithmetic operations, which builds on the modified Harvard foundation to handle complex workloads. This setup enables single-cycle MAC operations without pipeline stalls, as the architecture fetches two data words and the next instruction concurrently, significantly boosting throughput for tasks like convolution in audio processing. Such capabilities make modified Harvard DSPs ideal for applications in 5G modems, where they manage beamforming and modulation without latency issues, and in AI accelerators for edge computing.^[40]^[41] In modern contexts as of 2025, this architecture persists in edge AI chips like Qualcomm's Hexagon DSP, which optimizes for neural network inference by leveraging separate memory pathways for instructions and data, including vector extensions for parallel processing in low-power devices. The modification permitting data access to instruction memory when needed further supports dynamic workloads in inference pipelines, ensuring high efficiency in resource-constrained environments.^[42]

In General-Purpose Processors

In general-purpose processors, the modified Harvard architecture is implemented through separate level-1 (L1) instruction and data caches, enabling simultaneous access to instructions and data while sharing a unified higher-level memory hierarchy. This design originated in the Intel Pentium processor in 1993, which introduced split 8 KB instruction and 8 KB data L1 caches, departing from the unified 8 KB L1 cache of the preceding 80486 (1989).^[43]^[44] AMD followed suit with its K5 processor in 1996, adopting similar split L1 caches of 8 KB each, and both architectures have maintained this separation in subsequent x86 generations for performance optimization.^[45] The ARM Cortex-A series, used in high-performance mobile and server processors, also employs a modified Harvard structure with distinct L1 instruction (I-cache) and data (D-cache) components, typically 32 KB each in recent cores like Cortex-A78, connected to separate buses for independent operation.^[16] Cache coherence in these multi-level hierarchies is maintained through protocols such as MESI (Modified, Exclusive, Shared, Invalid), which ensure consistency across private L1 caches and shared L2/L3 caches when accessing unified DRAM, preventing data inconsistencies in multi-core environments.^[46] This separation benefits out-of-order execution by allowing instruction fetches and data loads/stores to proceed concurrently without cache port contention, reducing stalls and improving instruction-level parallelism in superscalar designs.^[47] Notable examples include Apple's M-series processors (2020 onward), which feature unified system memory but retain split L1 caches—192 KB instruction and 128 KB data per high-performance core in the M1—to enhance efficiency in mixed CPU/GPU workloads.^[48] RISC-V implementations in server processors, such as those from SiFive or Alibaba's T-Head, similarly use split L1 caches (often 32 KB I-cache and 32 KB D-cache) in modified Harvard configurations to support scalable general computing.^[49] As of 2025, trends in general-purpose processors emphasize larger L1 caches, with x86 cores like AMD Zen 5 offering 48 KB data and 32 KB instruction L1 per core, and ARM Cortex-X925 reaching 64 KB for both, to accommodate growing instruction footprints in AI and cloud workloads. Security enhancements include cache partitioning techniques, such as way isolation in L1 caches, to mitigate side-channel attacks like Spectre by segregating sensitive data from untrusted code partitions.^[50]^[51] This architecture dominates general-purpose computing, powering over 80% of PC and server processors through x86 and ARM dominance, while balancing high-speed cache access with the flexibility of unified memory addressing for operating systems.^[52] It extends the split-cache variation to support broad scalability in desktops and servers.

References

[1]
[PDF] 06-cpu-i-notes.pdf - CS@Cornell
A modified Harvard architecture machine is very much like a Harvard architecture machine, but it relaxes the strict separa;on between instruc;on and data while ...
[2]
None
### Summary of Harvard Architecture and Modified Harvard Processors
[3]
Harvard Architecture - an overview | ScienceDirect Topics
For instance, a modified Harvard architecture has separate instruction and data caches that are backed by a common address space. While the processor executes ...
[4]
What's the difference between Von-Neumann and Harvard ...
Mar 8, 2018 · The Harvard architecture stores machine instructions and data in separate memory units that are connected by different busses. In this case, ...
[5]
[PDF] Memory Architectures - Introduction to Embedded Systems
1 A Harvard architecture uses separate memory spaces for program and data. It originated with the Harvard Mark I relay-based computer (used during World War.
[6]
[PDF] ARCHITECTURE BASICS - Milwaukee School of Engineering
Howard Aiken proposed a machine called the. Harvard Mark 1 that used separate memories for instructions and data. Harvard Architecture. Page 11. CENTRAL ...<|separator|>
[7]
[PDF] First draft report on the EDVAC by John von Neumann - MIT
Further memory requirements of the type (d) are required in problems which depend on given constants, fixed parameters, etc. (g) Problems which are solved by ...
[8]
[PDF] Von Neumann Computers 1 Introduction - Purdue Engineering
Jan 30, 1998 · In 1945, von Neumann wrote the paper \First Draft of a Report on the EDVAC," which was the first written description of what has become to ...
[9]
The von Neumann Bottleneck Revisited - SIGARCH
Jul 26, 2018 · The term “von Neumann bottleneck” was coined by John Backus in his 1978 Turing Award lecture to refer to the bus connecting the CPU to the store in von Neumann ...
[10]
Milestone-Proposal:The Manchester University 'Baby' computer
Jun 30, 2021 · At this site on 21 June 1948 the “Baby” became the first computer to execute a program stored in addressable read-write electronic memory. “Baby ...
[11]
None
### Summary of Modified Harvard Architecture History and Evolution (1980s-1990s)
[12]
http://metalup.org/harvardarchitecture/The%20Myth%20of%20the%20Harvard%20Architecture.pdf
[13]
https://ieeexplore.ieee.org/document/9779481
[14]
[PDF] Code Injection Attacks on Harvard-Architecture Devices - Hal-Inria
Permanent code injection attacks are much more powerful: an attacker can inject malicious code in order to take full control of a node, change and/or disclose ...
[15]
Cache architecture - Arm Developer
A modified Harvard architecture has separate instruction and data buses and therefore there are two caches, an instruction cache (I-cache) and a data cache (D- ...Missing: shared | Show results with:shared
[16]
[PDF] arXiv:2110.05910v2 [cs.ET] 16 Oct 2021
Oct 16, 2021 · CPU caches for data and instructions which share a single address space. This is the split-cache architecture of most modern CPUs. An ...
[17]
[PDF] A Primer on Memory Consistency and Cache Coherence, Second ...
The cache coherence protocol abstracts away the caches completely and presents an illusion of atomic memory—it is as if the caches are removed and only the ...<|separator|>
[18]
Modified Harvard Architecture: Clarifying Confusion
Sep 21, 2015 · These architectures which are clearly defined as “Modified Harvard”, are: Split Cache, Access Instruction Memory as Data, and Read Instructions from Data ...
[19]
DSP 101 Part 3: Implement Algorithms on a Hardware Platform
The DSP's enhanced Harvard architecture lets programmers store data in Program Memory ... Program Memory storing the coefficient values, both a data value ...
[20]
https://www.analog.com/en/resources/analog-dialogue/articles/dsp-101-part-3.html
[21]
dsPIC® DSC Architecture Review
The dsPIC architecture is a modified Harvard Bus Architecture. This means that the program and data memories are accessed by separate buses. However, there are ...
[22]
https://skills.microchip.com/dsp-features-of-the-microchip-dspic-dsc/693195
[23]
[PDF] PIC32MX FRM Section 3. Memory Organization
In addition, the data memory can be made executable, allowing the PIC32MX to execute from data memory. Key features of PIC32MX memory organization include the ...
[24]
Definition of Harvard architecture | PCMag
Named after the Mark I computer at Harvard University in the 1940s, a pure Harvard architecture can execute instructions and process data simultaneously ...
[25]
https://www.pcmag.com/encyclopedia/term/harvard-architecture
[26]
John Von Neumann and Computer Architecture - Washington
The modified Harvard architecture fixes the von Neumann architecture's bottleneck by using separate instruction and data caches between the memory and CPU.
[27]
[PDF] von Neumann von Neumann vs. Harvard Harvard Architecture von ...
Harvard allows two simultaneous memory fetches. • Most DSPs use Harvard architecture for streaming data: • greater memory bandwidth;. • ...
[28]
[PDF] Introduction to computing, architecture and the UNIX OS
Jan 8, 2019 · Most modern computers use a hybrid architecture that is sometimes called a modified Harvard architecture. • Many of the modern gains in ...
[29]
8-bit AVR® Microcontroller Structure - Microchip Developer Help
Nov 9, 2023 · AVR® microcontrollers are built using a modified Harvard Architecture. Learn how this structure works and what you need to know about ...
[30]
16-bit PIC® MCU Architecture - Microchip Developer Help
Nov 9, 2023 · PIC24 MCUs and dsPIC® Digital Signal Controllers (DSCs) share the same modified Harvard Architecture. An embedded processor using a Harvard ...Harvard Architecture · Memory Architecture · MCU Configuration Registers
[31]
Harvard architectures (AVR) - TinyGo
Jul 25, 2024 · The AVR architecture is a modified Harvard architecture, which means that flash and RAM live in different address spaces.
[32]
Power Consumption Efficiency on Harvard Architecture-Based ...
Jul 20, 2025 · Power Consumption Efficiency on Harvard Architecture-Based Microcontrollers for IoT Devices. July 2025. Authors: Ilman Zuhry at University of ...
[33]
[PDF] Interrupt System in tinyAVR 0- and 1-series, and megaAVR 0-series
Interrupt handling techniques have become more configurable in the tinyAVR 0- and 1-series, and megaAVR 0-series. This application note describes the ...
[34]
[PDF] RISC-V Microcontroller and Encryption Accelerator with Integrated ...
The memory system that the core supports is a typical Harvard architecture. There is a separate instruction memory, implemented using a QSPI Flash. There is ...
[35]
Selecting a Synthesizable RISC-V Processor Core for Low-cost ...
All the processors are geared toward low-cost devices so this work evaluates and selects the ideal core for IoT devices. This work is important because the ...
[36]
[PDF] Second-Generation Digital Signal Processors datasheet (Rev. B)
The TMS320 family's modification of the Harvard architecture allows transfers between program and data spaces, thereby increasing the flexibility of the device.
[37]
[PDF] TMS320C54x DSP Functional Overview - Texas Instruments
Architecture. 7. TMS320C54x DSP Functional Overview. 1.2 Architecture. The '54x DSPs use an advanced, modified Harvard architecture that maximizes processing ...
[38]
New Fixed-Point DSP Family Provides High-Performance ...
Modified Harvard architecture, a key characteristic of a DSP, allows two data words, as well as the next instruction, to be fetched in a single cycle. This ...
[39]
High-Performance DSP for 5G Networks | Synopsys IP
Oct 21, 2018 · Synopsys' ARC HS4xD processors feature a dual-issue, 32-bit RISC + DSP architecture for embedded applications where high performance and high clock speed plus ...
[40]
Qualcomm Hexagon NPU | Snapdragon NPU Details
The Hexagon NPU is our custom-designed NPU. It is designed to inference AI workloads based on very long instruction word (VLIW) processor with specialized ...Missing: modified Harvard
[41]
[PDF] i486™ MICROPROCESSOR
An 8 Kbyte unified code and data cache combined with a 106 Mbyte/Sec burst bus at. 33.3 MHz ensure high system throughput even with inexpensive DRAMs. New ...
[42]
Why did Intel abandon unified CPU cache?
Jun 6, 2019 · Intel went with unified L1 cache for the '486. Then, they switched to separate Instruction & Data L1 cache with the Pentium and its successors.What is the history and development of memory caching?Why does the 80486 take longer to execute simple instructions than ...More results from retrocomputing.stackexchange.comMissing: configuration | Show results with:configuration
[43]
A look back at the history of AMD - Club386
Dec 22, 2021 · AMD pioneered 64-bit x86 computing in 2003 by releasing server-orientated Opteron 64 and desktop Athlon 64 models and followed these innovations ...
[44]
Cache Coherence - an overview | ScienceDirect Topics
A set of rules that governs how multiple caches interact in order to solve this problem is called a cache coherence protocol. One approach is to use what is ...
[45]
What does a 'Split' cache means. And how is it useful(if it is)?
Apr 18, 2019 · A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions.
[46]
Analyzing the memory ordering models of the Apple M1
Each processor encompasses separate L1 instruction (L1i) and L1 data (L1d) caches, while an L2 cache is associated with each cluster. Information about a ...
[47]
[PDF] The RISC-V Processor - Cornell: Computer Science
• (modified) Harvard architecture: separate insts and data. • von Neumann architecture: combined inst and data. A bus connects the two. We now have enough ...
[48]
[PDF] SCAM: Secure Shared Cache Partitioning Scheme to Enhance ...
It provides better performance without compromising security, making it an effective strategy for protecting against side-channel attacks while ensuring optimal ...
[49]
Arm taking 25% of server CPU market | Electronics Weekly
Sep 17, 2025 · Arm CPU's now hold 25% of the server market driven by the Nvidia GB 200 GB 300 and hyperscalers' custom deployment, according to Dell'Oro's ...