Fact-checked by Grok 2 weeks ago

Load–store architecture

A load–store architecture is an instruction set architecture in which arithmetic and logical operations are performed exclusively on data stored in registers, with memory access limited to dedicated load instructions that transfer data from memory to registers and store instructions that transfer data from registers to memory.^[1]^[2] This separation ensures that computational instructions do not directly reference memory addresses, distinguishing it from register-memory or memory-memory architectures where operations can involve memory operands directly.^[3] Load–store architectures form the foundation of Reduced Instruction Set Computing (RISC) designs, which prioritize simplicity and efficiency in instruction execution.^[3] Originating in the late 1970s and early 1980s through projects like the IBM 801, Berkeley RISC, and Stanford MIPS, these architectures aimed to optimize pipelined processors by minimizing memory access complexity and enabling better compiler optimizations.^[3] Key characteristics include a large number of general-purpose registers (often 16 to 32) to hold operands and results, fixed-length instructions for uniform decoding, and the ability to schedule load/store operations in parallel with arithmetic instructions, which reduces overall execution latency.^[1]^[3] Prominent examples of load–store architectures include the ARM family, widely used in mobile and embedded systems; RISC-V, an open-source ISA gaining traction in academia and industry; MIPS, influential in early RISC implementations; SPARC, developed by Sun Microsystems for servers; and PowerPC, employed in high-performance computing.^[1]^[4] These architectures typically account for 15–35% of instructions being loads or stores in typical programs, highlighting their role in balancing register usage with memory interactions.^[3] The advantages of load–store designs lie in their support for high-performance pipelining and superscalar execution, as the restricted memory access simplifies hardware design, shortens clock cycles, and lowers memory traffic by encouraging register reuse.^[3]^[1] This approach has made them dominant in modern processors, particularly in energy-efficient and scalable systems, though they require sophisticated compilers to manage register allocation effectively.^[3]

Core Concepts

Definition and Principles

A load–store architecture, also known as a register–register architecture, is an instruction set architecture in which memory access is strictly separated from computational operations. In this model, data must first be loaded from memory into registers using dedicated load instructions before any arithmetic or logical processing can occur, and results are subsequently stored back to memory via store instructions. All arithmetic and logical unit (ALU) operations are performed exclusively between registers, prohibiting direct memory-to-memory or memory-operand computations.^[5]^[6] The core principles of load–store architectures emphasize this rigid separation to simplify instruction execution and enhance pipelining efficiency. Memory operations are confined to load and store instructions, ensuring that ALU operations remain register-bound and thus predictable in terms of timing and resource usage. This design mandates explicit data movement for all computations, fostering a uniform instruction format and reducing the complexity of decoding and execution stages. Unlike register-memory architectures, which allow ALU instructions to directly reference memory operands, load–store systems enforce a clear delineation that minimizes variable-latency memory accesses during computation.^[7]^[5]^[6] Central to this architecture is the register file, which serves as the primary hub for data manipulation and temporary storage during processing. The register file consists of a fixed set of general-purpose registers that hold operands and results for ALU instructions. Instruction formats in load–store architectures are typically fixed-length and include fields for the opcode, source and destination register specifiers, and immediate values or offsets for addressing in load and store operations. This structure supports efficient encoding and decoding, with load/store instructions often using base-plus-displacement addressing to compute memory locations relative to a register value.^[7]^[6] To illustrate, consider a simple addition operation followed by storage: an ADD instruction might compute the sum of two registers and place it in a third, as in ADD R1, R2, R3 (where R1 = R2 + R3), and a subsequent STORE instruction would write the result to memory, such as STORE R1, [address]. Direct operations like adding two memory locations without intermediate registers are not permitted, requiring explicit loads beforehand. This pseudocode exemplifies the separation:

LOAD R2, [mem_addr1]   // Load first value into R2
LOAD R3, [mem_addr2]   // Load second value into R3
ADD R1, R2, R3         // Compute sum in R1 (register-register)
STORE R1, [result_addr] // Store result to memory
LOAD R2, [mem_addr1]   // Load first value into R2
LOAD R3, [mem_addr2]   // Load second value into R3
ADD R1, R2, R3         // Compute sum in R1 (register-register)
STORE R1, [result_addr] // Store result to memory

^[5]^[6]

Register File and Operations

In load-store architectures, the register file serves as a small, high-speed array of general-purpose registers (GPRs) designed to hold operands and results for computations, minimizing memory access latency. Typically comprising 16 to 32 GPRs, each of fixed width such as 32 bits or 64 bits to match the processor's word size, the register file emphasizes rapid read and write operations.^[7]^[8] For instance, the DLX architecture features 32 32-bit GPRs, with R0 hardwired to zero and R31 often used for return addresses.^[9] In addition to GPRs, the register file includes specialized registers like the program counter (PC) for instruction addressing and status registers for flags such as overflow or zero conditions, though GPRs form the core for data manipulation.^[7]^[10] Operations on the register file are restricted to register-to-register computations, ensuring that arithmetic, logical, and shift instructions operate exclusively on GPR operands without memory involvement. Arithmetic instructions include addition (e.g., ADD Rd, Rs1, Rs2), subtraction (SUB), and variants with immediate values (ADDI), all producing results in a destination register.^[11]^[12] Logical operations encompass bitwise AND, OR, XOR (e.g., AND Rd, Rs1, Rs2), and their immediate forms (ANDI), enabling efficient bit-level manipulation.^[11]^[9] Shift instructions, such as logical left shift (SLL) or arithmetic right shift (SRA), support both register and immediate shift amounts, facilitating data alignment and multiplication by powers of two.^[12]^[9] These operations leverage the arithmetic logic unit (ALU) and are optimized for single-cycle execution in pipelined designs.^[10] Addressing modes in load-store architectures are intentionally simple to simplify hardware and enhance pipelining, limited primarily to register-indirect for memory accesses. Load instructions (e.g., LOAD Rd, [Rs + offset]) compute the effective address as the contents of a base register plus a sign-extended immediate offset, transferring data from memory to a destination register.^[10]^[9] Store instructions (e.g., STORE Rs, [Rd + offset]) similarly use the base-plus-offset mode to write register data to memory, without allowing direct memory operands in computational instructions.^[11] In ARM implementations, this extends to pre-indexed and post-indexed variants, where the base register is updated after address calculation, but complex modes like indirect through memory are avoided.^[13] This restriction contrasts with more elaborate modes in non-RISC designs, prioritizing predictable timing over flexibility.^[7] The data flow model in load-store architectures centers on the register file as an intermediary, enabling computations to bypass memory for reduced latency in pipelined execution. In a typical five-stage pipeline (instruction fetch, decode/register read, execute, memory access, write-back), operands are read from the register file during the decode stage, processed in the execute stage via the ALU, and written back to the register file in the write-back stage, with load/store instructions alone accessing memory in the dedicated stage.^[10]^[7] This separation allows multiple register reads (often two or three ports) and writes (one or two ports) per cycle, with forwarding paths from execute or memory stages to resolve data hazards without stalling.^[12] By confining arithmetic and logical operations to registers, the model exploits temporal locality, keeping active data on-chip and streamlining instruction dispatch.^[7]

Comparisons with Other Architectures

Versus Register-Memory Architectures

Register-memory architectures, such as the Intel x86, permit arithmetic and logical operations to access one memory operand directly, allowing instructions such as ADD R1, [memory_address] where one operand is in a register and the other sourced from memory without explicit loading into a register first.^[14] This contrasts with load-store architectures, which strictly separate memory access from computation by requiring data to be loaded into registers before any operations and stored back afterward.^[4] A primary difference lies in the number of instructions required for memory-involved operations: load-store architectures typically demand 3 instructions (two loads and compute) for tasks like adding two memory values to a register, whereas register-memory designs can accomplish the same in 2 instructions (load and compute with memory operand).^[14] This results in higher instruction counts and potentially lower code density in load-store systems, though it enhances execution efficiency by minimizing memory traffic during computation.^[4] In terms of instruction complexity, load-store architectures favor fixed-length, simple instructions with limited addressing modes, simplifying decoding and enabling straightforward pipelining, while register-memory architectures often employ variable-length instructions with richer addressing modes to support direct memory access, increasing hardware complexity for instruction fetch and decode.^[15] For illustration, consider computing the sum of two memory locations into a register: Load-Store Pseudocode:

LOAD R1, mem1
LOAD R2, mem2
ADD R3, R1, R2
LOAD R1, mem1
LOAD R2, mem2
ADD R3, R1, R2

Register-Memory Pseudocode:

LOAD R3, mem1
ADD R3, [mem2]
LOAD R3, mem1
ADD R3, [mem2]

This example highlights how load-store requires explicit data movement for both operands, leading to more instructions but allowing reuse of register values without repeated memory accesses.^[14] Hardware trade-offs further distinguish the approaches: load-store designs simplify the arithmetic logic unit (ALU) and pipeline stages by confining operations to registers, reducing control logic complexity and enabling higher clock speeds, but they necessitate a larger register file to accommodate temporary values and mitigate the increased instruction count.^[4] Conversely, register-memory architectures demand more sophisticated hardware to handle memory operands inline, including additional addressing hardware and potential pipeline stalls from memory dependencies, though they can reduce overall register pressure by allowing direct memory use.^[14] These choices reflect a balance between code compactness and execution predictability, with load-store prioritizing the latter for modern pipelined processors.^[15]

Versus Stack-Based Architectures

Stack-based architectures utilize an operand stack as the primary mechanism for handling data, where arithmetic and logical operations implicitly pop the required operands from the top of the stack (TOS), perform the computation, and push the result back onto the stack.^[16] This design contrasts with load–store architectures, which rely on a fixed set of explicit registers for computations, separating memory access into dedicated load and store instructions.^[16] Representative examples include the Java Virtual Machine (JVM) bytecode, which employs a stack for operand management, and Hewlett-Packard calculators using Reverse Polish Notation (RPN), where user-entered values are pushed onto a stack for postfix evaluation.^[17]^[18] A fundamental difference lies in operand access and addressing: load–store architectures use named registers specified explicitly in instructions, enabling direct multi-operand operations without implicit positioning, whereas stack-based designs depend on the stack's depth for operand location, with the TOS serving as the implicit source and destination.^[16] In stack machines, there are no explicit operand fields in arithmetic instructions, relying instead on sequences that build the stack state through loads or pushes followed by operations that manipulate the TOS.^[16] This leads to instruction semantics based on postfix notation in stack architectures, such as a sequence of PUSH A; PUSH B; ADD that pops two values and pushes their sum, in contrast to the explicit form in load–store architectures like ADD R1, R2, R3, where registers R2 and R3 are added and stored in R1.^[17] Consider an example of accumulating a sum in a loop, such as computing the total of array elements. In a load–store architecture, this can leverage register indexing for efficiency: initialize a sum register to zero, load each array element into a temporary register using an address offset, add it to the sum register, and increment the index register. In a stack-based architecture, the process requires careful stack management, such as duplicating the sum value (via DUP) before loading the next element, adding, and then handling storage, which can lead to deeper stack usage and more complex sequencing to avoid overwriting intermediate results.^[16] These differences have notable implications for compilation. Load–store architectures facilitate straightforward register allocation during code generation, as compilers can assign variables to specific registers based on lifetime analysis, simplifying optimization for loops and data reuse.^[17] Stack-based architectures, however, are particularly amenable to compiling expression trees via postfix traversal, enabling simple, uniform code generation for nested operations, but they complicate loop handling due to the need to track and manage stack depth to prevent spills or overflows.^[17]

Historical Development

Origins in Early RISC Projects

The Reduced Instruction Set Computer (RISC) philosophy emerged in the late 1970s and early 1980s as a deliberate reaction to the growing complexity of Complex Instruction Set Computer (CISC) architectures, which featured intricate instructions, multiple addressing modes, and microcode implementations that complicated hardware design and limited performance scaling.^[19] Proponents argued that simplifying the instruction set would reduce decoding overhead, enable deeper pipelining, and allow higher clock speeds by dedicating more transistors to execution rather than control logic.^[19] This shift emphasized a load-store architecture, where arithmetic and logical operations occur exclusively between registers, with memory access restricted to dedicated load and store instructions, thereby separating computation from data movement to streamline hardware implementation.^[20] Pioneering academic projects at the University of California, Berkeley, and Stanford University formalized these ideas into the first load-store designs. At Berkeley, David Patterson initiated the RISC-I project in 1980, aiming to create a VLSI-compatible processor with a minimal instruction set focused on register operations.^[20] The design incorporated 31 general-purpose registers (plus a program counter) and restricted memory interactions to load and store instructions, reflecting empirical studies of program behavior that revealed approximately 80% of executed instructions involved register-to-register operations without memory access.^[19] Concurrently, at Stanford, John Hennessy launched the MIPS project in 1981, developing a load-store architecture with 32 registers and an emphasis on pipelined execution to achieve single-cycle throughput for most instructions.^[21] These efforts were motivated by the need to minimize hardware complexity—such as eliminating variable-length instructions and complex addressing—to facilitate faster clock rates and more efficient VLSI fabrication.^[21] Early prototypes demonstrated the viability of these principles. The Berkeley RISC-I chip, fabricated in 1982 using a 5-micrometer NMOS process, featured exactly 31 instructions, all adhering to a load-store separation that ensured arithmetic operations remained register-bound while loads and stores handled two-cycle memory transfers.^[20] This design achieved a clock speed of around 1 MHz and validated the RISC approach through benchmarks showing performance comparable to contemporary CISC machines but with simpler circuitry.^[20] Although influenced by earlier experimental work, such as IBM's 801 minicomputer project in the 1970s—which introduced load-store separation and register-focused computation in a single-chip prototype—the Berkeley and Stanford efforts distinctly formalized these concepts within the RISC framework, prioritizing academic rigor and VLSI integration over proprietary constraints.^[22]

Adoption in Commercial Processors

The commercialization of load-store architecture began in the mid-1980s with the founding of MIPS Computer Systems in 1984, culminating in the release of the R2000 processor in 1986.^[23] This 32-bit implementation featured a pure load-store design, separating memory access from computation to enhance pipelining efficiency, and was later used in workstations such as the DECstation series starting in 1989.^[24] Following this milestone, IBM introduced the RS/6000 line in February 1990, incorporating load-store elements into its POWER architecture for high-performance computing workstations and servers.^[25] The POWER ISA's load-store model supported precise interrupts and efficient data handling, enabling over 800,000 RS/6000 systems to ship by 1999.^[26] In parallel, load-store principles expanded into embedded systems through ARM's development at Acorn Computers, where the ARM1 processor debuted in 1985 as a low-power RISC core for personal computing tasks like graphics and word processing.^[27] Evolving to the ARMv2 core in 1987 with the ARM2 chip, this load-store architecture optimized register-based operations for battery-constrained devices, powering Acorn's Archimedes computers and laying the groundwork for widespread mobile adoption, including the Nokia 6110 in the 1990s.^[27] The architecture's influence extended to open standards with Sun Microsystems' SPARC in 1987, an open RISC specification featuring a load-store model with register windows to minimize memory traffic.^[28] Standardized as IEEE 1754-1994, SPARC drove Unix-based workstations and servers, achieving over 450 record benchmarks by the 1990s and enabling scalable deployments in engineering and Internet infrastructure.^[28] This openness facilitated adoption in supercomputing, where systems like Cray's T3D (1993) and T3E (1995) integrated RISC load-store processors such as the DEC Alpha for massively parallel processing in scientific simulations.^[29] The 1990s saw accelerated growth, highlighted by the 1991 AIM alliance forming PowerPC as a load-store RISC derivative of POWER for broader markets.^[30] Its integration into Apple's Power Macintosh series from 1994 onward restored performance competitiveness, with millions sold annually in the mid-1990s, including around 4.5 million total Macintosh units in 1995.^[31] Meanwhile, Intel's Itanium, announced in 1999 and released in 2001, adopted an explicit instruction-level parallel load-store model under the IA-64 umbrella, targeting enterprise servers despite later market challenges.^[32] By 2000, load-store RISC architectures dominated commercial designs in embedded and high-performance segments.^[33]

Notable Implementations

MIPS and RISC-I

The RISC-I, developed at the University of California, Berkeley in 1982 as part of the initial RISC project, employed a load-store architecture with 32 general-purpose registers (GPRs) numbered r0 through r31, where r0 was hardwired to zero and writes to it were ignored.^[34] The design utilized a uniform 32-bit fixed-length instruction format for all 39 instructions, simplifying decoding and enabling efficient pipelining.^[35] Memory operations were restricted to dedicated load and store instructions, including byte (LB/SB), halfword (LH/SH), and word (LW/SW) variants that supported sign extension or zero filling for sub-word loads.^[34] The MIPS project originated in 1981 at Stanford University under John Hennessy, initially featuring a prototype with 16 GPRs before evolving to 32 GPRs in commercial implementations like the R3000 released in 1988.^[36] The R3000 maintained a load-store design, with load and store instructions using a 16-bit signed offset from a base register for addressing, limiting immediate displacements but promoting register-based computations.^[36] It incorporated delayed branches, where the instruction immediately following a branch was always executed to fill pipeline bubbles, and provided interfaces for up to four coprocessors to handle tasks like floating-point arithmetic without interrupting the main pipeline.^[36] Key innovations in these designs included a consistent three-operand format for all arithmetic-logic unit (ALU) operations in both RISC-I and MIPS, allowing operations like ADD Rd, Rs, Rt to specify distinct destination and source registers independently.^[34]^[36] Exception handling relied on dedicated registers rather than complex traps; for instance, MIPS used the Exception Program Counter (EPC) register to store the address of the interrupted instruction.^[36] Early performance evaluations demonstrated the efficacy of these features, with the MIPS R3000 achieving up to 20 MIPS at 25 MHz in benchmark tests.^[37] The legacy of MIPS extends to its widespread adoption in embedded systems, powering networking equipment from vendors like Cisco and gaming consoles such as the original PlayStation, which utilized a customized R3000A core at 33.8 MHz.^[38] MIPS Technologies open-sourced the architecture in 2018 under the MIPS Open initiative, but following the company's acquisition and subsequent bankruptcy, active development of the MIPS architecture was discontinued in 2021, with the company transitioning to RISC-V.^[39]

ARM and RISC-V

The ARM architecture originated in 1985 as part of Acorn Computers' effort to develop a reduced instruction set processor for personal computing, establishing a load-store paradigm that separates memory access from computation.^[40] It utilizes 16 general-purpose registers, R0 through R15, with R15 functioning as the program counter to hold the address of the next instruction.^[41] Core memory operations rely on dedicated load (LDR) and store (STR) instructions, which support multiple addressing modes such as pre-indexed—where the offset is applied before memory access and the base register is updated—and post-indexed, where the update occurs after access, facilitating efficient data handling in resource-constrained environments.^[42] To address code density in embedded applications, ARM introduced Thumb mode in 1994, compressing a subset of instructions to 16 bits while maintaining compatibility with the 32-bit ARM instruction set, thereby reducing program size by up to 35% without sacrificing performance in narrow memory systems.^[43] Distinctive features include conditional execution, allowing most instructions to be predicated on processor flags to minimize branch overhead, and the Jazelle extension, which enables hardware-accelerated execution of Java bytecode as a third mode alongside ARM and Thumb.^[44] RISC-V emerged as an open-standard instruction set architecture in 2010, developed by a team at the University of California, Berkeley, to provide a modular, extensible foundation for diverse computing needs, building on principles from earlier RISC designs like MIPS.^[45] Its base integer ISA (RV32I or RV64I) includes 32 general-purpose registers denoted x0 through x31, with x0 hardwired to zero to serve as a constant source and simplify certain operations.^[46] The architecture enforces a strict load-store model, where memory instructions like LB (load byte) and SB (store byte) handle all data movement, while arithmetic and logical operations act solely on registers. Modularity is central to RISC-V, with standard extensions such as M for integer multiplication and division—adding instructions like MUL and DIV—and A for atomic memory operations, including load-reserved (LR) and store-conditional (SC) for thread-safe synchronization in multiprocessor systems.^[47] The V vector extension, frozen in version 1.0 and ratified in 2021, introduces scalable vector registers and instructions for parallel data processing, complementing the base ISA's uncompressed 32-bit fixed-length format that avoids the complexity of variable-length decoding.^[48] ARM's design emphasizes power efficiency, making it ideal for battery-constrained embedded devices through techniques like low-power states and optimized pipelines, while RISC-V prioritizes customizability via its royalty-free, open-source model, allowing implementers to add or modify extensions without licensing fees to suit specific IoT or specialized hardware needs.^[43]^[49] By 2020, ARM-based processors dominated the smartphone market with approximately 95% share, powering billions of mobile devices annually; as of 2025, this has grown to over 99%.^[50]^[51] In contrast, RISC-V has seen rapid adoption in IoT ecosystems, where its flexibility supports low-cost, tailored microcontrollers for sensors, edge computing, and connected devices; as of 2025, it is increasingly used in high-performance computing and AI accelerators by vendors like SiFive and Alibaba.^[52]^[53]

Advantages and Limitations

Performance and Efficiency Gains

Load–store architectures facilitate higher clock speeds by simplifying instruction decoding and execution pipelines, as computational operations are isolated from memory addressing, reducing hardware complexity and enabling deeper pipelining without excessive branch hazards or variable-length instructions. For instance, early RISC implementations like the MIPS M/2000 achieved a 40 ns cycle time, comparable to the VAX 8700's 45 ns but with superior effective throughput due to streamlined pipelines. In terms of code execution efficiency, load–store designs with large register files minimize memory accesses by keeping operands in registers, leading to fewer cache misses and higher instruction throughput. Studies on integer codes show that register promotion in such architectures reduces loads by 30%–60% and stores by 50%, cutting memory traffic and improving performance by 5%–20% through optimizations like global variable allocation.^[54] Empirical evidence from 1980s benchmarks demonstrates 2–3× speedups for RISC load–store processors over CISC register-memory designs, attributed to fewer loads/stores per computation.^[3] Power efficiency benefits arise from the reduced transistor count for arithmetic logic units, as memory addressing is confined to dedicated load/store instructions, avoiding complex operand decoding in compute paths. ARM processors, exemplifying load–store RISC, exhibit superior energy efficiency in server workloads like SQL and static HTTP serving compared to x86 equivalents due to these design traits.^[55] Quantitative metrics further highlight gains: load–store architectures achieve instructions per cycle (IPC) of 1–2 in pipelined implementations, versus 0.5 or less in complex ISAs, while predictable memory operations enhance cache hit rates.

Design Trade-offs and Challenges

One significant trade-off in load-store architectures is reduced code density compared to register-memory or complex instruction set computing (CISC) designs. These architectures require separate load and store instructions for all memory accesses, leading to more instructions per program and thus larger binaries; on average, RISC load-store code is about 25% larger than equivalent CISC code, with some benchmarks showing up to threefold increases due to fixed-length 32-bit instructions versus variable-length ones in CISC.^[56] This increased size can pressure instruction caches and memory bandwidth, potentially offsetting some performance gains from simpler decoding. To mitigate this, techniques like instruction compression have been employed; for instance, ARM's Thumb mode uses 16-bit instructions for common operations, achieving approximately 30% better code density over standard 32-bit ARM instructions.^[57] Another challenge arises from register pressure, as load-store architectures mandate that all arithmetic and logical operations occur exclusively between registers, amplifying the demand for on-chip register resources. With typically 32 general-purpose registers, compilers often face scenarios where live variables exceed available registers, necessitating spilling to memory via additional load and store instructions, which introduces latency and complexity in code generation. Register allocation in these systems relies on sophisticated algorithms like graph coloring, where variables are nodes in an interference graph and colored to assign registers without conflicts, but high pressure can lead to suboptimal spilling decisions that degrade performance.^[58] This issue is particularly pronounced in RISC designs with specialized registers (e.g., for immediate values or zero), further constraining the effective register pool and requiring integrated scheduling to minimize spills.^[47] Hardware implementation costs also pose trade-offs, primarily from the need for a large, multi-ported register file to support parallel register accesses in pipelined execution. In early RISC processors like RISC II, the register file consumed 27.5% of the chip area, highlighting how scaling to 32 registers significantly increases die size, power consumption, and access latency due to the quadratic growth in ports for read/write operations. Additionally, features like delayed slots in branches—common in load-store RISC to hide pipeline hazards—complicate modern branch prediction, as the slot instruction executes unconditionally regardless of the branch outcome, making it harder to speculate correctly in superscalar designs and often requiring no-operation (NOP) fillers that waste cycles.^[59]^[60] Compatibility with legacy software presents further hurdles, especially when emulating CISC binaries on load-store hardware. The Intel Itanium, an explicitly parallel instruction computing (EPIC) load-store processor produced from 2001 to 2021, exemplified these issues by relying on compilers to expose instruction-level parallelism, which proved challenging for optimizing existing x86 CISC code and led to inefficient emulation modes for complex instructions. Hardware accelerators for x86 compatibility existed but incurred high overhead due to the architectural mismatch, contributing to Itanium's market struggles as software ecosystems favored adaptable x86-64 extensions over full redesigns.^[61]^[62]

References

[1]
Load-Store Architecture - an overview | ScienceDirect Topics
Load-store architecture is a type of computer architecture where arithmetic operations use operands from and produce results in addressable registers.
[2]
4.4: Load and Store Architecture - Engineering LibreTexts
Jun 29, 2023 · A load and store architecture uses special instructions to load data from memory into registers and store data from registers back to memory. ...
[3]
[PDF] REDUCED INSTRUCTION SET COMPUTERS
Since it takes one cycle to calculate the address and one cycle to access memory, the straightforward way to implement LOADS and STORES is to take two cycles.
[4]
Lecture 2: Instruction Set Architectures and Compilers
So-called Reduced Instruction Set Computing (RISC) architectures are load/store architectures. Examples: SPARC, MIPS, Alpha AXP, PowerPC. Complex Instruction ...
[5]
[PDF] INSTRUCTION SETS - Milwaukee School of Engineering
Arithmetic instructions never directly put values into memory. This is called the load-store principle or load-store memory access. Load-store memory access is ...
[6]
[PDF] Instruction Set Principles
– Load/store architecture with up to one memory access/instruction. – Few addressing modes, synthesize others with code sequence. – Register-register ALU ...
[7]
[PDF] REDUCED INSTRUCTION SET COMPUTERS
Often, RISC is referred to as Load/Store architecture. Alternatively the operations in its instruction set are defined as Register-to-Register operations. The ...
[8]
[PDF] Chapter 13 Reduced Instruction Set Computers (RISC) Computer ...
Many studies fail to distinguish the effect of a large register file from the effect of RISC instruction set. • Many designs ...
[9]
[PDF] The DLX Instruction Set Architecture - UT Computer Science
There are 3 miscellaneous registers. →PC, Program Counter, contains the address of the instruction currently being retrieved from.
[10]
https://www.ics.uci.edu/~swjun/courses/2019W-CS152/material/lec3%20-%20RISC-V%20Assembly.pdf
[11]
None
### Summary of MIPS Register Operations, Logical, and Load/Store Instructions
[12]
2. Instruction Set Architecture - UMD Computer Science
– Register – register, where registers are used for storing operands. Such architectures are in fact also called load – store architectures, as only load and ...
[13]
Addressing modes - Arm Developer
Addressing modes use a base register and offset. The offset can be immediate, register, or scaled. Modes include offset, pre-indexed, and post-indexed.
[14]
[PDF] EECS 361 Computer Architecture Lecture 3 – Instruction Set ...
Register-to-Register: Load-Store Architectures. 3-17. EECS 361. Page 18. Register-to-Memory Architectures. 3-18. EECS 361. Page 19. Memory-to-Memory ...Missing: differences | Show results with:differences
[15]
Lecture notes - Chapter 8 - MAL and Registers - cs.wisc.edu
Arithmetic/logical instructions use register values as operands. A set up ... For a load/store instruction, we need a register specification also ...
[16]
Stack Computers: 6.2 ARCHITECTURAL DIFFERENCES FROM ...
The obvious difference between stack machines and conventional machines is the use of 0-operand stack addressing instead of register or memory based addressing ...Missing: architecture<|control11|><|separator|>
[17]
[PDF] Virtual Machine Showdown: Stack Versus Registers - USENIX
Jun 12, 2005 · The most popular VMs, such as the Java VM, use a virtual stack architecture, rather than the register architecture that dominates in real ...
[18]
[PDF] HP 35s scientific calculator
automatic storage is the automatic, RPN memory stack. HP's operating logic is based on an unambiguous, parentheses–free mathematical logic known as "Polish ...
[19]
[PDF] The Case for the Reduced Instruction Set Computer - People @EECS
Investigation of a RISC architecture has gone on for several months now under the supervision of D.A. Patterson and C.H. Séquin. By a judicious choice of the ...
[20]
[PDF] Design and implementation of RISC I - UC Berkeley EECS
Load and store instructions move data between registers and memory. These instructions use two CPU cycles. We decided to make an exception to our. 6. Page ...
[21]
[PDF] John Hennessy, Norman Jouppi, - Steven Przybylski, Christopher ...
We also chose to design a load-store architccturc, i.c., a machine in which only the load and store operations access memory and all ALU instructions arc ...Missing: history | Show results with:history
[22]
[PDF] A Perspective on the 801/Reduced Instruction Set Computer
On virtually every Sys- tem/370 implementation the Load-Store sequence is considerably faster than MVC for short align moves. The trace tapes also provide ...
[23]
Introduction to the MIPS ISA
To introduce MIPS R-Type, immediate, and load-store instructions Materials 1. MIPS ISA Handout (will have been distributed before class) 2. Connection to MIPS ...Missing: processor | Show results with:processor
[24]
[PDF] A MIPS R2000 IMPLEMENTATION - IIS Windows Server
Corporation in 1985, releasing the MIPS R2000 running at. 8 MHz on 2.0 micron process. In 1988 the R3000 was released, improving performance to eventually 40 ...Missing: commercialization | Show results with:commercialization
[25]
[PDF] The POWER4 Processor Introduction and Tuning Guide
The first RS/6000 products were announced by IBM in February of 1990, and were based on a multiple chip implementation of the POWER architecture, described ...
[26]
[PDF] RS/6000 Systems Handbook - Kev009
... RS/6000 History. The first RS/6000 was announced February 1990 and shipped June 1990. Since then, over 800,000 systems have shipped to over 125,000 customers ...
[27]
[PDF] IBM Power Architecture - IBM Research Report
Apr 15, 2011 · Power ISA specifies a load-store architecture consisting two distinct types of instructions: (1) memory access instructions, and (2) compute ...
[28]
The Official History of Arm
Aug 16, 2023 · Arm was officially founded as a company in November 1990 as Advanced RISC Machines Ltd, which was a joint venture between Acorn Computers, Apple Computer.
[29]
Milestones:SPARC RISC Architecture, 1987
Mar 18, 2024 · Sun Microsystems first introduced SPARC (Scalable Processor Architecture) RISC (Reduced Instruction-Set Computing) in 1987. Over the course of ...Missing: load- store Unix
[30]
[PDF] RISC Microprocessors and Scientific Computing
Mar 26, 1993 · Cray systems can fetch two operands and store one operand every clock period, ... width by means of burst load and store operations. It remains to ...
[31]
https://arstechnica.com/features/2005/12/total-share/
[32]
[PDF] Intel Itanium® Architecture Software Developer's Manual
A key feature of the Itanium architecture is IA-32 instruction set compatibility. The Intel® Itanium® Architecture Software Developer's Manual provides a.Missing: integration | Show results with:integration
[33]
An overview of RISC architecture
It is becoming clear that most commercial RISC based processors are ac- tually similar to CISC based processors in many ways. The first generations of RISC ...
[34]
RISC I: A REDUCED INSTRUCTION SET VLSI COMPUTER
The static frequencies of RISC I instructions for nine typical C programs show that less than 20% of the instructions were loads and stores, and more than 50% ...
[35]
RISC on a Chip: David Patterson and Berkeley RISC-I
Apr 23, 2023 · The first version of the RISC-I design included only 31 instructions. Late in the design process 8 additional load and store instructions, with ...
[36]
(PDF) MIPS: A microprocessor architecture - ResearchGate
Aug 7, 2025 · MIPS is a new single chip VLSI microprocessor. It attempts to achieve high performance with the use of a simplified instruction set, similar to those found in ...
[37]
Computer MIPS and MFLOPS Speed Claims 1980 to 1996
This document contains performance claims and estimates for more than 2000 mainframes, minicomputers, supercomputers and workstations, from around 120 suppliers
[38]
A Brief History of the MIPS Architecture - SemiWiki
Dec 7, 2012 · John Hennessy at Stanford University led a team of researchers in ... R3000, was released in 1988. It was used primarily in SGI's ...
[39]
MIPS Goes Open Source - EE Times
Dec 17, 2018 · Wave Computing (Campbell, Calif.) announced Monday (Dec. 17) that it is putting MIPS on open source, with MIPS Instruction Set Architecture (ISA) ...Missing: 2010 discontinued 2020<|control11|><|separator|>
[40]
First ARM processor powered up - Event - Computing History
Apr 26, 1985 · The ARM1 processor was developed in just 18 months. Steve Furber defined the architecture while Sophie Wilson developed the instruction set. The ...Missing: load- | Show results with:load-<|separator|>
[41]
ARM core registers - Arm Developer
This view provides 16 ARM core registers, R0 to R15, that include the Stack Pointer (SP), Link Register (LR), and Program Counter (PC). These registers are ...Missing: origins 1985
[42]
Loads and stores - addressing - Arm Developer
Pre-indexed addressing is like offset addressing, except that the base pointer is updated as a result of the instruction. In the preceding figure, X1 would have ...
[43]
Architecture history and extensions - Arm Developer
The ARM architecture changed relatively little between the first test silicon in the mid-1980s through to the first ARM6 and ARM7 devices of the early 1990s ...
[44]
Jazelle direct bytecode execution support - Arm Developer
The Jazelle extension provides architectural support for hardware acceleration of bytecode execution by a Java Virtual Machine (JVM).
[45]
About RISC-V International
Celebrating 15 years of open innovation, RISC-V has grown from a research project at UC Berkeley into a global movement transforming the semiconductor industry.
[46]
[PDF] The RISC-V Instruction Set Manual - People @EECS
May 31, 2016 · There are 31 general-purpose registers x1–x31, which hold integer values. Register x0 is hardwired to the constant 0. There is no hardwired ...
[47]
[PDF] Design of the RISC-V Instruction Set Architecture - People @EECS
Jan 3, 2016 · Even so, the registers are of limited use to the kernel because they are not protected from user access. • Handling misaligned loads and stores ...Missing: studies motivations
[48]
Confused about authoritative specification info - Google Groups
May 13, 2024 · https://wiki.riscv.org/display/HOME/Ratified+Extensions. The V (Vector) extension was ratified in November 2021. I presume that this refers ...
[49]
The Future of RISC-V in Embedded Systems - RunTime Recruitment
Jun 18, 2025 · Unlike proprietary architectures like ARM or x86, RISC-V is license-free, allowing anyone to implement, modify, and extend the ISA without ...
[50]
ARM Revenue and Growth Statistics (2024) - SignHouse
Do smartphones use ARM processors? Arm CPUs are the leading smartphone processor IP on the market today. 95 per cent of premium smartphones are powered by Arm.
[51]
IoT/Embedded - RISC-V International
RISC-V is the AI-native ISA that addresses the challenges of the IoT and embedded markets. Powering ultra-efficient MCUs to high-performance application ...
[52]
[PDF] The Need for Large Register Files in Integer Codes - Trevor Mudge
Large register files have many advantages. If used effectively they can: 1) reduce memory traffic by removing load and store operations; 2) improve performance ...
[53]
A comparison of x86 and ARM architectures power efficiency
▻ ARM systems are more power efficient for SQL and static HTTP servers. ▻ x86 architecture is still more power efficient for floating point computation.
[54]
[PDF] Code Density Concerns for New Architectures
There is a tradeoff, in that having fewer registers generates more loads/stores from spilling in load-store architectures. Virtual address of first instruction ...Missing: challenges pressure
[55]
Thumb Instruction Set - an overview | ScienceDirect Topics
The Thumb instruction set achieves higher code density by using 16-bit instructions instead of the original 32-bit ARM instructions, resulting in a code size ...Missing: compression | Show results with:compression
[56]
Integrating register allocation and instruction scheduling for RISCs
To achieve high performance in uniprocessor. RISC systems, compilers must perform both register alloca- tion to reduce memory references and instruction.
[57]
A reduced register file for RISC architectures
A multiple-window register file has been shown to be very effective in reducing the memory traffic due to saving and restoring of registers in function.
[58]
[PDF] Branch Prediction and Multiple-Issue Processors
Size of basic blocks limited to 4-7 instructions. • Delayed branches not a solution in multiple- issue processors. • Why? Hard to find independent ...<|separator|>
[59]
[PDF] the vliw-supercisc compiler: exploiting - D-Scholarship@Pitt
Apr 28, 2006 · Several large scale projects, such as the Intel Itanium 2 VLIW processor, have failed in the market due to the lack of ability to leverage ...
[60]
(PDF) Itanium: a system implementor's tale - ResearchGate
Oct 4, 2025 · Itanium is a fairly new and rather unusual architecture. Its defining feature is explicitly-parallel instruction-set computing (EPIC).Missing: 2001-2021 | Show results with:2001-2021