Fact-checked by Grok 2 weeks ago

Out-of-order execution

Out-of-order execution is a technique that allows a to execute instructions in a sequence different from their original program order, provided that data dependencies and other constraints are satisfied, thereby improving performance by reducing idle time in execution units and exploiting . This approach originated in the 1960s as part of efforts to overcome limitations in early pipelined processors, with the , announced in 1966, becoming the first commercial machine to implement it using Tomasulo's algorithm for floating-point operations. Tomasulo's algorithm, developed by Robert M. Tomasulo, employs reservation stations to buffer instructions and operands, to eliminate false dependencies, and a common data bus to broadcast results, enabling dynamic scheduling. Later implementations added mechanisms like reorder buffers to maintain precise exceptions through in-order retirement. In modern processors, out-of-order execution has become a cornerstone of , featured in designs from (starting with the in 1995), , and ARM-based chips like those in , where it sustains instruction throughput despite long-latency operations such as cache misses. Key benefits include tolerance for data hazards—such as read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW)—and enhanced utilization of superscalar pipelines with multiple functional units, though it introduces complexity in hardware for dependency tracking and power consumption. Despite challenges like vulnerability to attacks in contemporary systems, out-of-order execution remains essential for achieving high () in general-purpose CPUs.

Basic Concepts

In-Order Execution

In-order execution refers to the traditional model in which a processes strictly in the sequential order specified by the , ensuring that each completes its fetch, decode, execute, memory access, and writeback stages before the next one advances fully through the . This approach maintains semantics by preserving the original sequence, which simplifies since interrupts and faults occur in a predictable order corresponding to the source code. However, it inherently limits the exploitation of (ILP) by requiring all prior to complete before subsequent ones can proceed, even if later are independent. A classic implementation of in-order execution is the five-stage RISC pipeline, which divides instruction processing into distinct phases to overlap execution where possible while adhering to program order. The stages are: (1) Instruction Fetch (IF), where the processor retrieves the instruction from memory using the program counter (PC); (2) Instruction Decode/Register Fetch (ID), where the instruction is decoded and source registers are read; (3) Execute/Effective Address (EX), where the ALU performs computations or calculates memory addresses; (4) Memory Access (MEM), where data is read from or written to memory if required (e.g., for loads or stores); and (5) Writeback (WB), where results are written back to the destination register. In this pipeline, each stage takes one clock cycle under ideal conditions, aiming for a throughput of one instruction per cycle, but hazards disrupt this flow by forcing stalls. Pipeline hazards in in-order execution arise primarily from three sources: structural hazards, hazards, and hazards, all of which cause the pipeline to until the conflict is resolved. Structural hazards occur when hardware resources are insufficient to handle simultaneous demands from multiple , such as a single memory port being accessed for both fetch and operations in the same cycle, leading to a one-cycle . hazards, more common in practice, happen when an depends on the result of a prior that has not yet completed, categorized as read-after-write (), write-after-read (), or write-after-write (WAW). A typical RAW hazard is a load-use , where an immediately following a load requires the loaded ; for example, in the sequence LW R1, 0(R2) (load word into R1) followed by ADD R3, R1, R4 (add R1 to R4), the ADD must for one cycle until the load completes its MEM stage, with the forwarded for use in the ADD's EX stage. Control hazards occur due to conditional or jumps, where the next instruction's depends on a condition resolved in the EX stage. Without prediction, the fetches instructions sequentially after the branch, potentially flushing 2-3 incorrectly fetched instructions if the branch is taken, incurring a penalty of several cycles. These hazards impose significant performance limitations by blocking independent instructions behind dependent ones, reducing effective throughput and increasing (CPI) beyond the ideal value of 1. Consider a simple assembly code snippet:
LW  R1, 0(R2)   # Load word into R1
SUB R4, R1, R5  # Dependent: R4 = R1 - R5 (RAW hazard on R1)
LW  R6, 0(R7)   # Independent load into R6
MUL R8, R9, R10 # Independent multiply R8 = R9 * R10
In an in-order with forwarding, the instruction stalls the for one due to the load-use on R1, forcing the LW and MUL to wait idly despite their from the load, resulting in lost ILP opportunities and a CPI potentially exceeding 1 for such sequences. Out-of-order execution addresses these limitations by dynamically reordering independent to proceed during stalls, without altering program order for completion.

Out-of-Order Execution

Out-of-order execution is a technique in which a dynamically schedules and executes instructions based on their data dependencies rather than their sequential program order, enabling the overlap of independent operations to maximize the utilization of multiple execution units. This approach, pioneered by , uses reservation stations to hold instructions until their operands are available, allowing subsequent non-dependent instructions to proceed without waiting for preceding ones that may be stalled on long-latency operations or unresolved branches. At its core, out-of-order execution exploits (ILP), which measures the potential for concurrent execution of within a program. In contrast to in-order execution, where a stall in one blocks the entire , out-of-order mechanisms permit non-dependent to advance, thereby hiding latencies from hazards, dependencies, and . Key enablers include dynamic for runtime reordering, to execute before branch outcomes are known, and a commitment stage that retires results in original program order to preserve architectural correctness. For example, consider a sequence where an ADD instruction depends on a prior resolution, causing it to , while a subsequent LOAD operation is independent and ready; in an out-of-order processor, the LOAD can execute immediately, followed by the ADD once its inputs arrive, thus maintaining throughput. This separation between architectural state—visible registers and memory that must reflect in-order semantics—and microarchitectural state—internal buffers and queues that track reordered execution—ensures precise interrupts, where exceptions are handled as if instructions completed sequentially, allowing reliable recovery without corrupting the program state. The primary performance benefit arises from reduced pipeline stalls and higher instruction throughput in superscalar processors, which can issue multiple ; conceptual analyses and simulations show that out-of-order execution can achieve 2-3x over in-order counterparts in ILP-bound workloads by more effectively extracting parallelism from typical programs.

Historical Development

Early Implementations in Supercomputers

The pioneering implementations of out-of-order execution emerged in the within supercomputers designed for scientific computing, where high-latency floating-point operations posed significant bottlenecks in numerical simulations. The , introduced in 1964 by at , was the first machine to employ dynamic scheduling through a technique known as , enabling out-of-order execution across its functional units. This approach allowed instructions to issue in order but execute out of sequence when dependencies permitted, addressing the long delays in —such as 10 minor cycles (1,000 ns) for and 29 cycles for —by overlapping computations with accesses. The maintained status flags for 10 independent functional units (including two floating-point multipliers, one adder, and one divider) and detected data hazards by tracking operand availability and unit occupancy, preventing conflicts like double result assignments while permitting up to 10 parallel operations in theory. This marked a key milestone in demonstrating hardware-based dynamic scheduling to hide latency in custom architectures tailored for high-performance numerical workloads, such as hydrodynamic simulations. Building on this foundation, the , released in 1967, incorporated a limited form of out-of-order execution specifically for its floating-point units using , which decoupled instruction issue from execution to tolerate long latencies without requiring compiler optimizations. The algorithm utilized reservation stations to buffer operations and a common data bus (CDB) with tag-based broadcasting to resolve dependencies dynamically, allowing up to three add and two multiply/divide stations to proceed when operands were ready, even if prior instructions were stalled on memory accesses or divisions (up to 18 cycles). For instance, in floating-point-intensive loops common to scientific , such as solving partial equations, this reduced execution time from 17 cycles to 11 per iteration by enabling concurrent use of multiple arithmetic units despite the system's four floating-point registers and pre-cache memory latencies. Scoreboarding-like dependency detection ensured that only independent operations advanced, providing a robust mechanism to sustain throughput in environments demanding precise control over high-latency operations. Early vector processors extended these concepts implicitly through pipelining and , facilitating out-of-order-like overlap in array-based computations prevalent in supercomputing. The , operational in 1972, featured 256 processing elements in a SIMD configuration with deeply pipelined functional units for operations, allowing elements of a to stream through arithmetic pipelines while subsequent instructions overlapped, effectively hiding startup latencies in parallel numerical tasks like modeling. Similarly, the supercomputer of 1976 advanced this with chaining, where results from one pipelined functional unit (e.g., an with 1/2 throughput) could directly feed into another (e.g., a multiplier) without intermediate register writes, enabling dependent operations to commence before prior ones completed and reducing idle cycles in memory-bound simulations. This implicit out-of-order capability tolerated latencies up to 100 cycles with minimal performance loss (less than 6% degradation) in applications such as hydrodynamic codes, underscoring the evolution of latency-hiding techniques in specialized hardware for scientific workloads.

Innovations in Precise Exceptions and Decoupling

One major challenge in early out-of-order execution designs was the handling of imprecise exceptions, where instructions could complete in a non-sequential order, potentially allowing later instructions to finish before earlier ones that triggered traps or interrupts. This disorder complicated program restarts, as the processor might reflect partial execution of subsequent instructions, making it difficult for the operating system to restore a consistent architectural and resume correctly from the faulting . The concept of precise exceptions addressed this by ensuring that the saved processor state always corresponded exactly to a sequential execution point, with all prior instructions completed and no subsequent ones partially executed. In the 1970s, Henry M. Levy and Richard H. Eckhouse described mechanisms in the that supported precise handling of most interrupts and exceptions through structured vectors and serialization points, enabling reliable recovery without state corruption. Building on this, IBM's 801 experimental , completed around 1980, incorporated history buffers to log changes for rollback, allowing the system to revert to a precise state upon exceptions by undoing speculative updates in reverse order. Decoupling emerged as a complementary innovation, separating the fetch and decode stages from execute and writeback to permit out-of-order issue and completion while maintaining architectural correctness. This allowed instructions to be issued dynamically without strict adherence to completion order, buffering results until safe commitment, as demonstrated in the Multiflow system during the , where trace scheduling enabled compiler-directed parallelism across decoupled pipeline phases. James E. Smith's 1982 paper formalized decoupling for superscalar processors, introducing access/execute separation to tolerate memory latencies and boost through independent pipeline advancement. Smith also advocated checkpointing, where periodic snapshots of register and memory states facilitate restoration after exceptions, ensuring precise recovery in decoupled designs without full rollback overhead. A notable prototype embodying these ideas was the VLSI Test Processor in the 1980s, developed by , which featured decoupled pipelines to enhance and performance in high-speed GaAs technology. By isolating front-end instruction processing from back-end execution, the Aurora design demonstrated improved throughput for vector and scalar workloads, validating decoupling's role in mitigating exception imprecision while supporting out-of-order .

Research Maturation and Commercial Adoption

During the 1980s, research efforts shifted toward validating scalable out-of-order execution in practical prototypes, building on earlier decoupling concepts to demonstrate feasibility in superscalar environments. The Berkeley HPSm project (1985) exemplified this maturation, implementing a superscalar processor with out-of-order execution, speculative execution, and in-order commit, achieving a cycle time of 100 ns compared to 330 ns for the contemporary RISC II processor, thus highlighting significant efficiency gains in instruction-level parallelism. Similarly, IBM's Advanced Computer Systems (ACS) project, though initiated in the 1960s and canceled in 1969, influenced 1980s designs through its pioneering dynamic instruction scheduling and out-of-order execution techniques, which informed later scalable implementations despite the project's early termination. The transition to commercial adoption began with limited implementations in the late 1980s, marking the first widespread availability of out-of-order capabilities beyond research labs. The microprocessor, released in 1988, was the first commercial CPU to incorporate limited out-of-order execution, featuring superscalar pipelining that allowed multiple and out-of-order completion for certain operations to avoid stalls, as described in its programmer's reference manual. Full-scale adoption arrived with the Intel Pentium Pro in 1995, which introduced a comprehensive out-of-order engine using a reorder buffer to handle up to 40 instructions in flight, enabling precise while maximizing throughput in x86 environments. Drivers for wide adoption included the proliferation of superscalar architectures in both RISC and CISC domains during the , fueled by performance validations from standardized benchmarks. For instance, the (1991) advanced RISC superscalar designs with deep pipelining that laid groundwork for later out-of-order enhancements, while the PA-8000 (1996) represented a milestone in CISC-like RISC adoption, delivering 11.8 SPECint95 and 20.2 SPECfp95 at 180 MHz through its 56-entry reorder buffer and four-way out-of-order issue. Key events, such as SPEC benchmarks demonstrating 2-3x performance gains from out-of-order over in-order execution in multi-issue processors, and the integration of advanced branch prediction in the (1996)—which achieved over 30 SPECint95 and 50 SPECfp95 through speculative out-of-order execution supporting 80 instructions in flight—accelerated this trend. By the 2000s, out-of-order execution became standard across architectures, including x86 with the K7 (1999), with early out-of-order support in the Arm1136J(F)-S (2003), and systems; recent extensions appear in via SiFive's U8-Series (2019), a superscalar out-of-order core configurable for high-performance applications.

Core Mechanisms

Dispatch and Issue Decoupling

Dispatch refers to the process of fetching instructions from the instruction , decoding them, and allocating them to reservation stations or issue queues associated with functional units, typically in program order to maintain front-end flow. This stage ensures that instructions are prepared for potential execution by renaming registers and tracking dependencies, but it does not immediately send them to execution . Issue decoupling separates the dispatch of instructions from their actual issuance to functional units, allowing instructions to be held in reservation stations until all source are available, at which point they can be issued out-of-order to available execution resources. In this mechanism, introduced in , reservation stations instructions and operands, enabling the to continue dispatching new instructions even if earlier ones are stalled on dependencies. Dependencies are tracked using tags that identify the producing functional unit for each operand; when a result is produced, it is broadcast over a common data bus along with its tag, waking up and updating any waiting dependent instructions in the reservation stations. For example, consider the instruction sequence: LOAD F0 from A (long latency), ADD F4, F2, F6 (independent of F0), MULTIPLY F0, F4. The ADD can issue and execute immediately after dispatch while the LOAD is pending, preventing idle functional units and reducing total execution time compared to in-order execution. This decoupling enables greater exploitation of (ILP) by minimizing in the front-end pipeline, where a single stalled instruction would otherwise halt subsequent dispatches. In Tomasulo's implementation on the , it improved performance by approximately 33% on a representative , reducing execution cycles from 17 to 11 through better overlap of arithmetic operations. It also supports issuing speculative instructions beyond predicted branches, though recovery mechanisms are required for mispredictions (detailed in other sections).

Execution and Writeback Decoupling

In the execution phase of out-of-order processors, instructions complete their computations within functional units as soon as their input dependencies are resolved, often finishing out of program order and producing results that are temporarily stored rather than immediately updating the architectural state. This approach maximizes resource utilization by allowing subsequent instructions to proceed without waiting for unrelated prior operations to finish, thereby overlapping execution latencies. Writeback decoupling ensures that these execution results remain isolated in temporary storage until all preceding instructions in the program order have successfully committed, at which point the results are retired and applied to registers or memory in strict sequential order. This separation prevents premature modifications to the processor's visible state, maintaining the illusion of in-order execution from the programmer's perspective while enabling higher instruction-level parallelism. By holding results until the commit point, the mechanism supports precise handling of exceptions, where the state reflects completion of instructions up to the faulting one without interference from later operations. Restartability is achieved through checkpointing of the architectural state, including registers and memory, which captures a consistent at key program points for recovery after traps, interrupts, or other disruptions. In cases of mispredictions, instructions executed speculatively beyond the can be squashed entirely, rolling back the state to the checkpoint without corrupting earlier committed writes. For instance, if a instruction completes its execution out of order but is later resolved as mispredicted, the discards the speculative results and any dependent computations, restoring the pre- state seamlessly to resume correct execution. A primary benefit of this is the ability to sustain high throughput by exploiting parallelism in computation while guaranteeing precise exceptions, which is essential for reliable software execution and . It integrates with by committing only non-speculative results to the architectural state, ensuring that branch outcomes and other uncertainties do not propagate errors until verified. This design, building briefly on prior dispatch and for initiating parallelism, focuses on back-end ordering to enable robust recovery and maintain sequential semantics.

Microarchitectural Implementations

Register Renaming

Register renaming addresses false dependencies arising from register reuse in sequential code, specifically write-after-read (WAR) and write-after-write (WAW) hazards, which limit instruction-level parallelism (ILP) despite no true data dependence on prior computations. These false dependencies occur when subsequent instructions overwrite or read a register before the original value is fully utilized, stalling out-of-order execution. The renaming mechanism dynamically maps architectural registers specified in instructions to physical registers during the dispatch stage, using mapping tables to track bindings and enable parallel execution of independent operations. Source operands are resolved by consulting the map to fetch values from the correct physical registers, while destination registers receive new physical allocations from a pool, eliminating name conflicts. Two primary implementation types exist: the physical register file (PRF) approach, where a large dedicated file stores all renamed values and mappings point directly to it (as in ES/9000 or ), and the unified scheduler approach, where structures like reorder buffers hold results and serve as a combined rename and scheduling mechanism (as in ). In the PRF style, physical registers are allocated early at rename time; in unified designs, allocation may integrate with execution units for efficiency. A classic example illustrates WAW elimination in a loop: consider code like loop: ADD R1, R1, #1; JMP loop, where each iteration writes to R1, creating WAW hazards that serialize execution. Renaming assigns distinct physical registers per write—e.g., first ADD to P5, second to P6—allowing iterations to proceed in without , as reads the latest entry. Implementation relies on a free list to track and allocate available physical registers, typically managed as a or pointer structure to ensure rapid dispatch without stalls. For , checkpointing saves map table states and free list heads at branch points, enabling quick to prior mappings upon misprediction by restoring the saved configuration in one cycle. Historically, originated in Robert Tomasulo's 1967 algorithm for the , initially applied to floating-point units to tolerate long latencies. It was refined in subsequent designs, such as modern processors employing hundreds of physical registers for integer and floating-point operations. By providing more physical registers than architectural ones, renaming expands the effective ILP window—for instance, enabling execution windows to grow from a few instructions to 128 or more—thus sustaining higher throughput in out-of-order pipelines. This integrates into the broader out-of-order flow by resolving dependencies early, facilitating subsequent scheduling.

Reorder Buffers and Completion Logic

The (ROB) is a that holds instructions from the point of dispatch until , storing their results, destination registers, and completion status to enable out-of-order execution while ensuring in-order commitment to the architectural state. Introduced in early pipelined processor designs to support precise interrupts, the ROB allows instructions to complete execution out of program order but buffers them until all prior instructions have finished, preventing speculative results from corrupting the processor state. Completion logic in the ROB relies on head and tail pointers to maintain program order. The tail pointer indicates the position where new instructions are allocated upon dispatch, forming a that advances as are issued. The head pointer tracks the oldest ; an at the head retires only when it has completed execution (i.e., its result is available and all dependencies are resolved) and is confirmed non-speculative, such as after branch resolution. This mechanism ensures sequential retirement, with multiple potentially retiring per cycle if consecutive head entries are ready. For , the ROB stalls retirement on traps or interrupts until the buffer drains to the faulting , allowing precise state restoration by committing only instructions before the exception. On branch mispredictions, the tail pointer is to all speculative instructions beyond the correct path, freeing ROB entries and preventing incorrect results from retiring. This flushing preserves architectural correctness but incurs recovery latency proportional to the speculation window. Consider a simple four-instruction sequence: ADD R1, R2, R3; R4, [R5]; SUB R6, R7, R8; MUL R9, R1, R4. Despite out-of-order completion (e.g., SUB finishes first, followed by ADD, , then MUL), the ensures retirement in original order by holding results until the head advances.
ROB EntryInstructionDispatch OrderCompletion OrderStatusRetire Order
0 (Head)ADD R1, R2, R312Ready1
1 R4, [R5]23Ready2
2SUB R6, R7, R831Ready3
3 (Tail)MUL R9, R1, R444Ready4
In this example, SUB's early completion is buffered until ADD and LOAD retire first, demonstrating how the ROB decouples execution from commitment. variants include unified designs, where the buffer stores both status and results (as in early implementations), versus separate structures paired with a physical (PRF) that holds results while the ROB tracks only tags and status for retirement. In the Pentium 4, for instance, the ROB comprises 126 entries focused on status tracking, integrated with a separate 128-entry PRF to manage without duplicating . size directly influences the speculation window; larger buffers, like the 126 entries in Pentium 4, allow tracking more in-flight instructions for higher but limit throughput if full. Larger ROBs enable greater and performance gains from out-of-order execution but impose power and area penalties due to increased storage, pointer management, and wake-up logic complexity. Dynamic resizing techniques can mitigate these trade-offs by scaling ROB capacity based on workload, reducing energy dissipation in low-ILP scenarios while preserving peak performance.

Instruction Scheduling and Resource Allocation

In out-of-order execution processors, relies on dynamic structures such as issue queues or reservation stations to hold renamed instructions after dispatch, allowing them to wait until their dependencies are resolved and execution resources become available. These structures prioritize instructions based on factors like age (to preserve program order where possible) and dependency readiness, enabling the processor to issue operations out-of-order while minimizing stalls. Originating from Robert Tomasulo's 1967 algorithm for the , reservation stations buffer instructions near functional units, tagging with sources rather than immediate values to facilitate data forwarding. The core of is the wakeup-select mechanism, which dynamically identifies and dispatches ready to execution units. In the wakeup , when an completes execution, its result is broadcast on a common bus or result tags are compared against pending in the scheduler; matching dependents are marked ready by updating tags. The select then employs priority encoders or arbiters to choose for issue, often favoring the oldest ready to reduce and improve overall throughput. This mechanism, refined in modern implementations, supports back-to-back execution of dependent chains by allowing immediate wakeup and selection in the same cycle. Resource allocation during scheduling tracks the availability of functional units such as arithmetic logic units (ALUs), floating-point units, and address generation units (AGUs), as well as associated ports and pipelines. Schedulers maintain status tables or counters for these resources, allocating them only to ready instructions that match the unit's capabilities, while separate queues handle memory operations like loads and stores to avoid contention with integer or floating-point execution. For instance, in a typical design, a scheduler might issue up to four integer operations per cycle across multiple ALU ports while queuing memory instructions in a dedicated load/store queue, ensuring balanced utilization without overcommitting resources. Advanced scheduling incorporates out-of-order load/store disambiguation through mechanisms like store-forwarding, where loads check against pending stores in the store queue for address matches to forward data directly, bypassing memory access if possible. This requires content-addressable memory (CAM) structures in the load/store queues to compare addresses efficiently, resolving potential anti-dependencies and allowing loads to execute speculatively ahead of stores. Such features enhance memory-level parallelism but demand precise address speculation to avoid misforwards that could lead to recovery overhead. Trade-offs in scheduler design balance queue depth, which increases the window of exploitable parallelism, against added latency in wakeup and selection due to larger structures. Deeper queues enable holding more instructions for opportunistic issue but raise power and area costs from expanded arrays; modern processors like AMD's architecture mitigate this with distributed schedulers—six queues of 14 entries each supporting a 6-wide issue width—optimizing for low while scaling .

References

  1. [1]
    Lecture 5: Out-of-order Execution
    Out-of-order execution, or dynamic scheduling, is a technique used to get back some of that wasted execution bandwidth.
  2. [2]
    [PDF] Out-of-Order-Execution (OOOE) - UAF CS
    History cont. ○ IBM 360/91 (1966). ○ Introduced Tomasulo's Algorithm. ○ IBM Power1 (1990). ○ OOOE Limited to floating point. ○ IBM PowerPC 601. ○ Fujitsu ...
  3. [3]
    Out-of-order processing-History of the IBM System/360 Mode 91
    In 1966, IBM announced the System/360 Model 91, the first computer system to feature out-of-order execution–the ability to automatically find concurrency in ...
  4. [4]
    Exploring the Performance Limits of Out-of-order Commit
    Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling.
  5. [5]
    [PDF] computer-architecture-a-quantitative-approach-by-hennessy-and ...
    Computer Architecture: A Quantitative Approach fur- thers its string of firsts in presenting comprehensive architecture coverage of sig- nificant new ...
  6. [6]
    [PDF] An Efficient Algorithm for Exploiting Multiple Arithmetic Units
    The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code.
  7. [7]
    [PDF] Implementing precise interrupts in pipelined processors
    A precise interrupt in pipelined processors means the saved state reflects a sequential model where one instruction completes before the next begins.
  8. [8]
    [PDF] A Two-way Superscalar Processor with Out-of-Order Execution and ...
    superscalar out-of-order processor is 2.590 times the performance of an in-order pipeline processor, which indicates a 159% performance improvement. Figure ...
  9. [9]
    [PDF] PARALLEL OPERATION IN THE CONTROL DATA 6600
    Many functional units. 2. Units with three address properties. 3. Many transient registers with many trunks to and from the units. 4. A simple and efficient ...
  10. [10]
    [PDF] An Introductory Description of the ILLIAC IV System. Volume 1 - DTIC
    This book was written for an applications programmer who would like a tutorial description of the ILLIAC IV System before attempting to read the reference ...
  11. [11]
    [PDF] Out-of-Order Vector Architectures - University of Wisconsin–Madison
    Cray-1 [12]. Following the Cray-1, a number of vec- tor machines have been designed and sold, from su- percomputers with very high vector bandwidths [8] to ...
  12. [12]
  13. [13]
    [PDF] Computer Programming and Architecture: The VAX - Bitsavers.org
    Levy, Henry M. Computer programming and architecture. Bibliography: p ... Exceptions 293. Interrupts 294. Exceptions 294. Interrupt and Exception Vectors ...
  14. [14]
    [PDF] SYSTEM 801 - Principles of Operation - Bitsavers.org
    It contains the sequencing and processing controls for instruction execution, interruption action, initial program loading, and other functions. The CPU, in ...Missing: history | Show results with:history
  15. [15]
    Decoupled access/execute computer architectures
    James E. Smith. James E. Smith. Department of Electrical and Computer ... This results in an implementation which has two separate instruction ... Read More · A ...
  16. [16]
    [PDF] A VLIW Architecture for a Trace Scheduling Compiler
    This process allows the compiler to break the "conditional jump bottleneck" and find parallelism throughout long streams of code, achieving order-of-magnitude ...Missing: decoupling | Show results with:decoupling
  17. [17]
    [PDF] Architectural and Circuit Issues for a High Clock Rate Floating-Point ...
    precision floating-point unit as part of the Aurora HI processor. The chip was designed in a l.Oum (0.6^im effective gate length) GaAs direct coupled FET ...
  18. [18]
    [PDF] Implementing out-of-order execution processors - UCSD CSE
    Feb 11, 2010 · Historical Context: IBM 360/91. • Tomasulo's Algorithm implemented for the Floating. Point operations. • It had only 2 functional units: 1 ...
  19. [19]
    The IBM ACS Project - ResearchGate
    Apr 6, 2025 · Although the project was canceled, it brought many talented engineers to California and contributed to several later developments at IBM and ...
  20. [20]
    [PDF] The HP PA-8000 RISC CPU - Hot Chips
    Aug 19, 1996 · Completely redesigned core/new microarchitecture. 56 Entry Instruction Reorder Buffer (IRB). Peak execution rate of 4 instructions/cycle.Missing: gains | Show results with:gains
  21. [21]
    [PDF] The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 MHz
    Continued Alpha performance leadership. 600 MHz operation in 0.35u CMOS6, 6 metal layers, 2.2V. 15 Million transistors, 3.1 cm2, 587 pin PGA.Missing: ARM Cortex- A8 SiFive U74
  22. [22]
    Incredibly Scalable High-Performance RISC-V Core IP - SiFive
    Oct 24, 2019 · The SiFive U8-Series is the highest performance RISC-V ISA based Core IP available today, based on a superscalar out-of-order pipeline with configurable ...Missing: execution commercial 2018
  23. [23]
    Memory access ordering - an introduction - Arm Developer
    Sep 11, 2013 · The first Arm processor to support out-of-order execution was the Arm1136J(F)-S, which permitted non-dependent load and store operations to ...Missing: commercial 3x
  24. [24]
    [PDF] Description of Tomasulo Based Machine - UCSD CSE
    There are 4 basic stages to Tomasulo's Algorithm: 1. Dispatch (D): An instruction proceeds from dispatch to issue when it reaches the front of the instruction ...
  25. [25]
    [PDF] The design space of register renaming techniques
    Register renaming is a technique to remove false data dependencies—write after read (WAR) and write after write (WAW)— that occur in straight line code ...
  26. [26]
    Out of Order Execution and Register Renaming - UAF CS
    You can eliminate the "fake" dependencies WAW and WAR using register renaming (also called Tomasulo's Algorithm): a typical implementation is to hand out a new ...Missing: development | Show results with:development
  27. [27]
    [PDF] Dynamic Register Renaming Through Virtual-Physical Registers
    Register renaming is a key issue for the performance of out-of-order execution processors and therefore, it is extensively used.
  28. [28]
  29. [29]
    Measuring Reorder Buffer Capacity - Blog - Henry Wong
    May 14, 2013 · The Pentium 4 Northwood appears to have 115 speculative PRF entries. Since the PRF size is expected to be 128 entries, that means 13 ...
  30. [30]
    [PDF] The Microarchitecture of the Pentium 4 Processor - Washington
    The Allocator allocates a Reorder Buffer. (ROB) entry, which tracks the completion status of one of the 126 uops that could be in flight simultaneously in ...
  31. [31]
    An Efficient Algorithm for Exploiting Multiple Arithmetic Units
    An Efficient Algorithm for Exploiting Multiple Arithmetic Units. Abstract: This paper describes the methods employed in the floating-point area of the System/ ...
  32. [32]
    [PDF] On Pipelining Dynamic Instruction Scheduling Logic
    Each reservation station entry (RSE) has wakeup logic that wakes up any instruction stored in it. The select logic chooses instructions for execution from the ...
  33. [33]
    [PDF] Direct Instruction Wakeup for Out-Of-Order Processors
    The select logic chooses a subset of instructions flagged by the wakeup logic for execution. The operands may come from a register file or from a previously ...
  34. [34]
    Zen - Microarchitectures - AMD - WikiChip
    Instead of a large scheduler, Zen has 6 distributed scheduling queues, each 14 entries deep (4xALU, 2xAGU). Zen includes a number of enhancements such as ...
  35. [35]
    [PDF] Dynamic Memory Disambiguation in the Presence of Out-of-order ...
    Our scheme handles the problem of false memory dependencies effectively even when the predictor relies only on load/store program counter values and the store ...
  36. [36]
    Address-indexed memory disambiguation and store-to-load ...
    This paper describes a scalable, low-complexity alternative to the conventional load/store queue (LSQ) for superscalar processors that execute load and store ...