Fact-checked by Grok 2 weeks ago

Out-of-order execution

Out-of-order execution is a computer architecture technique that allows a processor to execute instructions in a sequence different from their original program order, provided that data dependencies and other constraints are satisfied, thereby improving performance by reducing idle time in execution units and exploiting instruction-level parallelism.^[1]^[2] This approach originated in the 1960s as part of efforts to overcome limitations in early pipelined processors, with the IBM System/360 Model 91, announced in 1966, becoming the first commercial machine to implement it using Tomasulo's algorithm for floating-point operations.^[3]^[2] Tomasulo's algorithm, developed by Robert M. Tomasulo, employs reservation stations to buffer instructions and operands, register renaming to eliminate false dependencies, and a common data bus to broadcast results, enabling dynamic scheduling. Later implementations added mechanisms like reorder buffers to maintain precise exceptions through in-order retirement.^[2]^[1] In modern processors, out-of-order execution has become a cornerstone of high-performance computing, featured in designs from Intel (starting with the Pentium Pro in 1995), IBM Power series, and ARM-based chips like those in Apple Silicon, where it sustains instruction throughput despite long-latency operations such as cache misses.^[2]^[4] Key benefits include tolerance for data hazards—such as read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW)—and enhanced utilization of superscalar pipelines with multiple functional units, though it introduces complexity in hardware for dependency tracking and power consumption.^[1]^[4] Despite challenges like vulnerability to speculative execution attacks in contemporary systems, out-of-order execution remains essential for achieving high instructions per cycle (IPC) in general-purpose CPUs.^[4]

Basic Concepts

In-Order Execution

In-order execution refers to the traditional model in which a processor processes instructions strictly in the sequential order specified by the program, ensuring that each instruction completes its fetch, decode, execute, memory access, and writeback stages before the next one advances fully through the pipeline.^[5] This approach maintains program semantics by preserving the original instruction sequence, which simplifies exception handling since interrupts and faults occur in a predictable order corresponding to the source code.^[5] However, it inherently limits the exploitation of instruction-level parallelism (ILP) by requiring all prior instructions to complete before subsequent ones can proceed, even if later instructions are independent.^[5] A classic implementation of in-order execution is the five-stage RISC pipeline, which divides instruction processing into distinct phases to overlap execution where possible while adhering to program order.^[5] The stages are: (1) Instruction Fetch (IF), where the processor retrieves the instruction from memory using the program counter (PC); (2) Instruction Decode/Register Fetch (ID), where the instruction is decoded and source registers are read; (3) Execute/Effective Address (EX), where the ALU performs computations or calculates memory addresses; (4) Memory Access (MEM), where data is read from or written to memory if required (e.g., for loads or stores); and (5) Writeback (WB), where results are written back to the destination register.^[5] In this pipeline, each stage takes one clock cycle under ideal conditions, aiming for a throughput of one instruction per cycle, but hazards disrupt this flow by forcing stalls.^[5] Pipeline hazards in in-order execution arise primarily from three sources: structural hazards, data hazards, and control hazards, all of which cause the pipeline to stall until the conflict is resolved.^[5] Structural hazards occur when hardware resources are insufficient to handle simultaneous demands from multiple instructions, such as a single memory port being accessed for both instruction fetch and data operations in the same cycle, leading to a one-cycle stall.^[5] Data hazards, more common in practice, happen when an instruction depends on the result of a prior instruction that has not yet completed, categorized as read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW).^[5] A typical RAW hazard is a load-use dependency, where an instruction immediately following a load requires the loaded data; for example, in the sequence LW R1, 0(R2) (load word into R1) followed by ADD R3, R1, R4 (add R1 to R4), the ADD must stall for one cycle until the load completes its MEM stage, with the data forwarded for use in the ADD's EX stage.^[5] Control hazards occur due to conditional branches or jumps, where the next instruction's address depends on a condition resolved in the EX stage. Without prediction, the pipeline fetches instructions sequentially after the branch, potentially flushing 2-3 incorrectly fetched instructions if the branch is taken, incurring a penalty of several cycles.^[5] These hazards impose significant performance limitations by blocking independent instructions behind dependent ones, reducing effective throughput and increasing cycles per instruction (CPI) beyond the ideal value of 1.^[5] Consider a simple assembly code snippet:

LW  R1, 0(R2)   # Load word into R1
SUB R4, R1, R5  # Dependent: R4 = R1 - R5 (RAW hazard on R1)
LW  R6, 0(R7)   # Independent load into R6
MUL R8, R9, R10 # Independent multiply R8 = R9 * R10
LW  R1, 0(R2)   # Load word into R1
SUB R4, R1, R5  # Dependent: R4 = R1 - R5 (RAW hazard on R1)
LW  R6, 0(R7)   # Independent load into R6
MUL R8, R9, R10 # Independent multiply R8 = R9 * R10

In an in-order pipeline with forwarding, the SUB instruction stalls the pipeline for one cycle due to the load-use hazard on R1, forcing the LW and MUL to wait idly despite their independence from the load, resulting in lost ILP opportunities and a CPI potentially exceeding 1 for such sequences.^[5] Out-of-order execution addresses these limitations by dynamically reordering independent instructions to proceed during stalls, without altering program order for completion.^[5]

Out-of-Order Execution

Out-of-order execution is a technique in which a processor dynamically schedules and executes instructions based on their data dependencies rather than their sequential program order, enabling the overlap of independent operations to maximize the utilization of multiple execution units. This approach, pioneered by Tomasulo's algorithm, uses reservation stations to hold instructions until their operands are available, allowing subsequent non-dependent instructions to proceed without waiting for preceding ones that may be stalled on long-latency operations or unresolved branches.^[6] At its core, out-of-order execution exploits instruction-level parallelism (ILP), which measures the potential for concurrent execution of instructions within a program. In contrast to in-order execution, where a stall in one instruction blocks the entire pipeline, out-of-order mechanisms permit non-dependent instructions to advance, thereby hiding latencies from data hazards, control dependencies, and resource contention. Key enablers include dynamic instruction scheduling for runtime reordering, speculation to execute instructions before branch outcomes are known, and a commitment stage that retires results in original program order to preserve architectural correctness.^[6] For example, consider a sequence where an ADD instruction depends on a prior branch resolution, causing it to stall, while a subsequent LOAD operation is independent and ready; in an out-of-order processor, the LOAD can execute immediately, followed by the ADD once its inputs arrive, thus maintaining pipeline throughput. This separation between architectural state—visible registers and memory that must reflect in-order semantics—and microarchitectural state—internal buffers and queues that track reordered execution—ensures precise interrupts, where exceptions are handled as if instructions completed sequentially, allowing reliable recovery without corrupting the program state.^[7] The primary performance benefit arises from reduced pipeline stalls and higher instruction throughput in superscalar processors, which can issue multiple instructions per cycle; conceptual analyses and simulations show that out-of-order execution can achieve 2-3x speedup over in-order counterparts in ILP-bound workloads by more effectively extracting parallelism from typical programs.^[8]

Historical Development

Early Implementations in Supercomputers

The pioneering implementations of out-of-order execution emerged in the 1960s within supercomputers designed for scientific computing, where high-latency floating-point operations posed significant bottlenecks in numerical simulations. The CDC 6600, introduced in 1964 by Seymour Cray at Control Data Corporation, was the first machine to employ dynamic scheduling through a technique known as scoreboarding, enabling out-of-order execution across its functional units.^[9] This approach allowed instructions to issue in order but execute out of sequence when dependencies permitted, addressing the long delays in floating-point arithmetic—such as 10 minor cycles (1,000 ns) for multiplication and 29 cycles for division—by overlapping computations with memory accesses.^[9] The scoreboard maintained status flags for 10 independent functional units (including two floating-point multipliers, one adder, and one divider) and detected data hazards by tracking operand availability and unit occupancy, preventing conflicts like double result assignments while permitting up to 10 parallel operations in theory.^[9] This marked a key milestone in demonstrating hardware-based dynamic scheduling to hide latency in custom architectures tailored for high-performance numerical workloads, such as hydrodynamic simulations. Building on this foundation, the IBM System/360 Model 91, released in 1967, incorporated a limited form of out-of-order execution specifically for its floating-point units using Tomasulo's algorithm, which decoupled instruction issue from execution to tolerate long latencies without requiring compiler optimizations.^[6] The algorithm utilized reservation stations to buffer operations and a common data bus (CDB) with tag-based broadcasting to resolve dependencies dynamically, allowing up to three add and two multiply/divide stations to proceed when operands were ready, even if prior instructions were stalled on memory accesses or divisions (up to 18 cycles).^[6] For instance, in floating-point-intensive loops common to scientific computing, such as solving partial differential equations, this reduced execution time from 17 cycles to 11 per iteration by enabling concurrent use of multiple arithmetic units despite the system's four floating-point registers and pre-cache memory latencies.^[6] Scoreboarding-like dependency detection ensured that only independent operations advanced, providing a robust mechanism to sustain throughput in environments demanding precise control over high-latency operations. Early vector processors extended these concepts implicitly through pipelining and chaining, facilitating out-of-order-like overlap in array-based computations prevalent in 1970s supercomputing. The ILLIAC IV, operational in 1972, featured 256 processing elements in a SIMD array configuration with deeply pipelined functional units for vector operations, allowing elements of a vector to stream through arithmetic pipelines while subsequent instructions overlapped, effectively hiding startup latencies in parallel numerical tasks like fluid dynamics modeling.^[10] Similarly, the Cray-1 supercomputer of 1976 advanced this with vector chaining, where results from one pipelined functional unit (e.g., an adder with 1/2 cycle throughput) could directly feed into another (e.g., a multiplier) without intermediate register writes, enabling dependent vector operations to commence before prior ones completed and reducing idle cycles in memory-bound simulations.^[11] This implicit out-of-order capability tolerated latencies up to 100 cycles with minimal performance loss (less than 6% degradation) in applications such as hydrodynamic codes, underscoring the evolution of latency-hiding techniques in specialized hardware for scientific workloads.^[11]

Innovations in Precise Exceptions and Decoupling

One major challenge in early out-of-order execution designs was the handling of imprecise exceptions, where instructions could complete in a non-sequential order, potentially allowing later instructions to finish before earlier ones that triggered traps or interrupts. This disorder complicated program restarts, as the processor state might reflect partial execution of subsequent instructions, making it difficult for the operating system to restore a consistent architectural state and resume correctly from the faulting instruction.^[12] The concept of precise exceptions addressed this by ensuring that the saved processor state always corresponded exactly to a sequential execution point, with all prior instructions completed and no subsequent ones partially executed. In the 1970s, Henry M. Levy and Richard H. Eckhouse described mechanisms in the VAX architecture that supported precise handling of most interrupts and exceptions through structured vectors and serialization points, enabling reliable recovery without state corruption. Building on this, IBM's 801 experimental RISC processor, completed around 1980, incorporated history buffers to log changes for rollback, allowing the system to revert to a precise state upon exceptions by undoing speculative updates in reverse order.^[13]^[14] Decoupling emerged as a complementary innovation, separating the fetch and decode stages from execute and writeback to permit out-of-order issue and completion while maintaining architectural correctness. This allowed instructions to be issued dynamically without strict adherence to completion order, buffering results until safe commitment, as demonstrated in the Multiflow TRACE system during the 1980s, where trace scheduling enabled compiler-directed parallelism across decoupled pipeline phases. James E. Smith's 1982 paper formalized decoupling for superscalar processors, introducing access/execute separation to tolerate memory latencies and boost instruction-level parallelism through independent pipeline advancement. Smith also advocated checkpointing, where periodic snapshots of register and memory states facilitate restoration after exceptions, ensuring precise recovery in decoupled designs without full rollback overhead.^[15]^[16] A notable prototype embodying these ideas was the Aurora VLSI Test Processor in the 1980s, developed by Honeywell, which featured decoupled pipelines to enhance fault tolerance and performance in high-speed GaAs technology. By isolating front-end instruction processing from back-end execution, the Aurora design demonstrated improved throughput for vector and scalar workloads, validating decoupling's role in mitigating exception imprecision while supporting out-of-order speculation.^[17]

Research Maturation and Commercial Adoption

During the 1980s, research efforts shifted toward validating scalable out-of-order execution in practical prototypes, building on earlier decoupling concepts to demonstrate feasibility in superscalar environments. The Berkeley HPSm project (1985) exemplified this maturation, implementing a superscalar processor with out-of-order execution, speculative execution, and in-order commit, achieving a cycle time of 100 ns compared to 330 ns for the contemporary RISC II processor, thus highlighting significant efficiency gains in instruction-level parallelism.^[18] Similarly, IBM's Advanced Computer Systems (ACS) project, though initiated in the 1960s and canceled in 1969, influenced 1980s designs through its pioneering dynamic instruction scheduling and out-of-order execution techniques, which informed later scalable implementations despite the project's early termination.^[19] The transition to commercial adoption began with limited implementations in the late 1980s, marking the first widespread availability of out-of-order capabilities beyond research labs. The Intel i960 microprocessor, released in 1988, was the first commercial CPU to incorporate limited out-of-order execution, featuring superscalar pipelining that allowed multiple instructions per cycle and out-of-order completion for certain operations to avoid stalls, as described in its programmer's reference manual. Full-scale adoption arrived with the Intel Pentium Pro in 1995, which introduced a comprehensive out-of-order engine using a reorder buffer to handle up to 40 instructions in flight, enabling precise exception handling while maximizing throughput in x86 environments.^[18] Drivers for wide adoption included the proliferation of superscalar architectures in both RISC and CISC domains during the 1990s, fueled by performance validations from standardized benchmarks. For instance, the MIPS R4000 (1991) advanced RISC superscalar designs with deep pipelining that laid groundwork for later out-of-order enhancements, while the HP PA-8000 (1996) represented a milestone in CISC-like RISC adoption, delivering 11.8 SPECint95 and 20.2 SPECfp95 at 180 MHz through its 56-entry reorder buffer and four-way out-of-order issue.^[20] Key events, such as SPEC benchmarks demonstrating 2-3x performance gains from out-of-order over in-order execution in multi-issue processors, and the integration of advanced branch prediction in the Alpha 21264 (1996)—which achieved over 30 SPECint95 and 50 SPECfp95 through speculative out-of-order execution supporting 80 instructions in flight—accelerated this trend.^[21] By the 2000s, out-of-order execution became standard across architectures, including x86 with the AMD Athlon K7 (1999), ARM with early out-of-order support in the Arm1136J(F)-S (2003), and embedded systems; recent extensions appear in RISC-V via SiFive's U8-Series (2019), a superscalar out-of-order core configurable for high-performance applications.^[22]^[23]

Core Mechanisms

Dispatch and Issue Decoupling

Dispatch refers to the process of fetching instructions from the instruction cache, decoding them, and allocating them to reservation stations or issue queues associated with functional units, typically in program order to maintain front-end pipeline flow.^[18] This stage ensures that instructions are prepared for potential execution by renaming registers and tracking dependencies, but it does not immediately send them to execution hardware.^[24] Issue decoupling separates the dispatch of instructions from their actual issuance to functional units, allowing instructions to be held in reservation stations until all source operands are available, at which point they can be issued out-of-order to available execution resources.^[6] In this mechanism, introduced in Tomasulo's algorithm, reservation stations buffer instructions and operands, enabling the pipeline to continue dispatching new instructions even if earlier ones are stalled on dependencies.^[6] Dependencies are tracked using tags that identify the producing functional unit for each operand; when a result is produced, it is broadcast over a common data bus along with its tag, waking up and updating any waiting dependent instructions in the reservation stations.^[6] For example, consider the instruction sequence: LOAD F0 from memory address A (long latency), ADD F4, F2, F6 (independent of F0), MULTIPLY F0, F4. The ADD can issue and execute immediately after dispatch while the LOAD is pending, preventing idle functional units and reducing total execution time compared to in-order execution.^[18] This decoupling enables greater exploitation of instruction-level parallelism (ILP) by minimizing head-of-line blocking in the front-end pipeline, where a single stalled instruction would otherwise halt subsequent dispatches.^[18] In Tomasulo's implementation on the IBM System/360 Model 91, it improved performance by approximately 33% on a representative loop, reducing execution cycles from 17 to 11 through better overlap of arithmetic operations.^[6] It also supports issuing speculative instructions beyond predicted branches, though recovery mechanisms are required for mispredictions (detailed in other sections).^[18]

Execution and Writeback Decoupling

In the execution phase of out-of-order processors, instructions complete their computations within functional units as soon as their input dependencies are resolved, often finishing out of program order and producing results that are temporarily stored rather than immediately updating the architectural state.^[7] This approach maximizes resource utilization by allowing subsequent instructions to proceed without waiting for unrelated prior operations to finish, thereby overlapping execution latencies.^[7] Writeback decoupling ensures that these execution results remain isolated in temporary storage until all preceding instructions in the program order have successfully committed, at which point the results are retired and applied to registers or memory in strict sequential order.^[7] This separation prevents premature modifications to the processor's visible state, maintaining the illusion of in-order execution from the programmer's perspective while enabling higher instruction-level parallelism.^[7] By holding results until the commit point, the mechanism supports precise handling of exceptions, where the state reflects completion of instructions up to the faulting one without interference from later operations.^[7] Restartability is achieved through checkpointing of the architectural state, including registers and memory, which captures a consistent snapshot at key program points for recovery after traps, interrupts, or other disruptions.^[7] In cases of branch mispredictions, instructions executed speculatively beyond the branch can be squashed entirely, rolling back the state to the checkpoint without corrupting earlier committed writes.^[25] For instance, if a branch instruction completes its execution out of order but is later resolved as mispredicted, the processor discards the speculative results and any dependent computations, restoring the pre-branch state seamlessly to resume correct execution.^[25] A primary benefit of this decoupling is the ability to sustain high throughput by exploiting parallelism in computation while guaranteeing precise exceptions, which is essential for reliable software execution and debugging.^[7] It integrates with speculative execution by committing only non-speculative results to the architectural state, ensuring that branch outcomes and other uncertainties do not propagate errors until verified.^[25] This design, building briefly on prior dispatch and issue decoupling for initiating parallelism, focuses on back-end ordering to enable robust recovery and maintain sequential semantics.^[7]

Microarchitectural Implementations

Register Renaming

Register renaming addresses false dependencies arising from register reuse in sequential code, specifically write-after-read (WAR) and write-after-write (WAW) hazards, which limit instruction-level parallelism (ILP) despite no true data dependence on prior computations.^[26] These false dependencies occur when subsequent instructions overwrite or read a register before the original value is fully utilized, stalling out-of-order execution.^[27] The renaming mechanism dynamically maps architectural registers specified in instructions to physical registers during the dispatch stage, using mapping tables to track bindings and enable parallel execution of independent operations.^[26] Source operands are resolved by consulting the map to fetch values from the correct physical registers, while destination registers receive new physical allocations from a pool, eliminating name conflicts.^[27] Two primary implementation types exist: the physical register file (PRF) approach, where a large dedicated file stores all renamed values and mappings point directly to it (as in IBM ES/9000 or Alpha 21264), and the unified scheduler approach, where structures like reorder buffers hold results and serve as a combined rename and scheduling mechanism (as in Pentium Pro).^[26] In the PRF style, physical registers are allocated early at rename time; in unified designs, allocation may integrate with execution units for efficiency. A classic example illustrates WAW elimination in a loop: consider code like loop: ADD R1, R1, #1; JMP loop, where each iteration writes to R1, creating WAW hazards that serialize execution.^[27] Renaming assigns distinct physical registers per write—e.g., first ADD to P5, second to P6—allowing iterations to proceed in parallel without interference, as reads reference the latest map entry. Implementation relies on a free list to track and allocate available physical registers, typically managed as a queue or pointer structure to ensure rapid dispatch without stalls.^[28] For speculative execution, checkpointing saves map table states and free list heads at branch points, enabling quick rollback to prior mappings upon misprediction by restoring the saved configuration in one cycle.^[26]^[28] Historically, register renaming originated in Robert Tomasulo's 1967 algorithm for the IBM System/360 Model 91, initially applied to floating-point units to tolerate long latencies.^[26] It was refined in subsequent designs, such as modern Intel Core processors employing hundreds of physical registers for integer and floating-point operations.^[27] By providing more physical registers than architectural ones, renaming expands the effective ILP window—for instance, enabling execution windows to grow from a few instructions to 128 or more—thus sustaining higher throughput in out-of-order pipelines.^[28] This integrates into the broader out-of-order flow by resolving dependencies early, facilitating subsequent scheduling.^[26]

Reorder Buffers and Completion Logic

The reorder buffer (ROB) is a circular buffer that holds instructions from the point of dispatch until retirement, storing their results, destination registers, and completion status to enable out-of-order execution while ensuring in-order commitment to the architectural state.^[29] Introduced in early pipelined processor designs to support precise interrupts, the ROB allows instructions to complete execution out of program order but buffers them until all prior instructions have finished, preventing speculative results from corrupting the processor state.^[29] Completion logic in the ROB relies on head and tail pointers to maintain program order. The tail pointer indicates the position where new instructions are allocated upon dispatch, forming a queue that advances as instructions are issued. The head pointer tracks the oldest instruction; an instruction at the head retires only when it has completed execution (i.e., its result is available and all dependencies are resolved) and is confirmed non-speculative, such as after branch resolution. This mechanism ensures sequential retirement, with multiple instructions potentially retiring per cycle if consecutive head entries are ready.^[29] For exception handling, the ROB stalls retirement on traps or interrupts until the buffer drains to the faulting instruction, allowing precise state restoration by committing only instructions before the exception. On branch mispredictions, the tail pointer is reset to squash all speculative instructions beyond the correct path, freeing ROB entries and preventing incorrect results from retiring. This flushing preserves architectural correctness but incurs recovery latency proportional to the speculation window.^[29] Consider a simple four-instruction sequence: ADD R1, R2, R3; LOAD R4, [R5]; SUB R6, R7, R8; MUL R9, R1, R4. Despite out-of-order completion (e.g., SUB finishes first, followed by ADD, LOAD, then MUL), the ROB ensures retirement in original order by holding results until the head advances.

ROB Entry	Instruction	Dispatch Order	Completion Order	Status	Retire Order
0 (Head)	ADD R1, R2, R3	1	2	Ready	1
1	LOAD R4, [R5]	2	3	Ready	2
2	SUB R6, R7, R8	3	1	Ready	3
3 (Tail)	MUL R9, R1, R4	4	4	Ready	4

In this example, SUB's early completion is buffered until ADD and LOAD retire first, demonstrating how the ROB decouples execution from commitment.^[30] ROB variants include unified designs, where the buffer stores both status and results (as in early implementations), versus separate structures paired with a physical register file (PRF) that holds results while the ROB tracks only tags and status for retirement. In the Intel Pentium 4, for instance, the ROB comprises 126 entries focused on status tracking, integrated with a separate 128-entry PRF to manage speculation without duplicating data storage.^[31] ROB size directly influences the speculation window; larger buffers, like the 126 entries in Pentium 4, allow tracking more in-flight instructions for higher instruction-level parallelism but limit throughput if full.^[31] Larger ROBs enable greater speculation and performance gains from out-of-order execution but impose power and area penalties due to increased storage, pointer management, and wake-up logic complexity. Dynamic resizing techniques can mitigate these trade-offs by scaling ROB capacity based on workload, reducing energy dissipation in low-ILP scenarios while preserving peak performance.

Instruction Scheduling and Resource Allocation

In out-of-order execution processors, instruction scheduling relies on dynamic structures such as issue queues or reservation stations to hold renamed instructions after dispatch, allowing them to wait until their operand dependencies are resolved and execution resources become available.^[6] These structures prioritize instructions based on factors like age (to preserve program order where possible) and dependency readiness, enabling the processor to issue operations out-of-order while minimizing stalls.^[32] Originating from Robert Tomasulo's 1967 algorithm for the IBM System/360 Model 91, reservation stations buffer instructions near functional units, tagging operands with sources rather than immediate values to facilitate data forwarding.^[6] The core of instruction scheduling is the wakeup-select mechanism, which dynamically identifies and dispatches ready instructions to execution units.^[33] In the wakeup phase, when an instruction completes execution, its result is broadcast on a common bus or result tags are compared against pending instructions in the scheduler; matching dependents are marked ready by updating operand tags.^[34] The select phase then employs priority encoders or arbiters to choose instructions for issue, often favoring the oldest ready instruction to reduce head-of-line blocking and improve overall throughput.^[33] This mechanism, refined in modern implementations, supports back-to-back execution of dependent chains by allowing immediate wakeup and selection in the same cycle.^[34] Resource allocation during scheduling tracks the availability of functional units such as arithmetic logic units (ALUs), floating-point units, and address generation units (AGUs), as well as associated ports and pipelines.^[35] Schedulers maintain status tables or counters for these resources, allocating them only to ready instructions that match the unit's capabilities, while separate queues handle memory operations like loads and stores to avoid contention with integer or floating-point execution.^[32] For instance, in a typical design, a scheduler might issue up to four integer operations per cycle across multiple ALU ports while queuing memory instructions in a dedicated load/store queue, ensuring balanced utilization without overcommitting resources.^[35] Advanced scheduling incorporates out-of-order load/store disambiguation through mechanisms like store-forwarding, where loads check against pending stores in the store queue for address matches to forward data directly, bypassing memory access if possible.^[36] This requires content-addressable memory (CAM) structures in the load/store queues to compare addresses efficiently, resolving potential anti-dependencies and allowing loads to execute speculatively ahead of stores.^[37] Such features enhance memory-level parallelism but demand precise address speculation to avoid misforwards that could lead to recovery overhead. Trade-offs in scheduler design balance queue depth, which increases the window of exploitable parallelism, against added latency in wakeup and selection due to larger structures.^[33] Deeper queues enable holding more instructions for opportunistic issue but raise power and area costs from expanded CAM arrays; modern processors like AMD's Zen architecture mitigate this with distributed schedulers—six queues of 14 entries each supporting a 6-wide issue width—optimizing for low latency while scaling resource allocation.^[35]

References

[1]
Lecture 5: Out-of-order Execution
Out-of-order execution, or dynamic scheduling, is a technique used to get back some of that wasted execution bandwidth.
[2]
[PDF] Out-of-Order-Execution (OOOE) - UAF CS
History cont. ○ IBM 360/91 (1966). ○ Introduced Tomasulo's Algorithm. ○ IBM Power1 (1990). ○ OOOE Limited to floating point. ○ IBM PowerPC 601. ○ Fujitsu ...
[3]
Out-of-order processing-History of the IBM System/360 Mode 91
In 1966, IBM announced the System/360 Model 91, the first computer system to feature out-of-order execution–the ability to automatically find concurrency in ...
[4]
Exploring the Performance Limits of Out-of-order Commit
Out-of-order execution is essential for high performance, general-purpose computation, as it can find and execute useful work instead of stalling.
[5]
[PDF] computer-architecture-a-quantitative-approach-by-hennessy-and ...
Computer Architecture: A Quantitative Approach fur- thers its string of firsts in presenting comprehensive architecture coverage of sig- nificant new ...
[6]
[PDF] An Efficient Algorithm for Exploiting Multiple Arithmetic Units
The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code.
[7]
[PDF] Implementing precise interrupts in pipelined processors
A precise interrupt in pipelined processors means the saved state reflects a sequential model where one instruction completes before the next begins.
[8]
[PDF] A Two-way Superscalar Processor with Out-of-Order Execution and ...
superscalar out-of-order processor is 2.590 times the performance of an in-order pipeline processor, which indicates a 159% performance improvement. Figure ...
[9]
[PDF] PARALLEL OPERATION IN THE CONTROL DATA 6600
Many functional units. 2. Units with three address properties. 3. Many transient registers with many trunks to and from the units. 4. A simple and efficient ...
[10]
[PDF] An Introductory Description of the ILLIAC IV System. Volume 1 - DTIC
This book was written for an applications programmer who would like a tutorial description of the ILLIAC IV System before attempting to read the reference ...
[11]
[PDF] Out-of-Order Vector Architectures - University of Wisconsin–Madison
Cray-1 [12]. Following the Cray-1, a number of vec- tor machines have been designed and sold, from su- percomputers with very high vector bandwidths [8] to ...
[12]
https://people.eecs.berkeley.edu/~kubitron/courses/cs252-F00/handouts/papers/p291-smith.pdf
[13]
[PDF] Computer Programming and Architecture: The VAX - Bitsavers.org
Levy, Henry M. Computer programming and architecture. Bibliography: p ... Exceptions 293. Interrupts 294. Exceptions 294. Interrupt and Exception Vectors ...
[14]
[PDF] SYSTEM 801 - Principles of Operation - Bitsavers.org
It contains the sequencing and processing controls for instruction execution, interruption action, initial program loading, and other functions. The CPU, in ...Missing: history | Show results with:history
[15]
Decoupled access/execute computer architectures
James E. Smith. James E. Smith. Department of Electrical and Computer ... This results in an implementation which has two separate instruction ... Read More · A ...
[16]
[PDF] A VLIW Architecture for a Trace Scheduling Compiler
This process allows the compiler to break the "conditional jump bottleneck" and find parallelism throughout long streams of code, achieving order-of-magnitude ...Missing: decoupling | Show results with:decoupling
[17]
[PDF] Architectural and Circuit Issues for a High Clock Rate Floating-Point ...
precision floating-point unit as part of the Aurora HI processor. The chip was designed in a l.Oum (0.6^im effective gate length) GaAs direct coupled FET ...
[18]
[PDF] Implementing out-of-order execution processors - UCSD CSE
Feb 11, 2010 · Historical Context: IBM 360/91. • Tomasulo's Algorithm implemented for the Floating. Point operations. • It had only 2 functional units: 1 ...
[19]
The IBM ACS Project - ResearchGate
Apr 6, 2025 · Although the project was canceled, it brought many talented engineers to California and contributed to several later developments at IBM and ...
[20]
[PDF] The HP PA-8000 RISC CPU - Hot Chips
Aug 19, 1996 · Completely redesigned core/new microarchitecture. 56 Entry Instruction Reorder Buffer (IRB). Peak execution rate of 4 instructions/cycle.Missing: gains | Show results with:gains
[21]
[PDF] The Alpha 21264 Microprocessor: Out-of-Order Execution at 600 MHz
Continued Alpha performance leadership. 600 MHz operation in 0.35u CMOS6, 6 metal layers, 2.2V. 15 Million transistors, 3.1 cm2, 587 pin PGA.Missing: ARM Cortex- A8 SiFive U74
[22]
Incredibly Scalable High-Performance RISC-V Core IP - SiFive
Oct 24, 2019 · The SiFive U8-Series is the highest performance RISC-V ISA based Core IP available today, based on a superscalar out-of-order pipeline with configurable ...Missing: execution commercial 2018
[23]
Memory access ordering - an introduction - Arm Developer
Sep 11, 2013 · The first Arm processor to support out-of-order execution was the Arm1136J(F)-S, which permitted non-dependent load and store operations to ...Missing: commercial 3x
[24]
[PDF] Description of Tomasulo Based Machine - UCSD CSE
There are 4 basic stages to Tomasulo's Algorithm: 1. Dispatch (D): An instruction proceeds from dispatch to issue when it reaches the front of the instruction ...
[25]
[PDF] The design space of register renaming techniques
Register renaming is a technique to remove false data dependencies—write after read (WAR) and write after write (WAW)— that occur in straight line code ...
[26]
Out of Order Execution and Register Renaming - UAF CS
You can eliminate the "fake" dependencies WAW and WAR using register renaming (also called Tomasulo's Algorithm): a typical implementation is to hand out a new ...Missing: development | Show results with:development
[27]
[PDF] Dynamic Register Renaming Through Virtual-Physical Registers
Register renaming is a key issue for the performance of out-of-order execution processors and therefore, it is extensively used.
[28]
https://jilp.org/vol2/v2paper10.pdf
[29]
Measuring Reorder Buffer Capacity - Blog - Henry Wong
May 14, 2013 · The Pentium 4 Northwood appears to have 115 speculative PRF entries. Since the PRF size is expected to be 128 entries, that means 13 ...
[30]
[PDF] The Microarchitecture of the Pentium 4 Processor - Washington
The Allocator allocates a Reorder Buffer. (ROB) entry, which tracks the completion status of one of the 126 uops that could be in flight simultaneously in ...
[31]
An Efficient Algorithm for Exploiting Multiple Arithmetic Units
An Efficient Algorithm for Exploiting Multiple Arithmetic Units. Abstract: This paper describes the methods employed in the floating-point area of the System/ ...
[32]
[PDF] On Pipelining Dynamic Instruction Scheduling Logic
Each reservation station entry (RSE) has wakeup logic that wakes up any instruction stored in it. The select logic chooses instructions for execution from the ...
[33]
[PDF] Direct Instruction Wakeup for Out-Of-Order Processors
The select logic chooses a subset of instructions flagged by the wakeup logic for execution. The operands may come from a register file or from a previously ...
[34]
Zen - Microarchitectures - AMD - WikiChip
Instead of a large scheduler, Zen has 6 distributed scheduling queues, each 14 entries deep (4xALU, 2xAGU). Zen includes a number of enhancements such as ...
[35]
[PDF] Dynamic Memory Disambiguation in the Presence of Out-of-order ...
Our scheme handles the problem of false memory dependencies effectively even when the predictor relies only on load/store program counter values and the store ...
[36]
Address-indexed memory disambiguation and store-to-load ...
This paper describes a scalable, low-complexity alternative to the conventional load/store queue (LSQ) for superscalar processors that execute load and store ...