Fact-checked by Grok 2 weeks ago

Instruction pipelining

Instruction pipelining is a fundamental technique in computer architecture that enhances processor performance by dividing the execution of an instruction into sequential stages—typically including fetch, decode, execute, memory access, and write-back—allowing multiple instructions to overlap in execution as each stage processes a different instruction simultaneously, akin to an assembly line.^[1] This approach aims to achieve an ideal cycles per instruction (CPI) of 1, improving throughput without necessarily increasing clock speed, though it requires uniform instruction formats for optimal efficiency, as seen in reduced instruction set computing (RISC) designs.^[1] Originating in the late 1950s, pipelining was first implemented in general-purpose computers like the IBM 7030 Stretch, which introduced overlapped execution to meet ambitious performance goals, marking a shift from single-cycle processors.^[2] The technique gained prominence in the 1980s with RISC architectures, such as the MIPS R3000 featuring a five-stage pipeline, which refined compiler-scheduled operations to minimize stalls from data dependencies, control hazards like branches, and structural conflicts.^[1] Key challenges include pipeline hazards that cause stalls or flushes, addressed through strategies like branch prediction (achieving up to 90% accuracy in dynamic schemes), instruction reordering, and delayed branching.^[1] Advancements evolved into deeper pipelines (superpipelining) for higher frequencies, multiple parallel pipelines (superscalar execution), and out-of-order processing with reservation stations, enabling modern CPUs to sustain instruction-level parallelism beyond simple overlap.^[1] Despite these benefits, deeper pipelines increase latency for individual instructions and complicate exception handling, requiring precise interrupt mechanisms to maintain program correctness.^[3] Overall, instruction pipelining remains a cornerstone of high-performance computing, underpinning the efficiency of processors from embedded systems to supercomputers.

Fundamentals

Concept and Motivation

Instruction pipelining is a technique for implementing instruction-level parallelism within a processor by decomposing the execution of each instruction into a series of sequential stages, such as fetch, decode, execute, and write-back, which can operate concurrently on different instructions.^[4] This approach allows overlapping of instruction processing, where the hardware resources dedicated to each stage are utilized more efficiently by handling portions of multiple instructions simultaneously.^[4] The concept draws an analogy to an industrial assembly line, where specialized units perform distinct tasks on successive items in an overlapped manner, preventing idle time and increasing overall production rate without requiring duplicate equipment for each item.^[4] In a processor, this means that while one instruction completes its execute stage, the next may be in decode, and another in fetch, thereby sustaining continuous operation across the pipeline stages.^[4] The primary motivation for instruction pipelining is to enhance processor throughput, enabling the completion of approximately one instruction per clock cycle in an ideal steady-state scenario, which corresponds to an instructions per cycle (IPC) of about 1.^[4] Without pipelining, the time to execute an instruction is the sum of all stage times T = \sum t_i; with pipelining, the effective time per instruction approaches the duration of the longest stage t_{\max}, yielding a throughput improvement factor of up to n for n balanced stages.^[4] This speedup can be quantified as the ratio of non-pipelined execution time to pipelined execution time for a program, or S = \frac{\text{CPI}_{\text{non}} \times \text{cycles}_{\text{non}}}{\text{CPI}_{\text{pip}} \times \text{cycles}_{\text{pip}}}, where \text{CPI}_{\text{pip}} \approx 1 in steady state and cycle times adjust for stage division.^[4]

Pipeline Stages

The classic five-stage pipeline in Reduced Instruction Set Computer (RISC) architectures divides instruction execution into sequential phases to enable overlapping operations and improve throughput. These stages are Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB), with each stage handling a specific portion of the instruction lifecycle while instructions advance synchronously through the pipeline.^[5] In the IF stage, the processor retrieves the instruction from instruction memory using the program counter to generate the memory address. The ID stage decodes the fetched instruction to interpret the opcode and operands, reading the necessary values from the register file while generating control signals for subsequent stages. During the EX stage, the arithmetic logic unit (ALU) performs computations such as addition, subtraction, or address calculation based on the decoded operation. The MEM stage facilitates data memory operations, allowing load instructions to read from memory or store instructions to write to it. Finally, the WB stage writes the computed results—either from the ALU or memory—back to the register file for use by future instructions.^[5] Pipeline designs vary in stage count to balance simplicity, power efficiency, and performance targets. Shallow pipelines with 3 to 5 stages are common in simple or embedded processors, such as early RISC implementations, where reduced complexity supports lower power consumption and easier design. In contrast, deeper pipelines with 10 to 30 stages appear in high-performance processors, like modern superscalar CPUs, to enable higher clock frequencies by shortening individual stage delays, though this increases latency for single instructions and complexity in hazard management.^[6]^[7] Achieving balanced latencies across stages is crucial, as the clock cycle time is dictated by the slowest stage, limiting overall frequency if imbalances exist. Uneven stage delays reduce efficiency by forcing the entire pipeline to operate at the pace of the bottleneck, potentially lowering achievable clock speeds and throughput.^[8]^[9] The MIPS R2000, released in 1986, exemplified this approach with a balanced five-stage pipeline that supported clock speeds up to 16.67 MHz, making it one of the fastest commercial microprocessors of its era and establishing a model for subsequent RISC designs.^[10]^[11]

Historical Development

Early Implementations

The conceptual roots of instruction pipelining trace back to the 1950s era of vacuum tube computers, where designers began exploring techniques for overlapping instruction execution to improve throughput despite hardware limitations.^[2] A pivotal milestone in this development was the IBM 7030 Stretch, delivered in 1961, which introduced overlapped instruction execution in a general-purpose computer to achieve high performance for scientific computing, though it fell short of initial speed goals due to technological challenges.^[2] Building on these advances, the CDC 6600, the first commercial supercomputer released in 1964 and designed by Seymour Cray at Control Data Corporation, pioneered deep pipelining within its functional units, enabling scalar operations to overlap significantly for enhanced performance in numerical workloads. The CDC 6600 featured 10 independent functional units for arithmetic and logic operations—with memory access managed by six peripheral processors—that operated in a pipelined manner, allowing multiple instructions to progress through computation stages simultaneously while using scoreboarding to manage dependencies. This design achieved a throughput of approximately 3 million instructions per second (MIPS), a remarkable feat for the time, by sustaining high utilization across the units despite a 10 MHz clock rate.^[12] Building on these advances, the IBM System/360 Model 91, introduced in 1967, represented another early commercial adoption of instruction pipelining tailored for scientific applications. Optimized for floating-point intensive tasks in fields like space exploration and physics simulations, it employed a multi-stage pipeline that overlapped instruction fetch, decode, operand addressing, operand fetch, and execution, with a base cycle time of 60 nanoseconds. The design allowed the instruction unit to issue up to 0.8 instructions per cycle, surging ahead of execution to buffer operations and achieve concurrency, resulting in performance up to 100 times that of the earlier IBM 7090 for certain floating-point benchmarks. This pipelined approach was particularly effective for workloads requiring rapid handling of complex arithmetic, though it introduced challenges like imprecise interrupts due to the depth of overlap.^[13] Early pipelined designs like these were inherently constrained by mid-20th-century technology, typically limited to 3-4 pipeline stages owing to slow core memory access times (around 1-2 microseconds) and the complexity of vacuum tube or early transistor logic. These limitations meant that pipelines could not be deepened without risking instability from propagation delays, and designers prioritized reliability over aggressive overlap, often resulting in simpler fetch-execute structures rather than the deeper pipelines of later decades. Despite these hurdles, such implementations demonstrated pipelining's potential to boost instruction throughput in high-performance computing environments.

Modern Evolution

The rise of Reduced Instruction Set Computing (RISC) architectures in the 1980s significantly advanced instruction pipelining by emphasizing simple, uniform instructions that facilitated efficient multi-stage designs. The MIPS R2000, introduced in 1985, and its successor the R3000 in 1988, popularized a clean five-stage pipeline consisting of instruction fetch, decode, execute, memory access, and writeback stages, which minimized interlocks and maximized throughput.^[10] This design influenced subsequent RISC implementations, including the ARM architecture, by demonstrating how streamlined pipelines could achieve high performance with low complexity and power consumption.^[14] In parallel, complex instruction set computing (CISC) processors like those in the x86 family pursued deeper pipelines to attain gigahertz clock speeds, though at the cost of increased complexity. The Intel Pentium 4, launched in 2000, featured a 20-stage pipeline that expanded to 31 stages in the Prescott variant by 2004, enabling higher frequencies but amplifying branch misprediction penalties and power demands.^[15]^[16] By the mid-2000s, Intel shifted toward balanced designs in the Core microarchitecture series, reducing pipeline depth to 14 stages to improve energy efficiency and recovery from pipeline flushes while integrating out-of-order execution.^[17] Contemporary trends through 2025 reflect a move away from extreme pipeline depths toward wider execution units and improved branch prediction, constrained by power walls that limit clock scaling. ARM's Cortex-A series, such as the A78 (2020) and subsequent A715 (2022), employs 13- to 15-stage pipelines, prioritizing balanced performance for mobile and edge devices over sheer depth.^[18] Similarly, Apple's M1 processor (2020) utilizes an out-of-order pipeline in its high-performance Firestorm cores, focusing on efficiency and wide issue widths to deliver superior single-threaded performance per watt without excessive depth.^[19] This evolution stems from power-performance models showing that deeper pipelines beyond 15-20 stages yield diminishing returns due to higher leakage and dynamic power, favoring superscalar widths instead.^[20] Historically, pipeline depths have grown from approximately four stages in 1970s designs to over 20 in the 2000s, driven by frequency pursuits, but now stabilize at 10-15 stages to optimize branch recovery times and overall efficiency in multicore environments.^[21]

Pipeline Hazards

Types of Hazards

In pipelined processors, hazards represent situations where the pipeline's assumption of independent instruction execution is violated, potentially leading to incorrect results or stalls if not addressed. These disruptions arise because instructions in different stages may conflict in their resource usage or data dependencies, preventing the next instruction from proceeding in its scheduled clock cycle.^[9] Structural hazards occur when multiple instructions require the same hardware resource simultaneously, but the resource cannot support concurrent access. A classic example is a single memory unit shared between the instruction fetch (IF) stage and the memory access (MEM) stage, where one instruction is fetching while another is loading or storing data. This conflict forces the pipeline to stall until the resource becomes available.^[9] Data hazards stem from dependencies between instructions where the result of one instruction is needed by a subsequent one before it is fully available. They are classified into three types based on the order of read and write operations: read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). RAW hazards, the most common, happen when an instruction reads a register or memory location after a previous instruction has written to it but before the write completes; for instance, in the sequence add $1, $2, $3 followed by sub $4, $1, $5, the execution (EX) stage of the subtraction requires the write-back (WB) result from the addition, which is still pending. WAR hazards arise when an instruction writes to a location before a prior instruction has read from it, potentially overwriting a value needed by the earlier instruction. WAW hazards occur when multiple instructions write to the same location, risking out-of-order updates that could alter the final value. In in-order pipelines, WAR and WAW hazards are rare or impossible due to the fixed sequential execution order, which ensures reads and writes to registers occur predictably without overtaking.^[9]^[22] Control hazards emerge from instructions that alter the program counter (PC), such as branches or jumps, introducing uncertainty about the correct execution path. When a conditional branch is encountered, subsequent instructions are fetched assuming the branch is not taken, but if it is taken, those fetched instructions are incorrect, leading to a pipeline flush. This uncertainty delays resolution until the branch condition is evaluated, typically in later stages.^[9]

Resolution Techniques

Hazard detection in pipelined processors primarily occurs in the instruction decode (ID) stage through dedicated hardware units that identify potential data hazards by comparing register specifiers. For read-after-write (RAW) data hazards, comparators check whether the destination register of a prior instruction (in the execute or memory stages) matches the source register of the current instruction in the ID stage, such as verifying if the EX/MEM.RegisterRd equals ID/EX.RegisterRs or similar conditions for MEM/WB dependencies.^[23] This early detection allows the pipeline control logic to initiate resolution mechanisms before the hazard propagates, simplifying overall pipeline management as all checks are centralized in the ID stage.^[24] One common hardware resolution technique is stalling the pipeline, which inserts no-operation (NOP) instructions, or bubbles, to delay dependent instructions until the required data is available. When a hazard is detected—particularly load-use hazards where a load instruction's result is needed immediately in the next instruction—the hazard detection unit prevents the instruction fetch (IF) and ID stages from advancing by holding the program counter (PC) and flushing control signals to zero in the ID/EX pipeline register, effectively propagating a bubble through subsequent stages.^[25] This approach ensures correctness but incurs a performance penalty, typically a one-cycle stall for load-use cases in a classic five-stage pipeline, as the dependent instruction remains in ID while the load completes memory access and write-back.^[23] Forwarding, also known as bypassing, addresses many RAW data hazards more efficiently by using multiplexers (muxes) to route intermediate results directly from later pipeline stages back to earlier ones, bypassing the register file. In a typical implementation, results from the execute (EX) stage are forwarded via muxes from the EX/MEM pipeline register to the EX stage inputs of dependent instructions, while memory (MEM) stage results are forwarded from the MEM/WB register; control logic selects these paths when register matches are detected, such as ForwardA = 10 for EX/MEM sourcing.^[23] This technique resolves the majority of data hazards without stalling, though it requires additional hardware paths and comparator logic, and cannot handle all cases like immediate load-use dependencies.^[25] Software-based resolution through compiler scheduling complements hardware methods by reordering instructions at compile time to minimize hazard occurrences, particularly for non-critical dependencies. Compilers analyze data dependencies and rearrange code sequences—such as placing independent operations between a load and its use—to avoid stalls, assuming knowledge of the target pipeline structure; for instance, reordering can reduce clock cycles by eliminating multiple load-use stalls in a sequence.^[23] This approach is especially effective for straight-line code but is limited by true dependencies that cannot be reordered without altering program semantics, and it relies on static analysis rather than runtime detection.^[26]

Branch Handling

Control Hazards

Control hazards in pipelined processors arise from conditional branches and jumps that disrupt the sequential execution of instructions by altering the program counter (PC). When a branch instruction enters the pipeline, the fetch stage continues to load subsequent instructions assuming sequential flow, but the branch outcome (taken or not taken) and target address are not resolved until later stages, typically the execute stage in a classic five-stage pipeline. This leads to the pipeline fetching and partially processing incorrect instructions, requiring the flushing of 1 to 3 pipeline stages to redirect fetch to the correct path.^[27]^[28] The impact of unresolved control hazards is significant, as they introduce stalls or flushes that degrade performance. In programs with frequent branches—comprising about 14-20% of instructions—unmitigated control hazards can increase the cycles per instruction (CPI) from an ideal 1 to 1.42, effectively reducing instructions per cycle (IPC) by approximately 30% in branch-heavy workloads.^[27] One early technique to mitigate control hazards is delayed branching, which schedules the branch resolution such that the instruction immediately following the branch executes unconditionally, irrespective of the branch outcome. This creates a branch delay slot that the compiler fills with independent instructions, such as those not dependent on the branch condition or target, thereby hiding the penalty without flushing. In the MIPS architecture, which features a single branch delay slot in its five-stage pipeline, compilers rearrange code to populate this slot with useful operations about 48-60% of the time, avoiding the full branch penalty when successful.^[27]^[29] In contrast to data hazards, which stem from dependencies on operand values and primarily affect register reads or writes, control hazards fundamentally alter the instruction stream path, often necessitating pipeline-wide recovery rather than targeted forwarding or localized stalls.^[28]^[30]

Prediction Methods

Prediction methods for branch outcomes aim to anticipate whether a conditional branch will be taken or not taken, thereby reducing the pipeline stall associated with control hazards. Static prediction techniques, determined at compile time without runtime adaptation, include always-not-taken and always-taken strategies. The always-not-taken approach assumes all branches fall through to the sequential path, achieving typical accuracies of around 60-70% in benchmark workloads, but performs poorly on loops where backward branches are frequently taken. Conversely, always-taken prediction favors branch targets, yielding slightly higher accuracy in programs with more taken branches, yet it also struggles with non-loop forward branches that are often not taken. These methods are simple to implement, requiring no hardware tables, but their fixed nature limits effectiveness in diverse code patterns.^[31] Dynamic prediction, in contrast, adapts based on runtime branch history using hardware structures like the branch history table (BHT). A seminal dynamic scheme employs a 2-bit saturating counter per entry in the BHT, indexed by the branch's program counter (PC) bits, to track recent outcomes: the counter increments on taken branches and decrements on not taken, with the top bit determining the prediction. This design, introduced in early studies of pipelined processors, mitigates the oscillation issues of 1-bit predictors and achieves prediction accuracies of 82-99% depending on table size (e.g., 512 to 32K entries), significantly outperforming static methods by capturing local branch patterns. Aliasing from shared table entries can degrade performance, but the 2-bit mechanism provides hysteresis for stable predictions in repetitive code like loops.^[32] To address target address resolution for taken branches, especially indirect ones, the branch target buffer (BTB) serves as a cache-like structure that stores recent branch PCs and their targets, indexed by the current PC during fetch. Proposed in foundational work on pipelined systems, the BTB enables early target fetching, reducing latency beyond direction prediction alone; hits provide the target immediately, while misses default to sequential execution. It complements BHT-based direction predictors, with set-associative designs minimizing conflicts, and is essential for high-performance pipelines where branch resolution occurs late.^[33] Advanced dynamic predictors like TAGE (TAgged GEometric history length) combine multiple history lengths with tagging to exploit both local and global correlations, using a geometric increase in history table sizes for longer patterns. Developed as a high-accuracy solution, TAGE selects among components via a parallel lookup and override mechanism, achieving over 95% accuracy in modern benchmarks with reasonable hardware (e.g., 64KB), and is widely adopted in x86 and ARM processors for its balance of precision and complexity. For instance, Intel's Core i7 employs a hybrid predictor incorporating TAGE-like elements in its 14-stage pipeline, limiting misprediction penalties to 10-15 cycles through accurate foresight and recovery.^[34]^[35]

Advanced Topics

Special Situations

In pipelined processors, exceptions such as arithmetic overflows or page faults must be handled precisely to maintain the illusion of sequential execution, meaning the processor state reflects completion of all prior instructions without side effects from subsequent ones.^[36] Precise exceptions contrast with imprecise ones, where the state may reflect partially executed future instructions, complicating software recovery and debugging.^[36] To achieve precision in out-of-order pipelines, mechanisms like history buffers store the original register values and memory states before speculative updates, allowing rollback to the exact faulting instruction upon exception detection.^[37] Interrupts, which are asynchronous signals from hardware devices, are classified as maskable—those that can be temporarily disabled by setting an interrupt mask bit—or non-maskable, which cannot be ignored and demand immediate response for critical events like power failure.^[38] In pipelined designs, handling an interrupt typically involves flushing instructions after the current one from the pipeline to prevent interference, while saving processor state from pipeline registers at boundaries such as instruction decode (ID) to execute (EX), ensuring the restart address points to the interrupted instruction.^[36] Multi-cycle instructions, such as integer division operations that typically require 20–90 cycles or more depending on the architecture, introduce variable latency that can disrupt pipeline flow.^[39]^[40] These are managed either by stalling the pipeline—inserting no-op bubbles until completion to resolve structural hazards—or by dedicating separate functional units that operate in parallel without blocking the main pipeline, as seen in early RISC designs where divide units feed results back via a dedicated latch.^[39] In ARM pipelines, exceptions leverage banked registers—separate sets of registers for modes like Fast Interrupt Request (FIQ)—to minimize context switch overhead; for instance, FIQ mode provides banked R8–R12 and SPSR, avoiding the need to push these to the stack and saving several clock cycles compared to standard IRQ handling.^[41] For recovery in speculative execution, checkpointing establishes restore points by snapshotting the register rename map and architectural state just before branches or other speculative decisions, enabling efficient rollback on mispredictions or exceptions without full re-execution of the window.^[42] This approach, as in checkpoint processing and recovery designs, uses a small buffer of checkpoints (e.g., 8 for a 2048-instruction window) to limit overhead to under 8% while scaling instruction windows.^[42]

Integration with Other ILP Techniques

Instruction pipelining integrates seamlessly with superscalar architectures, which employ multiple parallel pipelines to issue and execute several instructions simultaneously in each clock cycle, typically ranging from 2 to 8 instructions per cycle depending on the design.^[43] This approach exploits instruction-level parallelism (ILP) by dispatching independent instructions to distinct execution units, such as integer ALUs or floating-point units, thereby increasing throughput beyond the single-instruction-per-cycle limit of scalar pipelines. For instance, Intel Core processors, starting from the Core 2 generation, feature a 4-wide superscalar design, allowing up to four micro-operations to be issued per cycle to enhance overall performance in pipelined environments.^[43] Out-of-order execution further enhances pipelining by dynamically reordering instructions at runtime to bypass dependencies and hide latency in deep pipelines, using mechanisms like reservation stations and a reorder buffer to maintain architectural correctness.^[44] Originating from Tomasulo's algorithm, this technique tracks instruction dependencies and dispatches ready operations to available execution units ahead of program order, mitigating stalls from data hazards in superscalar pipelines.^[44] The reorder buffer ensures results are committed in original order, supporting precise exceptions while tolerating long latencies, such as those from memory accesses, in modern deep pipelines.^[45] Speculative execution complements pipelining by allowing the processor to fetch, decode, and execute instructions along a predicted control flow path before resolving branches, with mispredicted work discarded via squashing to minimize penalties.^[46] This ties closely to branch prediction methods, enabling the pipeline to continue processing without stalling on control hazards, as the speculated instructions can be rolled back if the prediction fails.^[47] A practical integration is seen in AMD's Zen 4 architecture (introduced in 2022), which combines a 19-stage pipeline with out-of-order execution and speculative mechanisms to achieve over 5 instructions per cycle (IPC) in high-ILP workloads.^[48]^[49] However, widening superscalar issue rates beyond 4 instructions per cycle yields diminishing returns due to escalating hardware complexity, including larger dependency check logic, increased power consumption, and limited available ILP in typical programs.^[43] Designs exceeding 4-6 wide often face bottlenecks in fetch, rename, and wakeup stages, making further scaling inefficient without proportional performance gains.^[50]

Design Considerations

Performance Metrics

The clock cycles per instruction (CPI) serves as a fundamental metric for assessing pipeline efficiency, representing the average number of clock cycles required to execute one instruction. In an ideal scalar pipeline without hazards, CPI equals 1, as each instruction completes one stage per cycle in steady state. However, structural, data, and control hazards introduce stalls, increasing CPI above 1; the formula for CPI in the presence of stalls is given by CPI = 1 + stall fraction, where the stall fraction is the average number of stall cycles per instruction.^[51]^[52] The instructions per cycle (IPC), the reciprocal of CPI, measures the pipeline's ability to sustain instruction execution rates, providing a direct indicator of throughput. In scalar pipelines, IPC ideally approaches 1, but superscalar processors, which issue multiple instructions per cycle, target IPC values greater than 1 to exploit instruction-level parallelism. For instance, modern superscalar designs aim for IPC in the range of 2-4, depending on workload and hardware capabilities, enabling higher overall performance compared to non-superscalar pipelines.^[53]^[54] Pipeline throughput quantifies the steady-state rate at which instructions are completed, typically expressed as instructions per cycle once the pipeline is filled. In an ideal k-stage pipeline, throughput reaches 1 instruction per cycle, limited only by the slowest stage and assuming no bottlenecks from hazards. Real-world throughput is reduced by pipeline stalls and flushes, with deeper pipelines potentially increasing peak throughput but amplifying losses from disruptions.^[55] The branch misprediction penalty measures the performance cost of incorrect branch predictions, defined as the number of cycles lost due to flushing incorrectly fetched instructions from the pipeline. This penalty is often approximated as the pipeline depth minus the branch resolution stage, multiplied by the flush cost, leading to losses of several cycles per misprediction in deep pipelines. For example, in a 5-stage pipeline resolving branches in the execute stage, the penalty can be 2-3 cycles, scaling with pipeline complexity.^[56]^[57] Overall, pipelining contributes a 5-10x speedup over non-pipelined processors in modern CPUs by overlapping instruction execution, though actual gains depend on hazard mitigation; this is evident in typical 5- to 20-stage designs achieving effective IPC of 2-4 under balanced workloads.^[58]^[59]

Implementation Trade-offs

Implementing deeper instruction pipelines enables higher clock frequencies by reducing the combinational logic delay per stage, allowing each stage to complete in less time. However, this comes at the cost of increased penalties for control hazards, such as branch mispredictions and exceptions, since recovery from errors requires flushing more stages, leading to greater performance degradation.^[60]^[61] Power consumption in pipelined processors rises with additional stages due to increased switching activity from more latches and wires, which elevate dynamic energy dissipation across the pipeline. Techniques like dynamic voltage scaling can mitigate this by adjusting supply voltage and frequency based on workload demands, reducing overall power without proportionally sacrificing performance.^[20]^[62]^[63] Forwarding logic and associated buffers, essential for resolving data hazards in deep pipelines, introduce significant area overhead on the die, often accounting for a notable portion of the processor's silicon resources.^[64] Post-2010, processor designs shifted toward shallower pipelines combined with wider issue widths to improve energy efficiency, as exemplified by Intel's transition from the NetBurst architecture's 31 stages in the Pentium 4 Prescott to approximately 14 stages in the Core microarchitecture starting with Nehalem and Sandy Bridge.^[65]^[66] The optimal pipeline depth represents a balancing act influenced by target application domains, with mobile processors favoring 10-15 stages to prioritize low power and quick recovery from hazards, while server processors often employ 20 or more stages to maximize throughput at higher frequencies.^[64]^[20]

Illustrative Examples

Basic Pipeline Execution

In a classic five-stage instruction pipeline, as described in foundational computer architecture texts, each instruction progresses through the Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB) stages, with each stage ideally completing in one clock cycle. This design enables multiple instructions to overlap in execution, increasing throughput without reducing the latency for a single instruction. To illustrate ideal pipeline behavior, consider the execution of four independent arithmetic-logic unit (ALU) instructions on a processor like the MIPS architecture: ADD R1, R2, R3 (adds R2 and R3, stores in R1); SUB R4, R5, R6 (subtracts R5 from R6, stores in R4); AND R7, R8, R9 (bitwise AND of R8 and R9, stores in R7); and OR R10, R11, R12 (bitwise OR of R11 and R12, stores in R10).^[67] These instructions have no data dependencies or control transfers, allowing seamless overlap without disruptions. The first instruction enters the IF stage in clock cycle 1, the second in cycle 2, the third in cycle 3, and the fourth in cycle 4. The following table depicts the cycle-by-cycle progression of these instructions through the pipeline stages, assuming balanced stage timings and no hazards. Each row represents a clock cycle, with stages listed horizontally for each instruction (I1 through I4). A dash (-) indicates the instruction has not yet entered that cycle or has completed all stages.

Cycle	I1 (ADD)	I2 (SUB)	I3 (AND)	I4 (OR)
1	IF	-	-	-
2	ID	IF	-	-
3	EX	ID	IF	-
4	MEM	EX	ID	IF
5	WB	MEM	EX	ID
6	-	WB	MEM	EX
7	-	-	WB	MEM
8	-	-	-	WB

By cycle 5, the pipeline reaches steady state, with all five stages occupied by portions of different instructions: I1 completing WB, I2 in MEM, I3 in EX, I4 in ID, and the prior cycle's IF stage now feeding into ID for a hypothetical next instruction.^[68] In this configuration, the processor completes one instruction per cycle, achieving a throughput approaching one instruction per clock cycle after the initial pipeline fill. This ideal overlap demonstrates the core benefit of pipelining for sequential instruction streams under no-hazard conditions.^[67]

Stalls and Bubbles

In instruction pipelining, stalls occur when a hazard prevents an instruction from proceeding to the next stage, forcing the pipeline to pause earlier instructions to maintain correctness. Bubbles refer to the no-operation (NOP) instructions or idle cycles inserted during these stalls, which propagate through the pipeline stages without performing useful work. Data hazards, particularly read-after-write (RAW) dependencies, are a primary cause of such stalls, as a subsequent instruction attempts to read a register before the producing instruction writes its result.^[9] A classic example of a RAW data hazard is a load instruction followed immediately by a dependent arithmetic operation, such as:

lw $t0, 0($t1)   # Load word into $t0
add $t2, $t0, $t3  # Add $t0 to $t3, store in $t2
lw $t0, 0($t1)   # Load word into $t0
add $t2, $t0, $t3  # Add $t0 to $t3, store in $t2

In a standard five-stage pipeline (Instruction Fetch [IF], Instruction Decode [ID], Execute [EX], Memory [MEM], Write-Back [WB]), the load instruction fetches data during MEM but only writes it back to the register file during WB. The add instruction decodes the dependency during ID and requires the value in EX. Without mitigation, this creates a two-cycle stall in the EX stage for the add, as it must wait until the load completes WB. The pipeline control detects the hazard during ID of the add and inserts two NOP bubbles, delaying fetch and decode of subsequent instructions. The following table illustrates the pipeline execution with stalls (assuming no other hazards), showing instructions held in their stages during the two stall cycles:

Cycle	1	2	3	4	5	6	7	8	9
lw	IF	ID	EX	MEM	WB	-	-	-	-
add	-	IF	ID	ID	ID	EX	MEM	WB	-
next	-	-	IF	IF	IF	ID	EX	MEM	WB

Bubble insertion ensures correctness by effectively pushing back dependent instructions; the NOPs flow through IF, ID, EX, etc., without altering program state, allowing the load to complete before the add proceeds.^[69] Forwarding, or bypassing, mitigates this by routing the result directly from the MEM or WB stage to the EX stage input via multiplexers (muxes), bypassing the register file read. In the load-add example, the load's MEM-stage output can be mux-selected for the add's EX input after one stall cycle, reducing the penalty to one cycle instead of two. The hardware adds muxes at the ALU inputs, controlled by hazard detection logic, which selects between the register file value and forwarded paths from prior stages. This simple resolution maintains pipeline throughput closer to ideal while preserving sequential execution semantics.^[23] In unoptimized pipelines lacking forwarding, bubbles from RAW data hazards can reduce throughput by 20-50% in dependent code sequences, where instructions frequently rely on immediate predecessors, elevating cycles per instruction (CPI) significantly above the ideal value of 1.^[70]

References

[1]
Pipelining - Stanford Computer Science
Pipelining is like an assembly line where a processor works on different steps of an instruction at the same time, executing more instructions faster.
[2]
[PDF] Historical Perspective and Further Reading - 4.15
The RISC processors refined the notion of compiler-scheduled pipelines in the early 1980s. The concepts of delayed branches and delayed loads— common in.
[3]
[PDF] Computer Architecture and Engineering Introduction to Pipelining
Oct 24, 1997 · Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline. • How to stop the pipeline? • Restart? • Who caused the interrupt?
[4]
[PDF] Pipeline Architecture
Pipelining is a technique of decomposing a repeated sequential process into subprocesses executed by dedicated autonomous units, enabling overlapped execution.
[5]
[PDF] Can Programming Be Liberated from the von Neumann Style? A ...
The 1977 ACM Turing Award was presented to John Backus at the ACM Annual Conference in Seattle, October 17. In intro- ducing the recipient, Jean E. Sammet, ...
[6]
[PDF] The Engineering Design of the Stretch Computer - Bitsavers.org
This paper reviews the engineering design of the. Stretch System with ... Dunwell, "Design Objectives for the IBM Stretch Computer,". EJCC Proc., p. 20 ...
[7]
CPU Architecture - CS 3410
The 5 Classic CPU Stages · Fetch the instruction from the instruction memory. · Decode the instruction bits, producing control signals to orchestrate the rest of ...
[8]
[PDF] The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter ...
... However, since the overhead is a greater fraction of their clock period, deeper pipelines benefit more from reducing than do shallow pipelines.
[9]
https://www.cs.umd.edu/~meesh/411/CA-online/chapter/pipeline-hazards/index.html
[10]
[PDF] Pipelining & Performance - Cornell: Computer Science
Feb 27, 2025 · • Balanced pipeline very important. • Slowest stage determines clock rate. • Imbalance kills performance. Add pipeline registers (flip-flops) ...Missing: unbalanced | Show results with:unbalanced
[11]
Pipeline Hazards – Computer Architecture
Time to “fill” pipeline and time to “drain” it reduces speedup o Unbalanced lengths of pipe stages reduces speedup; Execute billions of instructions, so ...
[12]
[PDF] A MIPS R2000 IMPLEMENTATION - IIS Windows Server
The instructions are processed in a five-stage pipeline: fetch, decode, execute, memory, and writeback. Instructions are read from the instruction cache ...Missing: balanced | Show results with:balanced
[13]
[PDF] Design of the RISC-V Instruction Set Architecture - People @EECS
Jan 3, 2016 · The MIPS was first commercially implemented in the R2000 processor in 1986 [53]. In its original incarnation, the MIPS user-level integer ...
[14]
Illiac I | Physics - University of Illinois Urbana-Champaign
This machine, the ORDVAC, was the first of the two computers completed, passing its acceptance tests in February of 1952. ... ILLIAC II (ILLIAC I ...Missing: pipelining | Show results with:pipelining
[15]
[PDF] Design Of A Computer: The Control Data 6600
The Control Data 6600 is a sample of the 6600 display lettering. The display unit contains two cathode ray tubes and a manual keyboard.
[16]
[PDF] The IBM System/360 Model 91: Machine Philosophy and Instruction
These instructions (storage-to- storage) enable the Model 91 to achieve a performance rate of up to 7 times that of the System/360 Model 75 for the "translate- ...
[17]
Semiconductor Design Consolidation: MIPS -> RISC-V
Mar 11, 2021 · MIPS was one of the first RISC-family of chip architectures that gained wide usage, and influenced the designs of other RISC-based ISAs that ...<|separator|>
[18]
[PDF] The Microarchitecture of the Pentium 4 Processor - Washington
As shown, the Pentium 4 processor has a 20-stage misprediction pipeline while the P6 microarchitecture has a 10-stage misprediction pipeline. By dividing the ...
[19]
Now 31 Pipeline Stages - Intel's New Weapon: Pentium 4 Prescott
Feb 1, 2004 · The pipeline has been stretched from 20 stages to now 31 stages. Intel tries to reduce the complexity of each stage in order to run higher clock speeds.
[20]
How long is a typical modern microprocessor pipeline?
Sep 7, 2013 · Today, in the Core series II processors (i3, i5, and i7), there are 14 stages in the processor pipeline.
[21]
Cortex-A78 - Microarchitectures - ARM - WikiChip
Mar 31, 2022 · The pipeline is 13 stages with a 10-cycle branch misprediction penalty best-case. It has a private level 1 instruction cache and a private level ...
[22]
Why Is Apple's M1 Chip So Fast? - Debugger
Nov 27, 2020 · This is what allows the M1 Firestorm cores to essentially process twice as many instructions as AMD and Intel CPUs at the same clock frequency.The M1 Is Not A Cpu! · What Is Special About... · How Out-Of-Order Execution...
[23]
[PDF] Optimizing Pipelines for Power and Performance - cs.wisc.edu
In this paper we present an optbnization methodology that starts with an analytical power-performance model to derive op- timal pipeline depth for a superscalar ...
[24]
[PDF] Trends in Processor Architecture - arXiv
Initial processors had just a few pipeline stages but the introduction of RISC architectures in the early. 1980s made pipelining to be more cost-effective and ...
[25]
[PDF] Data Hazards - UCSD CSE
WAW hazard possible in a reasonable pipeline, but not ... The hazard? add R1, R2, R3 sub R2, R5, R4. • WAR hazard is uncommon/impossible in a reasonable (in-order).Missing: rare | Show results with:rare
[26]
12. Handling Data Hazards - UMD Computer Science
The objectives of this module are to discuss how data hazards are handled in general and also in the MIPS architecture. ... Patterson and John L. Hennessy, 4th.
[27]
None
Nothing is retrieved...<|separator|>
[28]
[PDF] Pipeline Hazards - Cornell: Computer Science
Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline.
[29]
[PDF] CS4617 Computer Architecture - Lecture 9: Pipelining Reference
9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, since that would mean forwarding the result in “negative time ...
[30]
Handling Control Hazards – Computer Architecture
The objectives of this module are to discuss how to handle control hazards, to differentiate between static and dynamic branch prediction and to study the ...
[31]
Lecture 4: Control Hazards
Two unresolved issues in the pipeline cause control hazards: branch outcome (eg, taken or not taken) and branch target (ie computing the next value for the PC ...
[32]
MIPS Architecture
The MIPS architecture executes one instruction during such load delay slots. In many cases, the compiler can rearrange the order of instructions in order to ...
[33]
https://courses.grainger.illinois.edu/ece511/Fa2003/papers/Smith.1984.Computer.pdf
[34]
https://www.irisa.fr/caps/people/seznec/L-TAGE.pdf
[35]
[PDF] Branch Prediction - UFMS/FACOM
It turns out that a 2-bit predictor does about as well as you could get with more bits, achieving anywhere from 82% to 99% prediction accuracy with a table of ...
[36]
[PDF] Branch Prediction Strategies and Branch Target Buffer Design
The branch target can be obtained only by computing it directly from the instruction or by remembering it from the past execution and assuming that it will be ...
[37]
[PDF] A 256 Kbits L-TAGE branch predictor - IRISA
The TAGE predictor, TAgged GEometric length predictor, was introduced in [10]. TAGE relies on several predictor tables indexed through independent functions of ...
[38]
[PDF] Intel's Core i7 microarchitecture - Size
•4-core multiprocessor. •Two threads per core, each thread dynamically scheduled. (SMT or Hyperthreading). •Pipeline depth 14 cycles. •15 cycles for branch.
[39]
[PDF] IMPLEMENTATION OF PRECISE INTERRUPTS IN PIPELINED ...
additional logic for supporting precise interrupts implies greater board area which implies more wiring delay5 which could alro lengthen the clock period. 10.
[40]
[PDF] Implementing precise interrupts in pipelined processors
When the history buffer contains an element at the head that is known to have finished without exceptions, the history buffer entry is no longer needed and that ...
[41]
Interrupts and Exceptions - Understanding the Linux Kernel, Second ...
A maskable interrupt can be in two states: masked or unmasked; a masked interrupt is ignored by the control unit as long as it remains masked. Nonmaskable ...
[42]
[PDF] Unit 6: Pipelining - Architecture and Compilers Group
• Fetch, decode, execute one complete instruction every cycle. + Takes 1 cycle to execution any instruction by definition (“CPI” is 1). – Long clock period ...
[43]
Types of exception - Arm Developer
... banked registers available in FIQ mode. This potentially saves clock cycles on pushing registers to the stack within the handler. Both of these kinds of ...
[44]
[PDF] Checkpoint Processing and Recovery: Towards Scalable Large ...
Branch misprediction recovery requires redirecting fetch to the correct instruction and restoring the rename map table before new instructions can be renamed.
[45]
[PDF] Unit 7: Superscalar Pipelines - Architecture and Compilers Group
– Diminishing returns from “super-pipelining” (hazards + overhead) regfile. D ... • 4-6 way issue is about the peak issue width currently justifiable.
[46]
[PDF] An Efficient Algorithm for Exploiting Multiple Arithmetic Units
The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code.
[47]
[PDF] Out-of-Order Execution: Reorder Buffer - Wei Wang
– Simply flush the execution of instruction 6 from reorder buffer. – When programmer inspects R6, value M1 will be seen, allowing investigating the cause.
[48]
[PDF] Speculative Execution - Wei Wang
○ Branch prediction is a type of speculative execution. ○ Speculative execution is extensive in modern processors since they are crucial for performance ...
[49]
[PDF] Branch Prediction & Speculative Execution - DSpace@MIT
Branch Prediction. Motivation: branch penalties limit performance of deeply pipelined processors. Modern branch predictors have high accuracy. (>95%) and can ...<|separator|>
[50]
[PDF] 3. The microarchitecture of Intel, AMD, and VIA CPUs - Agner Fog
Sep 20, 2025 · The present manual describes the details of the microarchitectures of x86 microprocessors from Intel, AMD, and VIA. The Itanium processor is ...
[51]
AMD's Zen 4 Part 1: Frontend and Execution Engine
Nov 4, 2022 · Zen 4 focuses on the frontend and out-of-order execution engine, with upgrades throughout the pipeline, similar to Zen 3, and a new 5nm process ...
[52]
[PDF] The Microarchitecture of Superscalar Processors - cs.wisc.edu
Aug 20, 1995 · While superscalar implementations have become a staple just like pipelining, there is some consensus that returns are likely to diminish as more ...
[53]
[PDF] Data Hazards
• What is the impact of stalling on CPI? • Fraction of instructions that stall: 30%. • Baseline CPI = 1. • Stall CPI = 1 + 2 = 3. • New CPI = 11. 0.3*3 + 0.7*1 ...
[54]
[PDF] Instruction Pipelining - Computation Structures Group
Sep 13, 2024 · CPI: (1- f) + 2f cycles per instruction where f is the fraction of instructions that cause a stall. What is a likely value of f? Page 18. MIT ...
[55]
Superscalar: extracting parallelism at runtime - UAF CS
Nowadays, the metric is "instructions per cycle" (IPC), which can be substantially more than one. (Similarly, at some point travel went from gallons per mile, ...
[56]
[PDF] Integrating Superscalar Processor Components to Implement ...
Figure 5 shows the IPC of the four perfect configurations. It can be seen that the IPCs have increased from the 1-1.5 range up to the 2-3 range. The ...Missing: typical | Show results with:typical
[57]
[PDF] What is Pipelining?
– Rather than ~5 cycles per instruction, 1 cycle per instruction! – Ideal ... – Each stage then takes 11ns; in steady state we execute each instruction ...
[58]
[PDF] Characterizing the Branch Misprediction Penalty
The absolute branch misprediction penalty is defined as the number of cycles lost due to a mispredicted branch. Referring to Figure 6, this is the branch ...
[59]
[PDF] Problem M7.1: Branch Prediction - Computation Structures Group
Mar 30, 2021 · How much work is lost (in cycles) on a branch misprediction in this pipeline? 6 cycles are lost when stalls are inserted into pipeline stages A, ...
[60]
[PPT] Pipelined Processing
For the non-pipelined processor. T1 = n k. Speedup factor. Sk = T1. Tk. = n k. [ k + (n-1)] . = n k. k + (n-1). 7. Efficiency and Throughput. Efficiency of ...
[61]
[PDF] Lecture 10: “From Pipelined to Superscalar Processors”
Oct 3, 2016 · The Ideas Behind Modern Superscalar Processors. ➢Superscalar or wide instruction issue. Ideal IPC = n (CPI = 1/n). ➢Diversified pipelines.
[62]
[PDF] IncreasingProcessor Performance by Implementing Deeper Pipelines
Deeper pipelines increase processor frequency, and can increase performance by 35-90% when combined with larger caches, as long as latency is not exposed.
[63]
Deep Pipeline - an overview | ScienceDirect Topics
Modern CPUs often use pipelines with 10–20 stages, and some designs, such as the Intel Pentium 4, have 20–31 stages. 1 3 2. However, deeper pipelines introduce ...<|separator|>
[64]
[PDF] Power-Optimal Pipelining in Deep Submicron Technology - scale
Power-optimal pipelining varies depending on activity factor and e #d because these change the proportion of switching power and leakage power. (or idle power ...
[65]
(PDF) A case for dynamic pipeline scaling - ResearchGate
Energy consumption can be reduced by scaling down frequency when peak performance is not needed. A lower frequency permits slower circuits, and hence a lower ...
[66]
(PDF) Integrated Analysis of Power and Performance for Pipelined ...
Aug 7, 2025 · Choosing the pipeline depth of a microprocessor is one of the most critical design decisions that an architect must make in the concept ...
[67]
Intel's Netburst: Failure is a Foundation for Success
Jun 16, 2022 · Netburst makes significant gains in bandwidth compared to previous generations of Intel CPUs. Versus the P6 based Pentium M, L1D load width has ...
[68]
The Future of Microprocessors - Communications of the ACM
May 1, 2011 · Evident from the data is that reverting to a non-deep pipeline reclaimed energy efficiency by dropping these expensive and inefficient ...Missing: stages
[69]
[PDF] CSE 410 Computer Systems - Washington
A pipeline diagram shows the execution of a series of instructions. – The instruction sequence is shown vertically, from top to bottom. – Clock cycles are ...
[70]
[PDF] LECTURE 7 Pipelining - FSU Computer Science
As stated before, the ideal speedup is equivalent to the number of stages in the pipeline. We can express this in a concise formula. Let TBI be Time Between.
[71]
[PDF] Pipelining
To avoid data hazards, the control unit can insert bubbles. As an alternative, the compiler can use NOP instructions. Example: Compute a:= b + c; d:= ...
[72]
[PDF] (Scalar In-Order) Pipelining Readings Performance: Latency vs ...
Performance Impact of Load/Use Penalty. •! Assume. •! Branch: 20%, load: 20%, store: 10%, other: 50%. •! 50% of loads are followed by dependent instruction.

Cycle	I1 (ADD)	I2 (SUB)	I3 (AND)	I4 (OR)
1	IF	-	-	-
2	ID	IF	-	-
3	EX	ID	IF	-
4	MEM	EX	ID	IF
5	WB	MEM	EX	ID
6	-	WB	MEM	EX
7	-	-	WB	MEM
8	-	-	-	WB

Cycle	I1 (ADD)	I2 (SUB)	I3 (AND)	I4 (OR)
1	IF	-	-	-
2	ID	IF	-	-
3	EX	ID	IF	-
4	MEM	EX	ID	IF
5	WB	MEM	EX	ID
6	-	WB	MEM	EX
7	-	-	WB	MEM
8	-	-	-	WB

Cycle	I1 (ADD)	I2 (SUB)	I3 (AND)	I4 (OR)
1	IF	-	-	-
2	ID	IF	-	-
3	EX	ID	IF	-
4	MEM	EX	ID	IF
5	WB	MEM	EX	ID
6	-	WB	MEM	EX
7	-	-	WB	MEM
8	-	-	-	WB