Fact-checked by Grok 2 weeks ago

Speed Up

Speedup in is a that quantifies the performance enhancement of an , , or system compared to a , commonly defined as the of the execution time of the to that of the improved version for the same problem. This concept is fundamental in evaluating efficiency gains from optimizations such as , where speedup measures how much faster a solves a problem relative to its sequential counterpart. In , speedup is often analyzed through Amdahl's law, proposed by in 1967, which predicts the theoretical maximum speedup based on the of a that can be parallelized. According to Amdahl's law, if a f of the is sequential and cannot be parallelized, the maximum speedup S with p processors is given by S = 1 / (f + (1 - f)/p), highlighting that even small sequential portions can severely limit overall gains. This law underscores the challenges in achieving linear speedup as the number of processors increases, emphasizing the need to minimize serial components. Complementing Amdahl's perspective, , introduced by John L. Gustafson in 1988, addresses limitations in fixed-size problem assumptions by considering scaled workloads that grow with available processors. It posits that for a problem with serial fraction f, the scaled speedup S is S = p - f(p - 1), where p is the number of processors, allowing for near-linear improvements in scenarios like scientific simulations where problem size can expand. These laws together provide a framework for assessing strong scaling (fixed problem size) and weak scaling (proportional problem growth), guiding the design of systems. Beyond theoretical models, speedup metrics are applied in diverse domains, including memory optimization and analysis, where they help quantify benefits from accelerations or software refinements. For instance, in multicore architectures, achieving superlinear (beyond the number of processors) is possible under specific conditions like improvements, though it remains rare. Overall, speedup analysis remains crucial for advancing computational efficiency in an era of increasing parallelism.

Definitions

Latency-Based Speedup

Latency-based speedup is a fundamental metric in performance , emphasizing the reduction in execution time for a fixed , which is crucial for comparing systems—where tasks execute one after another—and systems designed to shorten overall times without altering the problem size. In this context, represents the time required to process a unit of , expressed as L = \frac{T}{W}, where T is the total execution time and W is the size of the . This definition highlights 's role as a measure of per computational unit, distinct from aggregate processing rates. Latency-based speedup quantifies performance gains as the ratio of the original execution time to the improved execution time for the same fixed : S = \frac{T_{\text{original}}}{T_{\text{improved}}}. To connect this to , consider a constant W. The original time is T_{\text{original}} = L_1 \cdot W, and the improved time is T_{\text{improved}} = L_2 \cdot W. Substituting yields S = \frac{L_1 \cdot W}{L_2 \cdot W} = \frac{L_1}{L_2}, demonstrating that speedup directly equals the ratio of initial to reduced under fixed conditions. This approach underpins theoretical limits in , as explored in models like . Historically, early CPU prioritized metrics, with the inaugural SPEC benchmarks in evaluating performance via execution times for fixed integer and floating-point tasks on reference machines.

Throughput-Based Speedup

Throughput-based speedup measures the improvement in a system's capacity to process workloads over a given time period, particularly in environments where the volume of work can scale with available resources. It is defined as the ratio of the improved throughput Q_{\text{improved}} to the original throughput Q_{\text{original}}, expressed as S = \frac{Q_{\text{improved}}}{Q_{\text{original}}}. This metric is essential for evaluating performance in scalable systems, such as servers or setups, where the goal is to maximize the rate of task completion rather than minimizing time for a single task. Throughput Q is fundamentally the amount of work W completed per unit time T, given by the formula Q = \frac{W}{T}. Higher throughput indicates superior performance in systems designed for high-volume processing, such as multi-core processors or distributed clusters, where increased resources allow for greater overall output. In this context, throughput-based speedup S_{\text{throughput}} = \frac{Q_2}{Q_1} can be derived as S_{\text{throughput}} = \frac{W_2 / T_2}{W_1 / T_1}. Under fixed resources and fixed workload (W_1 = W_2), this simplifies to S_{\text{throughput}} = \frac{T_1}{T_2}, which equals the latency-based speedup but reflects the inverse relationship to the degradation in execution time, emphasizing rate improvements for sustained operations. This concept finds particular application in pipelined or multi-processor environments, where workloads can expand proportionally with added resources, enabling linear or near-linear gains in processing capacity. For instance, in multiprocessor systems configured for pipelining, throughput can approach the number of processors as tasks are streamed continuously, outperforming non-pipelined setups by distributing work across stages to maintain steady . In such scenarios, the metric highlights how architectural choices, like interconnect topologies, directly impact aggregate output without being constrained by single-task completion times. In concurrent systems, throughput-based speedup relates to through , which states that the average number of items in the system L equals the arrival rate (throughput) \lambda times the average response time () W, or L = \lambda W. Rearranged, this yields \lambda = \frac{L}{W}, illustrating how concurrency L enables high throughput by overlapping operations to mask in parallel environments like hierarchies or multi-threaded processors. This connection underscores the metric's relevance for designing scalable parallel systems, where achieving balanced concurrency directly boosts overall processing rates.

Theoretical Models

Amdahl's Law

establishes the theoretical maximum attainable from parallelizing a computation on multiple processors, assuming the overall problem size remains fixed. Introduced by in his 1967 paper arguing for the viability of single-processor enhancements over multiprocessor systems, the law demonstrates that the non-parallelizable, serial fraction of the workload fundamentally caps performance improvements, even as the number of processors grows. This model applies to latency-based , where the focus is on reducing the time to complete a single task. The rests on key assumptions: the total workload is invariant with respect to the number of processors, and the serial portion of the execution cannot be divided or accelerated by additional processing units. These premises imply that is inherently limited by the serial components, such as initialization, I/O operations, or control logic that must execute sequentially. As a result, even perfect parallelization of the remaining work yields beyond a certain number of processors. To derive , begin with the execution time on a single processor, denoted as T(1), which decomposes into and components: T(1) = T_s + T_p, where T_s is the time for the and T_p is the time for the parallelizable . Let P represent the of T(1) that is parallelizable, so T_s = (1 - P) T(1) and T_p = P T(1), with $0 \leq P \leq 1. With N processors, the time remains T_s = (1 - P) T(1), but the parallel time reduces to T_p / N = P T(1) / N, assuming ideal load balancing and no overhead. The total execution time then becomes: T(N) = (1 - P) T(1) + \frac{P T(1)}{N}. The speedup S(N), defined as the ratio of single-processor time to multiprocessor time, is: S(N) = \frac{T(1)}{T(N)} = \frac{T(1)}{(1 - P) T(1) + \frac{P T(1)}{N}} = \frac{1}{(1 - P) + \frac{P}{N}}. This yields the core formula of : S(N) \leq \frac{1}{(1 - P) + \frac{P}{N}}, where the inequality accounts for potential inefficiencies in practice. The maximum possible , as N \to \infty, approaches $1 / (1 - P), bounded solely by the serial fraction. For illustration, consider a program where 80% of the execution time is parallelizable (P = 0.8). With 2 processors, the speedup is S(2) = 1 / (0.2 + 0.8/2) = 1 / 0.6 \approx 1.67. For 16 processors, S(16) = 1 / (0.2 + 0.8/16) = 1 / 0.25 = 4. As N \to \infty, S(\infty) = 1 / 0.2 = 5, revealing how returns diminish rapidly after the initial gains, as the serial portion dominates. A key limitation of is its underestimation of speedup for problems amenable to , where larger workloads can be tackled with more processors without proportionally increasing serial time, as the model enforces a fixed problem size.

Gustafson's Law

Gustafson's Law, proposed by John L. Gustafson in 1988, provides a theoretical framework for understanding in when the workload scales proportionally with the number of processors, offering a to models assuming fixed problem sizes. Unlike approaches focused on fixed workloads, Gustafson's model assumes that additional processing resources enable tackling larger problems, keeping the execution time roughly constant while the serial fraction remains a small, fixed portion of the total scaled workload. This perspective highlights the potential for near-linear speedup in scenarios where parallelizable portions dominate and can expand with available compute power, making it particularly relevant for (HPC) environments that routinely scale simulations to leverage massive parallelism. The law's core formula expresses the scaled S for N processors as S = N - (1 - P)(N - 1), where P is the parallelizable fraction of the scaled workload, or equivalently S = (1 - P) + P \cdot N, emphasizing that time component scales linearly with N while the serial time stays fixed. This formulation arises from the assumption that the total execution time on the parallel system is held constant (normalized to 1), with the serial time s fixed and the parallel time p = 1 - s distributed across N processors; the hypothetical single-processor time for the full scaled problem then becomes s + p \cdot N, yielding the as the ratio of that to the parallel execution time. Gustafson's derivation underscores that as N increases, the parallel fraction P can approach 1 for well-designed scalable applications, allowing to grow nearly linearly, which demonstrates why fixed-workload models may appear overly pessimistic for weak in modern HPC. For instance, in a hypothetical simulation, a single might model a coarse over a limited area in a fixed time; with N processors, the scales to a finer-resolution covering a larger , where the setup remains , but the of atmospheric distributes the expanded work evenly, achieving close to N if the parallel fraction P is high.

Examples and Calculations

Basic Computational Examples

In latency-based speedup, the performance improvement is quantified as the ratio of the serial execution time to the parallel execution time, S = \frac{T_{\text{serial}}}{T_{\text{parallel}}}. A fundamental example involves parallelizing a loop, such as one that sums elements in a large or computes pi via estimation. In such cases, the serial version executes on a single . When parallelized across multiple processors using domain decomposition (e.g., dividing the iterations evenly among threads with no data dependencies), the parallel time ideally scales inversely with the number of processors. However, real implementations often achieve sub-linear speedup due to overheads like , load imbalance, or communication costs, falling short of the ideal linear speedup S = p (where p is the processor count). This is limited by factors such as those described in . Another illustrative case arises in optimization, such as enhancing branch prediction accuracy to reduce stalls. For instance, in SPEC CPU benchmarks, improving the predictor via placement techniques can reduce mispredictions by up to 22%, boosting accuracy to around 92% in cases like 176.gcc (from 13.2% to 7.7% misprediction rate). This leads to execution time reductions of up to 16%, yielding a of approximately 1.16×, assuming fixed instruction count and with fewer stall cycles from reduced mispredictions. Speedup concepts extend to microarchitectural metrics like cycles per instruction (CPI). For a program with 10^9 instructions executing at a 1 GHz clock, a baseline CPI of 3 results in T_{\text{serial}} = 10^9 \times 3 / 10^9 = 3 seconds (since cycles = instructions × CPI, and time = cycles / frequency). Reducing CPI to 2 through techniques like better instruction scheduling cuts cycles to $10^9 \times 2, yielding T_{\text{parallel}} = 2 seconds and S = 3 / 2 = 1.5 \times. Equivalently, in terms of instructions per cycle (IPC = 1/CPI), increasing from 0.333 to 0.5 achieves the same speedup, as higher IPC directly lowers execution time for fixed instructions and frequency. These computations underscore the gap between ideal linear speedup, where performance scales directly with resources, and practical outcomes constrained by factors like load imbalance or hardware limits.

Performance Efficiency Metrics

In , performance efficiency, often denoted as η, quantifies the fraction of computational resources that are effectively utilized to achieve a given . It is formally defined as the ratio of the speedup S to the number of processors p, expressed as η = S / p. This metric assumes an ideal case where η = 1, corresponding to perfect linear scaling where the speedup equals the number of processors. In practice, η is typically less than 1 due to factors such as communication overhead, load imbalance, and costs, reflecting the real-world utilization of hardware. An equivalent formulation of efficiency uses execution times directly: η = (T₁ / Tₚ) / p, where T₁ is the execution time on a single processor and Tₚ is the execution time on p processors. This form highlights as the speedup normalized by resource scaling, providing a normalized measure that isolates performance gains from mere resource addition. curves, often plotted as η versus p, graphically illustrate scaling behavior; for instance, a linear drop in η indicates increasing overhead dominance as processors are added, while a plateau near 1 suggests strong . Efficiency is particularly valuable over raw speedup for comparing parallel architectures because it incorporates cost-benefit analysis, revealing whether additional resources yield proportional returns in a resource-constrained . For example, achieving a 4× using 8 processors results in η = 0.5, or 50% , indicating that only half the added resources contributed effectively to the performance gain— a scenario common in basic parallelizations. This metric thus guides architects in evaluating trade-offs, prioritizing systems where high η balances computational power with practical overheads.

Advanced Concepts

Super-Linear Speedup

Super-linear speedup refers to the phenomenon in where the observed speedup S exceeds the number of processors N, such that S > N. For instance, a system might achieve a 10x speedup using only 8 processors, surpassing the linear expectation of S = N. This effect arises from several hardware and algorithmic factors that enhance efficiency beyond theoretical linear bounds. One primary cause is effects, where multiple processors provide a larger size, reducing contention and cache misses compared to a single-processor sequential execution; for example, data partitions fit more readily into per-processor caches, minimizing global accesses. Another contributor is the availability of increased total in parallel systems, enabling the processing of larger datasets or more complex problem instances that exceed the memory limits of a sequential machine, thus avoiding paging or overheads. Additionally, in search algorithms employing , the distribution of search spaces across processors can lead to earlier discovery of solutions, as independent explorations reduce redundant computations inherent in sequential heuristics. Notable examples occur in optimization problems, particularly with SAT solvers, where parallel implementations exhibit super-linear due to effective that uncovers satisfiable assignments more rapidly than sequential methods. In parallel SAT solving, this has been observed in divide-and-conquer approaches, yielding speedups greater than the processor count for certain instances by leveraging across threads. Super-linearity does not indicate flaws in foundational models like or but rather deviations from their ideal assumptions, such as uniform access and perfect sequential baselines; in practice, parallel executions often benefit from interactions like expanded caching that sequential runs cannot replicate. Early observations of this phenomenon appeared in parallel systems during the late and early , with researchers noting "speedup anomalies" in algorithms on distributed- architectures.

Limitations and Scalability

Achieving speedup in parallel computing is constrained by several inherent limitations, including communication overhead, load imbalance, and the serial fraction of computations as described by Amdahl's Law. Communication overhead arises from the time required to exchange data between processors, which can dominate execution time in distributed systems and reduce overall efficiency, particularly as the number of processors increases. Load imbalance occurs when computational tasks are unevenly distributed across processors, leading to idle time for some units while others remain busy, thereby limiting parallel efficiency. In practice, Amdahl's serial fraction—the portion of a program that must execute sequentially—imposes a fundamental cap on speedup, even with ideal parallelization of the remaining code; for instance, if 5% of the workload is serial, the maximum speedup is theoretically 20x regardless of processor count. Scalability challenges further compound these limitations, distinguishing between strong scaling, where problem size remains fixed and additional processors are added to reduce time, and weak scaling, where both problem size and processors increase proportionally to maintain constant time. Strong scaling is often hindered by the aforementioned overheads, following , while weak scaling can be limited by resource constraints like . The iso-efficiency function provides a metric for assessing by quantifying the problem size growth needed to offset parallel overheads and maintain ; for memory-bound algorithms, this function reveals limits where communication or I/O costs prevent sustained as processor count rises. The slowdown of since around 2020 has amplified these constraints on speedup potential, as transistor density improvements have decelerated from doubling every two years to roughly every three, shifting reliance toward architectural innovations like parallelism but without alleviating fundamental bottlenecks. As of 2025, hybrid integrations of GPUs and TPUs in workloads face specific challenges, such as issues in software frameworks and high transfer latencies between heterogeneous accelerators, which can erode expected speedups in large models. Looking ahead, may redefine these limits by enabling exponential speedups for problems like via algorithms such as Shor's, while neuromorphic systems could offer brain-like efficiency for tasks, though both remain in early stages with practical unproven. Super-linear speedups, though rare exceptions, do not broadly overcome these barriers.

Historical Development

Origins in Parallel Computing

The conceptual foundations of speedup in originated in the 1940s amid the development of early electronic computers like , where explored ideas of concurrent operations to enhance computational efficiency. During modifications to in 1948, von Neumann oversaw configurations that enabled arithmetic and transfer operations to execute concurrently, distinguishing it from purely sequential processing and foreshadowing parallel execution principles. In the 1960s, advanced with the introduction of vector processing influences and early multiprocessing architectures, prominently featured in the released in . Designed by , the utilized multiple functional units and pipelined operations to achieve up to three million floating-point operations per second, three times faster than contemporaries like the , thereby demonstrating initial practical gains in parallel speedup. This architecture's emphasis on overlapping operations reduced effective latency, influencing subsequent vector-based designs for . The term "speedup" gained formal usage in the project documentation during the mid-1960s, where it quantified efficiency improvements from massively parallel processing across 64 processing elements. Project reports applied "speedup" to describe performance factors, such as achieving up to 64x gains in simultaneous computations like sine function evaluations, directly linking the concept to parallel efficiency metrics. Seymour Cray's innovations in the further advanced vector speedup ideas by integrating high-speed pipelines that enabled concurrent data handling in supercomputers. These early developments were propelled by Cold War imperatives, as U.S. military needs for rapid simulations and ballistic calculations demanded reductions through architectures to maintain technological superiority over the . Such demands, channeled through agencies like , funded projects like to explore scalable systems for defense applications. Building on these foundations, Gene Amdahl's 1967 formulation of limits marked a pivotal theoretical milestone in assessing potential.

Key Milestones and Evolutions

The publication of Gene Amdahl's paper "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities" in 1967 at the AFIPS Spring Computer formalized the theoretical limits of speedup, establishing a foundational model for evaluating multiprocessor performance based on the fraction of serial work in computations. In 1988, John L. Gustafson challenged Amdahl's fixed-problem-size assumptions in his influential paper "Re-evaluating ," published in Communications of the ACM, by introducing scaled speedup that accounts for problem size growth with processor count, thereby supporting the viability of massive for large-scale scientific applications. During the and 2000s, the development of the (MPI) standard, first released by the MPI Forum in 1994, facilitated portable parallel programming across distributed systems, enabling widespread adoption in . Concurrently, Beowulf clusters—inexpensive networks of commodity PCs pioneered by researchers in 1994—demonstrated practical parallel performance, including observed super-linear speedups in applications like search algorithms due to cache locality effects and reduced contention. Super-linear speedup emerged as a notable milestone, highlighting deviations from theoretical bounds in real-world parallel implementations. The marked a surge in GPU acceleration following NVIDIA's release of the programming model in 2006, which unlocked general-purpose computing on graphics processors and delivered speedups exceeding 100x over CPUs in tasks, such as training for image recognition. In 2024, at became the world's fastest , achieving 1.742 exaflops and further advancing heterogeneous architectures for and scientific simulations. As of 2025, , exemplified by leading systems like at (1.742 exaflops) and at (1.35 exaflops), has demonstrated Gustafson-like scaling in large-scale simulations, such as cosmology models on resolving structures involving billions of galaxies, by efficiently utilizing heterogeneous architectures to handle expanded problem sizes. Over this period, speedup paradigms evolved from CPU-centric to heterogeneous systems integrating CPUs, GPUs, and FPGAs, as seen in modern supercomputers like and , which leverage CPUs paired with GPUs to optimize diverse workloads including and scientific modeling.

References

  1. [1]
    Speedup – Knowledge and References - Taylor & Francis
    Speedup refers to the ratio of the total execution time on a sequential computer to the corresponding execution time on a parallel computer for a specific set ...
  2. [2]
    Speed-Up and Efficiency - TU Berlin - VSP
    Speedup achieved by a parallel algorithm is defined as the ratio of the time required by the best sequential algorithm to solve a problem.
  3. [3]
    Validity of the single processor approach to achieving large scale ...
    The organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of ...
  4. [4]
    [PDF] Validity of the Single Processor Approach to Achieving Large Scale ...
    Demonstration is made of the continued validity of the single processor approach and of the weaknesses of the multiple proces- sor approach in terms of applica-.
  5. [5]
    Reevaluating Amdahl's law | Communications of the ACM
    Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIP$ Conference Proceedings, vol. 30 (Atlantic City, ...
  6. [6]
    [PDF] REEVALUATING AMDAHL'S LAW - John Gustafson
    We feel that it is important for the computing research community to overcome the “mental block” against massive parallelism imposed by a misuse of. Amdahl's ...
  7. [7]
  8. [8]
    Difference Between Efficiency and Speedup in Cache Memory
    Jul 23, 2025 · Speedup, on the other hand, refers to the improvement in the performance time of a system compared to its previous state. Cache memory can also ...
  9. [9]
    [PDF] Lecture 14: Discussing Speedup
    Can Speedup(code,sys,p) > p? ♢ That means using p processors is more than p times faster than using one processor. Page 5. 5. Speedup and Memory. • Yes ...
  10. [10]
    [PDF] Performance: Latency vs. Throughput
    Latency is the time to finish a fixed task, while throughput is the number of tasks in a fixed time. They are often contradictory.
  11. [11]
    [PDF] Quantifying the Performance Impact of Memory Latency and ...
    Jan 12, 2016 · To address this, we present straightforward analytic equations to quantify the impact of memory bandwidth and latency on workload performance, ...
  12. [12]
    The SPEC Benchmarks at MROB
    This page describes the SPEC benchmarks, giving their history, a description of how they are measured and how the scores are calculated, and useful formulas.
  13. [13]
    [PDF] Implications of the Power Wall: Dim Cores and Reconfigurable Logic
    Dark silicon vs. dim silicon, regarding the throughput based speedup (left), die utilization (middle), and the supply voltage (right) with optimal throughput ...<|separator|>
  14. [14]
    (PDF) Pipelining vs. Multiprocessors -- Choosing the Right Network
    arranged in various topologies. 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 0 50 100 150 200 250. throughput speedup relative to single processor. number of processors.
  15. [15]
    [PDF] CDMA as a Multiprocessor Interconnect Strategy
    throughput speedup over a split-transaction bus as large numbers of processors are interconnected, giving more scalable performance without additional end ...
  16. [16]
    [PDF] Little's Law and High Performance Computing - David H Bailey
    Sep 10, 1997 · This note discuses Little's law and relates the form cited in queuing theory with a form often cited in the field of high performance computing.Missing: parallel throughput
  17. [17]
    9.4. Limits of Parallelism and Scaling - Computer Science - JMU
    A variant of Amdahl's law uses f to denote the portion that must be run sequentially; that is, f=1–p. This leads to another derivation of Amdahl's law.
  18. [18]
    [PDF] Amdahl's Law - Brown CS
    Substituting into the equation for the speedup, we get. S(n) = T(1). T(n). = T(1). T(1)(1 - p) +. T (1)p n. Normally, Amdahl's Law is presented with all the T(1) ...Missing: derivation | Show results with:derivation
  19. [19]
    [PDF] Performance analysis - Purdue Engineering
    The serial portion in. Amdahl's law is a fraction of the total execution time of the program. The serial portion in G-. B is a fraction of the parallel ...
  20. [20]
    Introduction to Parallel Computing Tutorial - | HPC @ LLNL
    Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem.Missing: throughput- | Show results with:throughput-
  21. [21]
    [PDF] Code Placement for Improving Dynamic Branch Prediction Accuracy
    Apr 14, 2005 · On SPEC CPU integer benchmarks, our technique reduces branch mispredictions by up to 22% and 3.5% on aver- age. This reduction yields a speedup ...<|control11|><|separator|>
  22. [22]
    The Performance Equation
    Computer architects can reduce CPI by exploiting more instruction-level parallelism. If they add more complex instructions it often increases CPI. Clock Time.
  23. [23]
    Parallel Speedup — Parallel Computing Concepts - Selkie
    The speedup of a parallel algorithm over a corresponding sequential algorithm is the ratio of the compute time for the sequential algorithm to the time for the ...
  24. [24]
    [PDF] Parallel Computing: Performance Metrics and Models - UF CISE
    Abstract. We review the many performance metrics that have been proposed for parallel systems (i.e., program - architecture combinations).Missing: η = | Show results with:η =
  25. [25]
    5.2 Performance Metrics for Parallel Systems
    In an ideal parallel system, speedup is equal to p and efficiency is equal to one. In practice, speedup is less than p and efficiency is between zero and ...
  26. [26]
    Parallel Programming Concepts and High Performance Computing
    Speedup is defined as the ratio of the wallclock time for a serial program to the wallclock time for the parallel program that accomplishes the same work.Missing: throughput- | Show results with:throughput-
  27. [27]
    [2212.11223] Speedup and efficiency of computational parallelization
    Dec 21, 2022 · In this work, we provide a generic speedup (and thus also efficiency) model for homogeneous computing environments.
  28. [28]
    [PDF] Superlinear Speedup in HPC Systems: why and when?
    The speedup is usually defined as a ratio of the wall times of sequential and parallel execution of an algorithm. The target of the parallelization is to ...Missing: seminal | Show results with:seminal
  29. [29]
    [PDF] Shared Virtual Memory and Generalized Speedup - DTIC
    Mar 3, 2024 · Four possible causes of superlinear speedup given in [7] are listed in Fig. 1. 1. cache size increased in parallel processing. 2. overhead ...Missing: seminal | Show results with:seminal
  30. [30]
    [PDF] Seven Challenges in Parallel SAT Solving
    They encountered this fact in the form of observ- ing superlinear speed-ups, which was so unusual to them that they called them speedup anomalies. (Pruul and ...Missing: database queries
  31. [31]
    [PDF] The Future of Database Processing or a Passing Fad? - Jim Gray
    Overall, the Tandem systems demonstrate near-linear scaleup on transaction processing workloads, and near-linear speedup and scaleup on large relational queries ...
  32. [32]
    [PDF] Speedup Limits for Tightly-Coupled Parallel Computations - kluedo
    For example, we assumed that parallelization, com- munication and synchronization causes no additional overhead, and that perfect load balance is preserved ...
  33. [33]
    Load Imbalance - an overview | ScienceDirect Topics
    Load imbalance can arise due to insufficient parallelism during certain phases of computation or the assignment of unequal size tasks, such as those found in ...
  34. [34]
    Scaling - HPC Wiki
    Jul 19, 2024 · Weak scaling concerns the speedup for a scaled problem size with respect to the number of processors, and is governed by Gustafson's law. • When ...Scaling tests · Weak Scaling · Scaling Measurement...
  35. [35]
    [PDF] Isoefficiency: measuring the scalability of parallel algorithms and ...
    A small isoefficiency function implies that small incre- ments in the problem size are sufficient to use an increasing number of processors efficiently; hence, ...
  36. [36]
    Moore's Law and Its Practical Implications - CSIS
    Oct 18, 2022 · A2: Moore's Law has largely held true into the twenty-first century, though it has begun to slow down as engineers reach the limits of shrinking ...
  37. [37]
    The Future of AI Training: Are GPUs Enough? | Runpod Blog
    Apr 10, 2025 · GPUs still dominate AI training in 2025, but emerging hardware and hybrid infrastructure are reshaping what's possible.Missing: speedup | Show results with:speedup
  38. [38]
    21st century progress in computing - ScienceDirect.com
    Although there seems no doubt that Moore's Law as originally stated and Dennard scaling involve physical limits contributing to slower progress in computation, ...
  39. [39]
    The History of the ENIAC Computer - ThoughtCo
    Jan 13, 2020 · In 1948, Dr. John Von Neumann made several modifications to the ENIAC. The ENIAC had performed arithmetic and transfer operations concurrently, ...
  40. [40]
    Seymour Cray, the father of supercomputers and vector processing
    Dec 5, 2015 · A few years later in 1964, Cray outdid himself with his design of the CDC 6600, a machine with the world's fastest clock time of 100 nanoseconds ...
  41. [41]
    1964 | Timeline of Computer History
    The Control Data Corporation (CDC) 6600 performs up to 3 million instructions per second —three times faster than that of its closest competitor, the IBM 7030 ...Missing: vector | Show results with:vector
  42. [42]
    [PDF] THE IL IC IV - The First Supercomputer
    Illiac IV documents are available from: Illiac IV Project. Documentation. Institute for Advanced Computation. ~1ail Stop 233-14,. NASA Ames Research Center ...
  43. [43]
    [PDF] The Role of the Supercomputer During the Cold War, 1947-1963
    Due to this, it was necessary to maintain a balance of power not only in weapons but in supercomputing as well. Considering the role that the computers played, ...
  44. [44]
    DARPA's varied approaches to developing early parallel computers
    Nov 3, 2023 · As the Cold War waned and the acute need for military supercomputing and near-term AI applications lessened, the Bush administration and the ...
  45. [45]
    Frontier - Oak Ridge Leadership Computing Facility
    Frontier. ORNL's exascale supercomputer is delivering world-leading performance in 2022 and beyond. The Oak Ridge Leadership Computing Facility at Oak Ridge ...Missing: Gustafson | Show results with:Gustafson
  46. [46]
    Hewlett Packard Enterprise ushers in new era with world's first and ...
    May 30, 2022 · At 1.1 exaflops, Frontier is faster than the next seven most powerful supercomputers in the world combined, based on the Top500 list of May 2022.Missing: Gustafson scaling<|control11|><|separator|>