Fact-checked by Grok 2 weeks ago

Instructions per second

Instructions per second (IPS) is a fundamental metric in computer architecture that quantifies the execution speed of a central processing unit (CPU) by counting the number of machine instructions it processes within one second.^[1] This measure provides an indication of raw computational throughput, though it varies based on factors such as the instruction set architecture (ISA), clock frequency, and cycles per instruction (CPI).^[2] Commonly scaled into units like millions of instructions per second (MIPS), billions (GIPS), or trillions (TIPS), IPS originated in the early days of computing to benchmark processor performance against reference systems, such as the VAX-11/780 defined as 1 MIPS in 1977.^[3] The formula for MIPS is typically expressed as MIPS = (Instruction Count / Execution Time) × 10⁶, where execution time is in seconds, or alternatively as MIPS = (Clock Rate / CPI) × 10⁶, highlighting its dependence on hardware clock speed and the average number of clock cycles required per instruction.^[2] Historically, IPS ratings were derived from synthetic benchmarks like Dhrystone or Whetstone, which simulated instruction mixes to estimate performance, but these often favored simpler instructions and compiler optimizations.^[3] For instance, a 1994 Pentium-based PC achieved around 66 MIPS, while modern multi-core CPUs in 2024 can exceed billions of IPS through parallelism and advanced architectures.^[3]^[4] Despite its utility in early comparisons, IPS has significant limitations as a standalone performance indicator, earning the acronym "Meaningless Indicator of Processor Speed" due to inconsistencies across different ISAs and workloads— a RISC processor might execute more simple instructions per second than a CISC one, yet deliver comparable or inferior real-world results.^[3] It fails to account for instruction complexity, memory access latencies, or application-specific demands, making execution time or benchmarks like SPEC more reliable for comprehensive evaluations.^[5] Today, while IPS remains relevant for low-power embedded systems and historical analysis, it is often supplemented by metrics such as floating-point operations per second (FLOPS) for scientific computing and overall system throughput in high-performance contexts.^[6]

Fundamentals

Definition in Computing

Instructions per second (IPS) is a measure of a computer's processor speed, defined as the number of instructions that the central processing unit (CPU) can execute in one second.^[1] This metric originated from early computer architecture concepts in the 1950s, where performance evaluations focused on the rate at which machines could process basic computational operations.^[7] In the historical context of computing, IPS emerged as a fundamental performance indicator for central processing units (CPUs) during the 1960s, serving to quantify execution speed in a way distinct from clock speed, which measures the frequency of processor cycles, or throughput, which accounts for broader system output including input/output operations.^[6] It allowed engineers and researchers to assess and compare the raw computational capabilities of processors in isolation from other system components. Early computers like the UNIVAC I, delivered in 1951, exemplified this approach by achieving approximately 2,000 instructions per second, marking an initial benchmark for commercial systems.^[1] An instruction, in this metric, refers to a fundamental operation encoded in machine language that the processor performs, such as arithmetic computations (e.g., addition or multiplication), data movement via load and store operations, or control flow directives like conditional branches.^[8] These elemental commands form the core of any executable program, translating high-level software into hardware-executable actions. IPS plays a crucial role in benchmarking processor efficiency for general-purpose computing tasks, providing a standardized way to evaluate how effectively a CPU handles diverse workloads like scientific calculations or data processing.^[6] Its adoption in the 1960s facilitated direct comparisons between mainframes and emerging minicomputers; for instance, lower-end IBM System/360 models from 1964 executed about 75,000 instructions per second, while the CDC 6600 supercomputer reached 3 million instructions per second, highlighting rapid advancements in processor design.^[9]^[10]

Core Measurement Principles

Instructions per second (IPS) quantifies the raw rate at which a processor executes machine instructions under ideal conditions, focusing solely on computational throughput while assuming no delays from input/output operations, memory access stalls, or other system-level bottlenecks. This metric isolates the processor's intrinsic execution capability, providing a baseline for comparing architectural efficiency in controlled environments.^[11]^[12] The fundamental formula for IPS is derived from the total instructions executed divided by the elapsed execution time:

\text{IPS} = \frac{\text{Number of instructions executed}}{\text{Time in seconds}}

This approach is applied in simple benchmarks, such as Dhrystone, a synthetic workload consisting of a fixed loop of integer and string operations; for instance, on the VAX 11/780 baseline system using Berkeley Unix Pascal, approximately 483 instructions execute in 700 microseconds, yielding about 0.69 MIPS (millions of instructions per second). Such benchmarks emphasize straightforward counting of instruction completions over complex workloads to establish relative performance scales.^[13]^[11] IPS can also be expressed in terms of hardware parameters, incorporating the processor's clock rate (cycles per second) and the average cycles per instruction (CPI):

\text{IPS} = \frac{\text{Clock rate}}{\text{CPI}}

Here, CPI represents the mean clock cycles needed to complete one instruction, which varies by instruction type and implementation; lower CPI values, often achievable through optimized designs, directly boost IPS for a given clock rate. Measurements under this model assume sequential instruction execution without pipeline overlaps, multithreading, or other forms of parallelism, ensuring the metric reflects unadulterated single-threaded throughput.^[12]^[11] Despite its utility, IPS serves as a simplistic metric with inherent limitations, as it overlooks differences in instruction complexity across architectures—for example, reduced instruction set computing (RISC) designs typically feature simpler instructions with lower CPI but may require more total instructions for equivalent functionality, while complex instruction set computing (CISC) approaches use multifaceted instructions that inflate CPI despite fewer overall executions. This disregard for semantic equivalence can lead to misleading comparisons, underscoring IPS's role as a narrow indicator rather than a comprehensive performance gauge.^[14]^[11]

Units and Scaling

Standard Units

The primary unit for measuring instructions per second (IPS) is simply IPS itself, representing the number of instructions a processor executes in one second.^[1] To denote larger scales, metric prefixes are applied, such as kIPS for thousands of instructions per second (1 kIPS = 1,000 IPS), MIPS for millions (1 MIPS = 1,000,000 IPS), and GIPS for billions (1 GIPS = 1,000,000,000 IPS).^[1] These prefixed units facilitate practical reporting of processor performance, particularly as computing power grew beyond basic IPS counts in the late 20th century.^[6] The term MIPS originated in the 1970s as a marketing and comparative metric for mainframe and minicomputer performance, allowing vendors to quantify and advertise processing speeds in a standardized way.^[6] By the 1980s, MIPS became a widely adopted industry shorthand, despite criticisms of its limitations in accounting for instruction complexity across architectures.^[6] For instance, Digital Equipment Corporation's VAX-11/780, released in 1977 and a benchmark for early minicomputers, was rated at 1 MIPS based on its execution of typical workloads, serving as a reference point for subsequent systems.^[15] In industry standards, MIPS-like metrics influenced benchmark suites such as those from the Standard Performance Evaluation Corporation (SPEC), founded in 1988, where early scores were normalized relative to the VAX-11/780's 1 MIPS performance to provide comparable ratings across diverse hardware.^[16] This integration helped MIPS units gain traction in performance reporting for servers and workstations, though SPEC later evolved to more comprehensive integer and floating-point metrics to address MIPS's shortcomings.^[17] Today, while direct MIPS usage has declined in favor of workload-specific benchmarks, the unit remains a foundational concept for understanding processor throughput in historical and architectural contexts.^[6]

Scaling to Larger Metrics

As computing demands grew in high-performance systems, the million instructions per second (MIPS) unit proved insufficient, leading to scaled metrics such as giga instructions per second (GIPS) for systems processing billions of instructions and tera instructions per second (TIPS) for trillions, commonly applied to supercomputers and clustered environments.^[1] These larger units emerged to quantify aggregate performance in vector-based and parallel architectures, where individual processor speeds alone could not capture overall throughput. TIPS, however, is less commonly used in modern contexts, as high-performance computing has shifted toward floating-point operations per second (FLOPS) metrics.^[18] In parallel and multi-core systems, aggregate IPS is conceptually calculated as the product of the number of cores and the average IPS per core, assuming ideal scaling without overheads: Total IPS = Cores × Average core IPS. However, this formula represents an upper bound, as real-world scaling faces significant challenges due to Amdahl's law, which demonstrates that non-parallelizable serial components limit overall speedup, reducing the practical meaning of summed IPS in highly parallel environments.^[19] For instance, even if 99% of a workload is parallelizable, adding more processors yields diminishing returns beyond a speedup factor of 100, rendering simple IPS aggregation misleading for cluster performance evaluation.^[20] To address these limitations, modern adaptations like effective MIPS incorporate workload-specific adjustments, accounting for factors such as instruction complexity and execution efficiency to yield a more realistic performance metric beyond raw counts. In the 1990s, this progression manifested in vector processors, such as the Soviet Union's PS-2100 system achieving 1.5 GIPS in 1990, highlighting the shift to GIPS for capturing vectorized throughput in supercomputing.^[22] By the 2020s, while aggregate IPS concepts can theoretically scale to zetta (10^21) levels in massive clusters, practical measurements in exascale computing emphasize FLOPS and workload-adjusted variants to mitigate Amdahl's constraints in distributed environments.

Instruction Mixes

The Gibson Mix (1959)

The Gibson Mix was developed in 1959 by Jack C. Gibson, an IBM engineer, based on traces from 17 programs run on the IBM 704 and 650 computers, totaling approximately 9 million instructions. This mix aimed to provide a representative sample of instruction frequencies in scientific computing workloads, enabling more realistic evaluations of processor performance beyond simplistic single-instruction benchmarks.^[24] The mix categorized instructions into 13 classes, emphasizing data movement and arithmetic operations typical of early scientific applications on mainframes. The following table details the percentage distribution for each class:

Instruction Class	Percentage
Load and store	31.2
Indexing	18.0
Branches	16.6
Floating add and subtract	6.9
Fixed-point add and subtract	6.1
Instructions not using registers	5.3
Shifting	4.4
Compares	3.8
Floating multiply	3.8
Logical (and, or, etc.)	1.6
Floating divide	1.5
Fixed-point multiply	0.6
Fixed-point divide	0.2

These weights highlighted the dominance of load/store and indexing operations (collectively 49.2%), reflecting register-limited architectures, alongside floating-point arithmetic (12.2%) for numerical computations.^[24] As the first widely adopted instruction mix, it significantly influenced early instructions-per-second ratings for systems like the IBM 7090 and later informed the design of the IBM System/360 series by providing a standardized basis for comparing processor speeds across diverse workloads.^[25] Its legacy endures as a foundational model for subsequent benchmarks, such as those for VAX systems, though it became outdated for modern software due to evolving instruction sets and application patterns.^[24]

VAX MIPS Variations

VAX MIPS emerged in the late 1970s as a performance metric for Digital Equipment Corporation's VAX computer systems, calibrating the VAX-11/780 as the reference machine rated at 1 MIPS based on its execution of a mix of simple instructions.^[26] One variant relied on benchmarks like modified Whetstone or Dhrystone tests emphasizing integer and string operations with straightforward instructions, yielding the nominal 1 MIPS rating for the VAX-11/780 under ideal conditions.^[13] In contrast, another variant incorporated an OS-like instruction mix, featuring higher proportions of system calls, subroutine linkages, and complex operations such as character string moves (e.g., MOVC3) and conversions (e.g., CVTTP), which were prevalent in commercial workloads like COBOL applications; this resulted in 20-30% lower effective ratings due to increased overhead from cache misses and longer execution times per instruction.^[27]^[13] These variations highlighted how the benchmark-based approach overestimated performance by neglecting real-world OS interactions and workload complexities, leading to lower effective ratings due to increased overhead, often 20-30% less, from higher cycles per instruction (CPI) in complex operations.^[27] Building on earlier concepts like the Gibson Mix for scientific computing, VAX MIPS variations became a standard for commercial benchmarking through the 1990s, ultimately revealing fundamental inconsistencies in IPS measurements that prompted the shift to more comprehensive suites like SPEC.^[13]

Modern Instruction Mixes

The evolution of instruction mixes for evaluating instructions per second (IPS) has shifted toward standardized benchmarks that better reflect contemporary computing demands, beginning with the establishment of the Standard Performance Evaluation Corporation (SPEC) in 1988. SPEC CPU benchmarks, first released in 1989, introduced suites like SPECint and SPECfp, which incorporate a balanced mix of integer and floating-point instructions derived from real-world applications, such as scientific simulations and data processing tasks.^[28] These mixes emphasize compute-intensive operations, with SPEC CPU 2017 featuring 43 benchmarks across integer and floating-point categories to provide a more comprehensive assessment of processor performance under mixed workloads.^[29] This approach marked a departure from earlier, less diverse mixes by prioritizing portability and relevance to modern software ecosystems. In the realm of artificial intelligence, modern instruction mixes have adapted to prioritize tensor and matrix operations critical for machine learning training and inference. The MLPerf benchmark suite, developed by MLCommons since 2018, focuses on end-to-end AI workloads where matrix multiplications and convolutions dominate, often comprising the bulk of computational instructions in models like BERT and ResNet-50.^[30] For instance, DeepBench components within MLPerf evaluate granular operations such as dense matrix multiplications, which form a substantial portion of the instruction stream in deep learning tasks, enabling fair comparisons across hardware accelerators.^[31] These mixes highlight the growing importance of vectorized tensor instructions, adjusting IPS metrics to account for parallel processing in AI pipelines. Cloud computing workloads necessitate instruction mixes that integrate significant I/O operations alongside computational tasks, as seen in Transaction Processing Performance Council (TPC) benchmarks. TPC-DS and TPC-H, updated through the 2020s, model decision support systems with query mixes that emphasize I/O operations, simulating data ingestion, storage access, and analytics in cloud environments.^[32] These benchmarks maintain a transaction mix emphasizing read-heavy operations, reflecting real-world cloud database behaviors where I/O latency impacts overall IPS.^[33] In the 2020s, instruction mixes for architectures like ARM and x86 have evolved to incorporate vector extensions, enhancing IPS evaluations for high-performance computing. For x86 processors, mixes in SPEC and MLPerf adjust IPS by weighting AVX-512 instructions, which process 512-bit vectors equivalent to multiple scalar operations, boosting throughput in floating-point heavy workloads by up to 2x compared to AVX2.^[34] Similarly, ARM's Scalable Vector Extension (SVE) in AArch64 mixes, as used in benchmarks like MLPerf on processors such as the AWS Graviton3, scale vector lengths up to 2048 bits, allowing dynamic IPS adjustments for workloads involving AI and scientific computing. Advancements in post-2010 benchmarks address energy efficiency by including low-power instructions in their mixes, responding to the demands of sustainable computing. SPEC CPU 2017 introduced an optional energy metric, incorporating power-efficient instructions like those for idle states and dynamic voltage scaling in integer/floating-point evaluations.^[28] MLPerf Power, launched in 2024, extends this by measuring energy per sample in AI workloads, emphasizing instructions that optimize tensor operations for reduced wattage, such as mixed-precision computing on GPUs and CPUs.^[35] These inclusions fill gaps in earlier benchmarks, providing IPS metrics alongside power consumption for data-center and edge deployments.

Performance Factors

Hardware Influences

Pipelining is a fundamental hardware technique that overlaps the execution stages of multiple instructions, such as fetch, decode, execute, memory access, and write-back, to increase instruction throughput without reducing individual instruction latency. In a non-pipelined processor, each instruction takes the full cycle time of the slowest stage, but pipelining divides this into balanced stages, allowing a new instruction to enter the pipeline each cycle in ideal conditions. For a classic 5-stage MIPS pipeline with stage times of approximately 200 ps (register operations at 100 ps), the effective time per instruction drops from 800 ps in non-pipelined execution to 200 ps, yielding up to a 4-fold theoretical increase in instructions per second when hazards are minimized.^[36] Cache memory hierarchies, consisting of multiple levels (L1, L2, and L3), serve as high-speed buffers between the CPU and main memory to mitigate access latencies that can stall instruction execution. L1 caches, closest to the core, provide the fastest access but smallest capacity, while deeper levels offer larger storage at slightly higher latencies; high hit rates (typically over 95% for L1) ensure most data accesses complete quickly, adding minimal cycles to the overall CPI. In benchmarks, a split L1 cache configuration can reduce the memory stall component of CPI to 0.45 compared to 0.69 for a unified cache, directly boosting IPS by limiting the proportion of cycles lost to memory penalties, which can otherwise inflate execution time by 20-50% in memory-intensive workloads.^[37] Superscalar architectures extend pipelining by incorporating multiple execution units, enabling the processor to issue and complete several instructions simultaneously per clock cycle, thus increasing instructions per cycle (IPC) beyond 1. Out-of-order execution complements this by dynamically scheduling instructions based on data dependencies rather than program order, using mechanisms like reservation stations and reorder buffers to maximize functional unit utilization while preserving precise exceptions. The overall IPS is calculated as the product of clock rate and IPC, where superscalar designs like the MIPS R10000 can issue up to 4 instructions per cycle, potentially doubling or tripling throughput over scalar processors in parallelizable code.^[38] Branch prediction hardware anticipates control flow decisions to avoid pipeline stalls from conditional branches, which occur in 10-20% of instructions in typical mixes, by speculatively fetching subsequent instructions based on historical patterns. Accurate prediction, often exceeding 90% in modern predictors, minimizes misprediction penalties—where the pipeline must flush and refill, costing 10-20 cycles—thereby sustaining higher IPC in branch-heavy applications. Reducing branch predictor latency by even one cycle can improve overall performance by 2-5%, underscoring its role in maintaining steady IPS gains.^[39] RISC architectures, with their simplified, fixed-length instructions, facilitate higher clock rates and easier pipelining compared to CISC designs featuring variable-length, complex instructions that demand more decoding resources. This leads to lower average CPI in RISC processors, enabling superior IPS in compute-bound tasks; for instance, early comparisons on SPEC benchmarks showed MIPS RISC implementations achieving approximately 2-4 times the performance of VAX CISC systems with similar hardware organization. Modern examples like ARM (RISC) versus x86 (CISC) continue to highlight RISC's efficiency advantages in power-constrained environments.^[14]^[40]

Software and Workload Effects

Compiler optimizations play a crucial role in enhancing effective instructions per second (IPS) by reducing the number of instructions executed or improving their parallelism. Techniques such as loop unrolling expose more opportunities for instruction-level parallelism, allowing the processor to execute multiple iterations simultaneously and thereby increasing throughput. Vectorization, which packs multiple data elements into SIMD registers, further amplifies this effect by processing arrays in parallel, often yielding speedups in the range of 2-5x for compute-intensive loops in applications like scientific simulations.^[41] Operating system overheads in multitasking environments diminish effective IPS through mechanisms like context switching, where the OS saves and restores process states to enable time-sharing. In scenarios with frequent switches, such as running multiple interactive applications, this can consume 5-15% of CPU cycles, directly reducing the time available for user instructions and lowering overall IPS.^[42] The impact scales with the number of active processes and switch frequency, emphasizing the need for efficient kernel designs to minimize this penalty. Workload variability significantly alters effective IPS, as tasks differ in their balance between computation and external dependencies. CPU-bound workloads, such as numerical simulations, can approach peak IPS by fully utilizing processing resources, whereas I/O-bound tasks like database queries spend much of their time waiting for disk or network operations, dropping CPU utilization—and thus IPS—to as low as 10% of peak in extreme cases. This contrast highlights how application demands dictate realized performance, with I/O-intensive queries in databases often yielding far lower IPS than pure computational simulations despite identical hardware. Virtualization introduces additional layers that impact IPS via hypervisor management of resources across virtual machines. Hypervisor overheads, including instruction emulation and resource partitioning, typically add 10-20% to execution costs for enterprise workloads, effectively reducing IPS. The effective IPS in virtualized environments can be modeled as \text{Effective IPS} = \frac{\text{Raw IPS}}{\text{Overhead factor}}, where the overhead factor ranges from 1.1 to 1.2 for moderate loads.^[43] In modern cloud environments, containerization offers a lighter alternative to full virtualization, with minimal CPU overhead—often under 5%—due to shared kernel execution, preserving higher effective IPS for microservices and scalable applications. This efficiency addresses gaps in traditional virtualization by enabling denser deployments without substantial performance penalties, though storage and networking aspects may introduce isolated bottlenecks.

Historical Timeline

Single CPU Milestones

The development of single-processor performance, measured in instructions per second (IPS), began modestly in the 1960s with mainframe systems that laid the foundation for compatible computing architectures. The IBM System/360 family, introduced in 1964, represented a pivotal advancement in unified instruction sets across models, with higher-end configurations like the Model 65 and Model 75 achieving approximately 0.1 to 1 MIPS, enabling reliable execution for business and scientific workloads of the era.^[44] These early machines prioritized compatibility over raw speed, processing basic arithmetic and data movement instructions at rates that supported the transition from vacuum-tube to transistor-based computing.^[45] By the 1970s and 1980s, minicomputers brought more accessible performance benchmarks, exemplified by the Digital Equipment Corporation's VAX-11/780, released in 1977, which became the reference standard at 1 MIPS based on the VAX benchmark.^[46] This CISC-based processor handled complex virtual memory and multitasking operations efficiently, influencing performance metrics for decades as the "VAX Unit of Performance" (VUP).^[47] The Intel 80486, introduced in 1989, marked a leap in personal computing with integrated floating-point units and pipelining, delivering 20-50 MIPS at clock speeds up to 50 MHz, which powered early desktop applications and established x86 as a dominant architecture.^[48] The 1990s saw rapid escalation driven by superscalar designs and the shift to gigahertz clock rates, with the Intel Pentium Pro (1995) achieving 200-300 MIPS at 200 MHz through out-of-order execution and deep pipelines.^[48] This processor's dual-integer execution units allowed it to sustain higher throughput on integer workloads, bridging the gap between workstation and server capabilities while foreshadowing the clock speed wars that pushed beyond 1 GHz by the decade's end.^[49] Entering the 2000s, multi-core architectures tempered raw clock increases but boosted effective IPS through parallelism within a single chip. The Intel Core i7 series, debuting in 2008 with the Nehalem microarchitecture, delivered 10-20 GIPS per core effectively on typical workloads, as seen in models like the i7-920 at 2.66 GHz sustaining around 4-5 instructions per cycle in mixed benchmarks.^[50] This represented a focus on power efficiency alongside performance, enabling consumer desktops to handle multimedia and productivity tasks at scales previously reserved for servers. In the 2010s and 2020s, ARM-based designs emphasized integrated efficiency, with Apple's M1 SoC (2020) exceeding 100 GIPS across its 8-core CPU configuration, where high-performance Firestorm cores achieved up to 25 GIPS individually through advanced branch prediction and wide execution units.^[51] By 2025, emerging quantum-assisted hybrid processors integrated classical cores with quantum accelerators, as demonstrated in IBM-AMD collaborative architectures that leverage quantum co-processors for speedups in hybrid workflows, such as over 4x in chemistry simulations.^[52] These milestones reflect a progression from monolithic mainframes to sophisticated, efficiency-driven single chips capable of exascale potential in specialized domains.

Parallel and Cluster Developments

In the 1980s, early symmetric multiprocessing (SMP) systems pioneered aggregate IPS growth through shared-memory architectures. Sequent Computer Systems' Symmetry series, starting with models like the S81 featuring up to 30 processors at approximately 3 MIPS each, delivered 10-20 MIPS in total for initial configurations, enabling modest parallel execution for database and scientific workloads. By the late 1980s, advancements in bus design and cache coherence allowed systems like the Symmetry S81/20 with 20 CPUs at 20 MHz to reach 100 MIPS aggregate, demonstrating early scalability despite bottlenecks in memory contention.^[53]^[48] The 1990s saw distributed-memory clusters, exemplified by Beowulf-style systems, achieve giga instructions per second (GIPS) using commodity off-the-shelf hardware. NASA's Beowulf project, initiated in 1994, connected standard PCs via Ethernet to form cost-effective parallel environments, with early prototypes like the 16-node i486 DX4 cluster at 100 MHz providing foundational scalability for scientific computing. Larger systems, such as the Intel Paragon XP/S with 4,000 i860 CPUs at 50 MHz, attained 160 GIPS peak in 1992, while the Thinking Machines CM-5 scaled to 16,000 processors for 352 GIPS, underscoring how clustering democratized high-IPS performance beyond proprietary hardware. These developments reduced costs dramatically, with Beowulf clusters offering supercomputing capabilities at fractions of traditional prices.^[54]^[48] Entering the 2000s, Top500 supercomputers pushed toward tera instructions per second (TIPS) equivalents through massive parallelism. The Earth Simulator, operational from 2002, integrated 5,120 vector processors, establishing a benchmark for distributed systems in climate modeling through its vector-parallel design that amplified performance for compute-intensive tasks.^[55] The 2010s and 2020s advanced to exascale precursors, with systems like the Frontier supercomputer achieving exascale deployment in 2022 using over 8.7 million cores across 9,472 nodes powered by AMD EPYC processors and Instinct accelerators, enabling massive parallelism for simulations and AI. By 2025, AI-focused clusters, such as xAI's Colossus with 100,000 NVIDIA H100 GPUs and Oracle's expansions targeting up to 800,000 GPUs, have scaled aggregate performance through GPU parallelism for training large models and hyperscale AI inference.^[56]^[57]^[58] Scalability in these parallel and cluster systems faces inherent challenges, particularly communication overhead that prevents ideal linear summation of individual node IPS. Gustafson's law addresses this by emphasizing scaled speedup, where larger problem sizes on more processors maintain wall-clock time, allowing efficient utilization up to 100 processors in practice while highlighting limits from serial fractions and interconnect latency.

References

[1]
Definition of instructions per second | PCMag
Instructions per second (IPS) is the execution speed of a CPU as follows: KIPS was a metric in the early days of computers but is still used for CPUs in low- ...Missing: performance | Show results with:performance
[2]
[PDF] Chapter 1 Performance Measures - UCSD ECE
sures are commonly used like, for example, MIPS and MFLOPS. MIPS is defined as follows: MIPS = Millions of Instructions Per Second. = I. ET · 106. = I. #CC · T ...
[3]
None
### Summary of MIPS and Instructions Per Second as Performance Measures
[4]
CPU Speed Explained: What's a Good Processor Speed? - HP
Aug 2, 2024 · Modern CPUs can process billions of instructions per second. The exact speed depends on various factors, including: Clock speed; Number of cores ...
[5]
CPU time - Chapter 2: The Role of Performance
However, time is the only accurate metric for computer performance. But, there are problems with using MIPS. Instruction sets differ between machines. Some ...
[6]
What is million instructions per second (MIPS)? - TechTarget
Mar 1, 2023 · Million instructions per second (MIPS) is a measure of a processor's speed, providing a standard for representing the number of instructions ...
[7]
Computer Speeds From Instruction Mixes pre-1960 to 1971
Average speed rating of computers was based on calculations for a mix of instructions with the result given in Kilo Instructions Per Second (KIPS).
[8]
[PDF] Instruction Codes - Systems I: Computer Organization and Architecture
An instruction code is a group of bits that instruct the computer to perform a specific operation. • The operation code of an instruction is a group of.
[9]
April 7, 1964: IBM Bets Big on System/360 - WIRED
Apr 7, 2011 · Even the lower-end models in the System/360 line were capable of 75,000 instructions per second.
[10]
Timeline of Computer History
It performed 2 million instructions per second, but other RISC-based computers worked significantly faster. The Connection Machine is unveiled. Connection ...
[11]
[PDF] Quantifying Performance
• MIPS (millions of instructions per second) this would be higher for a program using simple instructions. Performance. Performance is determined by execution ...
[12]
[PDF] Chapter 2 Performance - GMU CS Department
– # of instructions in program? – # of cycles per second? – average # of cycles per instruction? – average # of instructions per second?
[13]
[PDF] An overview of common benchmarks
With Berkeley Unix (4.2) Pascal, the benchmark was trans- lated into 483 instructions executed in 700 microsec- onds, yielding 0.69 (native) MIPS. With DEC VMS ...
[14]
[PDF] Performance from Architecture: Comparing a RISC and a CISC
RISC has fewer cycles per instruction, but more instructions per program, resulting in a performance advantage of 2.7 times on average.Missing: impact | Show results with:impact
[15]
[PDF] History of Processor Performance - Columbia CS
This chart plots performance relative to the VAX 11/780 as measured by the SPECint benchmarks (see Section 1.8). Prior to the mid 1980s, processor performance ...<|control11|><|separator|>
[16]
Behind the benchmarks: SPEC, GFLOPS, MIPS et al - Ars Technica
Apr 2, 1999 · For a given program, the MIPS (Millions of Instructions Per Second) ... measure speed, or how long the computer takes to complete a single task.<|control11|><|separator|>
[17]
[PDF] Overview of the SPEC Benchmarks - Jim Gray
Many vendors characterize system performance in millions of instructions per second (MIPS) and millions of floating-point operations per second (MFLOPS). All ...Missing: usage | Show results with:usage
[18]
Assessing and Understanding Performance - Edward Bosworth
The first is MIPS (Million Instructions Per Second). Another measure is the ... either “BIPS” or “GIPS”, much less “TIPS”. Using MIPS as a Performance ...
[19]
Amdahl's Law & Parallel Speedup
Even so, Amdahl's law is still far too optimistic. It ignores the overhead incurred due to parallelizing the code. We must generalize it.
[20]
9.4. Limits of Parallelism and Scaling - Computer Science - JMU
The usefulness of Amdahl's law is limited by its reliance on strong scaling and unrealistic assumptions of parallel execution. Specifically, Amdahl's law ...
[21]
[PDF] Energy-Efficient Processor System Design - UC Berkeley EECS
Mar 7, 2001 · The effective MIPS/W is calculatedas the ratio of peak throughput (85 MIPS) to average power dissipation, and demonstrates the achievable.
[22]
Frontiers of Supercomputing II
Electronika-SSBIS. 450 MFLOPS. 2. 1991? (1990) ; PS-2000. 200 MIPS (24-bit). 64. 1981 (1980) ; PS-2100. 1.5 GIPS (32-bit). 640. 1990 (1987).
[23]
From MIPS to exaflops in mere decades: Compute power is ...
Apr 6, 2025 · MIPS reflects integer processing speed, which is useful for general-purpose computing, particularly in business applications.Missing: IPS | Show results with:IPS
[24]
Processor workloads - WashU Computer Science & Engineering
Table1: This table shows the Gibson mix ordered after the frequency of the instructions. This data was taken from [Jain91]. Instruction, Percentage.
[25]
[PDF] Evaluation of Instruction Set Processor Architecture by ... - DTIC
... Gibson mix, developed by Jack C. Gibson at IBM in 1959. Gibson divided the ... percentages of time sometimes add up to more than 100. Note that an XCT ...
[26]
VAX 11/780 Computer – CPU - CHM Revolution
VAX 11/780 Computer – CPU. Appears In: What Became of Mainframes? Prev ... 1 MIPS; Memory Type: 4K MOS RAM; Memory Size: 1 MB +; Memory Width: 32-bit; Cost ...
[27]
Measurement and analysis of instruction use in the VAX-11/780
This paper reports measurements of instruction set use on the VAX-iI/780 computer. A hardware monitor was used to measure the frequency and time taken by.
[28]
SPEC CPU 2017
### Summary of SPEC CPU 2017 Benchmarks
[29]
SPEC CPU®2017 Overview / What's New?
Sep 3, 2024 · SPEC CPU 2017 provides a comparative measure of integer and/or floating point compute intensive performance. If this matches with the type of ...
[30]
[PDF] MLPerf Training Benchmark - People @EECS
MLPerf is an ML benchmark designed to overcome challenges in training, combining broad benchmarks and end-to-end metrics to fairly evaluate system performance.
[31]
[PDF] Demystifying the MLPerf Training Benchmark Suite - NSF PAR
We present a study on the characteristics of MLPerf benchmarks and how they differ from previous deep learning benchmarks such as DAWNBench and DeepBench.Missing: instruction percentage
[32]
https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.1.0.pdf
[33]
TPC Benchmarks Overview
The benchmark defines the required mix of transactions the benchmark must maintain. The TPC-E metric is given in transactions per second (tps). It ...Missing: O- bound 40%
[34]
Intel® AVX-512 Instructions
Jun 20, 2017 · Intel AVX-512 instructions provide 512-bit SIMD support, enabling processing of twice the data of AVX/AVX2, and offer higher performance for ...
[35]
MLPerf Power: Benchmarking the Energy Efficiency of Machine ...
Oct 15, 2024 · This paper introduces MLPerf® Power, a comprehensive benchmarking methodology with capabilities to evaluate the energy efficiency of ML systems at power levels ...
[36]
10. Pipelining – MIPS Implementation - UMD Computer Science
For a non pipelined implementation it takes 800ps for each instruction and for a pipelined implementation it takes only 200ps. Observe that the MIPS ISA is ...Missing: impact | Show results with:impact
[37]
Cache Performance
Effects of Cache Performance on CPU Performance. Low CPI machines suffer more relative to some fixed CPI memory penalty. A machine with a CPI of 5 suffers ...
[38]
[PDF] The Microarchitecture of Superscalar Processors - cs.wisc.edu
Aug 20, 1995 · A typical superscalar processor fetches and decodes the incoming instruction stream several instructions at a time.
[39]
Revisiting the performance impact of branch predictor latencies
**Summary of https://ieeexplore.ieee.org/document/1620790:**
[40]
[PDF] Outline - Iscaconf.org
Found VAX 11/180 average clock cycles per instruction (CPI) = 10! ... Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6 ...
[41]
[PDF] Can Traditional Programming Bridge the Ninja Performance Gap for ...
Our average speedup for inner loop vectorization is 2.2X for SSE and 3.6X on AVX. Outer loop vectorization: Vectorizing an outer-level loop has a unique set ...
[42]
The context-switch overhead inflicted by hardware interrupts (and ...
Context switching imposes a performance penalty on threads in a multitasking environment. The source of this penalty is both direct overhead due to running ...Missing: percentage | Show results with:percentage
[43]
[PDF] Devirtualizable Virtual Machines Enabling General, Single-Node ...
Oct 9, 2004 · State-of-the art, commercial virtual machine software induces 10-20% overhead for typical enterprise workloads.
[44]
IBM 360/370/3090/390 Model Numbers - Beagle-Ears.COM
360/75 - the fastest of the original 360s. Had the entire S/360 instruction set implemented in hardwired logic (all the others had microcode). Ran at one MIPS.<|separator|>
[45]
A Complete History Of Mainframe Computing - Tom's Hardware
Jun 26, 2009 · In 1956, Los Alamos Scientific Laboratory awarded IBM a contract to build a supercomputer. The goal of this computer was to offer a hundred-fold ...
[46]
VAX 11-780 - HamPage
The VAX11-780 was a one MIPS (one million instruction per second) machine that became an industry standard.
[47]
The SPEC Benchmarks at MROB
The VAX 11/780 had performance similar to the IBM System/370 model 158-3, which was marketed as a "1 MIPS" machine. The term "VAX MIPS" was also common in those ...
[48]
Computer MIPS and MFLOPS Speed Claims 1980 to 1996
This document contains performance claims and estimates for more than 2000 mainframes, minicomputers, supercomputers and workstations, from around 120 suppliers
[49]
The History of Intel Processors - businessnewsdaily.com
Aug 8, 2024 · ... rated at 70.7 MIPS. The year 1989 was also the ... It was also offered as Pentium II Overdrive as an upgrade option for the Pentium Pro.
[50]
[PDF] History of Processor Performance - Columbia CS
Apr 24, 2012 · Since 2002, the limits of power, available instruction level parallelism, and long memory latency have slowed uniprocessor performance recently, ...
[51]
Counting cycles and instructions on the Apple M1 processor
Mar 24, 2021 · The Apple M1 processor gets close to 8 instructions retired per cycle when parsing numbers with the fast_float library. That is a score far higher than ...
[52]
IBM And AMD Tag Team On Hybrid Classical-Quantum ...
Aug 27, 2025 · Most of the current quantum machines out there use a mix of CPU, GPU, and FPGA compute engines to manage their qubits, and therefore, in the ...
[53]
Sequent Computer Systems
Using 12 processors would result in a total performance of 8.4 MIPS. The bigger system with 30 CPUs has a performance of 21 MIPS. Multiply the numbers by 1000 ...
[54]
[PDF] The Roots of Beowulf - NASA Technical Reports Server
In recent years, Beowulf-inspired commodity cluster systems have grown to represent greater than 80% of the world's Top 500 supercomputers and are now operated ...Missing: GIPS | Show results with:GIPS
[55]
(PDF) Earth Simulator Running - ResearchGate
The development has been successfully completed in February, 2002, and a remarkable sustained performance of 35.86 Tflops with 87.5% of the peak performance in ...
[56]
Hewlett Packard Enterprise ushers in new era with world's first and ...
May 30, 2022 · At 1.1 exaflops, Frontier is faster than the next seven most powerful supercomputers in the world combined, based on the Top500 list of May 2022.Missing: EIPS | Show results with:EIPS
[57]
NVIDIA Ethernet Networking Accelerates World's Largest AI ...
Oct 28, 2024 · NVIDIA today announced that xAI's Colossus supercomputer cluster comprising 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this ...Missing: IPS | Show results with:IPS
[58]
Announcing World's Largest AI Supercomputer in the Cloud
Sep 11, 2024 · Generally available in 2025, the Blackwell architecture-based B200 GPUs and GB200 Grace Blackwell Superchips provide up to four times faster ...Missing: IPS | Show results with:IPS