Fact-checked by Grok 2 weeks ago

Coprocessor

A coprocessor is a specialized computer that operates in conjunction with the (CPU) to perform specific tasks more efficiently, such as or multimedia processing, often running in parallel with its own . These auxiliary processors can be implemented as separate chips or integrated on the same die as the CPU, communicating via shared buses, dedicated instructions, or memory-mapped interfaces to offload workload from the main . The concept of coprocessors emerged in the 1960s with early supercomputers like the , which used dedicated functional units for operations such as multiplication and division to enable parallel execution alongside the CPU. By the 1970s, coprocessors became common in minicomputers and microcomputers, exemplified by the in the DEC PDP-11/45, which handled complex numerical computations independently. A pivotal milestone occurred in 1980 with the introduction of the math coprocessor, the first dedicated for the x86 architecture, which implemented the standard for accurate and reliable numerical processing, including 80-bit extended precision and transcendental functions like logarithms. This innovation significantly influenced personal computing, enabling the PC's adoption of processors and setting the foundation for standardized floating-point operations in subsequent hardware. Coprocessors are categorized by their specialized functions, including mathematical units for high-precision calculations, graphics processing units (GPUs) for rendering and parallel computations, and cryptographic accelerators for secure data handling. Other types encompass multimedia coprocessors for streaming acceleration, such as the PipeRench , which uses reconfigurable for efficient via low-latency connections to the CPU. Secure coprocessors, like those in server environments, provide isolated execution for privacy-sensitive applications, allowing dynamic software updates while protecting against tampering. In instruction path coprocessors, programmable on-chip units execute mini-instruction sets to optimize dynamic code paths, enhancing overall system performance. In contemporary , coprocessors have evolved from discrete components to tightly integrated elements, such as the floating-point units embedded in modern CPUs and many-core accelerators like GPUs or neural processing units (NPUs), which support tasks including scientific simulations and workloads. This integration allows for scalable parallelism, as seen in systems where coprocessors handle vectorized media instructions or hardware-accelerated functions, reducing latency and boosting efficiency in domains like and data analytics. The ongoing development of coprocessors continues to address the growing demands for specialized acceleration in environments.

Definition and Functionality

Core Concept

A coprocessor is a specialized auxiliary designed to supplement the primary (CPU) by performing specific computational tasks that the CPU handles less efficiently, such as complex arithmetic or data-intensive operations. This collaboration enables the coprocessor to execute targeted instructions in parallel with the CPU, thereby enhancing overall system capabilities without requiring the main to manage every type of workload. The primary benefits of a coprocessor lie in its ability to offload computationally intensive tasks from the CPU, reducing processing latency, boosting throughput for specialized operations, and facilitating parallel execution that aligns with the coprocessor's optimized architecture rather than the CPU's general-purpose design. By delegating such tasks, the CPU remains free to focus on and other core functions, leading to improved efficiency in applications requiring high numerical or rapid manipulation. This division of labor is particularly valuable in domains like scientific computing and , where the coprocessor's dedicated can deliver orders-of-magnitude gains over software on the CPU alone. In its basic operational model, a coprocessor interfaces with the CPU through shared buses or dedicated communication channels, monitoring the CPU's stream to identify and execute relevant coprocessor-specific commands while ignoring others. Upon receiving an , the coprocessor fetches operands via the bus, performs processing using its own registers and a tailored instruction set—often an extension of the CPU's —and signals completion before returning results for the CPU to integrate into the main execution flow. This synchronized yet autonomous operation ensures seamless data exchange and minimizes synchronization overhead, allowing the system to maintain coherent processing across both units.

Integration with Main Processors

Coprocessors integrate with main processors through various communication protocols that enable efficient exchange and coordination. Common approaches include systems, where both the CPU and coprocessor access a common memory space to transfer without explicit copying, reducing latency in tightly coupled designs. For instance, in heterogeneous systems, structures are mapped directly to the coprocessor's local memory via a collector mechanism integrated with the CPU's , preserving original layouts and minimizing overhead. Dedicated buses, such as local coprocessor interface buses, provide a direct, high-speed pathway for instruction and transfer, often synchronized with the CPU's clock to support real-time operation in system-on-chip () environments. signals further facilitate asynchronous communication, allowing the coprocessor to notify the CPU of task completion or errors, ensuring seamless in multi-component systems. Instruction handling between the CPU and coprocessor typically involves the CPU issuing specialized that the coprocessor recognizes and executes. In architectures like x86, (ESC) instructions occupy a dedicated opcode range (e.g., D8 to DF), which the CPU passes to the coprocessor for decoding and processing while handling computations itself. The coprocessor traps these opcodes, executes the corresponding operations—such as —and signals readiness for the next instruction, allowing the CPU to either wait idly or proceed with independent tasks to optimize overall throughput. This mechanism ensures that coprocessor-specific computations are offloaded transparently, with the CPU maintaining control over the instruction stream by snooping the bus or prefetch queue. Synchronization methods are essential to maintain and coordinate execution between the CPU and coprocessor. Busy-wait polling, where the CPU repeatedly checks a status flag until the coprocessor completes its task, provides simple but resource-intensive , often used in low-latency scenarios. Handshaking signals, such as request/grant (RQ/GT) pins, enable more efficient immediate by allowing the coprocessor to seize bus control temporarily, preventing conflicts during memory accesses without requiring explicit waits. Event-driven offer asynchronous coordination, where the coprocessor asserts an upon finishing, prompting the CPU to resume dependent operations; this approach minimizes idle time and supports interruptible busy-waiting to handle higher-priority . In practice, combinations of these methods, like the in early designs, ensure the CPU pauses only when necessary, tracking coprocessor state to avoid race conditions. Coprocessors typically share the main processor's and to form a cohesive unit, with integrated and adaptive clocking managing overall consumption in modern SoCs. This sharing extends to thermal management, where coprocessors contribute to on-chip , necessitating system-level techniques like dynamic throttling to distribute heat and prevent hotspots across the die. allocation in the hierarchy prioritizes data flows, often using hybrid policies to balance access efficiency between CPU and coprocessor tasks, ensuring sustained performance without excessive contention.

Historical Development

Early Innovations (1970s–1980s)

Before the emergence of microprocessor-based coprocessors, mainframe systems incorporated specialized hardware for floating-point arithmetic as analogous concepts to accelerate numerical computations. The IBM System/360 family, announced in 1964, offered optional floating-point features that supported both short (32-bit) and long (64-bit) precision operations, significantly enhancing the speed and efficiency of scientific and engineering calculations on its integrated processing units. These capabilities demonstrated the value of dedicated arithmetic hardware to handle tasks beyond the core integer processing of early computers, setting the stage for discrete coprocessors in microcomputer architectures. Coprocessors as separate chips emerged in the late , with early examples like AMD's Am9511 arithmetic processing unit introduced in 1977 for microprocessors such as the 8080. For the x86 architecture, 's 8087 Numeric Data Processor, released in 1980, became a seminal example, designed specifically to complement the 8086 by offloading floating-point operations. This coprocessor supported 80-bit for internal temporaries, allowing for greater accuracy in real-number arithmetic compared to the 8086's 16-bit integer focus, and it implemented a stack-based register architecture with eight 80-bit registers. A key innovation of the 8087 was its hardware support for transcendental functions, including sine, cosine, , arctangent, logarithm, and exponential operations, which were computed using algorithms like for efficiency. Operating at clock speeds up to 5 MHz in its standard variant—synchronized with the host CPU—the 8087 executed floating-point additions, multiplications, divisions, and square roots at rates far exceeding software emulation, thereby minimizing CPU overhead for math-intensive tasks. The adoption of early coprocessors like the 8087 was propelled by the inherent limitations of integer-only microprocessors such as the 8086, which struggled with decimal, floating-point, and vector mathematics required in engineering simulations, scientific modeling, and data processing. By enabling high-speed and precise , the 8087 addressed these gaps, fostering broader use of microcomputers in technical applications and influencing the development of standardized numerical practices.

Advancements in the 1990s

The Intel 80287, introduced as a successor to the 8087, served as the numeric coprocessor for the 80286 microprocessor, offering enhanced compatibility through shared bus architecture and synchronized operation modes. It supported clock speeds ranging from 5 MHz to 12 MHz, enabling more efficient handling of floating-point operations in systems like the IBM PC/AT derivatives. The 80387 further advanced this lineage for the 80386 CPU, achieving clock speeds up to 25 MHz in standard configurations and incorporating parallel functional units for improved throughput, including support for pipelined bus cycles that reduced latency in data transfers. These coprocessors maintained full backward compatibility with earlier x87 designs while extending the instruction set to include over 70 numeric operations, facilitating seamless upgrades in 32-bit protected mode environments. During the 1990s, coprocessor technology began transitioning from discrete components toward on-chip integration, driven by the need for reduced latency and simplified system design. Early experiments involved embedding floating-point units (FPUs) directly onto the main processor die, with Intel's —launched in 1993—marking a pivotal milestone by incorporating a fully integrated FPU capable of superscalar execution. This integration eliminated the requirement for separate coprocessor chips like the 80387, cutting costs and improving performance by allowing simultaneous integer and floating-point operations, with the FPU pipeline handling one floating-point instruction per clock cycle. Subsequent Pentium variants refined this approach, boosting overall system efficiency and paving the way for broader adoption in personal computing. Performance advancements in coprocessors were quantified through metrics like MFLOPS, with the 80387 achieving around 0.3 MFLOPS in typical double-precision workloads at 25 MHz on , a significant leap that supported simulations in scientific and applications. This throughput, derived from optimized execution of basic arithmetic instructions (e.g., add and multiply taking 10–30 cycles), enabled practical use in fields like , where prior discrete designs struggled with bottlenecks. MIPS ratings for coprocessor-specific instructions also improved, reflecting pipelined enhancements that increased effective instructions per second by up to 20% over the 80287. These developments profoundly influenced software ecosystems, as compilers began incorporating optimizations tailored to coprocessor presence. For instance, C/C++ version 6.0 (1989, with updates through the 1990s) provided flags like /FPc87 to generate code assuming an 80x87-series coprocessor, enabling intrinsic functions for direct FPU access and accelerating floating-point computations in applications. Such tools encouraged developers to leverage coprocessor capabilities, fostering growth in numerically intensive software for Windows platforms and reducing reliance on software .

Key Manufacturers and Examples

Intel Implementations

's coprocessor lineup began with the 8087 Numeric Data Processor, introduced in 1980 as the first floating-point coprocessor for the 8086 , featuring 40,000 transistors and capable of approximately 50,000 floating-point operations per second (0.05 MFLOPS). Designed in HMOS with a 40-pin package, it extended the 8086's capabilities for scientific and applications by handling complex arithmetic independently. This was followed by the 80287 in 1982, a high-performance extension for the 80286, built on HMOS III in a 40-pin DIP package, supporting integer, floating-point, and BCD formats while maintaining object-code compatibility with the 8087. The 80387, released in 1987 for the 80386, advanced the series with full compliance to the standard, including support for denormals, NaNs, and rounding modes, in a 68-pin array package operating synchronously with the host CPU. Complementing these x87 math coprocessors, introduced the i860 in 1989, a 64-bit RISC with over 1 million transistors, optimized for processing in supercomputing environments through pipelined floating-point units achieving up to 80 MFLOPS in single precision at 40 MHz. The family, encompassing the 8087, 80287, and 80387, employed a stack-based register with eight 80-bit registers organized as a last-in, first-out (LIFO) structure, where operations reference positions relative to the top-of-stack (ST(0)) pointer in the status word. This model supported (64-bit mantissa) for internal computations, alongside single (32-bit), double (64-bit), and packed BCD formats, enabling precise handling of transcendental functions like sine and logarithm while adhering to for arithmetic accuracy, including gradual underflow and exception flags for invalid operations, overflow, and inexact results. The integrated seamlessly via ESCAPE instructions, allowing the coprocessor to execute in parallel with the main CPU without halting the . Compatibility was a core design principle, with pin-compatible variants like the 80387DX for the 80386DX and 80387SX for the 80386SX, enabling straightforward upgrades within the 386 family via the dedicated math coprocessor . The 80387 preserved full object-code with 8087 and 80287 software, running existing programs unchanged in real-address and supporting virtual-8086 for multitasking, though it introduced refinements such as distinct signaling for NaNs absent in prior models. This extended to earlier CPUs like the 80286 through shared in some systems, facilitating incremental enhancements without redesigning motherboards or rewriting code. By the mid-1990s, Intel's coprocessors had achieved widespread adoption, powering personal computers through their integration as standard components in x86 systems and contributing to Intel's dominance in the microprocessor market. This prevalence drove the evolution toward software-emulated alternatives, exemplified by the MMX instruction set extensions introduced in 1996, which repurposed existing registers for multimedia vector operations, bridging the gap to fully integrated floating-point units in later CPUs.

Motorola and Other Early Designs

The Motorola 68881 and 68882 floating-point coprocessors, introduced in the 1980s, served as dedicated arithmetic units for the MC68020 and MC68030 microprocessors in the 68000 family. These devices extended the host CPU's capabilities by handling IEEE 754-compliant floating-point operations, including addition, multiplication, division, and transcendental functions like sine and logarithm, all performed internally at 80-bit extended precision (64-bit mantissa and 15-bit exponent) using a 67-bit arithmetic unit and barrel shifter. The 68881 featured a basic architecture divided into a Bus Interface Unit (BIU) for communication and an Arithmetic Processing Unit (APU) for execution, while the 68882 improved upon this with dual-ported registers, an added Conversion Unit (CU), and enhanced concurrency, achieving over twice the performance of its predecessor through parallel instruction handling. Both supported eight 80-bit floating-point data registers (FP0–FP7) and three 32-bit control registers for status, precision control, and instruction addressing. Architecturally, these coprocessors integrated via the M68000 family's coprocessor , which allowed up to eight such devices per and used standard asynchronous bus cycles for . A key feature was support for dynamic bus sizing, enabling seamless operation over 8-, 16-, or 32-bit buses without additional , which provided flexibility in design compared to more rigid interfaces in contemporary x86 alternatives. The 68881 employed a sequential execution model where the host CPU waited for APU availability, but the 68882 introduced pipelining across the BIU, , and to overlap instruction fetch, decode, and execution, supporting clock speeds up to 33 MHz (with later variants reaching 40 MHz). This design emphasized high-precision computation for scientific and engineering applications, with all operations rounded to at most one unit in the last place (ulp) for accuracy up to 19 decimal digits. These coprocessors found adoption in niche ecosystems outside the dominant x86 market, particularly in Apple's Macintosh II series and Commodore's Amiga 3000 workstations, where they accelerated graphics, simulations, and CAD tasks. By 1990, systems using 68000-based architectures like those from Apple and Commodore held a significant share of alternative computing platforms. Other early vendors offered alternatives to Intel's dominance in x86 coprocessors. AMD produced the Am287 and Am387 as pin- and software-compatible equivalents to the Intel 80287 and 80387, respectively, providing second-sourcing for floating-point acceleration in 286 and 386 systems with similar 80-bit register stacks and IEEE 754 support, but at potentially lower cost for OEMs. Meanwhile, Cyrix produced compatible coprocessors like the 387, offering similar functionality at lower cost for 386 systems. Weitek's WTL 3167 targeted high-speed vector mathematics in professional workstations paired with the Intel 80386, delivering 2-3 times the performance of the 80387 through a pipelined architecture optimized for single- and double-precision operations in scientific computing and graphics rendering. These designs contrasted with Motorola's by focusing on x86 compatibility and vector extensions, yet shared the goal of offloading complex math from the main CPU to enable embedded and workstation applications in the late 1980s.

Modern Coprocessors

Graphics and Compute Accelerators

Graphics processing units (GPUs) have evolved into versatile coprocessors, extending beyond visual rendering to general-purpose computing on graphics processing units (GPGPU). NVIDIA's Compute Unified Device Architecture (), introduced in 2006, marked a pivotal advancement by enabling developers to program GPUs for non-graphics workloads such as scientific simulations and data processing. Similarly, AMD's Stream SDK, launched around 2008 as part of the FireStream initiative, facilitated on Radeon-based hardware, allowing parallel execution of compute-intensive tasks. These frameworks leverage the GPU's architecture, which incorporates thousands of simpler cores optimized for (SIMD) operations, enabling massive parallelism for tasks that benefit from concurrent thread execution. Key to their efficacy as coprocessors are features that streamline integration with host CPUs. NVIDIA's Unified Virtual Addressing (UVA), introduced in CUDA 4.0, establishes a single across CPU and GPU , simplifying data sharing and pointer manipulation without explicit copies. Complementing this, the (Open Computing Language) , developed by the , provides a cross-vendor standard for orchestrating , allowing kernels to run on diverse accelerators including GPUs while abstracting hardware differences. These mechanisms enable seamless CPU-GPU collaboration, where the CPU handles sequential and the GPU accelerates vectorized computations. Modern GPUs in the RTX 50 series (as of 2025) exemplify high-performance coprocessing, with the flagship RTX 5090 delivering 104.8 teraflops (TFLOPS) in single-precision floating-point (FP32) operations, significantly offloading complex workloads like ray tracing and inference from the CPU. This capability stems from architectures with up to 21,760 cores, which process pipelines and compute shaders in . In applications such as scientific , GPUs accelerate molecular dynamics simulations by performing force calculations on vast particle sets, achieving speedups of orders of magnitude over CPU-only implementations due to their throughput. Likewise, in cryptocurrency —particularly for algorithms like Ethash—GPUs outperform CPUs by factors of 10 to 100 times, as their numerous arithmetic units efficiently hash data streams.

Specialized Domain Accelerators

Specialized domain accelerators represent coprocessors designed for specific, high-performance workloads in fields like networking, , and , offloading tasks from general-purpose CPUs to achieve superior and throughput. These devices leverage custom architectures to handle domain-specific operations, such as packet manipulation in networking or tensor computations in , enabling scalable performance in data centers and systems. Network Processing Units (NPUs) exemplify this specialization, with Broadcom's series providing programmable packet processors capable of terabit-scale routing. For instance, the (BCM88690) delivers 9.6 Tb/s of packet processing capacity, supporting up to twelve 400 GbE ports while integrating accelerators for tasks like calculations and (QoS) enforcement. These features allow the NPU to perform on-the-fly modifications, including updates and computations, without CPU involvement, ensuring low-latency handling of ingress and egress traffic in high-density Ethernet fabrics. Similarly, the extends this to 28.8 Tb/s with deep buffering and hierarchical QoS scheduling, optimizing for congestion management in carrier-grade networks. By distributing scheduling across packet buffers, NPUs maintain state-of-the-art traffic prioritization, reducing for real-time applications like video streaming or VoIP. Cryptographic coprocessors further illustrate domain-specific acceleration, particularly for secure data handling. IBM's PCIe Cryptographic Coprocessors, operating under the Common Cryptographic Architecture (), provide hardware-accelerated and services compliant with standards like Level 4. These devices support encryption at high speeds, enabling secure storage and processing of sensitive keys without exposing them to the host system. The 4769 model, for example, integrates multiple engines per coprocessor, facilitating operations like data and digital signing in enterprise environments such as financial transactions or cloud security. This offload minimizes latency for bulk encryption workloads, with battery-backed memory ensuring key persistence during power events. In , Google's (), introduced in 2015, serves as a pioneering optimized for . The employs a architecture—a 256×256 grid of 65,536 multiply-accumulate units—for efficient matrix multiplications central to tensor operations in . This design processes 8-bit integer operations at 92 tera-operations per second (), tailored for convolutions and fully connected layers in models like those used for image recognition. Unlike general-purpose GPUs, the 's focus on yields significantly higher efficiency, achieving 15–30 times the performance and 30–80 times the performance-per-watt of contemporary CPUs and GPUs for tasks. Later iterations build on this, such as the 2025 , which delivers over 100 per watt in scenarios through architectural improvements like 2x perf/watt gains over predecessors, underscoring their advantage in power-constrained deployments over broader but less specialized GPU coverage.

Integration in System-on-Chips

In system-on-chip (SoC) architectures, coprocessors such as neural processing units (NPUs) and graphics processing units (GPUs) are tightly integrated onto a single die alongside central processing units (CPUs), enabling seamless data sharing and minimized inter-component communication delays. For instance, Apple's A-series chips, like the A16 Bionic, embed a 16-core Neural Engine and multi-core GPU within the same , allowing unified memory access that reduces latency for and graphics tasks by eliminating off-chip data transfers. This monolithic approach contrasts with discrete designs, fostering efficient in mobile and embedded systems where space and speed are paramount. The benefits of such integrated designs stem from the elimination of external interfaces, leading to substantial reductions in consumption and enhancements in throughput. Mobile SoCs typically operate within a 5–10 W envelope, where on-chip coprocessor integration cuts energy use by avoiding the overhead of board-level signaling, as seen in ARM-based platforms. Furthermore, interconnects like ARM's CoreLink Coherent Interconnect (CCI-550) provide high-bandwidth pathways—supporting up to six coherent interfaces with 2x snoop efficiency—facilitating rapid exchange between heterogeneous cores while maintaining cache coherency and low overhead. A prominent is Qualcomm's Snapdragon series, where serves as an integrated coprocessor for audio processing, offloading tasks from the CPU to achieve up to 32x efficiency over general-purpose software implementations. In devices like smartphones, this enables real-time audio enhancement with minimal battery drain, as the handles voice and directly via streaming modes that bypass main memory, improving overall system responsiveness. Despite these advantages, integrating coprocessors into dense SoCs introduces challenges, particularly in thermal management and . High-performance heterogeneous cores generate concentrated , complicating dissipation in compact form factors and risking thermal throttling or reliability issues under sustained loads. Additionally, the diverse sets of specialized accelerators versus general CPUs create programmability trade-offs, requiring complex tools for workload partitioning and optimization to fully leverage the architecture without excessive developer overhead.

Future Directions in AI and Edge Computing

The evolution of coprocessors in AI is increasingly driven by neuromorphic designs that emulate biological neural processes for efficient, sparse computing. Intel's Loihi, introduced in 2017, represents a seminal example, featuring 128 neuromorphic cores that mimic brain synapses through spiking neural networks and on-chip learning capabilities, enabling asynchronous, event-driven computation with minimal energy use suitable for edge inference tasks. Subsequent iterations, such as Loihi 2, further optimize for sparse activity and reduced data movement, targeting applications in robotics and sensor processing where power constraints are critical. In , coprocessors are poised to enable in resource-constrained environments, particularly within the burgeoning (IoT) ecosystem. Projections indicate that the number of connected IoT devices will reach approximately 40 to 50 billion by 2030, necessitating on-device processing to handle latency-sensitive tasks like and without relying on infrastructure. Coprocessors integrated into these devices, such as dedicated AI accelerators, will facilitate efficient execution of models directly at the edge, reducing demands and enhancing . Looking ahead, technological advancements in coprocessors include hybrid quantum-classical systems and advanced packaging techniques to achieve unprecedented performance in compact formats. IBM's framework supports extensions for hybrid workflows, where quantum processors act as coprocessors alongside classical hardware to tackle optimization and simulation problems intractable for traditional systems. Complementing this, 3D-stacked designs are emerging to deliver over 1 petaFLOPS of compute in small-form-factor devices, as demonstrated by prototypes like NVIDIA's DGX Spark, which leverages layered integration for high-density acceleration while improving . Industry forecasts underscore the growing dominance of coprocessors in workloads, with a strong emphasis on . By 2029, more than 65% of spending on -optimized (IaaS) is expected to focus on tasks, where specialized coprocessors play a central role in scaling deployments efficiently. This shift aligns with broader goals to mitigate 's energy footprint, as energy-efficient emerges as a top trend, potentially reducing power demands amid rising pressures.

References

  1. [1]
    The Hawk Floating Point Coprocessor - University of Iowa
    A coprocessor is a special purpose processor that operates in conjunction with the central processor. Coprocessors may be physically separate from the central ...
  2. [2]
    21. Coprocessors - University of Iowa
    A coprocessor is a system component that runs in parallel with the CPU and has its own control unit, so that it may perform computations while the CPU is ...Missing: definition | Show results with:definition
  3. [3]
    [PDF] Media Instructions, Coprocessors, and Hardware Accelerators
    performance goals. • Highly flexible. • Parallelism from SIMD structures only. • Coprocessors are preferred when well-defined interfaces are available ...Missing: computer architecture
  4. [4]
    [PDF] Practical server privacy with secure coprocessors
    The coprocessor5 features a software architecture that permits application developers to install and up- date their applications onto these devices at customer.
  5. [5]
    [PDF] Instruction Path Coprocessors - Carnegie Mellon University
    This paper presents the concept of an Instruction Path. Coprocessor (I-COP), which is a programmable on-chip coprocessor, with its own mini-instruction set, ...
  6. [6]
    [PDF] White Paper Intel® Xeon Phi™ Coprocessor
    Coprocessor. ISA – Instruction Set Architecture – part of the computer architecture related to programming, including the native data types, instructions ...
  7. [7]
    [PDF] Introduction to Many Integrated Core (MIC) Coprocessors on ...
    Oct 23, 2013 · – Works well for interactive jobs. – Re-compile and run! • Use Symmetric to run existing MPI code on MIC only, or Host +MIC. – MIC coprocessor ...
  8. [8]
    Definition: coprocessor - ComputerLanguage.com
    A secondary processor in a computer that is used to speed up operations by taking some of the workload that the main CPU would otherwise have to handle.
  9. [9]
    A numeric data processor
    **Summary of Numeric Data Processor (Coprocessor) from IEEE Document (1156144):**
  10. [10]
    About coprocessors - Arm Developer
    A coprocessor is connected to the same data bus as the ARM720T processor in the system, and tracks the pipeline in the ARM720T core.
  11. [11]
    [PDF] An Architecture for Efficient CPU/Co-processor Data Communication
    This paper presents CUBA, an architecture model where co- processors encapsulated as function calls can efficiently access their input and output data ...
  12. [12]
    Lecture 11 - The On-chip Bus environment - Patrick Schaumont
    A bus write transaction moves data from a master to a slave, and a bus read transaction moves data from a slave to a master. When two masters share the same bus ...Missing: works | Show results with:works
  13. [13]
    Busy-waiting and interrupts - Arm Developer
    For interrupt latency reasons the coprocessor might be interrupted while busy-waiting, causing the instruction to be abandoned using CPPASS. The coprocessor ...Missing: synchronization methods polling handshaking
  14. [14]
    Milestones:Intel 8087 Math Coprocessor, 1980
    Sep 29, 2025 · Coprocessors work in tandem with their host CPU, tracking the instruction stream, executing instructions intended for them, ignoring any other ...
  15. [15]
    Learn Something Old Every Day, Part VII: 8087 Intricacies
    Jan 23, 2023 · The 8087 clearly has two synchronization mechanisms. The BUSY signal is used for “long term” synchronization, and it is meant to allow the 8087 ...
  16. [16]
    [PDF] Recent Thermal Management Techniques for Microprocessors
    In the For microprocessor cores sub-section, we introduce recent outstanding microarchitectural thermal management techniques for microprocessor cores. The next ...Missing: coprocessor | Show results with:coprocessor
  17. [17]
    [PDF] Systems Reference Library IBM System/360 System Summary
    This publication provides basic information about the IBM System/360, with the objective of helping readers to achieve a general understanding of this new.
  18. [18]
    [PDF] Intel 8087 Math CoProcessor
    The CPU does, however, distinguish between ESC instructions that reference memory and those that do not. If the instruction refers to a memory operand, the CPU ...
  19. [19]
    [PDF] Intel 80287 Math CoProcessor - Ardent Tool of Capitalism
    For operation with the CPU clock (CKM 0), the 80287 works at one- third the frequency of the system clock (i.e., for an. 8 MHz 80286, the 16 MHz system clock is ...
  20. [20]
    Intel 80287 family - CPU-World
    The Intel 80287 was produced at speeds ranging from 5 to 12 MHz. Other companies produced 16 MHz and 20 MHz versions of the FPU.
  21. [21]
    [PDF] MILITARY i387TM MATH COPROCESSOR - Ardent Tool of Capitalism
    The Intel i387 is a high-performance numerics processor extension that extends the i386 microprocessor architecture with floating point, extended integer and ...
  22. [22]
    The Birth of Pentium - Explore Intel's history
    Intel released Pentium, its fifth-generation x86 chip and the first Intel processor to be named with a word instead of a number.
  23. [23]
    The Pentium: An Architectural History of the World's Most Famous ...
    Jul 11, 2004 · It had two five-stage integer pipelines, which Intel designated U and V, and one six-stage floating-point pipeline. The chip's front-end could ...
  24. [24]
    [PDF] Intel: from the i386 to today - MJR's Computing Page
    Aug 18, 2021 · The floating point unit was an optional separate chip, the 80387. It ... issued in two halves, and its peak performance is 8 MFLOPS/MHz.<|separator|>
  25. [25]
    [PDF] Microsoft® C - Advanced Programming Techniques
    ... Optimization from the Command Line. 1.3. Controlling Optimization with Pragmas ... Coprocessor Option (/FPc87). 4.4.5. Use Alternate Math Option (/FPa). 4.5.
  26. [26]
    [PDF] Floating Point Case Study - Intel
    Intel 8087 Floating-Point Co-Processor Die. Page 5. 5. The double precision format requires two 4- byte storage locations in computer memory, at address and ...Missing: coprocessor | Show results with:coprocessor
  27. [27]
    Do the Math - Explore Intel's history
    The 8087 was called a "coprocessor" because it complemented rather than supplanted, and took a load off of a primary processor, improving system performance ...<|separator|>
  28. [28]
    [PDF] 231917-001_80387_Programmers_Reference_Manual_1987.pdf
    This manual describes the 80387 Numeric Processor Extension (NPX) for the 80386 micro- processor. Understanding the 80387 requires an understanding of the 80386 ...
  29. [29]
    The First Million-Transistor Chip: the Engineers' Story - IEEE Spectrum
    The Intel i860—called the N10 by its designers —is a 64-bit CMOS microprocessor measuring 488 square mils. It contains more than 1 million transistors. The ...Missing: coprocessor | Show results with:coprocessor
  30. [30]
    [PDF] Intel Corporation Annual Report 1987
    The 80386 was also the basis of several new Intel mod- ules and systems introduced in 1987. We are supporting both the PC bus architecture that is standard in ...
  31. [31]
    [PDF] MC68882 .“. - NXP Semiconductors
    The M68000 Family coprocessor interface is an integral part of the MC68882 and MC68020 or MC68030 designs. The interface partitions. MPU and coprocessor ...Missing: Amiga Macintosh
  32. [32]
    [PDF] FAMILY - Bitsavers.org
    All functions are cal- culated to 80 bits of precision in hardware. The enhanced MC68882 has dual-ported registers and an advanced pipeline that allows ...
  33. [33]
    Goals and tradeoffs in the design of the MC68881 floating point ...
    The format on the MC68881 consists of 96 bits, 3 long words, with an explicit most significant mantissa bit. Only 80 bits are actually used, the other 16 bits ...Missing: stage pipeline
  34. [34]
    Total share: 30 years of personal computer market share figures
    Dec 14, 2005 · The PC kept soldiering on relentlessly, rising from 84% marketshare in 1990 to over 90% in 1994. However, there was still a chance for ...Missing: coprocessor | Show results with:coprocessor
  35. [35]
    AMD 80287 floating-point unit family - CPU-World
    80287 family » AMD Type: floating-point unit Introduction: 1990 Frequency (MHz): 10, 12 Sockets: DIP40 PLCC44
  36. [36]
    [PDF] WEITEK ~ - Bitsavers.org
    The WTL 3167 is pin- for-pin compatible with the WTL 1167 coprocessor daughter board. C, FORTRAN, and Pascal compilers fully support the. WTL 3167, allowing ...
  37. [37]
    About CUDA | NVIDIA Developer
    Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed ...
  38. [38]
    AMD Lights a Fire Under GPU Computing - HPCwire
    Apr 4, 2008 · This month AMD is preparing to make its FireStream stream computing boards and software development kit (SDK) generally available to customers.
  39. [39]
    CUDA C++ Programming Guide
    What Is the CUDA C Programming Guide? 3. Introduction. 3.1. The Benefits of Using GPUs; 3.2. CUDA®: A General-Purpose Parallel Computing Platform and ...
  40. [40]
    Unified Memory in CUDA 6 | NVIDIA Technical Blog
    Nov 18, 2013 · UVA provides a single virtual memory address space for all memory in the system, and enables pointers to be accessed from GPU code no matter ...What Unified Memory Delivers · Performance Through Data... · Unified Memory With C++
  41. [41]
    OpenCL for Parallel Programming of Heterogeneous Systems
    Unlike 'GPU-only' APIs, such as Vulkan, OpenCL enables use of a diverse range of accelerators including multi-core CPUs, GPUs, DSPs, FPGAs and dedicated ...Khronos OpenCL Registry · OpenCL News · OpenCL 3.0 Reference Pages · Forums
  42. [42]
    Compare Current and Previous GeForce Series of Graphics Cards
    Compare current RTX 30 series of graphics cards against former RTX 20 series, GTX 10 and 900 series. Find specs, features, supported technologies, and more.Compare 40 Series Specs · Compare 30 Series Specs · Shop All Geforce Rtx
  43. [43]
    Molecular dynamics simulations through GPU video games ...
    The application of parallel programming using GPUs in MD simulations has the significant benefit of a time cost significantly reduced by many times, as compared ...
  44. [44]
    GPU Usage in Cryptocurrency Mining - Investopedia
    Oct 31, 2024 · GPU-based mining offered the benefit of processing simple instructions in parallel with more cores, which made them much more efficient than CPUs.GPUs and Cryptocurrency... · GPU vs. CPU · Future of GPUs and Blockchain
  45. [45]
    BCM88690 - Broadcom Inc.
    Jericho2 is the world's first to provide 10 Tb/s packet processing per device at a high interface bandwidth while integrating a scalable multi-terabit switch ...Missing: NPU checksums QoS
  46. [46]
    [PDF] BCM88480 800-Gb/s Integrated Packet Processor and Traffic ...
    One-step clock features: – On-the-fly egress packet modification including UDP checksum update and CRC update. – All modifications to the correction field ...Missing: Jericho QoS
  47. [47]
    Ethernet Switches | Network Chips | Merchant Silicon | Jericho
    Jericho3 is a 28.8 Tb/s scalable router with high port density, a programmable packet processor, and a multi-terabit switch fabric, supporting 100-800 Gb/s ...Missing: NPU checksums QoS
  48. [48]
    [PDF] BCM88690 9.6-Tb/s Integrated Packet Processor, Traffic Manager ...
    The BCM88690 device (also known as Jericho2) processes. 4.8-Tb/s traffic at packet sizes above 284B and supports up to twelve 400GbE full-duplex ports ...Missing: Jericho | Show results with:Jericho
  49. [49]
    IBM PCIe Cryptographic Coprocessors
    Delivers high-speed cryptographic functions for data encryption and digital signing, secure storage of signing keys or custom cryptographic applications.
  50. [50]
    [PDF] IBM Power E1080 Technical Overview and Introduction
    Nov 15, 2024 · Power10 processor technology is engineered to achieve faster encryption performance with quadruple the number of AES encryption engines. In ...
  51. [51]
    CEX7S / 4769 Overview - IBM
    The IBM 4769 Cryptographic Coprocessor is the latest generation and fastest of IBM's PCIe hardware security modules (HSMs).Missing: encryption Gbps
  52. [52]
    An in-depth look at Google's first Tensor Processing Unit (TPU)
    May 12, 2017 · The TPU Matrix Multiplication Unit has a systolic array mechanism that contains 256 × 256 = total 65,536 ALUs. That means a TPU can process ...
  53. [53]
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Apr 16, 2017 · Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.
  54. [54]
    Performance per dollar of GPUs and TPUs for AI inference
    Sep 12, 2023 · Each TPU v5e chip provides up to 393 trillion int8 operations per second (TOPS), allowing fast predictions for the most complex models.
  55. [55]
    [PDF] Designing Power-Efficient Systems-on-Chip (SoCs) for AI-Driven ...
    Jul 29, 2025 · As an example, the Apple. A16 Bionic with its 16-core Neural Engine exhibited a steady-state AI throughput of about 17 TOPS at less than 1W in ...
  56. [56]
    [PDF] Profiling Apple Silicon Performance for ML Training - arXiv
    Jan 28, 2025 · This design not only simplifies data processing but also supports shared memory between processing units, significantly reducing latency and ...
  57. [57]
    [PDF] The Benefits of Multiple CPU Cores in Mobile Devices | NVIDIA
    By using multiple cores the CPUs of today can complete more work faster, and at lower power, than their single core predecessors. Mobile processors are ...Missing: 10W interconnect<|separator|>
  58. [58]
    The Arm CoreLink CCI-550 Cache Coherent Interconnect
    Hardware coherency enables shared virtual memory and removes the need for time-consuming software-managed cache maintenance.Missing: 5- 10W
  59. [59]
    Qualcomm Hexagon SDK 3.0 – DSP power and efficiency
    Sep 15, 2016 · 1. Increased processing power: Up to 1024 bits of data per clock cycle, simultaneous processing · 2. Improved compute efficiency: Streaming that ...Missing: 20x | Show results with:20x
  60. [60]
    2024 irds executive packaging tutorial—part 1
    Heterogeneous Integration—The future of SoC packaging is bound to ... dissipation pathways, and potential reliability issues from thermal stress.
  61. [61]
    OASIS: A Commercial High Performance Terminal AI Processor ...
    Oct 17, 2025 · A key architectural challenge in heterogeneous SoCs is that when a high-throughput AI accelerator is integrated with general-purpose CPU cores, ...Missing: dissipation | Show results with:dissipation
  62. [62]
    Intel Editorial: Intel's New Self-Learning Chip Promises to Accelerate ...
    Sep 25, 2017 · Intel introduces the Loihi test chip, a first-of-its-kind self-learning neuromorphic chip that mimics how the brain functions by learning to ...Missing: coprocessor | Show results with:coprocessor
  63. [63]
    [PDF] Loihi: A Neuromorphic Manycore Processor with On-Chip Learning
    Loihi is a 60-mm2 chip fabricated in Intel's 14-nm process that advances the state-of-the-art modeling of spiking neural networks in silicon.
  64. [64]
    Neuromorphic Computing and Engineering with AI | Intel®
    Loihi 2 neuromorphic processors focus on sparse event-driven computation that minimizes activity and data movement. The processors apply brain-inspired ...Missing: 2017 synapses inference
  65. [65]
    Global IoT connections to reach 50 billion by 2030: study
    May 20, 2019 · The number of devices connected to the internet is expected to reach 50 billion worldwide at the end of 2030, according to the latest research from Strategy ...
  66. [66]
    Number of connected IoT devices growing 14% to 21.1 billion globally
    Oct 28, 2025 · Number of connected IoT devices growing 14% to 21.1 billion globally in 2025. Estimated to reach 39 billion in 2030, a CAGR of 13.2% [...]
  67. [67]
  68. [68]
    What Is Quantum Computing? | IBM
    Quantum computing is a rapidly-emerging technology that harnesses the laws of quantum mechanics to solve problems too complex for classical computers.
  69. [69]
    MIT engineers grow “high-rise” 3D chips
    Dec 18, 2024 · MIT engineers have developed a method to seamlessly stack electronic layers to create faster, denser, more powerful computer chips.Missing: PetaFLOPS | Show results with:PetaFLOPS
  70. [70]
    3D-Stacked Processor Market Size, Report by 2034
    Oct 7, 2025 · 3D stacking delivers significantly higher on-package bandwidth, lower inter-die latency, improved performance-per-watt, and denser form factors ...Missing: PetaFLOPS | Show results with:PetaFLOPS
  71. [71]
    Gartner Says AI-Optimized IaaS Is Poised to Become the Next ...
    Oct 15, 2025 · In 2026, 55% of AI-optimized IaaS spending will support inference workloads and it is projected to reach more than 65% in 2029.Missing: coprocessors 70% 2025-2030
  72. [72]
    Gartner's Technology Trends for 2025 – Energy-Efficient Computing
    In October 2024, Gartner released its review of the top ten technology trends for 2025. For the first time, energy-efficient computing appeared on the list, ...Missing: projection coprocessors 70% workloads 2025-2030