Fact-checked by Grok 2 weeks ago

Heterogeneous computing

Heterogeneous computing is a computational paradigm that integrates diverse processing units, such as central processing units (), graphics processing units (), and field-programmable gate arrays (), within a unified system to execute applications by assigning subtasks to the most appropriate hardware for optimal efficiency. This approach leverages the unique strengths of each component—such as the general-purpose sequential processing of , the parallel throughput of , and the customizable logic of —to address varied workload requirements that exceed the capabilities of homogeneous systems. The paradigm has evolved significantly since the , driven by advances in interconnect technologies and the demand for high performance-per-watt in domains like (HPC), mobile devices, and cloud infrastructure. In HPC environments, heterogeneous systems accelerate complex simulations in fields such as scientific modeling and data analytics by combining distributed clusters with accelerators. Similarly, modern mobile system-on-chip (SoC) designs incorporate heterogeneous cores to balance for everyday tasks with bursts of for and workloads. Benefits include up to orders-of-magnitude improvements in execution speed and resource utilization compared to single-architecture setups, though these gains depend on effective and . Central to heterogeneous computing are programming models and tools that enable seamless task offloading and across disparate hardware, such as hybrid combinations of (MPI) for distributed coordination and (CUDA) for GPU acceleration. Other frameworks like provide cross-platform portability for accelerators. Key challenges include optimizing load balancing to account for varying speeds, minimizing communication overhead in data transfers, and ensuring in large-scale deployments. Ongoing research focuses on automated refactoring tools and unified instruction sets to simplify development and enhance scalability in emerging applications like and AI inference.

Fundamentals

Definition and Motivation

Heterogeneous computing encompasses systems that integrate multiple distinct types of processors or cores, such as central processing units (CPUs), graphics processing units (GPUs), and field-programmable gate arrays (FPGAs), each featuring differing architectures, instruction sets, or optimization focuses to deliver superior overall system performance or . This paradigm shifts from uniform processing environments by leveraging specialized hardware to handle diverse computational tasks more effectively, allowing workloads to be distributed across components best suited for specific operations. The primary motivation for adopting heterogeneous computing arises from the limitations of homogeneous systems, particularly the breakdown of —which maintained constant with shrinking transistors—and the deceleration of , which has curtailed exponential performance improvements through transistor density alone. These constraints have necessitated alternative strategies to sustain computational growth, enabling task-specific acceleration where, for instance, GPUs excel at floating-point computations while CPUs manage sequential control flows. Key benefits include enhanced throughput for mixed workloads that combine serial and parallel elements, reduced power consumption critical for battery-constrained devices such as smartphones, and improved scalability to address varying computational demands across applications like scientific simulations and machine learning. Unlike homogeneous computing, which relies on uniform processors for uniformity and simplicity, heterogeneous systems emphasize specialization to optimize real-world efficiency and performance.

Historical Development

The roots of heterogeneous computing trace back to the 1940s with the development of early electronic computers like , completed in 1945, which featured specialized hardware units dedicated to arithmetic operations and control functions to handle diverse workloads efficiently. This concept evolved through the postwar era, as computing systems incorporated varied processing elements to optimize performance for scientific calculations. By the 1970s and 1980s, supercomputers exemplified this progression with the introduction of vector processors, such as the CDC STAR-100 in 1974 and the in 1976, which combined scalar and vector processing units to accelerate numerical computations in high-performance environments. The saw the formal emergence of systems, driven by advances in networked machines and research into workload partitioning across diverse architectures. A seminal work in this period was the 1994 report by Siegel et al. from , which defined as the orchestrated use of varied processors and networks to maximize application performance, laying foundational principles for mixed-mode machines. This era focused on integrating disparate hardware suites, including early distributed systems, to address the limitations of homogeneous setups in handling complex, parallel tasks. The 2000s accelerated through the rise of graphics processing units (GPUs) for general-purpose tasks, with NVIDIA's introduction of in 2006 enabling programmable GPGPU computing and unlocking for non-graphics applications across thousands of cores. Concurrently, multi-core CPUs began incorporating integrated graphics, as seen in and designs from the mid-2000s, blending CPU and GPU capabilities on single chips to enhance and computational efficiency. In the 2010s, standardization efforts solidified heterogeneous paradigms, including 's big.LITTLE architecture announced in October 2011, which paired high-performance "big" cores with energy-efficient "LITTLE" cores for mobile devices to balance power and performance. Similarly, the (HSA) Foundation was formed in June 2012 by , , and others, promoting unified memory and programming models for CPU-GPU integration. The 2020s have been shaped by and demands, with widespread adoption of specialized accelerators like Google's Tensor Processing Units (TPUs), first deployed internally in 2015 and publicly available via cloud in 2018, but seeing explosive growth post-2020 for training large neural networks. Chiplet-based designs have further advanced heterogeneity, as in AMD's processors starting with the first generation in 2017 and expanding through multi-chiplet configurations by 2024, alongside Intel's adoption in (Core Ultra) in 2023 and subsequent generations up to 2025, enabling scalable integration of diverse compute tiles. These trends, amplified by growth since 2015, reflect external pressures from AI/ML workloads requiring efficient, distributed processing across heterogeneous hardware.

Types of Heterogeneity

Processor Heterogeneity

Processor heterogeneity refers to the diversity in the design and capabilities of individual processing units within a system, enabling specialized handling of workloads by leveraging different architectural strengths. This variation at the compute element level allows systems to optimize for specific tasks, such as general-purpose or , without relying on uniform cores. Processors in heterogeneous computing are classified into several types based on their design and optimization focus. General-purpose processors, such as central processing units (CPUs) based on x86 or architectures, handle sequential and control-intensive tasks efficiently. Accelerators include graphics processing units (GPUs), which excel in (SIMD) parallelism for tasks like graphics rendering and scientific simulations, and digital signal processors (DSPs), optimized for real-time in applications such as audio and . Reconfigurable processors, like field-programmable gate arrays (FPGAs), allow custom logic implementation post-manufacturing to adapt to varying computational needs. Domain-specific processors encompass application-specific integrated circuits () and neural processing units (NPUs), tailored for particular domains such as inference, where NPUs accelerate matrix operations in models. Key characteristics of these processors stem from differences in their instruction set architectures (ISAs) and specialized extensions. ISAs vary between complex instruction set computing (CISC), as in x86 CPUs, which support variable-length instructions for denser code, and reduced instruction set computing (RISC), as in CPUs, emphasizing fixed-length, simpler instructions for faster execution. Vector extensions further highlight heterogeneity; for instance, Intel's (AVX) in x86 CPUs enable 256-bit or wider SIMD operations for data-parallel tasks on general-purpose cores, while NVIDIA GPUs incorporate tensor cores for accelerated mixed-precision matrix multiply-accumulate operations critical to workloads. Heterogeneity also manifests in core topologies, where chips integrate diverse core designs to balance performance and efficiency. Asymmetric multi-core architectures, such as , combine high-performance "big" cores (e.g., Cortex-A78) for demanding tasks with energy-efficient "little" cores (e.g., Cortex-A55) for lighter workloads, allowing dynamic task migration to optimize power usage. Heterogeneous multi-threading extends this by enabling threads to execute across cores with varying capabilities, improving resource utilization in multi-core environments. Metrics for evaluating processor heterogeneity emphasize trade-offs in efficiency and performance. Compute density, often measured as floating-point operations per second (FLOPS) per watt, quantifies ; for example, GPUs achieve higher FLOPS/W than CPUs due to their design, making them suitable for throughput-oriented tasks. Latency versus throughput trade-offs are another key metric, with CPUs prioritizing low-latency single-thread execution and GPUs favoring high-throughput . Synchronization primitives, such as barriers and atomics tailored to each type (e.g., GPU-specific events versus CPU mutexes), address coordination challenges but introduce overheads unique to heterogeneous setups.

System-Level Heterogeneity

System-level heterogeneity in systems extends beyond individual processors to encompass the diverse interactions among subsystems, interconnect fabrics, and (I/O) peripherals, which collectively influence movement, , and overall efficiency. In such systems, components with varying architectures must interoperate seamlessly to avoid performance degradation, yet their differences often introduce complexities in resource sharing and communication. Memory heterogeneity manifests in disparate access models that range from unified, coherent to isolated s necessitating explicit . The (HSA) enables a unified memory model where CPUs and GPUs share a single with hardware-enforced , allowing transparent data access without manual copying and reducing programming overhead. In contrast, traditional discrete setups rely on separate s, requiring explicit transfers over interconnects like PCIe, where limitations—such as PCIe Gen 3's theoretical maximum of approximately 32 GB/s bidirectional (~16 GB/s per direction) for x16 lanes—can data-intensive workloads by imposing significant and throughput constraints. These models highlight the trade-offs: coherent simplifies development but demands sophisticated support, while discrete spaces offer flexibility at the cost of developer-managed data . Interconnect variations further amplify system-level diversity, spanning on-chip buses to high-speed off-chip links tailored for heterogeneous integration. On-chip interconnects like the (AMBA) in ARM-based systems-on-chip (SoCs) facilitate efficient communication among heterogeneous IP blocks, such as CPUs, GPUs, and accelerators, by providing scalable protocols like AXI for high- bursts and APB for low-power peripherals. Off-chip links, such as NVIDIA's , deliver up to 900 GB/s bidirectional between GPUs and CPUs, enabling low-latency data sharing in multi-GPU configurations far exceeding PCIe capabilities. Emerging standards like Compute Express Link (CXL), introduced post-2020, extend PCIe with cache-coherent protocols for memory expansion and accelerator attachment, supporting pooled memory resources across devices with latencies typically around 100-250 ns. I/O and peripheral diversity introduces additional heterogeneity, particularly in systems where components like USB controllers, network interfaces, and sensors integrate via varied interfaces, creating non-uniform data flow paths. In distributed networks, peripherals such as USB for high-speed connectivity and Ethernet for networking must bridge heterogeneous microcontrollers, often requiring bridges to manage differing voltage levels, timing, and needs, which can lead to challenges in applications. These elements culminate in system-wide implications, including bandwidth bottlenecks that arise from mismatched interconnect capacities—such as PCIe limitations constraining GPU utilization in heterogeneous clusters—and coherence protocols extended for diverse caches. Protocols like MESI, augmented with directory-based mechanisms for heterogeneous systems, track cache states (Modified, Exclusive, Shared, Invalid) across topologies to maintain consistency, though they incur overhead from snoop traffic in large-scale setups. Power delivery differences across components, where accelerators demand peak currents up to hundreds of amperes versus CPUs' more stable profiles, necessitate adaptive voltage regulators and dynamic allocation to prevent thermal throttling and ensure reliability in integrated heterogeneous platforms.

Architectures and Hardware

Integrated Architectures

Integrated architectures in integrate diverse processing elements, such as CPUs, GPUs, and specialized accelerators, onto a single chip or tightly coupled package to enable efficient resource sharing and low-latency communication. These designs, often realized through system-on-chip () methodologies, prioritize power efficiency and seamless data movement, contrasting with modular discrete systems by minimizing interconnect overhead. By colocating components, integrated architectures facilitate unified memory access and optimized task scheduling, making them ideal for , , and high-performance applications where and constraints are critical. A prominent example of processor heterogeneity in integrated designs is the architecture, which combines high-performance "big" cores with energy-efficient "LITTLE" cores on the same die to dynamically balance workloads. adopted this approach in its Snapdragon series starting with the Snapdragon S4 in 2012, enabling adaptive power management for mobile devices by switching between core types based on demand. Similarly, AMD's Accelerated Processing Units (APUs), introduced in 2011 with the lineup, fuse x86 CPU cores and GPU cores on a single die, allowing shared execution of compute-intensive tasks like graphics and general-purpose computing without external data transfers. Advanced packaging techniques, such as chiplet-based integration, extend these principles to scale heterogeneous components across multiple dies within a single package, enhancing modularity while preserving tight coupling. pioneered this in its first-generation processors launched in 2017, employing CPU core chiplets connected to a central I/O die via Infinity Fabric interconnects to deliver up to 32 cores with high-bandwidth access. advanced this further with the GPU, released in 2023 as part of its GPU Max series, which comprises 47 tiles—including compute, I/O, and tiles—fabricated on multiple process nodes and stacked using advanced 3D packaging for workloads. Unified memory and high-speed interconnects are hallmarks of these architectures, enabling processors to share a common and reduce data copying overhead. AMD's architecture, introduced in 2017, complies with the (HSA) standard, allowing CPUs and GPUs to access a unified coherently and supporting pointer-based data sharing across heterogeneous elements. This integration yields significant latency reductions compared to discrete CPU-GPU setups reliant on PCIe transfers. Domain-specific integrations further tailor these architectures for targeted efficiency, as seen in Apple's M-series chips debuting with the in 2020. These SoCs unify ARM-based CPU cores, GPU cores, and a dedicated Neural Engine for on a single die, leveraging a unified to streamline , , and general tasks. In mobile contexts, the M-series delivers 2-5x better power efficiency than comparable discrete GPU solutions, with the Pro achieving up to 70% lower power consumption for equivalent performance in workloads.

Discrete Component Systems

Discrete component systems in heterogeneous computing involve modular hardware configurations where distinct processors, such as central processing units (CPUs) and accelerators, are connected via external interconnects, enabling scalability and independent upgrades without replacing the entire system. These setups prioritize flexibility for (HPC) environments, allowing users to pair general-purpose CPUs with specialized accelerators like graphics processing units (GPUs) or field-programmable gate arrays (FPGAs) to handle diverse workloads efficiently. Common configurations feature CPU motherboards augmented with add-in GPUs via Peripheral Component Interconnect Express (PCIe) slots, exemplified by the NVIDIA A100 Tensor Core GPU, which operates as a PCIe Gen4 card providing up to 31.5 GB/s bidirectional bandwidth per x16 connection for AI and HPC tasks. Another influential example is the Intel Xeon Phi coprocessor, based on the Many Integrated Core (MIC) architecture, which integrated up to 61 x86 cores on a single card connected via PCIe to accelerate parallel workloads, though it was discontinued in 2020 due to market shifts toward GPU dominance. These discrete additions allow systems to offload compute-intensive operations from the host CPU while maintaining modularity for future enhancements. Interconnects play a critical role in these systems, with PCIe standards enabling intra-node communication and higher-speed fabrics like InfiniBand supporting inter-node scaling in HPC clusters. PCIe Gen5, finalized in 2019 and widely adopted by 2021, delivers up to 64 GB/s bidirectional bandwidth for an x16 link at 32 GT/s per lane, facilitating faster data transfer between CPUs and accelerators compared to prior generations. For larger-scale deployments, InfiniBand provides low-latency, high-throughput networking, often exceeding 200 Gb/s per port, as seen in GPU clusters within supercomputers like Summit, which combines 9,216 IBM Power9 CPUs with 27,648 NVIDIA V100 GPUs across 4,608 nodes using a custom high-speed interconnect for over 200 petaflops of performance since its 2018 deployment. Hybrid setups extend this modularity by integrating CPUs with discrete FPGA cards for reconfigurable acceleration, such as the Alveo series introduced in , which leverages UltraScale+ FPGAs on PCIe cards to customize hardware logic for specific algorithms, offering advantages in adaptability over fixed-function GPUs. However, these configurations face challenges from interconnect bottlenecks, with PCIe x16 links typically limited to 16 GB/s for Gen3 or 32 GB/s for Gen4, potentially constraining data movement in bandwidth-sensitive applications. Recent advancements highlight discrete accelerators in AI servers, including Google's (sixth-generation TPU) pods, generally available since December , which deploy tensor processing units as modular components scalable to thousands of for efficient via high-bandwidth interconnects. Similarly, Intel's Habana Gaudi3 processors, with general availability since and PCIe Gen5 cards available since May 2025, provide acceleration with up to 1,835 teraflops of FP8 matrix performance per card, emphasizing cost-effective scaling in heterogeneous server environments. For example, 's Blackwell GPUs, released in , offer enhanced discrete acceleration for and HPC with up to 20 petaflops of FP4 performance per GPU in PCIe form factors. In contrast to integrated architectures that prioritize power efficiency through on-package , discrete systems excel in upgradability for evolving computational demands.

Programming Models

Vendor-Specific Approaches

NVIDIA introduced in November 2006 as a proprietary extension to C/C++ that enables developers to write parallel kernels for execution on GPUs, providing a vendor-optimized model for heterogeneous computing by abstracting GPU hardware complexities. Key features include a hierarchy organized into blocks and grids, where threads within a block can synchronize and share data efficiently, allowing scalable parallelism tailored to 's streaming multiprocessor . 's distinguishes between global memory for large-scale data access across the GPU and for fast, low-latency communication within blocks, optimizing data locality in heterogeneous workloads. The performance model relies on warp scheduling, where groups of 32 threads execute in on the GPU, enabling vendor-specific tuning for high-throughput computations like simulations and graphics rendering. AMD launched ROCm in 2016 as an open-source software stack designed for programming AMD GPUs and accelerated processing units (APUs) in heterogeneous environments, emphasizing portability within AMD ecosystems through layered components like runtime libraries and compilers. Central to ROCm is the Heterogeneous-compute Interface for Portability (HIP), a source-to-source compiler that translates CUDA code to AMD targets, facilitating migration of GPU-accelerated applications while preserving vendor-specific optimizations for AMD hardware. The stack includes domain-specific libraries such as MIOpen, which provides primitives for machine learning operations like convolutions and matrix multiplications, accelerated for AMD's compute architectures to deliver high performance in AI training and inference. Intel's oneAPI, evolving from the SYCL and Data Parallel C++ (DPC++) initiatives, offers a unified based on ISO C++ standards extended for heterogeneous execution across CPUs, GPUs, and FPGAs, with a focus on single-source that avoids within platforms. It incorporates Unified Shared Memory (USM) to simplify by allowing pointers to address memory coherently across host and device, reducing explicit data transfers in heterogeneous applications. Offloading directives, such as those in DPC++, enable selective execution on accelerators via simple annotations, like parallel_for for data-parallel kernels, optimizing for 's diverse without requiring separate paths. The provides a collection of optimized functions for and , tailored for -based heterogeneous systems including GPUs and big.LITTLE CPU configurations, prioritizing efficiency in power-constrained mobile and embedded devices. It leverages intrinsics for SIMD operations on CPUs, enabling fine-grained optimizations like fused multiply-add instructions to accelerate tensor manipulations and image processing in heterogeneous workloads. For GPUs, the library includes OpenCL-based kernels that exploit tile-based rendering and units, delivering vendor-specific performance gains in real-time applications such as . Vendor-specific extensions further enhance these models for specialized tasks; for instance, NVIDIA's Tensor Cores, introduced in the 2017 architecture, accelerate computations through dedicated hardware for FP16 matrix multiply-accumulate operations, performing 4x4 matrix multiplications with FP32 accumulation to achieve up to 125 TFLOPS in benchmarks on V100 GPUs. These extensions integrate seamlessly with , allowing developers to invoke mixed-precision kernels via PTX instructions for optimized heterogeneous training of neural networks.

Cross-Platform Standards

Cross-platform standards in heterogeneous computing provide open, portable programming models that enable developers to write code once and deploy it across diverse hardware from multiple vendors, abstracting away low-level differences in processors like CPUs, GPUs, and FPGAs. These standards promote and by defining common APIs, memory models, and execution semantics, reducing the need for vendor-specific optimizations while maintaining reasonable performance portability. Key examples include , OpenMP extensions, and the (HSA) runtime, alongside emerging APIs like Vulkan Compute and . OpenCL, developed by the Khronos Group, is an open royalty-free standard introduced in 2009 for parallel programming on heterogeneous platforms including CPUs, GPUs, and FPGAs. It employs a kernel-based model where developers write parallel compute kernels in an extension of C or C++, which are executed on accelerators via command queues that manage asynchronous operations. Host-device data transfer is handled through buffers and images, allowing efficient without explicit copying in some cases. Implementations from vendors such as , , and ensure broad support, enabling kernels to run across their respective hardware with minimal modifications. OpenMP 5.0 and later versions, released in November 2018 by the Architecture Review Board, extend the directive-based to support heterogeneous offloading to accelerators like GPUs. features include the target directive for offloading code regions and data to devices, along with target data for managing mappings and transfers. These extensions incorporate tasking constructs for asynchronous execution and reduction clauses to aggregate results efficiently across host and device. By standardizing these mechanisms, OpenMP facilitates portable code that compiles and runs on diverse architectures without . The HSA Runtime, specified by the HSA Foundation starting in 2012, defines a unified programming for coherent heterogeneous systems, emphasizing seamless integration between CPUs and GPUs. It provides a unified for access, eliminating much of the explicit data copying required in other models. Lightweight messaging enables low-latency communication between agents, while pipe constructs support streaming data flows for producer-consumer patterns in compute pipelines. This runtime promotes efficient resource sharing across heterogeneous components from multiple vendors. Emerging standards build on these foundations for specialized environments. Vulkan Compute, part of the API released by the in 2016, offers a low-level interface for GPU compute shaders, allowing explicit control over memory and execution for high-performance parallel tasks. , which reached Candidate Recommendation status with the W3C in December 2024, enables browser-based heterogeneous computing by mapping to native APIs like , Metal, and 12, supporting GPU acceleration for web applications including and graphics. These standards deliver significant portability benefits, such as writing a single base that deploys across , , and hardware with only minor tweaks for optimal performance, as evidenced by OpenCL's cross-vendor conformance. For instance, offloading directives allow scientific codes to target accelerators from different vendors without rewriting core logic. Overall, they abstract hardware heterogeneity, fostering ecosystem-wide adoption while vendor-specific approaches handle deeper optimizations where needed.

Challenges and Solutions

Performance and Resource Management

Achieving efficient performance in heterogeneous computing systems requires careful across diverse processing units, such as CPUs and GPUs, to maximize throughput while minimizing overheads. Load balancing addresses the challenge of distributing computational tasks unevenly due to varying processor capabilities and workloads. Task partitioning algorithms aim to minimize , defined as the time from task initiation to overall completion, by dividing workloads optimally among heterogeneous resources. Static scheduling pre-allocates tasks based on prior knowledge of system characteristics, offering low overhead but risking load imbalances if runtime conditions deviate from estimates. In contrast, dynamic scheduling employs to monitor execution times and adjust task assignments in , potentially outperforming static methods by adapting to variability, with reported improvements of up to 9.6% in execution speed compared to optimal static partitions. These algorithms often integrate heuristics like min-min or selection to prioritize critical paths, ensuring balanced utilization without excessive reconfiguration costs. Data movement between heterogeneous components introduces significant overheads, particularly in non-coherent systems where explicit transfers are required. In setups relying on interconnects like PCIe, limitations and —typically in the range of hundreds of cycles for small packets—can dominate time, especially when transfer volumes are low relative to processing needs. For instance, PCIe copy operations for under 512 KB often fail to saturate available , leading to stalls that overall . To mitigate this, prefetching techniques anticipate requirements and initiate transfers early, reducing effective by overlapping movement with . Complementary caching hierarchies, spanning local accelerators to shared , further alleviate overheads by localizing access and minimizing cross-component traversals through tiered storage policies. Power and thermal management in heterogeneous systems leverage techniques like dynamic voltage and frequency scaling (DVFS) to adapt core operating points based on workload demands, trading performance for energy efficiency across diverse architectures. DVFS enables fine-grained adjustments to voltage and frequency on individual cores or clusters, reducing power consumption quadratically with voltage while preserving throughput for lighter tasks. In architectures such as ARM's big.LITTLE, which pairs high-performance "big" cores with energy-efficient "little" ones, task migration policies switch execution between clusters to optimize for varying loads, achieving energy savings of 20-30% in mobile workloads without substantial performance loss. These policies monitor thermal constraints and utilization to prevent hotspots, ensuring sustained operation in power-limited environments like embedded devices. Performance modeling tools and frameworks provide insights into resource utilization by quantifying bottlenecks in heterogeneous setups. The Roofline model visualizes attainable performance as a function of arithmetic intensity—the ratio of computational operations to memory accesses—plotted against peak floating-point operations per second (FLOPS) per unit, revealing whether applications are compute- or memory-bound. Adaptations for heterogeneity extend this by incorporating separate rooflines for each accelerator type, such as GPUs with high peak FLOPS but limited bandwidth, to guide optimizations like increasing data reuse. Profiling tools like NVIDIA Nsight Systems complement these models by tracing CPU-GPU interactions, measuring metrics such as kernel launch latencies and memory transfers to identify inefficiencies in real-time executions.

Programming and Integration Issues

Programming heterogeneous computing systems presents significant challenges due to the diverse architectures involved, such as CPUs, GPUs, and accelerators, each with distinct instruction set architectures (ISAs) and execution models. Developers must manage code that spans these disparate components, often requiring specialized tools and techniques to ensure correct functionality and performance. These issues are exacerbated by the need for seamless , where errors in one component can propagate unpredictably across the system. Debugging in heterogeneous environments is particularly complex, as traditional CPU debuggers like GDB cannot directly inspect GPU execution, rendering errors in accelerator code invisible to standard tools. For instance, GPU faults, such as out-of-bounds accesses, may only manifest as runtime crashes without detailed traces unless specialized extensions are employed. Tools like the Distribution for GDB provide extensions for debugging kernels on CPUs using hardware, allowing step-by-step execution and variable inspection. Similarly, NVIDIA's CUDA-GDB extends GDB to support simultaneous debugging of CPU and GPU code, enabling breakpoints in kernels and monitoring across multiple GPUs. These extensions mitigate ISA-specific debugging gaps but require vendor-specific setups, complicating cross-platform development. Portability challenges arise from vendor divergences in APIs, notably differences in memory semantics between CUDA and OpenCL, where CUDA's unified addressing contrasts with OpenCL's explicit host-device , leading to non-portable code that must be rewritten for each platform. For example, CUDA's implicit memory coalescing optimizations are not directly replicable in OpenCL without manual adjustments, potentially causing performance discrepancies of up to 30% across implementations. To address this, abstraction layers like Kokkos and enable source-to-source translation, allowing developers to write performance-portable code using high-level C++ templates that map to underlying APIs such as , , or . Kokkos provides multidimensional array abstractions and execution policies that abstract hardware details, while focuses on loop and kernel constructs for parallel execution, facilitating single-source applications across heterogeneous backends. Integration hurdles include runtime selection of accelerators, where decisions on offloading computations—such as using OpenMP's target directive with conditional clauses—must dynamically choose devices based on availability and workload, often leading to suboptimal if not tuned properly. Additionally, shared memory spaces in heterogeneous systems introduce security concerns, particularly side-channel attacks that exploit timing or contention to leak between isolated processes. For instance, microarchitectural side channels in shared s or memory buses allow adversaries to infer sensitive information from co-located accelerators, as demonstrated in attacks on GPU shared hierarchies. Mitigation strategies include auto-tuning frameworks like OpenTuner, which systematically explore configuration spaces for kernel parameters to optimize performance across heterogeneous hardware, using techniques such as genetic algorithms and to reduce tuning time. Unified APIs, such as Intel's oneAPI, further alleviate integration issues by providing a single programming model based on and DPC++ that abstracts vendor-specific details, enabling across CPUs, GPUs, and FPGAs while minimizing low-level boilerplate for and offloading. These approaches collectively enhance developer productivity by streamlining the development process in diverse environments.

Applications

High-Performance and Scientific Computing

Heterogeneous computing plays a pivotal role in high-performance computing (HPC) by integrating diverse processors, such as CPUs and GPUs, to achieve unprecedented computational throughput in supercomputers and clusters. The Frontier supercomputer, deployed at Oak Ridge National Laboratory in 2022 and powered by AMD EPYC CPUs and AMD Instinct MI250X GPUs within an HPE Cray EX architecture, exemplifies this approach, delivering 1.102 exaflops on the High-Performance Linpack (HPL) benchmark and securing the top position on the TOP500 list. This heterogeneous design enables efficient handling of compute-intensive tasks, including climate modeling simulations that require modeling complex atmospheric dynamics over global scales. In heterogeneous clusters, GPUs accelerate the HPL benchmark by leveraging their parallel processing capabilities, achieving speedups of 10 to 100 times compared to CPU-only systems through optimized CUDA implementations that distribute matrix operations across accelerators. In scientific computing, heterogeneous systems enhance simulations in domains like and by offloading parallelizable workloads to GPUs. The software package, widely used for biomolecular simulations, has incorporated heterogeneous parallelization over the past decade, allowing GPU clusters to accelerate non-bonded interaction calculations and achieve up to several-fold performance gains in large-scale studies. Similarly, in , GPU-accelerated N-body simulations model gravitational interactions among particles representing stars or , with heterogeneous CPU-GPU frameworks optimizing force computations to enable simulations of millions of particles that would otherwise be infeasible on homogeneous systems. For and workloads, heterogeneous computing facilitates and inference of deep neural networks on specialized hardware. , a leading framework, supports multi-GPU on systems, where multiple A100 or GPUs in a single node distribute across accelerators, reducing times for large models like transformers by exploiting the heterogeneous CPU-GPU for loading and . Inference acceleration benefits from tensor processing units (), with Cloud benchmarks from 2023 demonstrating 2-4x performance improvements over prior GPU-based systems for serving large language models, achieved through optimized matrix multiplications on TPU pods in heterogeneous cloud environments. Scaling heterogeneous computing to exascale systems further amplifies these gains in distributed environments. The Aurora supercomputer at Argonne National Laboratory, which became initially operational in 2024 and fully available to researchers in early 2025 and featuring Intel Xeon Max CPUs paired with Intel Data Center GPU Max Series in a heterogeneous node design, targets over 1 exaflop of performance for scientific discovery, including materials science and energy simulations. As of June 2025, Aurora ranks third on the TOP500 list with 1.012 exaFLOPS. Such systems pursue energy efficiency goals aligned with the original exascale target of around 50 gigaflops per watt under 20 megawatts, though actual deployments like Aurora operate at higher power levels of approximately 40-60 megawatts while achieving over 20 gigaflops per watt, balancing high throughput with sustainable operation.

Embedded and Edge Systems

Heterogeneous computing plays a pivotal role in and systems, where resource constraints demand optimized and processing capabilities. These systems integrate diverse processing units such as CPUs, GPUs, NPUs, DSPs, and FPGAs on a single chip or board to handle latency-sensitive tasks while adhering to strict power budgets, often below 10 watts. In contrast to environments that prioritize massive parallelism, and applications emphasize low-power operation for prolonged battery life and reliable performance in constrained settings like mobile devices and sensors. In mobile devices, such as smartphones, system-on-chip (SoC) designs exemplify heterogeneous computing through the integration of CPUs, GPUs, and NPUs to enable on-device AI processing. Samsung's SoCs, for instance, incorporate these elements to support real-time AI tasks like text generation, video enhancement, and without relying on cloud resources, a trend prominent in the . The in Exynos processors, evolving to its sixth generation by 2022, handles computations efficiently, collaborating with the CPU for general tasks and the GPU for graphics-intensive AI simulations. Task offloading in these heterogeneous mobile systems—shifting compute-intensive workloads to edge servers or specialized accelerators—can extend battery life through reduced local energy demands. For and applications, heterogeneous architectures combine microcontrollers (MCUs), DSPs, and other accelerators to manage from sensors in industrial settings. ' Sitara processors, such as the AM64x MPU and AM243x MCU, integrate cores with DSPs and programmable units (PRUs) for low-latency industrial control, supporting protocols like and enabling cycle times as low as 31.25 μs. The NVIDIA Jetson Nano module further illustrates this in edge video analytics, leveraging a quad-core CPU and 128-core GPU to process multiple neural networks in parallel for tasks like , all within a 5-10 watt power envelope. These setups ensure efficient handling of , such as video feeds, directly at the . In automotive embedded systems, particularly advanced driver-assistance systems (ADAS), heterogeneous computing employs CPU-FPGA combinations to meet stringent and requirements for autonomous driving features. Xilinx's Zynq-7000 integrates dual-core processors with programmable FPGA logic, facilitating customizable for image processing and in ADAS applications, while maintaining low consumption suitable for vehicle constraints under 10 watts. Heterogeneous scheduling in these systems dynamically allocates tasks across the CPU and FPGA to optimize within tight budgets, enabling reliable operation in safety-critical scenarios. Examples like the with GPU for edge demonstrate broader adoption, where add-on accelerators enhance inference speed for AI tasks on low-power boards. The proliferation of since 2020 has further boosted distributed in heterogeneous setups, with the global 5G edge market growing from USD 4.7 billion in 2024 to a projected USD 51.6 billion by 2030, facilitating low-latency across devices.

References

  1. [1]
    [PDF] A Gentle Introduction to Heterogeneous Computing for CS1 Students
    Abstract—Heterogeneous architectures have emerged as a dom- inant platform, not only in high-performance computing but also in mobile processing, ...
  2. [2]
    [PDF] HeteroRefactor: Refactoring for Heterogeneous Computing with FPGA
    Heterogeneous computing with field-programmable gate-arrays. (FPGAs) has demonstrated orders of magnitude improvement in computing efficiency for many ...
  3. [3]
    [PDF] Heterogeneous Computing - Purdue e-Pubs
    Dec 1, 1994 · A heterogeneous computing (HC) system provides a variety of architectural capabilities, orches- trated to perform an application whose ...
  4. [4]
    [PDF] Virtual Instruction Set Computing for Heterogeneous Systems ∗
    Heterogeneous parallel computing systems, including both mobile System-on-Chip (SOC) designs such as. Qualcomm's Snapdragon and nVidia's Tesla, or high- end ...
  5. [5]
    [PDF] an overview of heterogeneous high performance and grid computing
    Heterogeneous distributed computing is a means to overcome the limitations of single computing systems. Key words. heterogeneous, parallel, grid, high ...
  6. [6]
    [PDF] Heterogeneous Computing Fundamentals - Selkie
    Jul 26, 2013 · Heterogenous computing models are increasing in importance in parallel and distributed computing. This module.
  7. [7]
    Heterogeneous Computing for Signal and Data Processing
    Using programming languages such as OpenCL and CUDA for computational speedup in audio, image and video processing and computational data analysis. Significant ...
  8. [8]
    Heterogeneous vs. Homogeneous Computing Environments - Intel
    On the other hand, heterogeneous computing involves a system that combines different types of processing units, such as Central Processing Units (CPUs), ...
  9. [9]
    Heterogeneous Computing: Here to Stay
    Mar 1, 2017 · Let's start with the easy questions. What is heterogeneous computing? In a nutshell, it is a scheme in which the different computing nodes have ...
  10. [10]
    What Is Accelerated Computing? - NVIDIA Blog
    Sep 1, 2021 · Accelerated computers blend CPUs and other kinds of processors together as equals in an architecture sometimes called heterogeneous computing.
  11. [11]
    An initial performance review of software components for a ...
    Sep 7, 2015 · The end of Moore's Law and Dennard scaling has driven the proliferation of heterogeneous systems with accelerators, including CPUs, GPUs ...
  12. [12]
    potential for energy efficient multi-core mobile devices - IEEE Xplore
    By adopting heterogeneous computing, the total energy required for executing the applications can be significantly reduced. Based on the benchmarking results, ...
  13. [13]
    ENIAC | History, Computer, Stands For, Machine, & Facts | Britannica
    Oct 18, 2025 · ENIAC, the first programmable general-purpose electronic digital computer, built during World War II by the United States.
  14. [14]
    A Brief History of Computer Technology
    The CDC 7600, with its pipelined functional units, is considered to be the first vector processor and was capable of executing at 10 Mflops. The IBM 360/91 ...
  15. [15]
    About CUDA | NVIDIA Developer
    Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed ...More Than A Programming... · Widely Used By Researchers · Acceleration For All Domains
  16. [16]
    Where does big.LITTLE fit in the world of DynamIQ? - Arm Developer
    Apr 6, 2017 · When big.LITTLE was launched in October 2011, it became the world's first heterogeneous processing technology to enter the mobile market. The ...
  17. [17]
    AMD, ARM, Imagination, MediaTek and Texas Instruments Unleash ...
    Jun 12, 2012 · "One year ago, AMD boldly announced a roadmap for making HSA a reality, starting with combining the CPU and GPU as a unified processing engine ...
  18. [18]
    An in-depth look at Google's first Tensor Processing Unit (TPU)
    May 12, 2017 · In this post, we'll take an in-depth look at the technology inside the Google TPU and discuss how it delivers such outstanding performance.<|control11|><|separator|>
  19. [19]
    Pioneering chiplet technology and design for the AMD EPYC™ and ...
    This paper details the technology challenges that motivated AMD to use chiplets, the technical solutions we developed for our products, and how we expanded the ...
  20. [20]
    Intel Meteor Lake "Core Ultra" CPUs Launched - Wccftech
    Dec 14, 2023 · Intel Meteor Lake “Core Ultra” CPUs Launched: The First Chiplet Design With Next-Gen CPU Cores, Arc GPU & NPU For The AI PC Revolution.
  21. [21]
    TPU transformation: A look back at 10 years of our AI-specialized chips
    Jul 31, 2024 · Google's Tensor Processing Units were created in response to the growing demand for AI compute and have evolved over years to meet that ...
  22. [22]
    Heterogeneous Computation - an overview | ScienceDirect Topics
    A heterogeneous computing system refers to a system that contains different types of computational units, such as multicore CPUs, GPUs, DSPs, FPGAs, and ASICs.
  23. [23]
    Heterogeneous Computing Platform for data processing - IEEE Xplore
    Jan 19, 2017 · The Heterogeneous Computing Platform (HCP) contains the multiple types of processing elements which generally are CPUs, GPUs, and DSPs or FPGAs.
  24. [24]
    Query Processing on Heterogeneous CPU/GPU Systems
    Jan 17, 2022 · Whereas CPUs focus on single-thread latency, GPUs are optimized for data-parallel and throughput-oriented applications. To use each processor ...<|separator|>
  25. [25]
    Complex Mix Of Processors At The Edge - Semiconductor Engineering
    Aug 18, 2025 · It also provides an NPU fallback and offload mechanism, acting as an AI co-processor in many cases. ASICs: These deliver maximum efficiency and ...
  26. [26]
    A Survey on Deep Learning Hardware Accelerators for ...
    The purpose of an integrated NPU is to accelerate the performance and improve the energy efficiency of specific AI-tasks offloaded from the CPU [187]. In ...
  27. [27]
    RISC vs. CISC: Harnessing ARM and x86 Computing Solutions for ...
    Jul 11, 2024 · ARM adopts a RISC (Reduced Instruction Set Computing) philosophy, whereas x86 is based on a CISC (Complex Instruction Set Computing) approach.
  28. [28]
    Vectorization Basics for Intel® Architecture Processors
    Oct 30, 2018 · Intel processors use SIMD instruction sets like SSE, AVX, and AVX2 to process multiple data elements in a single instruction, enabling data ...
  29. [29]
    NVIDIA Tensor Cores: Versatility for HPC & AI
    Tensor Cores are the advanced NVIDIA technology that enables mixed-precision computing. This technology expands the full range of workload across AI & HPC.
  30. [30]
    Heterogeneous multi-processing - Arm Developer
    In a big.LITTLE system energy efficient LITTLE cores are coherently coupled with high performance big cores to form a system that can accomplish both high ...Missing: threading | Show results with:threading
  31. [31]
    Dealing With Performance Bottlenecks In SoCs
    Feb 23, 2023 · His analysis showed that peak FLOPS per socket were increasing 50% to 60% per year, while memory bandwidth only increased at about 23% per year.
  32. [32]
    [PDF] Synchronization and Coordination in Heterogeneous Processors
    Dec 14, 2016 · 4.2.1 Synchronization Primitives in Heterogeneous Processors ... coordinate computation across CPU and GPU cores in heterogeneous processors.
  33. [33]
    Understanding PCIe Configuration for Maximum Performance
    Maximum PCIe Bandwidth = SPEED * WIDTH * (1 - ENCODING) - 1Gb/s. For example, a gen 3 PCIe device with x8 width will be limited to: Maximum PCIe Bandwidth = 8G ...
  34. [34]
    AMBA - Arm
    The Advanced Microcontroller Bus Architecture (AMBA) is a freely available, open standard to connect and manage functional blocks in a system-on-chip (SoC).
  35. [35]
    NVLink & NVSwitch: Fastest HPC Data Center Platform | NVIDIA
    The NVLink Switch interconnects every GPU pair at an incredible 1,800GB/s. It supports full all-to-all communication. The 72 GPUs in the NVIDIA GB300 NVL72 can ...Maximize System Throughput... · Raise Reasoning Throughput... · Nvidia Nvlink Fusion
  36. [36]
    Understanding Compute Express Link: A Cache-coherent Interconnect
    Sep 22, 2020 · Compute Express Link (CXL) is an industry-supported cache-coherent interconnect for processors, memory expansion, and accelerators.
  37. [37]
    Networking Heterogeneous Microcontroller based Systems through ...
    Aug 7, 2025 · This paper presents an approach that addresses various issues related to networking distributed embedded systems through use of universal serial ...
  38. [38]
    [PDF] NoC-Based Support of Heterogeneous Cache-Coherence Models ...
    We proposed an extension to the MESI directory-based cache coherence protocol over NoC to support LLC-coherent accelerators. We presented the first NoC-based ...
  39. [39]
    Flexible on-chip power delivery for energy efficient heterogeneous ...
    May 29, 2013 · Heterogeneous systems-on-chip pose a challenge for power delivery given the variety of needs for different components.
  40. [40]
    Developers: Heterogeneous computing for your demanding apps
    Mar 4, 2020 · Qualcomm Kryo CPU: an ARM-based CPU featuring multiple cores configured in a big.LITTLE architecture. Our Kryo CPU supports multiple task ...
  41. [41]
    AMD Fusion APU Era Begins
    Jan 4, 2011 · These APUs feature the new x86 CPU core codenamed "Bobcat." "Bobcat" is AMD's first new x86 core since 2003 and was designed from the ground up ...
  42. [42]
    Intel® Xe GPU Architecture
    An Xe-HPC 2-stack Data Center GPU Max, previously code named Ponte Vecchio or PVC, consists of up to 2 stacks:: 8 slices, 128 Xe-cores, 128 ray tracing ...Missing: chiplets | Show results with:chiplets
  43. [43]
    Vega: AMD's New Graphics Architecture for Virtually Unlimited ...
    Jan 5, 2017 · The world's most advanced GPU memory architecture: The Vega architecture enables a new memory hierarchy for GPUs. This radical new approach ...Missing: HSA unified
  44. [44]
    [PDF] Harnessing Integrated CPU-GPU System Memory for HPC - arXiv
    Jul 10, 2024 · In this work, we provide the first in-depth study of the system memory management on the Grace Hopper Superchip, in both in-memory and memory.
  45. [45]
    Apple unleashes M1
    Nov 10, 2020 · Featuring Apple's most advanced 16-core architecture capable of 11 trillion operations per second, the Neural Engine in M1 enables up to 15x ...Apple (CA) · Apple (AU) · Apple (UK) · Apple (NZ)
  46. [46]
    NVIDIA A100 Tensor Core GPU
    The A100 provides up to 20X higher performance, 2TB/s memory bandwidth, and can be partitioned into seven GPU instances for AI, data analytics, and HPC.
  47. [47]
    ORNL Launches Summit Supercomputer
    Jun 8, 2018 · The IBM AC922 system consists of 4,608 compute servers, each containing two 22-core IBM Power9 processors and six NVIDIA Tesla V100 graphics ...
  48. [48]
    [PDF] NVIDIA A100 Tensor Core GPU Architecture
    The NVIDIA A100 Tensor Core GPU delivers the greatest generational leap in NVIDIA GPU accelerated computing ever. Page 11. NVIDIA A100 Tensor Core GPU Overview.
  49. [49]
    Intel Quietly Kills Off Xeon Phi - ExtremeTech
    May 8, 2019 · Intel has quietly notified customers that the Xeon Phi 7295, 7285, and 7235 will be end-of-life'd July 31, 2020, with no further orders for KML ...
  50. [50]
    What is PCIe 5.0? Everything You Need to Know - Trenton Systems
    Sep 8, 2022 · PCIe 5.0 is the next generation of PCIe, which is a widely-used, high-speed interface that can connect components such as graphics processing units (GPUs).
  51. [51]
    Accelerated InfiniBand Solutions for HPC - NVIDIA
    The NVIDIA Quantum InfiniBand Platform bring end-to-end high-performance networking to scientific computing, AI, and cloud data centers.InfiniBand Adapters · InfiniBand Switch Systems · Quantum-X800
  52. [52]
    [PDF] Accelerating DNNs with Xilinx Alveo Accelerator Cards (WP504)
    Oct 14, 2018 · Xilinx's reconfigurable FPGA silicon allows users to continue receiving new improvements and features through xDNN updates. This allows the ...
  53. [53]
    PCIe Gen 4 vs. Gen 3 Slots, Speeds - Trenton Systems
    Sep 25, 2023 · In addition, each PCIe 4.0 lane configuration supports double the bandwidth of PCIe 3.0, maxing out at 32 GB/s in a 16-lane slot, or 64 GB/s ...
  54. [54]
    Introducing Cloud TPU v5p and AI Hypercomputer - Google Cloud
    Dec 6, 2023 · The new TPU v5p is a core element of AI Hypercomputer, which is tuned, managed, and orchestrated specifically for gen AI training and ...Missing: discrete | Show results with:discrete
  55. [55]
    Intel® Gaudi® AI Accelerator Products
    Built on the high-efficiency Intel® Gaudi® architecture, the new Intel® Gaudi® 3 PCIe card (HL-338) delivers AI acceleration in a standard PCIe Gen5 form factor ...
  56. [56]
    Programming Guide :: CUDA Toolkit Documentation - NVIDIA Docs
    In November 2006, NVIDIA introduced CUDA®, a general purpose parallel computing platform and programming model that leverages the parallel compute engine in ...
  57. [57]
    CUDA Refresher: The CUDA Programming Model - NVIDIA Developer
    Jun 26, 2020 · A group of threads is called a CUDA block. CUDA blocks are grouped into a grid. A kernel is executed as a grid of blocks of threads (Figure 2).Missing: 2006 scheduling 32 documentation
  58. [58]
    CUDA C++ Programming Guide
    The programming guide to the CUDA model and interface.
  59. [59]
    Introduction to CUDA: tutorial and use of Warp - Damavis Blog
    Aug 1, 2024 · CUDA is NVIDIA's GPU programming language, vital for many computing tasks. In this post we look at it to understand the special paradigm it ...
  60. [60]
    AMD ROCm™ Software for AI
    AMD ROCm is an open software stack offering a suite of optimizations for AI workloads and supporting the broader AI software ecosystem.Missing: 2016 HIP
  61. [61]
    Data Parallel C++: oneAPI's Implementation of SYCL - Intel
    SYCL is an open alternative to single-architecture proprietary languages. It allows developers to reuse code across hardware targets (CPUs and accelerators ...
  62. [62]
    Compute Library – Arm®
    Arm Compute Library contains a comprehensive collection of software functions specifically optimized for Arm Cortex-A CPUs and Arm Mali GPUs.Missing: big. LITTLE NEON intrinsics
  63. [63]
    NEON Intrinsics - Arm Developer
    This book provides a guide for programmers to effectively use NEON technology, the ARM Advanced SIMD architecture extension. The book provides information ...Missing: Library Mali GPUs big. LITTLE
  64. [64]
    ARM-software/ComputeLibrary: The Compute Library is a ... - GitHub
    The Compute Library is a collection of low-level machine learning functions optimized for Arm® Cortex®-A, Arm® Neoverse™ and Arm® Mali™ GPUs architectures.Missing: mobile intrinsics
  65. [65]
    [PDF] NVIDIA TESLA V100 GPU ARCHITECTURE
    Tensor Core 4x4 Matrix Multiply and Accumulate. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision ...
  66. [66]
    Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog
    Oct 17, 2017 · Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full-precision result that is accumulated in FP32 ...
  67. [67]
    OpenCL for Parallel Programming of Heterogeneous Systems
    OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators.OpenCL 3.0 Reference Pages · Khronos OpenCL Registry · OpenCL News · Forums
  68. [68]
    The OpenCL™ Specification - Khronos Registry
    Jul 10, 2025 · OpenCL(TM) is an open, royalty-free standard for cross-platform parallel programming of diverse accelerators. This document describes the ...
  69. [69]
    Version 5.0 - OpenMP
    OPENMP API Specification: Version 5.0 November 2018. PIC ... This HTML version of the specification is a translation from the official PDF specification.Missing: heterogeneous offloading
  70. [70]
    Standards – Heterogeneous System Architecture Foundation
    The HSA Foundation is driving development of a new standard for the advancement of heterogeneous computing. Published documents include standards, recommended ...Missing: unified | Show results with:unified
  71. [71]
    [PDF] HSA Platform System Architecture Specification Version 1.2
    A compliant HSA system shall allow agents to access shared system memory through the common HSA unified virtual address space. The minimum virtual address width ...
  72. [72]
  73. [73]
    Home | Vulkan | Cross platform 3D Graphics
    Vulkan is a next generation graphics and compute API that provides high-efficiency, cross-platform access to modern GPUs used in PCs, consoles, ...Missing: 2016 | Show results with:2016
  74. [74]
    WebGPU - W3C
    Oct 28, 2025 · WebGPU is an API that exposes the capabilities of GPU hardware for the Web. The API is designed from the ground up to efficiently map to (post-2014) native GPU ...
  75. [75]
    [PDF] Bi-objective Scheduling Algorithms for Optimizing Makespan and ...
    In this paper we take on the problem of scheduling an application modeled by a task graph on a set of heterogeneous resources. The objectives are to minimize ...
  76. [76]
    [PDF] Chapter 1 Introduction to Scheduling and Load Balancing
    The advantage of dynamic load balancing over static scheduling is that the ... Dynamic algorithms have the poten- tial to outperform static algorithms by.
  77. [77]
    [PDF] Load Balancing in a Changing World: Dealing with Heterogeneity ...
    We show that our dynamic approach provides consistently good performance: compared to the best possi- ble static partition, it is on average 9.6% faster in ...
  78. [78]
    [PDF] Analysis of data movements over the PCIe bus in heterogeneous ...
    This thesis analyzes data movements over the PCIe bus in heterogeneous systems, specifically between CPU and GPU, using CUDA-based tools.
  79. [79]
    [PDF] Understanding Routable PCIe Performance for Composable ...
    Apr 18, 2024 · latency penalties to the D2H and H2D cases due to 3 one-way. PCIe. With larger data movement sizes, such overheads di- minish considerably.
  80. [80]
    [PDF] Programming Heterogeneous Computers and Improving Inter-Node ...
    Communication bandwidth between accelerators over the PCIe interconnect is much slower than internal memory bandwidth. This project examines the inter-node ...
  81. [81]
    Data Prefetching on Processors with Heterogeneous Memory
    Dec 11, 2024 · Our technique enables a prefetcher to dynamically determine the optimal prefetch degree and distance based on memory type.
  82. [82]
    [PDF] Efficient Unified Caching for Accelerating Heterogeneous AI ... - arXiv
    Jun 14, 2025 · called hierarchical prefetching to support data prefetching at any arbitrary granularity. In the horizontal direction, we first apply the ...
  83. [83]
    Heterogeneous microarchitectures trump voltage scaling for low ...
    Two common approaches are dynamic voltage/frequency scaling (DVFS) and heterogeneous microarchitectures (HMs).
  84. [84]
    [PDF] Rethinking Energy-Performance Trade-Off in Mobile Web Page ...
    Experimental results show that our techniques are able to achieve a 24.4% av- erage system energy saving for Chromium on a latest-generation big.LITTLE ...
  85. [85]
    [PDF] Roofline: An Insightful Visual Performance Model for Floating-Point ...
    We propose an easy-to-understand, visual performance model that offers insights to programmers and architects on improving parallel software and hardware for ...
  86. [86]
    [PDF] Gables: A Roofline Model for Mobile SoCs - Computer Sciences Dept.
    The model can be used to determine the critical limitation for performance in a mobile SoC and can be visualized via multiple rooflines on a single plot.
  87. [87]
    Introduction to NVIDIA Nsight Systems – A Performance Analysis Tool
    When optimizing for heterogeneous systems with multiple CPUs and GPUs such as NVIDIA DGX or workstations, independent CPU profilers and GPU profilers are ...
  88. [88]
    1. Introduction — CUDA-GDB 13.0 documentation - NVIDIA Docs
    CUDA-GDB is the NVIDIA tool for debugging CUDA applications running on Linux and QNX. CUDA-GDB is an extension to GDB, the GNU Project debugger.<|separator|>
  89. [89]
    [PDF] Tools for GPU Computing – Debugging and Performance Analysis ...
    CUDA-GDB is, as the name indicates, an extension to gdb, the Unix debugger. Simultaneous de- bugging on the CPU and multiple GPUs is possible. The user can set ...
  90. [90]
    [PDF] From CUDA to OpenCL: Towards a Performance-portable Solution ...
    Aug 31, 2010 · Abstract. In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the.
  91. [91]
    A comprehensive performance comparison of CUDA and OpenCL
    Sep 13, 2011 · Our results show that, for most applications, CUDA performs at most 30% better than OpenCL. We also show that this difference is due to unfair ...
  92. [92]
    [PDF] RAJA: Portable Performance for Large-Scale Scientific Applications
    Mar 27, 2018 · In this paper, we describe RAJA, a C++ portability layer that enables single-source applications to exploit multiple program- ming models, and ...
  93. [93]
    C/C++ or Fortran with OpenMP* Offload Programming Model - Intel
    OpenMP directives offload work to Intel accelerators using the `target` construct, transferring control and mapping variables between host and target.
  94. [94]
    [PDF] OpenMP Offload Features and Strategies for High Performance ...
    Runtime selection of architectural target is also possible in. OpenMP. Using the if clause on target regions, we can specify whether to execute on host or ...
  95. [95]
    Microarchitectural Attacks in Heterogeneous Systems: A Survey
    Microarchitectural covert and side channel attacks exploit unintended leakage that occurs when different applications compete for shared microarchitecture ...
  96. [96]
    [PDF] OpenTuner: An Extensible Framework for Program Autotuning
    ABSTRACT. Program autotuning has been shown to achieve better or more portable performance in a number of domains. However, autotuners themselves are rarely ...Missing: heterogeneous | Show results with:heterogeneous
  97. [97]
    oneAPI: A New Era of Heterogeneous Computing
    No readable text found in the HTML.<|control11|><|separator|>
  98. [98]
    June 2022 - TOP500
    The 59th edition of the TOP500 revealed the Frontier system to be the first true exascale machine with an HPL score of 1.102 Exaflop/s.
  99. [99]
    Frontier - Oak Ridge Leadership Computing Facility
    The Frontier supercomputer at the Department of Energy's Oak Ridge National Laboratory earned the top ranking today as the world's fastest on the 59th TOP500 ...
  100. [100]
    Heterogeneous parallelization and acceleration of molecular ...
    Here, we present the heterogeneous parallelization and acceleration design of molecular dynamics implemented in the GROMACS codebase over the last decade.
  101. [101]
    [PDF] Astrophysical Particle Simulations on Heterogeneous CPU-GPU ...
    We propose optimal task split between CPU and GPU where GPU is only used to compute the calculation of the particle force. Also, we describe optimization ...
  102. [102]
    Multi-GPU and distributed training | TensorFlow Core
    Jul 19, 2023 · This guide teaches you how to use the tf.distribute API to train Keras models on multiple GPUs, with minimal changes to your code, in the following two setups.Missing: DGX | Show results with:DGX
  103. [103]
    Performance per dollar of GPUs and TPUs for AI inference
    Sep 12, 2023 · Google Cloud inference systems deliver between a factor of 2-4x performance improvement and more than a factor of 2x cost-efficiency improvement over existing ...Missing: heterogeneous | Show results with:heterogeneous
  104. [104]
    Aurora Exascale Supercomputer - Argonne National Laboratory
    Aurora is one of the world's first exascale supercomputers, able to perform over a quintillion calculations per second.Missing: CPU GPU 2023 heterogeneous
  105. [105]
    15 Years Later, the Green500 Continues Its Push for Energy ...
    Jul 15, 2021 · Trends in the systems themselves are promising, as well: systems are broadly approaching 30-gigaflops-per-watt efficiency, with heterogeneous ...
  106. [106]
    Jetson Nano Brings the Power of Modern AI to Edge Devices - NVIDIA
    The Jetson Nano module is a small AI computer that gives you the performance and power efficiency to take on modern AI workloads.Missing: heterogeneous | Show results with:heterogeneous
  107. [107]
    On-device AI | Technologies | Samsung Semiconductor Global
    Samsung has continuously enhanced the mobile NPU performance of Exynos, paving the way for the on-device AI era. This mobile NPU technology has ...
  108. [108]
    The Important Role of CPU and NPU in Smartphones | Samsung ...
    Aug 31, 2022 · These days, the NPU is mainly in charge of AI computation, and it can process data more efficiently in mobile devices as well. It's optimized ...
  109. [109]
    Saving Energy in Mobile Devices Using Mobile Device Cloudlet in ...
    The simulation results demonstrates that the proposed framework extend battery life and reduces delays compared to the traditional MEC paradigm. Published in: ...
  110. [110]
    [PDF] Utilizing Sitara Processors and Microcontrollers for Industry 4.0 ...
    This article explores how Sitara microcontrollers (MCUs) and processors (MPUs) address servo drive market trends and new requirements of Industry 4.0 and smart.
  111. [111]
    Xilinx Zynq-based Development Platform for ADAS
    Oct 26, 2016 · Aldec provides an FPGA-based development platform powered by Xilinx Zynq-7000 SoC/FPGA heterogeneous technology, as well as a set of ADAS-class ...Missing: CPU | Show results with:CPU
  112. [112]
    5G Edge Computing Market Size | Industry Report, 2030
    The global 5G edge computing market size was estimated at USD 4,743.2 million in 2024 and is projected to reach USD 51,574.3 million by 2030, ...
  113. [113]
    Real-Time Edge Computing vs. GPU-Accelerated Pipelines for Low ...
    Both approaches use the OpenFlexure Microscope and Raspberry Pi devices. The first performs real-time inference with a Raspberry Pi 5 and Hailo-8L accelerator, ...Missing: ML | Show results with:ML