Fact-checked by Grok 2 weeks ago

GPU cluster

A GPU cluster is a high-performance computing system consisting of multiple interconnected nodes, each equipped with one or more graphics processing units (GPUs), along with central processing units (CPUs), memory, and storage, designed to distribute and execute complex parallel workloads efficiently. These clusters leverage the massive parallel processing capabilities of GPUs, which feature thousands of cores optimized for simultaneous computations, to accelerate tasks that would be prohibitively slow on traditional CPU-based systems. The architecture typically includes high-speed interconnects such as NVLink, InfiniBand, or Ethernet to enable rapid data transfer between nodes, minimizing bottlenecks in distributed processing. Each node functions as an independent server, but software frameworks like CUDA or MPI orchestrate the coordination, allowing the cluster to behave as a unified supercomputer for scalable performance. GPU clusters are pivotal in advancing fields requiring intensive computation, including and model training, where they enable faster iteration on large datasets; scientific simulations in , , and ; and analytics for real-time insights in and healthcare. Their benefits include linear —adding nodes proportionally boosts throughput—enhanced redundancy for , and support for hybrid deployments across on-premises, , or environments, though they demand significant investments in power, cooling, and maintenance. Leading implementations, such as those from NVIDIA-powered systems by companies like HPE and , underscore their role in and AI innovation.

Overview

Definition and Purpose

A GPU cluster is a network of interconnected computer nodes, each equipped with one or more graphics processing units (GPUs), along with central processing units (CPUs), , and , designed to perform general-purpose on GPUs (GPGPU) for handling tasks. These systems distribute computational workloads across multiple nodes to enable high-throughput processing for data-intensive applications, such as scientific simulations and training, achieving supercomputing-level performance at lower costs than traditional CPU-only clusters. Key benefits include through the addition of nodes, improved with higher , and significant speedups in floating-point operations, often reaching teraflops or petaflops in aggregate. GPUs differ fundamentally from CPUs in architecture, with GPUs featuring thousands of simpler cores optimized for (SIMT) execution, enabling simultaneous processing of vast numbers of threads in via scheduling, in contrast to CPUs' fewer, more complex cores focused on sequential, low-latency tasks. This design, rooted in streaming multiprocessors that prioritize over caching and control logic, makes GPUs ideal for workloads involving repetitive, independent operations like multiplications. Unlike single-node GPU setups, which are limited by per-node memory and processing capacity, GPU clusters leverage between nodes to achieve cluster-scale parallelism, allowing for the handling of datasets and models that exceed the capabilities of isolated systems. These nodes communicate via high-speed interconnects to coordinate distributed workloads efficiently.

Historical Development

The emergence of GPU clusters in the mid-2000s was propelled by 's release of the programming model in November 2006, which enabled general-purpose computing on GPUs (GPGPU) by allowing developers to leverage the parallel processing capabilities of graphics hardware for non-graphics workloads. This breakthrough shifted GPUs from specialized graphics accelerators to versatile compute engines, fostering early experiments in clustering multiple GPUs for scientific simulations and data processing. Initial clusters, such as those built around consumer-grade in 2008, demonstrated significant speedups in applications like and linear algebra, often outperforming CPU clusters by orders of magnitude in floating-point operations. A pivotal technological shift occurred in 2007 with the introduction of 's Tesla series, the first GPUs optimized for data-center environments rather than consumer graphics, featuring enhanced reliability, , and higher double-precision performance for scientific . Supported by funding from agencies like and the U.S. Department of Energy () in the late 2000s—such as 's High-Productivity Systems (HPCS) , which invested in GPU-accelerated architectures—these developments laid the groundwork for scalable clusters. By 2010, hybrid CPU-GPU systems began dominating the TOP500 list of supercomputers, exemplified by China's Tianhe-1A, which claimed the top spot with CPUs paired with Fermi GPUs, marking the onset of widespread GPU adoption in (HPC). GPUs first appeared on the in 2008, representing a small fraction of systems initially but growing rapidly thereafter. The 2012 debut of the supercomputer at (ORNL) represented a landmark milestone as the world's first petaflop-scale GPU-accelerated system, integrating 18,688 NVIDIA Tesla K20X GPUs based on the Kepler architecture to achieve 27 petaflops of peak performance. Funded by the , 's hybrid design—combining AMD CPUs with GPUs—validated GPU clusters for production HPC workloads like modeling and simulations, influencing a broader transition in the where over half of new performance came from GPU-accelerated systems. The deep learning boom, ignited by AlexNet's victory in the 2012 competition—trained on two NVIDIA GTX 580 GPUs and achieving a top-5 error rate of 15.3%—further accelerated cluster adoption by highlighting GPUs' efficiency in parallel training. Entering the 2020s, GPU clusters solidified their dominance in AI-driven workloads with the rollout of NVIDIA's (A100, 2020) and (H100, 2022) architectures, enabling exaflop-scale performance in specialized precisions like FP8; for instance, NVIDIA's DGX H100 systems delivered up to 1 exaflop of AI compute per pod. The rise of cloud-based GPU clusters, such as AWS's EC2 P3dn instances launched in 2018 with NVIDIA V100 GPUs, democratized access to these resources for training. By 2025, the Blackwell architecture—featuring dual-die GPUs with 208 billion transistors—powered massive AI training clusters, such as those using GB300 NVL72 configurations exceeding 4,600 GPUs, achieving record-breaking results in MLPerf benchmarks and supporting trillion-parameter models.

Hardware Components

GPU Configurations

GPU configurations in clusters are categorized as homogeneous or heterogeneous based on the uniformity of GPU hardware across nodes. Homogeneous setups employ identical GPUs in all nodes, such as Tensor Core GPUs equipped with 80 GB of HBM3 memory, enabling seamless execution of uniform workloads. This uniformity simplifies load balancing and reduces synchronization overhead, as all GPUs share consistent architectural features and performance characteristics, leading to more predictable and higher overall efficiency in shared environments. Heterogeneous configurations, by contrast, integrate diverse GPU models across nodes to accommodate a range of tasks, for instance, deploying A100 GPUs, which excel in due to their optimized Tensor Cores, alongside MI300 accelerators with 192 GB HBM3 memory tailored for memory-intensive training workloads. While this approach offers flexibility for mixed computational demands, it introduces challenges in software compatibility, scheduling complexity, and resource utilization, often requiring specialized frameworks to mitigate performance inconsistencies. At the node level, multi-GPU arrangements enhance intra-node parallelism, typically connecting 4 to 8 GPUs per server using high-bandwidth interconnects like , which delivers up to 900 GB/s bidirectional throughput between GPUs—far surpassing PCIe Gen5's 128 GB/s maximum—for data-intensive applications. In contrast, PCIe-based setups provide a more economical alternative for less demanding interconnect needs, though with higher latency for GPU-to-GPU communication. Systems like the NVIDIA DGX H100 integrate eight GPUs via , but configurations must account for factors such as (up to 3.35 TB/s per ), power draw (700 W TDP per GPU), and robust cooling to prevent thermal throttling under sustained loads. Cluster sizing spans small-scale deployments with dozens of nodes for targeted research to expansive systems with thousands of nodes for exascale computing, where GPU density per node directly scales aggregate performance. For example, increasing from 4 to 8 GPUs per node can double throughput while optimizing power and space efficiency. The total floating-point operations per second (FLOPS) for the cluster is approximated by the equation: \text{Total FLOPS} = \text{Number of GPUs} \times \text{Single GPU TFLOPS} \times \text{Utilization Factor} This metric underscores the impact of density; with each H100 delivering 67 TFLOPS in FP32 precision and typical utilization around 70-90%, a 1,000-node cluster at 8 GPUs per node yields petascale performance.

Interconnects and Supporting Infrastructure

GPU clusters rely on high-speed interconnects to facilitate efficient data transfer between nodes, minimizing bottlenecks in parallel processing workloads. In the early 2000s, Gigabit Ethernet served as the primary interconnect, offering modest bandwidth of up to 1 Gb/s but suffering from high CPU overhead due to protocol processing. By the 2020s, the shift to Remote Direct Memory Access (RDMA)-enabled fabrics, such as InfiniBand and RoCE v2 over Ethernet, enabled GPU-direct communications that bypass the CPU, reducing latency and overhead for large-scale AI and HPC tasks. This evolution supports the demands of modern GPU clusters, where inter-node communication can account for up to 50% of training time in distributed machine learning. High-performance interconnects in GPU clusters prioritize low and high bandwidth to handle the massive data exchanges required for GPU synchronization. InfiniBand, a leading option, provides ultra-low of 3-5 microseconds and bandwidth up to 400 Gb/s per port with NVIDIA's Quantum-2 platform, introduced in 2023 and widely deployed for AI clusters by 2025. Ethernet-based RoCE v2 offers a cost-effective alternative, supporting 100-800 Gb/s bandwidth with latencies around 5-10 microseconds, making it suitable for scalable AI fabrics in hyperscale environments. For intra-node connectivity, NVIDIA's NVSwitch enables direct GPU-to-GPU linking at up to 900 GB/s bidirectional bandwidth per GPU, forming a unified across multiple GPUs within a . InfiniBand excels in latency-sensitive scenarios like real-time simulations, while RoCE v2 provides broader compatibility with existing Ethernet infrastructure, though it may require additional tuning for lossless operation in GPU-direct transfers. Supporting infrastructure includes host CPUs, storage, power, cooling, and chassis designs that ensure reliable operation across cluster nodes. CPUs such as AMD EPYC or Intel Xeon processors manage orchestration and I/O, with EPYC models offering up to 128 PCIe Gen 5 lanes per socket for enhanced GPU connectivity in high-density setups. Storage solutions feature NVMe SSDs for low-latency local caching and parallel file systems like Lustre, which scale to petabytes and support thousands of clients for shared data access in HPC environments. Power and cooling systems address the high thermal loads from GPUs, with liquid cooling enabling up to 30% better power utilization and supporting rack densities exceeding 50 kW, as seen in direct-to-chip implementations for AI clusters. Chassis designs, such as 4U rackmount servers, accommodate up to 8 GPUs with modular layouts for airflow or liquid manifolds, optimizing space in rack-scale deployments. Infrastructure considerations emphasize rack-scale integration, fault tolerance, and scalability to maintain cluster efficiency. Rack-scale designs integrate multiple nodes with unified cabling and shared cooling loops, reducing deployment complexity for thousands of GPUs. is achieved through redundant power supplies and multi-path routing in interconnect fabrics, ensuring uptime during failures in large-scale operations. is evaluated using , a metric of efficiency calculated as the minimum bandwidth across a cut dividing the cluster into two equal halves, approximated by the formula: \text{Bisection Bandwidth} = \frac{\text{Total Ports}}{2} \times \text{Port Speed} This provides context for cluster performance, where higher values support balanced all-to-all communications in expansive GPU fabrics.

Software Ecosystem

System-Level Software

GPU clusters primarily rely on Linux-based operating systems for their stability, extensive hardware support, and compatibility with high-performance computing (HPC) environments. Distributions such as Ubuntu and Rocky Linux (a community-driven successor to CentOS) are widely adopted due to their optimized kernels that include modules for GPU acceleration and cluster management. These systems support essential kernel features like NVIDIA's proprietary drivers integration via DKMS (Dynamic Kernel Module Support) for seamless updates across kernel versions. Containerization plays a crucial role in isolating workloads and ensuring portability across nodes in GPU clusters. Tools like and (now Apptainer) enable the packaging of GPU-dependent applications, with particularly favored in HPC settings for its rootless operation and native support for MPI and GPU passthrough without requiring privileged access. , when used with NVIDIA's container toolkit, allows GPU resource allocation via runtime flags, facilitating multi-tenant environments. GPU drivers form the foundational layer for hardware interaction, with NVIDIA's CUDA drivers—version 13.0 as of November 2025—providing the core runtime for on datacenter GPUs like the Blackwell architecture. These drivers include libraries for and direct GPU-to-GPU communication, essential for cluster-scale operations. For AMD GPUs, the platform (version 7.1.0 as of October 2025) offers analogous open-source drivers and APIs optimized for HPC and AI workloads, supporting heterogeneous cluster configurations. Clustering middleware orchestrates resource allocation and communication in GPU environments. Job schedulers like Slurm and Professional handle GPU-specific requests through generic resource (GRES) configurations, allowing users to specify GPU counts, types, and sharing modes in job submissions for efficient workload distribution. (MPI) implementations, such as OpenMPI with GPU-aware extensions, support direct data transfer from GPU (via GPUDirect RDMA) to reduce in multi-node communications, bypassing CPU involvement for better . Monitoring tools including (integrated with DCGM exporters) and Ganglia provide cluster-wide visibility into resource utilization, enabling proactive fault detection and scaling decisions. Installation and dependency management in GPU clusters require careful handling to ensure compatibility. The toolkit installation process involves selecting distribution-specific packages, verifying driver versions, and resolving dependencies like compilers and headers, often using tools like yum or apt for automated resolution. For multi-version support, environments manage toolkit paths via modules or environment variables to avoid conflicts in shared clusters. Security features such as SELinux enhance isolation in multi-tenant setups by enforcing mandatory access controls on GPU devices, though custom policies may be needed to permit driver operations without disabling enforcement.

Programming and Runtime Frameworks

Programming models for GPU clusters provide the foundational abstractions for developers to express parallelism and manage resources across multiple accelerators. NVIDIA's (Compute Unified Device Architecture) is a widely adopted model that enables explicit control over GPU execution through kernel launches, where computational functions are invoked on the GPU as threads organized in grids and blocks. also supports unified memory addressing, which allows a single space accessible from both CPU and GPU, simplifying data transfer and management in multi-GPU environments without explicit copies in many cases. This model is optimized for hardware and has been instrumental in accelerating applications from scientific simulations to . For cross-vendor portability, (Open Computing Language) offers an maintained by the , allowing developers to write platform-agnostic kernels that execute on GPUs, CPUs, and other processors from multiple vendors like , , and . Complementing this, AMD's HIP (Heterogeneous-compute Interface for Portability) serves as a that translates code to run on AMD GPUs via , facilitating easier migration while maintaining similar syntax for kernel launches and memory operations. Emerging standards like , part of Intel's oneAPI initiative, extend C++ to support single-source programming for heterogeneous systems, including GPUs from various vendors, with features like for device-specific optimizations. promotes portability by abstracting hardware differences, enabling code reuse across , , and ecosystems without . Runtime frameworks build on these models to handle distributed execution, synchronization, and communication in GPU clusters. PyTorch Distributed provides tools like DistributedDataParallel (DDP), which wraps models for efficient multi-GPU training by replicating the model across devices and synchronizing gradients via all-reduce operations during , scaling seamlessly to multiple nodes. Similarly, integrates with Horovod, an open-source framework that extends single-GPU scripts to distributed settings using ring-allreduce for gradient aggregation, supporting frameworks like and enabling training on hundreds of GPUs with minimal code changes. NVIDIA's Collective Communications Library (NCCL) underpins many of these frameworks by optimizing collective operations such as all-reduce, which combines from all GPUs (e.g., summing gradients) and distributes the result; NCCL achieves high performance through topology-aware algorithms that maximize interconnect , with effective bandwidth approximated as \text{Effective Bandwidth} = \text{Raw Bandwidth} \times (1 - \text{Overhead Fraction}), where overhead accounts for in protocols like or tree reductions. Distributed computing tools offer higher-level abstractions for orchestrating complex workflows on GPU clusters. Dask integrates with GPU-accelerated libraries like CuPy and , allowing users to build task graphs for parallel execution across nodes, with to optimize resource allocation for data-intensive computations. provides a unified for scaling applications, supporting actor-based task distribution and integration with GPU runtimes, while incorporating through reconstruction for retrying failed tasks. Both tools support checkpointing mechanisms to save intermediate states in long-running jobs, enabling recovery from node failures without restarting from scratch—Dask via its scheduler's persistence options and through built-in job recovery APIs. Best practices in GPU cluster programming emphasize selecting appropriate parallelism strategies based on workload characteristics. , suitable for models that fit on a single GPU, replicates the full model across devices and partitions the input data (e.g., minibatches), with of gradients post-backward pass to maintain ; this approach scales well with more GPUs for larger effective batch sizes but can be communication-bound in clusters. In contrast, model parallelism divides the model itself across GPUs—either by layers (pipeline parallelism) or tensors (intra-layer sharding)—ideal for massive models exceeding single-GPU , though it requires careful partitioning to balance and minimize inter-GPU transfers. strategies combining both are common for large-scale , as seen in frameworks like . For multi-node setups, integrating Message Passing Interface (MPI) with CUDA enables coordination of processes across cluster nodes, where each process manages local GPUs. A basic example involves initializing MPI, selecting devices, and launching kernels with inter-node communication via MPI calls wrapped around CUDA operations:
cpp
#include <mpi.h>
#include <cuda_runtime.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    int device = rank % 4;  // Assume 4 GPUs per node
    cudaSetDevice(device);

    // Allocate and initialize data on GPU
    float *d_data;
    cudaMalloc(&d_data, sizeof(float) * 1024);
    // ... fill data ...

    // Example kernel launch
    kernel<<<blocks, threads>>>(d_data);

    // Gather results via MPI (e.g., all-reduce)
    float *h_data = (float*)malloc(sizeof(float) * 1024);
    cudaMemcpy(h_data, d_data, sizeof(float) * 1024, cudaMemcpyDeviceToHost);
    MPI_Allreduce(MPI_IN_PLACE, h_data, 1024, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);

    cudaFree(d_data);
    free(h_data);
    MPI_Finalize();
    return 0;
}
This snippet demonstrates process binding to GPUs and basic synchronization, often compiled with NVCC and mpicc for cluster deployment.

Applications and Workload Mapping

High-Performance Computing

GPU clusters have become integral to high-performance computing (HPC) for accelerating traditional scientific and engineering simulations, particularly those involving complex physics-based models. In molecular dynamics, software like GROMACS leverages GPU parallelism to simulate biomolecular systems at scale, achieving significant speedups through optimized GPU kernels for non-bonded interactions and integration steps. For climate modeling, the Community Earth System Model (CESM) employs GPU acceleration for radiation calculations and atmospheric dynamics, enabling faster simulations of global climate patterns by offloading compute-intensive components to GPUs. In fluid dynamics, OpenFOAM solvers ported to CUDA facilitate large-scale computational fluid dynamics (CFD) simulations, such as turbulent flows, by parallelizing finite volume methods on GPU architectures. Algorithm mapping strategies in GPU clusters emphasize spatial parallelism and efficient data handling to exploit the massive thread counts of GPUs. decomposition divides simulation domains, such as grids in CFD or particle systems, across multiple nodes and GPUs, allowing independent on subdomains with periodic boundary exchanges. methods, common in wave propagation and simulations, utilize libraries like cuFFT for fast transforms (FFTs) on GPUs, enabling efficient transformation between physical and spectral spaces while minimizing data transfers. Load balancing techniques dynamically workloads to equalize across GPUs, reducing communication overhead from inter-node via high-speed interconnects like . Performance in HPC workloads on GPU clusters is often evaluated through metrics like weak scaling efficiency, defined as \eta = \frac{T_1 / T_p}{p} \times 100\%, where T_1 is the runtime on one processor, T_p is the runtime on p processors for a proportionally scaled problem, and ideal efficiency approaches 100% with linear resource scaling. The Frontier supercomputer, powered by AMD GPUs, demonstrated this in 2022 by achieving 1.102 exaflops on the TOP500 Linpack benchmark, showcasing near-ideal weak scaling for HPC applications through optimized GPU-node configurations. Despite these advances, GPU clusters face HPC-specific challenges, including handling irregular workloads where varying computational demands per subdomain lead to load imbalances and underutilized GPUs. I/O bottlenecks also persist in managing large datasets for simulations, as high-throughput systems struggle to feed to thousands of GPUs without stalling computations, necessitating techniques like burst buffers for .

Machine Learning and AI

GPU clusters have become essential for training large-scale models in and , enabling the processing of massive and complex architectures that exceed the capabilities of single GPUs. Key applications include training for transformer-based models like those in the series, which trained using clusters of thousands of GPUs, such as approximately 25,000 A100 GPUs for , to handle the computational demands of billions of parameters. In , distributed training of convolutional neural networks like ResNet employs , where the is partitioned across multiple GPUs to compute independent forward and backward passes, synchronizing gradients at each step to scale training efficiently on clusters. also leverages GPU clusters for distributed policy optimization, as demonstrated by , which uses a centralized learner on GPUs to process experiences from distributed actors, achieving scalable training across thousands of environments. Workload mapping in GPU clusters for involves sophisticated distributed strategies to manage and compute constraints in large neural networks. Pipeline parallelism addresses models too large for single GPUs by splitting layers across multiple nodes, allowing sequential processing of micro-batches to overlap computation and reduce idle time, as pioneered in frameworks like GPipe for training models with over a parameters. For gradient synchronization in (SGD), all-reduce operations aggregate gradients from all workers before updating model weights, ensuring consistent optimization across the cluster. The SGD update rule is given by: \mathbf{w}_{t+1} = \mathbf{w}_t - \eta \cdot \frac{1}{B} \sum_{i=1}^B \nabla \mathcal{L}(\mathbf{w}_t; \mathbf{x}_i, y_i) where \mathbf{w}_t are the weights at step t, \eta is the , B is the batch size, and \nabla \mathcal{L} is the of . This approach, implemented in libraries like Horovod, enables efficient of SGD for distributed training on GPU clusters. At massive scales, GPU clusters demonstrate their impact through real-world deployments, such as Meta's 2023 cluster comprising 24,000 GPUs used to train the 3 model, achieving high throughput for foundation models with hundreds of billions of parameters. For inference, optimizations like tensor parallelism in 's Triton Inference Server distribute model tensors across GPUs, enabling low-latency serving of large language models by parallelizing matrix operations. AI-specific optimizations further enhance efficiency, including mixed-precision training with FP16 or FP8 formats to increase throughput by up to 3x while preserving accuracy through selective FP32 computations, supported natively on modern GPUs. Data loading bottlenecks are mitigated using frameworks like DALI, which performs GPU-accelerated preprocessing to overlap I/O with computation, reducing end-to-end training time in distributed setups.

Vendors and Deployments

Major Vendors

dominates the GPU cluster market, holding approximately 80-90% share in accelerators as of 2025, primarily through its integrated hardware solutions like the DGX and HGX systems, which combine multiple GPUs with high-speed interconnects for scalable and HPC deployments. The CPU Superchip, introduced in 2023, pairs an Arm-based CPU with GPUs via NVLink-C2C interconnects to form superchips like the GH200 ( + Hopper), enabling energy-efficient, high-bandwidth processing in data centers. also offers full-stack solutions such as DGX Cloud, a managed service for development that leverages its GPU infrastructure across partner clouds. AMD provides competitive alternatives through its MI-series accelerators, emphasizing cost-effective options for HPC and workloads. The MI300X, released in 2024, features 192 GB of HBM3 memory and integrates with CPUs in Instinct platforms to deliver high of up to 5.3 TB/s, targeting large-scale simulations and training at lower compared to rivals. Other vendors contribute specialized hardware to the GPU cluster landscape. Intel's Gaudi3 AI accelerators, launched in 2024, focus on scalable and with up to 3.7 TB/s per card, available in PCIe and OAM form factors for integration into enterprise clusters. offers TPU pods as cluster-scale equivalents, with the 2025 Ironwood ( v7) generation providing up to 4.6 petaFLOPS FP8 performance per chip in pods of thousands of units optimized for . ARM-based options, such as NVIDIA's Grace Superchip, further enable diverse architectures in clusters. System integrators like and HPE deliver turnkey GPU clusters incorporating these components. servers, such as the XE8712 with GB200 Grace Blackwell Superchips, support up to four GPUs per node for AI factories. HPE's Cray EX systems integrate NVIDIA Grace Hopper Superchips for exascale HPC, featuring direct liquid cooling for dense deployments. The vendor ecosystem thrives on partnerships and certifications to ensure . For instance, NVIDIA's HGX platforms collaborate with for modular designs certified for enterprise AI workloads, alongside support services from multiple OEMs.

Notable Implementations

One prominent example in supercomputing is the system at (ORNL), deployed in 2022, which features 37,632 AMD MI250X GPUs across 9,408 nodes and achieved 1.1 exaFLOPS on the High Performance Linpack (HPL) , marking the first exascale . Another key implementation is at (LLNL), deployed in 2024 and fully operational by 2025, powered by AMD MI300A GPUs in an HPE EX255a architecture, delivering a peak performance of 2.79 exaFLOPS and an HPL score of 1.742 exaFLOPS. In 2025, Europe's Booster at the EuroHPC facility in became the continent's first exascale , utilizing GH200 Superchips across thousands of nodes to achieve over 1 exaFLOPS on the HPL , ranking fourth on the list as of November 2025. In commercial deployments, 's cluster, operational since 2023, incorporates 4,608 Tensor Core GPUs across 576 DGX systems, enabling advanced research with capabilities up to 18.4 exaFLOPS in FP8 precision. Similarly, Meta's Research SuperCluster (RSC), completed in 2022, utilizes 16,000 A100 GPUs to accelerate model training, including large-scale language models. Cloud-based GPU clusters provide scalable access through virtual instances, such as (AWS) EC2 P5 instances, generally available since 2024, which offer up to eight GPUs per instance with 640 GB of high-bandwidth GPU memory for generative AI and HPC workloads. Microsoft's ND H100 v5 series, launched in 2023, features up to eight GPUs per virtual machine, supporting deployments from single VMs to thousands for tasks. These cloud options enhance accessibility for organizations without on-premises infrastructure, though they involve higher operational costs compared to dedicated on-premises setups, which offer better long-term customization and efficiency for sustained high-volume workloads. Notable achievements from these clusters include enabling the training of trillion-parameter AI models, as demonstrated by NVIDIA's and Blackwell-enabled systems, which support unprecedented scale in generative AI development. metrics, such as those in Frontier's design achieving approximately 52.7 gigaFLOPS per watt on HPL, highlight progress in sustainable .

References

  1. [1]
    GPU Cluster Explained: Architecture, Nodes and Use Cases
    Apr 16, 2025 · A GPU cluster is a network of interconnected graphics processing units (GPUs) working in tandem to execute complex computations at speeds far beyond those of ...
  2. [2]
    GPU Cluster: Key Things to Know & 5 Use Cases
    Jun 26, 2025 · A GPU cluster is a set of computers with GPUs, designed for parallel processing of complex calculations, using multiple GPUs for efficient ...How does a GPU cluster... · Steps to build a GPU cluster · Top Companies in GPU...
  3. [3]
    CUDA C++ Programming Guide — CUDA C++ Programming Guide
    Below is a merged summary of the GPU architecture, parallel cores, SIMD/SIMT, and differences from CPU, combining all information from the provided segments into a concise yet comprehensive response. To maximize detail and clarity, I’ve organized key information into tables where appropriate, followed by a narrative summary. All unique details are retained, and the most precise terminology (e.g., SIMT over SIMD where specified) is used.
  4. [4]
    What is High-Performance Computing (HPC)? | NVIDIA Glossary
    A high-performance computing cluster is a collection of tightly interconnected computers that work in parallel as a single system to perform large-scale ...
  5. [5]
    How Energy-Efficient Computing for AI Is Transforming Industries
    Feb 7, 2024 · At performance parity, a GPU-accelerated cluster consumes 588 less megawatt hours per month, representing a 5x improvement in energy efficiency.
  6. [6]
  7. [7]
  8. [8]
    CPU vs. GPU for Machine Learning - IBM
    CPUs are designed to process instructions and quickly solve problems sequentially. GPUs are designed for larger tasks that benefit from parallel computing.CPU vs. GPU for machine... · Understanding machine learning
  9. [9]
    What is a GPU cluster? Use cases for AI developers | Blog - Northflank
    Sep 12, 2025 · A GPU cluster is a network of interconnected computers equipped with multiple GPUs working together to handle massive computational workloads.
  10. [10]
    NVIDIA UNVEILS CUDA™ – THE GPU COMPUTING REVOLUTION ...
    Nov 9, 2006 · GPU computing with CUDA is a new approach to computing where hundreds of on-chip processor cores simultaneously communicate and cooperate to solve complex ...
  11. [11]
    [PDF] This article appeared in a journal published by Elsevier ... - NetLib.org
    The first CUDA GPU results that significantly outperformed standard CPUs on single precision DLA started appearing at the beginning of 2008. To mention a few, ...
  12. [12]
    Our History: Innovations Over the Years - NVIDIA
    Read about NVIDIA's history, founders, innovations in AI and GPU computing over time, acquisitions, technology, product offerings, and more.
  13. [13]
    Innovation Timeline | DARPA
    DARPA established its High-Productivity Computing Systems (HPCS) program, with a goal of revitalizing supercomputer research and markets, and incubating a new ...
  14. [14]
    November 2010 | TOP500
    The 36th edition of the closely watched TOP500 list of the world's most powerful supercomputers ... TSUBAME 2.0 - HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, ...
  15. [15]
    ORNL Debuts Titan Supercomputer
    Oct 29, 2012 · The Cray XK7 system contains 18,688 nodes, with each holding a 16-core AMD Opteron 6274 processor and an NVIDIA Tesla K20 GPU accelerator. Titan ...
  16. [16]
    New GPU-Accelerated Supercomputers Change the Balance of ...
    Jun 26, 2018 · In the latest TOP500 rankings announced this week, 56 percent of the additional flops were a result of NVIDIA Tesla GPUs running in new ...
  17. [17]
    Accelerating AI with GPUs: A New Computing Model - NVIDIA Blog
    Jan 12, 2016 · They designed a neural network called AlexNet and trained it with a million example images that required trillions of math operations on NVIDIA ...<|separator|>
  18. [18]
    NVIDIA Announces DGX H100 Systems – World's Most Advanced ...
    Mar 22, 2022 · Providing 1 exaflops of FP8 AI performance, 6x more than its predecessor, the next-generation DGX SuperPOD expands the frontiers of AI with the ...Missing: A100 2020s
  19. [19]
    New – GPU-Equipped EC2 P4 Instances for Machine Learning & HPC
    Nov 2, 2020 · The first-generation Cluster GPU instances were launched in late 2010, followed by the G2 (2013), P2 (2016), P3 (2017), G3 (2017), P3dn (2018), ...Missing: history | Show results with:history
  20. [20]
    NVIDIA GB300 NVL72: Next-generation AI infrastructure at scale
    Oct 9, 2025 · Microsoft delivers the first at-scale production cluster with more than 4,600 NVIDIA GB300 NVL72, featuring NVIDIA Blackwell Ultra GPUs ...
  21. [21]
  22. [22]
    H100 GPU - NVIDIA
    The NVIDIA H100 GPU delivers exceptional performance, scalability, and security for every workload. H100 uses breakthrough innovations based on the NVIDIA ...Transformational Ai Training · Real-Time Deep Learning... · Exascale High-Performance...
  23. [23]
    [PDF] HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
    Users typically use homogeneous. GPUs for a job for better performance and specify the desired. GPU/topology type (e.g., V100 vs. K80). Initial cell assignment.
  24. [24]
    AMD Instinct™ MI300 Series Accelerators
    With industry leading 256 GB HBM3E memory and 6 TB/s bandwidth, they optimize performance and help reduce TCO.1. View Specs ...Instinct™ MI300X · Instinct™ MI300A · AMD Instinct™ MI325X Platform
  25. [25]
    [PDF] Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning ...
    Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objec- tives such as makespan and average job ...
  26. [26]
    NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog
    Mar 22, 2022 · This post gives you a look inside the new H100 GPU and describes important new features of NVIDIA Hopper architecture GPUs.Introducing The Nvidia H100... · H100 Sm Architecture · H100 Gpu Hierarchy And...Missing: 2020s | Show results with:2020s
  27. [27]
    How NVLink Will Enable Faster, Easier Multi-GPU Computing
    Nov 14, 2014 · NVLink is an energy-efficient, high-bandwidth path between the GPU and the CPU at data rates of at least 80 gigabytes per second, or at least 5 ...
  28. [28]
    NVIDIA DGX H100/H200 System User Guide
    Sep 10, 2025 · Installation and Configuration · Registration · Obtaining an NGC Account · Turning DGX H100/H200 On and Off · Startup Considerations · Verifying ...Introduction to NVIDIA DGX... · Connecting to DGX H100/H200 · Using the BMC
  29. [29]
    Training extremely large neural networks across thousands of GPUs.
    Feb 26, 2025 · In this blog post, we'll discuss techniques such as data and model parallelism which allow us to distribute the model training process across a large cluster ...
  30. [30]
  31. [31]
    RoCE networks for distributed AI training at scale
    Aug 5, 2024 · The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless ...
  32. [32]
    InfiniBand in focus: bandwidth, speeds and high-performance ...
    Jun 18, 2024 · Low Latency. While Ethernet latencies range from around 20 to 80 microseconds, InfiniBand clocks in at 3 to 5 microseconds, boosting the speed ...What Is Infiniband And How... · Explore Infiniband... · Infiniband Vs Ethernet...
  33. [33]
    NVIDIA Quantum-2 InfiniBand Platform
    Performance · 400Gb/s bandwidth per port · 64 400Gb/s ports or 128 200Gb/s ports in a single switch · Over 66.5 billion packets per second (bidirectional) from a ...Record-Breaking Performance... · Enhancing Hpc And Ai... · Delivering Data At The Speed...
  34. [34]
    InfiniBand vs. Ethernet: Choosing the Right Network Fabric for AI ...
    Sep 9, 2025 · Ethernet with RoCEv2 offers ~5–10 µs latency and can be tuned for AI workloads. Q4. Is Ethernet ready for AI workloads?<|separator|>
  35. [35]
    InfiniBand vs RoCEv2: Choosing the Right Network for Large-Scale AI
    Aug 6, 2025 · – InfiniBand delivers ultra-low latency and high bandwidth, keeping large GPU clusters running efficiently without hiccups. Built for RDMA ...
  36. [36]
    NVLink & NVSwitch: Fastest HPC Data Center Platform | NVIDIA
    With NVLink Switch, NVLink connections can be extended across nodes to create a seamless, high-bandwidth, multi-node GPU cluster—effectively forming a data- ...
  37. [37]
    Ethernet, InfiniBand, and Omni-Path battle for the AI-optimized data ...
    Sep 17, 2025 · A look at the three competing interconnect technologies, each vying to solve the data movement bottleneck created by artificial ...Missing: NVSwitch | Show results with:NVSwitch
  38. [38]
    RoCEv2 for AI Networks | High-Performance GPU Fabrics
    RoCEv2 transforms Ethernet into a high-performance fabric for distributed AI, minimizing latency, maximizing GPU throughput, and reducing model training times.
  39. [39]
    AMD EPYC vs Intel Xeon - A Complete Technical Comparison
    AMD EPYC processors offer a massive 128 PCIe Gen 5.0 lanes per socket, making them extremely friendly in terms of I/O requirements. This would be important for ...
  40. [40]
    HPC Storage Explained - WEKA
    Aug 5, 2020 · HPC storage systems allow for the CPUs to keep busy while the data is efficiently written to or read from disk drives.
  41. [41]
    Deploy a Scalable, Distributed File System Using Lustre
    Lustre clusters scale for higher throughput, higher storage capacity, or both for the file system. It costs only a few cents per gigabyte per month for Compute ...
  42. [42]
    Why You Need Liquid Cooling for AI Performance at Scale
    Apr 25, 2025 · Improved power utilization: Liquid cooling provides up to 30% better power utilization than air-cooled systems; this improved use of available ...
  43. [43]
    GPU Servers For AI, Deep / Machine Learning & HPC - Supermicro
    Supermicro offers GPU servers with NVIDIA B300 GPUs, up to 21TB HBM3e GPU memory, and 17TB LPDDR5X system memory, and air-cooled NVIDIA servers.
  44. [44]
    UALink and the Battle for Rack-Scale GPU Interconnect - BITSILICA
    Sep 2, 2025 · UALink allows pods of 1,024 GPUs to be interconnected over Ethernet, creating superclusters of 10,000–100,000 accelerators. This architecture ...1. Introduction · 2. Nvidia's Nvlink And... · 5. System Design...
  45. [45]
    Accelerated InfiniBand Solutions for HPC - NVIDIA
    NVIDIA Quantum InfiniBand is the only high-performance interconnect solution with proven quality-of-service capabilities, including advanced congestion control ...Missing: 2023-2025 | Show results with:2023-2025
  46. [46]
    Bisection Bandwidth - an overview | ScienceDirect Topics
    Bisection bandwidth is a fundamental metric in evaluating the performance and scalability of interconnection networks in Computer Science. It is defined as the ...Introduction · Bisection Bandwidth Across... · Impact of Bisection Bandwidth...
  47. [47]
    Software Stack — NVIDIA AI Enterprise
    The example software stack provides examples for, Operating System, Orchestration Platform, Container Runtime, and NVIDIA Infrastructure Software.
  48. [48]
    Docker Compatibility with Singularity for HPC | NVIDIA Technical Blog
    Aug 15, 2018 · The Singularity runtime is designed to load and run Docker format containers, making Singularity one of the most popular container runtimes for HPC.
  49. [49]
    CUDA Toolkit - Free Tools and Training
    - **Latest CUDA Toolkit Version**: CUDA Toolkit 13.0 is the latest version mentioned, available for general use.
  50. [50]
    AMD Unveils Vision for an Open AI Ecosystem, Detailing New ...
    Jun 12, 2025 · ROCm 7 features improved support for industry-standard frameworks, expanded hardware compatibility and new development tools, drivers, APIs and ...
  51. [51]
    Overview — NVIDIA DCGM Documentation latest documentation
    Aug 28, 2025 · The NVIDIA Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments.DCGM Diagnostics · Getting Started · Feature Overview · DCGM Modularity
  52. [52]
    Generic Resource (GRES) Scheduling - Slurm Workload Manager
    Beginning in version 21.08, Slurm now supports NVIDIA Multi-Instance GPU (MIG) devices. This feature allows some newer NVIDIA GPUs (like the A100) to split up a ...Running Jobs · AutoDetect · Accounting · GPU Management
  53. [53]
    GPU mapping in PBS - Users/Site Administrators - OpenPBS
    Jun 27, 2023 · I am running an MPI + CUDA HPC code on a system using Open PBS with multiple nodes, each node having 8 NVIDIA GPUs. On a SLURM cluster, ...
  54. [54]
    FAQ: Building CUDA-aware Open MPI
    May 20, 2019 · CUDA-aware support means that the MPI library can send and receive GPU buffers directly. This feature exists in the Open MPI v1.7 series and later.
  55. [55]
    CUDA Installation Guide for Linux - NVIDIA Docs Hub
    Oct 7, 2025 · This guide will show you how to install and check the correct operation of the CUDA development tools.NVIDIA Driver Installation Guide · NVIDIA CUDA Compiler (NVCC) · Contents
  56. [56]
    CUDA C++ Programming Guide
    The programming guide to the CUDA model and interface.
  57. [57]
    Collective Operations — NCCL 2.28.6 documentation
    The AllReduce operation performs reductions on data (for example, sum, min, max) across devices and stores the result in the receive buffer of every rank.
  58. [58]
    GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for ...
    Aug 15, 2025 · GROMACS is a free, open-source molecular dynamics engine aiming for performance, portability, and flexibility. It is one of the most widely ...
  59. [59]
  60. [60]
    GPU Acceleration of CFD Simulations in OpenFOAM - MDPI
    We introduce algorithmic advancements designed to expedite simulations in OpenFOAM using GPUs. These developments include the following.
  61. [61]
    [1011.3318] Domain Decomposition method on GPU cluster - arXiv
    Nov 15, 2010 · In this work we investigate the performance of quark solver using the restricted additive Schwarz (RAS) preconditioner on a low cost GPU cluster.Missing: HPC | Show results with:HPC
  62. [62]
    Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale
    Jan 27, 2022 · cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascale platforms.
  63. [63]
    Introduction to Parallel Computing Tutorial - | HPC @ LLNL
    Weak scaling (Gustafson):. The problem size per processor stays fixed as more processors are added. The total problem size is proportional to the number of ...
  64. [64]
    June 2022 - TOP500
    The 59th edition of the TOP500 revealed the Frontier system to be the first true exascale machine with an HPL score of 1.102 Exaflop/s.
  65. [65]
    Analyzing GPU Utilization in HPC Workloads - ACM Digital Library
    Jul 18, 2025 · Despite their growing ubiquity, GPU-accelerated systems pose unique challenges in effective resource allocation, scheduling, and performance ...
  66. [66]
    Memory Challenges in Shared Computing Environments; CXL offers ...
    Aug 25, 2025 · With peta- and bigger-scale workflows, input/output (I/O) bottlenecks occur that limit performance, scalability, and efficiency. For example ...
  67. [67]
    Estimates of GPU or equivalent resources of large AI players for ...
    Nov 28, 2024 · Given GPT-4 was allegedly trained on 25,000 Nvidia A100 GPUs ... Yes, workloads on these computers are a bit different from a pure GPU training ...
  68. [68]
    IMPALA: Scalable Distributed Deep-RL with Importance Weighted ...
    We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab ...Missing: GPU cluster
  69. [69]
    Building Meta's GenAI Infrastructure - Engineering at Meta
    Mar 12, 2024 · In this design journey, we compared the performance seen in our small clusters and with large clusters to see where our bottlenecks are. In the ...A Peek Into Meta's... · Under The Hood · Performance<|separator|>
  70. [70]
    Simplifying AI Inference in Production with NVIDIA Triton
    Apr 12, 2021 · Pipeline parallelism splits the model vertically across layer boundaries and runs these layers across multiple GPUs in a pipeline. Tensor ...
  71. [71]
    Train With Mixed Precision - NVIDIA Docs
    Feb 1, 2023 · This guide will focus on how to train with half precision while maintaining the network accuracy achieved with single precision.
  72. [72]
    Floating-Point 8: An Introduction to Efficient, Lower-Precision AI ...
    Jun 4, 2025 · A key enabler of FP8 training's speed and efficiency is the inclusion of dedicated FP8 Tensor Cores within the NVIDIA H100 architecture.Fp8 Format Explanation · Tensor Scaling · Block Scaling
  73. [73]
    NVIDIA Data Loading Library (DALI)
    DALI reduces data access latency and training time, mitigating bottlenecks by overlapping AI training and data pre-processing. It provides a drop-in ...
  74. [74]
    Top 20+ AI Chip Makers: NVIDIA & Its Competitors
    Oct 26, 2025 · In 2025, AMD announced the acquisition of a talented team of AI hardware and software engineers from Untether AI, a developer of energy- ...
  75. [75]
    Nvidia GPU Market Share AI 2025: Dominating the AI Hardware ...
    As of May 2025, Nvidia unequivocally maintains its formidable lead in the AI hardware market, particularly concerning Graphical Processing Units (GPUs).
  76. [76]
    NVIDIA Grace CPU Superchip
    The NVIDIA Grace™ CPU is designed for a new type of data center —one that processes mountains of data to produce intelligence with maximum energy efficiency ...
  77. [77]
    AMD Instinct™ MI300X Accelerators
    Free 30-day returnsDec 6, 2023 · AMD Instinct™ MI300X accelerators are designed to deliver leadership performance for Generative AI workloads and HPC applications.Missing: effective | Show results with:effective
  78. [78]
    AMD Instinct™ MI300X Accelerators: AI & HPC Computing
    AMD Instinct MI300X accelerators offer 192GB of HBM3 memory, providing ~2.4x the density of competitor products, supported by up to 5.3 TB/s of peak memory ...Missing: effective | Show results with:effective
  79. [79]
    Intel Unveils Next-Generation AI Solutions with the Launch of Xeon ...
    Sep 24, 2024 · Intel today launched Xeon 6 with Performance-cores (P-cores) and Gaudi 3 AI accelerators, bolstering the company's commitment to deliver powerful AI systems.<|separator|>
  80. [80]
  81. [81]
    NVIDIA Grace CPU and Arm Architecture
    The NVIDIA Grace CPU is a groundbreaking Arm CPU with uncompromising performance and efficiency. It can be tightly coupled with a GPU to supercharge ...Nvidia Grace · Technological Breakthroughs · Nvidia Gb300 Nvl72Missing: Dell HPE Cray Supermicro
  82. [82]
    PowerEdge Server and Networking Announcements at NVIDIA GTC ...
    Mar 18, 2025 · The XE8712 features the GB200 Grace™ Blackwell Superchip with 2 NVIDIA Grace™ CPU Superchips and 4 NVIDIA® NVLink™ interconnected B200 GPUs. Why ...Poweredge Server And... · Why The Poweredge Xe8712... · More Innovations From Dell...
  83. [83]
    HPE expands direct liquid-cooled supercomputing solutions ...
    Nov 13, 2024 · Featuring the NVIDIA GB200 Grace Blackwell NVL4 Superchip, each accelerator blade holds four NVIDIA NVLink™-connected Blackwell GPUs unified ...<|control11|><|separator|>
  84. [84]
    Supermicro Starts Shipments of NVIDIA GH200 Grace Hopper ...
    Oct 18, 2023 · Supermicro Starts Shipments of NVIDIA GH200 Grace Hopper Superchip-Based Servers, the Industry's First Family of NVIDIA MGX Systems.
  85. [85]
    NVIDIA-Certified Systems
    NVIDIA-Certified Systems are tested with the most powerful enterprise NVIDIA GPUs and networking and are evaluated by NVIDIA engineers for performance.Reference Configurations · Data Center Servers · Edge Systems
  86. [86]
    Top500: Exascale Is Officially Here with Debut of Frontier - HPCwire
    May 30, 2022 · At 1.102 exaflops of Linpack performance, Frontier is faster than the next seven systems on the Top500 combined.
  87. [87]
    Lawrence Livermore National Laboratory's El Capitan verified as ...
    Nov 18, 2024 · The system has a total peak performance of 2.79 exaFLOPs. The Top500 list was released at the 2024 Supercomputing Conference (SC24) in Atlanta.
  88. [88]
    El Capitan: NNSA's first exascale machine
    Deployed in 2024, El Capitan is ranked as the world's most powerful supercomputer, capable of performing more than 2.79 exaflops per second.
  89. [89]
    NVIDIA Eos Revealed: Peek Into Operations of a Top 10 ...
    Feb 15, 2024 · Each DGX H100 system is equipped with eight NVIDIA H100 Tensor Core GPUs. Eos features a total of 4,608 H100 GPUs. As a result, Eos can handle ...
  90. [90]
  91. [91]
    New – Amazon EC2 P5 Instances Powered by NVIDIA H100 Tensor ...
    Jul 26, 2023 · P5 instances provide 8 x NVIDIA H100 Tensor Core GPUs with 640 GB of high bandwidth GPU memory, 3rd Gen AMD EPYC processors, 2 TB of system ...
  92. [92]
    ND-H100-v5 size series - Azure Virtual Machines | Microsoft Learn
    Sep 2, 2025 · The ND H100 v5 series starts with a single VM and eight NVIDIA H100 Tensor Core GPUs. ND H100 v5-based deployments can scale up to thousands of ...Host specifications · Feature support
  93. [93]
    Amazon EC2 P5 Instances – AWS
    To deliver these performance improvements and cost savings, P5 and P5e instances complement NVIDIA H100 and H200 Tensor Core GPUs with 2x higher CPU performance ...Why Amazon Ec2 P5 Instances? · Anthropic · AonMissing: 2024 | Show results with:2024
  94. [94]
    NVIDIA H100 Tensor Core GPU Used on New Microsoft Azure ...
    Aug 7, 2023 · Microsoft Azure ND H100 v5 virtual machine series instance offers next-level performance at scale for LLMs, generative AI and other compute-intensive workloads.
  95. [95]
    NVIDIA Blackwell Platform Arrives to Power a New Era of Computing
    Mar 18, 2024 · New Blackwell GPU, NVLink and Resilience Technologies Enable Trillion-Parameter-Scale AI Models; New Tensor Cores and TensorRT- LLM Compiler ...Missing: achievements | Show results with:achievements