Fact-checked by Grok 2 weeks ago

CUDA

CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and programming model developed by NVIDIA for general-purpose computing on graphics processing units (GPUs).^[1] It enables dramatic increases in computing performance by allowing developers to harness the massively parallel processing capabilities of NVIDIA GPUs for tasks beyond traditional graphics rendering, such as scientific simulations, data analysis, and artificial intelligence.^[2] Introduced in 2006, CUDA has become the foundation for GPU-accelerated computing, with hundreds of millions of CUDA-enabled GPUs installed worldwide across desktops, workstations, servers, and supercomputers.^[3] The platform originated from NVIDIA's efforts to extend GPU utility beyond graphics, led by engineer Ian Buck, who spearheaded its launch as the world's first solution for general computing on GPUs.^[1] Since its debut, CUDA has evolved through regular updates to the CUDA Toolkit, which provides compilers, libraries, and tools for developing high-performance applications; the latest version as of 2025, CUDA 13.0, includes advancements for the newest NVIDIA architectures like Hopper and Blackwell, supporting features such as heterogeneous memory management and enhanced multi-GPU scalability.^[4]^[5] CUDA's programming model uses extensions to languages like C and C++ (primarily CUDA C++), where developers write kernels—functions executed in parallel across thousands of threads on the GPU—launched using a unique syntax like <<<blocks, threads>>>.^[6] It also supports higher-level abstractions, including drop-in libraries for linear algebra (cuBLAS), fast Fourier transforms (cuFFT), and deep learning (cuDNN), as well as directives like OpenACC for easier parallelization without rewriting code.^[7] CUDA's impact spans diverse domains, accelerating applications in high-performance computing (HPC), such as climate modeling and molecular dynamics; machine learning and AI, powering frameworks like TensorFlow and PyTorch for training neural networks with speedups up to thousands of times over CPU-only implementations; and industries including finance for risk analysis, healthcare for genomics, and autonomous vehicles for real-time processing.^[4]^[8] By 2025, it underpins thousands of published research papers and commercial software, fostering an ecosystem that includes support for Python via libraries like Numba and CuPy, making GPU acceleration accessible to a broad range of developers.^[4] This widespread adoption has solidified CUDA as the de facto standard for GPU computing, driving innovations in fields requiring massive parallelism and high bandwidth.^[3]

Introduction

Background

CUDA (Compute Unified Device Architecture) was developed by NVIDIA to enable general-purpose computing on graphics processing units (GPUs), marking a significant advancement in parallel computing. The platform was unveiled by NVIDIA on November 8, 2006, with the first public release of CUDA 1.0 occurring in June 2007.^[9]^[10] This introduction coincided with the launch of NVIDIA's Tesla architecture, which unified graphics and compute capabilities on the same hardware. Prior to CUDA, general-purpose GPU (GPGPU) computing relied on mapping non-graphics algorithms to graphics primitives, such as pixel shaders in APIs like OpenGL or DirectX, which imposed significant limitations on flexibility, memory access, and programming efficiency for scientific and engineering workloads. CUDA addressed these challenges by providing a direct, hardware-optimized platform for parallel computing, shifting the focus from graphics-centric programming to scalable general-purpose applications in fields like simulations, data analysis, and high-performance computing. This motivation stemmed from the need to leverage the massive parallelism of GPUs—hundreds of cores working cooperatively—beyond rendering tasks, as highlighted in NVIDIA's foundational architecture design.^[11] The initial hardware support for CUDA was the GeForce 8 series GPUs, based on the Tesla architecture, such as the GeForce 8800 released in November 2006, featuring 128 streaming processor cores organized into 16 multiprocessors. Key early milestones included the integration of CUDA with C/C++ extensions, allowing developers to write serial host code on CPUs that invoked parallel kernels on GPUs using familiar syntax. Over time, CUDA evolved from a primarily proprietary ecosystem to incorporate open-source elements, notably through the CUDA-X libraries—a suite of GPU-accelerated tools for domains like AI and data science, many of which are now available under open-source licenses to foster broader adoption and collaboration.^[11]^[12]^[13]

Core Concepts

CUDA operates as a heterogeneous computing model that integrates the CPU, referred to as the host, with the GPU, known as the device, to enable parallel processing where the GPU serves as a coprocessor for computationally intensive tasks.^[14] This architecture leverages the GPU's massively parallel structure alongside the CPU's sequential processing capabilities, allowing developers to offload parallel workloads from the host to the device while maintaining separate memory spaces for each.^[14] At the heart of CUDA's parallelism are kernels, which are special functions executed on the GPU and annotated with the __global__ qualifier, launched from the host using a syntax that specifies execution configuration.^[14] These kernels are invoked multiple times in parallel by threads, the fundamental units of execution, organized into blocks of up to 1024 threads that execute cooperatively on a single multiprocessor.^[14] Blocks are further grouped into grids, enabling scalable parallelism across the entire GPU, with thread and block indices accessible via built-in variables like threadIdx and blockIdx to differentiate computations.^[14] This hierarchy facilitates fine-grained control over workload distribution, supporting the Single Instruction, Multiple Threads (SIMT) execution model where groups of 32 threads, called warps, execute in lockstep to maximize throughput, though branch divergence within a warp can lead to serialized execution of divergent paths.^[14] CUDA's memory hierarchy is designed to optimize data access patterns in parallel environments, featuring distinct types with varying scopes, latencies, and caching behaviors. Global memory is accessible to all threads across the device and persists across kernel launches, but incurs high latency unless accesses are coalesced into contiguous transactions.^[14] Shared memory, visible only within a block, provides low-latency access for threads to cooperate on shared data, often partitioned into banks to enable concurrent reads.^[14] Constant memory offers read-only access to all threads with caching for broadcast data, while texture memory supports read-only operations optimized for spatial locality in 2D data formats.^[14] Introduced in CUDA 6.0, unified memory establishes a single address space accessible from both host and device, automatically handling data migration to simplify programming without explicit transfers.^[15] Host-device interaction in CUDA relies on runtime APIs to manage data movement and resource allocation, ensuring efficient communication between the CPU and GPU. Functions like cudaMalloc allocate linear memory on the device, while cudaMemcpy transfers data between host and device memory spaces, supporting synchronous or asynchronous operations to minimize overhead.^[14] These APIs form the foundation for initializing the GPU environment and synchronizing execution, allowing the host to orchestrate kernel launches and monitor device completion.^[14]

Architecture

Hardware Support

CUDA supports a range of NVIDIA GPU architectures, evolving from the initial Tesla microarchitecture introduced in 2006 to the latest Blackwell architecture released in 2024.^[16]^[17] The supported architectures include Tesla (2006, compute capability 1.0–1.3), Fermi (2010, 2.0–2.1), Kepler (2012, 3.0–3.7), Maxwell (2014, 5.0–5.3), Pascal (2016, 6.0–6.2), Volta (2017, 7.0), Turing (2018, 7.5), Ampere (2020, 8.0–8.6), Ada Lovelace (2022, 8.9), Hopper (2022, 9.0), and Blackwell (2024, 12.0).^[16]^[18]^[17] These architectures represent progressive advancements in parallel processing capabilities tailored for CUDA execution, with each generation introducing enhancements in core count, memory bandwidth, and specialized units for compute-intensive tasks.^[16] Compute capability levels, ranging from 1.0 (basic features in early Tesla GPUs) to 12.0 and higher (in Blackwell as of 2025), define the specific hardware features and instruction sets available for CUDA programs on a given GPU.^[16]^[17] For instance, double-precision floating-point support was introduced with compute capability 1.3 in later Tesla GPUs and became more robust in subsequent architectures like Fermi (2.0), enabling high-performance scientific computing.^[18]^[17] Higher levels, such as 9.0 in Hopper and 12.0 in Blackwell, unlock advanced features like improved tensor operations and enhanced AI acceleration, ensuring CUDA applications can leverage the full potential of modern hardware.^[16] The minimum hardware requirements for CUDA include any NVIDIA GPU with compute capability 1.0 or higher, which inherently features at least one Streaming Multiprocessor (SM) for executing parallel threads.^[17] Additionally, a compatible NVIDIA driver is required, such as version 580 or later for CUDA 13.x toolkits, to enable runtime support and API access.^[19]^[20] CUDA maintains backward compatibility, allowing newer toolkit versions to run applications on older GPUs as long as the code targets a supported compute capability level, preventing the need for recompilation in many cases.^[19] This mechanism ensures that legacy hardware, from Tesla to current generations, remains viable for development and deployment without immediate upgrades.^[19]

Multiprocessor Design

The Streaming Multiprocessor (SM) serves as the fundamental processing unit within NVIDIA GPUs for executing CUDA threads, comprising multiple CUDA cores, instruction schedulers, and warp scheduling hardware to enable massively parallel computation.^[14] Each SM operates as an independent core capable of handling thousands of threads concurrently through its register file and shared memory resources, optimizing for the Single Instruction, Multiple Thread (SIMT) execution model.^[14] At the heart of each SM are the CUDA cores, which perform the arithmetic and logical operations for threads; for instance, the Volta architecture features 64 FP32 CUDA cores per SM, enabling high-throughput floating-point processing.^[21] Complementing these are warp schedulers—typically four per SM in Volta—that manage the execution of thread groups by issuing instructions to available processing resources, ensuring efficient utilization even under varying workloads.^[21] Warp processors within the SM handle the SIMT parallelism by executing instructions across synchronized threads, with each SM partitioned into processing blocks to distribute this workload.^[14] Warp execution forms the basic unit of scheduling in the SM, where threads are grouped into warps of 32 threads that execute the same instruction simultaneously in lockstep under the SIMT model, promoting high occupancy and throughput when paths converge.^[14] Divergence within a warp—where threads take different execution paths—leads to serialized processing for divergent branches, but the SM's hardware mitigates this by masking inactive threads.^[14] The scale of SM deployment varies across GPU models to match computational demands; the A100 GPU, based on the Ampere architecture, incorporates 108 SMs to deliver exascale performance potential.^[22] Per-SM resources include a 256 KB register file for fast thread-local storage and configurable shared memory up to 164 KB in Ampere, which supports low-latency data sharing among threads in the same block while balancing with L1 cache allocation.^[22]^[14] Architectural evolution in SM design has focused on enhancing concurrency and divergence handling; the Volta SM introduced independent thread scheduling, allowing each thread in a warp to maintain its own program counter and call stack, enabling finer-grained execution and automatic reconvergence without explicit barriers.^[21]^[14] This advancement over prior unified warp scheduling improves resource utilization for irregular workloads, with subsequent architectures like Ampere building upon it by increasing per-SM throughput and memory capacity.^[22]

Programming Model

Key Capabilities

CUDA enables developers to leverage NVIDIA GPUs for general-purpose computing through a set of extensions to the C/C++ programming language, allowing the definition of functions that execute on both host (CPU) and device (GPU). The __global__ qualifier declares kernel functions that run on the GPU and are invoked from the host, executing asynchronously across multiple threads. Functions marked with __device__ are compiled for execution solely on the device and can only be called from other device functions, while __host__ specifies host-only functions, though it can combine with __device__ for functions compilable for both environments. These extensions facilitate a heterogeneous programming model where developers manage parallelism explicitly without altering core application logic significantly.^[14] At the API level, CUDA provides the Runtime API for high-level operations such as memory allocation, data transfer, and kernel launches, exemplified by functions like cudaLaunchKernel which configures and executes kernels on the GPU. In contrast, the Driver API offers low-level control over GPU contexts, modules, and devices, enabling advanced features like dynamic loading of code and finer-grained resource management, though it requires more explicit error handling. Complementing these APIs, CUDA includes optimized libraries such as cuBLAS for basic linear algebra subroutines (BLAS) on GPUs, accelerating matrix operations essential for scientific computing, and cuDNN for deep neural network primitives, providing high-performance implementations of convolutions and other operations critical for machine learning workflows.^[14] Parallelism in CUDA is managed through hierarchical constructs that allow threads to index their positions within blocks and grids. The built-in variables threadIdx and blockIdx provide unique identifiers for threads within a block and blocks within a grid, respectively, enabling developers to partition data and computations across thousands of threads for scalable parallelism. Synchronization is achieved via intrinsics like __syncthreads(), which barriers all threads in a block to ensure ordered execution and prevent race conditions when accessing shared resources. These mechanisms support efficient mapping of algorithms to the GPU's SIMD architecture, promoting data-level parallelism without explicit thread creation.^[14] For development and optimization, CUDA integrates tools like Nsight Compute, which profiles kernel performance metrics such as occupancy, memory throughput, and instruction execution to identify bottlenecks and suggest improvements. Nsight Compute supports guided analysis workflows and integrates with the CUDA Runtime API to capture traces during application execution, aiding in the tuning of GPU-accelerated code for maximum efficiency. These profiling capabilities are essential for achieving high performance in production environments.^[14]

Code Example

A representative example of basic CUDA programming is the vector addition operation, where two input vectors are added element-wise on the GPU to produce an output vector.^[23] This demonstrates memory allocation on the device, data transfer between host and device, kernel launch with a specified grid and block configuration, and result verification.^[24] The following code implements vector addition for 50,000 floating-point elements, using random initialization for the input vectors and thorough error checking after each CUDA runtime API call.^[23]

cpp
/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
/**
 * Vector addition: C = A + B.
 * This sample implements element by element vector addition.
 */
#include <stdio.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements) {
        C[i] = A[i] + B[i] + 0.0f;
    }
}

int main(void) {
    cudaError_t err = cudaSuccess;
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    if (h_A == NULL || h_B == NULL || h_C == NULL) {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    for (int i = 0; i < numElements; ++i) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, [cudaMemcpyHostToDevice](/page/cudaMemcpyHostToDevice));
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    int threadsPerBlock = 256;
    int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA [kernel](/page/Kernel) launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to launch vectorAdd [kernel](/page/Kernel) (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    for (int i = 0; i < numElements; ++i) {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    err = cudaFree(d_A);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}
/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. */
/**
 * Vector addition: C = A + B.
 * This sample implements element by element vector addition.
 */
#include <stdio.h>
#include <cuda_runtime.h>
#include <helper_cuda.h>

__global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < numElements) {
        C[i] = A[i] + B[i] + 0.0f;
    }
}

int main(void) {
    cudaError_t err = cudaSuccess;
    int numElements = 50000;
    size_t size = numElements * sizeof(float);
    printf("[Vector addition of %d elements]\n", numElements);

    float *h_A = (float *)malloc(size);
    float *h_B = (float *)malloc(size);
    float *h_C = (float *)malloc(size);

    if (h_A == NULL || h_B == NULL || h_C == NULL) {
        fprintf(stderr, "Failed to allocate host vectors!\n");
        exit(EXIT_FAILURE);
    }

    for (int i = 0; i < numElements; ++i) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }

    float *d_A = NULL;
    err = cudaMalloc((void **)&d_A, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to allocate device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    float *d_B = NULL;
    err = cudaMalloc((void **)&d_B, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    float *d_C = NULL;
    err = cudaMalloc((void **)&d_C, size);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    printf("Copy input data from the host memory to the CUDA device\n");
    err = cudaMemcpy(d_A, h_A, size, [cudaMemcpyHostToDevice](/page/cudaMemcpyHostToDevice));
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    int threadsPerBlock = 256;
    int blocksPerGrid = (numElements + threadsPerBlock - 1) / threadsPerBlock;
    printf("CUDA [kernel](/page/Kernel) launch with %d blocks of %d threads\n", blocksPerGrid, threadsPerBlock);
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    err = cudaGetLastError();
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to launch vectorAdd [kernel](/page/Kernel) (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    printf("Copy output data from the CUDA device to the host memory\n");
    err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    for (int i = 0; i < numElements; ++i) {
        if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) {
            fprintf(stderr, "Result verification failed at element %d!\n", i);
            exit(EXIT_FAILURE);
        }
    }

    printf("Test PASSED\n");

    err = cudaFree(d_A);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to free device vector A (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_B);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to free device vector B (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    err = cudaFree(d_C);
    if (err != cudaSuccess) {
        fprintf(stderr, "Failed to free device vector C (error code %s)!\n", cudaGetErrorString(err));
        exit(EXIT_FAILURE);
    }

    free(h_A);
    free(h_B);
    free(h_C);

    printf("Done\n");
    return 0;
}

In the kernel function, each thread computes a single element of the output vector using the global index i = blockDim.x * blockIdx.x + threadIdx.x, which maps the thread hierarchy to array positions for parallel addition while avoiding out-of-bounds access via an if condition.^[25] The host code allocates device memory with cudaMalloc, transfers data using cudaMemcpy, launches the kernel with a grid of blocks (each containing 256 threads) calculated to cover all elements, and checks for launch errors using cudaGetLastError().^[23] Results are copied back to the host for verification against a tolerance of 1e-5, and memory is freed with cudaFree.^[23] To compile this program, use the NVIDIA CUDA compiler nvcc from the CUDA Toolkit, specifying the target architecture for GPUs such as Hopper (compute capability 9.0, as of 2025) with the flag -arch=sm_90; for example: nvcc vectorAdd.cu -o vectorAdd -arch=sm_90.^[17]

Features and Specifications

Version History

CUDA's development began with its initial release in June 2007 as version 1.0, introducing the foundational programming model for executing basic kernels on NVIDIA GPUs with compute capability 1.0, including the nvcc compiler for C/C++ extensions, runtime API, and initial libraries like cuBLAS for linear algebra operations.^[10] Version 2.0, released in August 2008, expanded hardware support to compute capability 1.3 on GT200 GPUs, adding double-precision floating-point arithmetic alongside device emulation for debugging without hardware and enhancements to libraries and sample codes.^[10]^[3] Subsequent releases built incrementally until CUDA 5.0 in October 2012, which introduced dynamic parallelism allowing GPU threads to launch child kernels dynamically, alongside improvements to libraries such as cuFFT for fast Fourier transforms and cuRAND for random number generation, enabling more complex adaptive algorithms.^[10]^[26] CUDA 8.0, launched in September 2016, enhanced memory management with full Unified Memory support for automatic data migration between host and device, optimized for Pascal architecture GPUs, and included updates to the nvcc compiler and libraries for better performance in multi-GPU setups.^[10]^[27] In June 2020, CUDA 11.0 brought advancements in multi-GPU programming through the Multi-Process Service for secure sharing of GPUs in virtualized environments and MIG (Multi-Instance GPU) support, along with independent versioning for toolkit components to simplify updates.^[10]^[8] CUDA 12.0, released in December 2022, added FP8 data type support in libraries like cuBLAS for efficient low-precision computations on Hopper and Ada architectures, while deprecating and removing legacy device emulation mode to streamline the runtime.^[10]^[28] The latest major version, CUDA 13.0, debuted in August 2025 with optimizations for Hopper (H100) and initial Blackwell GPU integration, including enhanced security features via updated driver compatibility.^[10]^[20]

Version	Release Date	Key Features
1.0	June 2007	Basic kernel execution, nvcc compiler, cuBLAS library
2.0	August 2008	Double-precision support, device emulation
5.0	October 2012	Dynamic parallelism, cuFFT and cuRAND libraries
8.0	September 2016	Unified Memory, Pascal architecture support
11.0	June 2020	Multi-Process Service, MIG for multi-GPU
12.0	December 2022	FP8 in cuBLAS, removal of device emulation
13.0	August 2025	Hopper/Blackwell optimizations, security enhancements

CUDA 13.0 Update 2, released in November 2025, incorporates performance improvements for Blackwell-based systems such as DGX Spark and Jetson Thor, alongside Python accelerations via improved Numba and CuPy integrations for seamless GPU kernel development in Python, and features announced at GTC 2025 such as enhanced Arm platform unification for cross-architecture portability.^[20]^[5]^[29]^[30] The CUDA Toolkit consistently comprises core components such as the nvcc compiler for device code, runtime and driver APIs, drop-in libraries (e.g., cuBLAS, cuDNN, Thrust), and sample applications hosted on GitHub for demonstrating usage patterns.^[31]^[32]

Data Types and Precision

CUDA supports a range of built-in integer data types for device code, including signed and unsigned variants of char (1 byte), short (2 bytes), int (4 bytes), long (4 bytes on 32-bit architectures, though typically aligned to host), and long long (8 bytes).^[33] These types are available across all compute capabilities since CUDA 1.0, enabling basic arithmetic and memory operations on the GPU.^[17] Vector variants, such as int2, int4, and uint3, aggregate 1 to 4 components with specific alignment requirements (e.g., 8 bytes for int2, 16 bytes for int4) to optimize memory access.^[34] Floating-point data types in CUDA prioritize precision trade-offs for performance in compute-intensive tasks. The 32-bit float type, conforming to IEEE 754 single-precision, has been fully supported since CUDA 1.0 across all compute capabilities, offering high throughput in arithmetic operations.^[35] The 64-bit double type, adhering to IEEE 754 double-precision, requires compute capability 1.3 or higher (introduced with the GT200 architecture) for native support, enabling accurate simulations in scientific computing. Lower-precision options include the 16-bit half type (IEEE 754 binary16 format), available in device code since compute capability 5.3 (Maxwell architecture) with full hardware acceleration from Pascal (6.0+), useful for machine learning due to reduced memory bandwidth.^[36] The bfloat16 type, a 16-bit format with 8-bit exponent for extended dynamic range in AI training, is supported starting from compute capability 8.0 (Ampere architecture). Additionally, 8-bit floating-point (FP8) types, including E4M3 and E5M2 formats, were introduced in CUDA 12 for Hopper GPUs (compute capability 9.0+), targeting efficiency in large-scale AI models by halving storage compared to FP16. The following table summarizes key floating-point type support by compute capability, highlighting availability for native arithmetic:

Compute Capability	float (32-bit)	double (64-bit)	half (16-bit)	bfloat16 (16-bit)	FP8 (8-bit)
1.0–1.2	Yes	No	No	No	No
1.3+	Yes	Yes	No	No	No
5.3+	Yes	Yes	Yes	No	No
6.0+ (Pascal)	Yes	Yes	Full accel.	No	No
8.0+ (Ampere)	Yes	Yes	Full accel.	Yes	No
9.0+ (Hopper)	Yes	Yes	Full accel.	Yes	Yes (CUDA 12+)

This table is derived from hardware feature tables in the CUDA documentation, where higher capabilities build on prior support.^[17]^[36] Atomic operations in CUDA, essential for thread-safe updates in parallel kernels, are supported for integer types ([int](/page/INT), [unsigned int](/page/INT), long long) and floating-point types ([float](/page/Float), [double](/page/Double)) in global and shared memory, with availability scaling by compute capability (e.g., 64-bit atomics from 1.1+).^[37] For lower precisions, atomic adds for half and bfloat16 are available from compute capability 6.0+ and 8.0+, respectively, aiding reductions in AI workloads.^[37] Intrinsics facilitate type reinterpretation and conversions without data movement; for example, __float_as_int(float x) reinterprets the bits of a 32-bit float as a signed integer, useful for bit-level manipulations, and is supported across all compute capabilities. Similar intrinsics exist for half (e.g., __half2float) from 5.3+ and FP8 conversions in CUDA 12+.

Advanced Components

Tensor Cores represent a specialized class of processing units integrated into CUDA-enabled GPUs starting with the Volta architecture, designed to accelerate mixed-precision matrix multiply-accumulate (MMA) operations critical for deep learning workloads. Introduced in 2017 with the Tesla V100 GPU, these cores perform 4x4x4 matrix multiplications per clock cycle, delivering up to 125 tensor TFLOPS at FP16 precision per streaming multiprocessor (SM), which significantly boosts throughput for neural network training and inference compared to traditional CUDA cores.^[21]^[38] The Warp Matrix Multiply-Accumulate (WMMA) API in CUDA provides developers with a means to directly program Tensor Cores at the warp level, enabling fragment operations on 16x16x16 matrices using FP16 inputs and FP32 accumulation for maintained numerical stability. This API facilitates efficient mapping of tensor operations to hardware, supporting MMA instructions that accumulate results in higher precision to mitigate precision loss in low-precision computations. Subsequent extensions to the API accommodate integer formats like INT8 for broader inference applications.^[38] Tensor Cores have evolved across GPU architectures to support an expanding range of precisions and optimizations. The Turing architecture (2018) introduced second-generation Tensor Cores with added INT8 and INT4 support, enabling up to 130 tensor TOPS for integer inference tasks, though without native sparsity acceleration at the time. The Ampere architecture (2020) advanced to third-generation cores, incorporating BF16 for robust training with reduced dynamic range issues, TF32 for drop-in FP32-like accuracy at higher speeds, and FP64 for high-precision scientific computing, alongside structured sparsity for 2x effective throughput in supported INT8 and INT4 operations on pruned neural networks.^[39]^[22] The Hopper architecture (2022), powering the H100 GPU, brought fourth-generation Tensor Cores with native FP8 support via the Transformer Engine, a software-hardware co-design that dynamically scales precisions to optimize large language model (LLM) training, achieving up to 6x higher performance over FP16 for transformer-based models by leveraging FP8's reduced memory footprint and compute latency. The Blackwell architecture (2025) further refines fifth-generation Tensor Cores with FP4 capabilities and doubled FP8 throughput, targeting ultra-efficient inference for massive AI deployments; for instance, the GB200 configuration delivers up to 20 petaFLOPS in FP8 tensor operations, underscoring the scale for next-generation AI factories.^[40]^[41] RT Cores, debuted in the Turing architecture, are dedicated hardware accelerators for real-time ray tracing, performing bounding volume hierarchy (BVH) traversals and ray-triangle intersections up to 10x faster than software implementations on CUDA cores alone. These cores integrate seamlessly with CUDA through the OptiX ray tracing API, which exposes a programmable pipeline for custom shaders while automatically dispatching ray tracing tasks to RT hardware; CUDA interoperability allows shared memory and data transfers between ray tracing and general compute kernels, enabling hybrid workflows in graphics and simulation applications. Direct access to RT Cores via pure CUDA kernels is not available, as their functionality is encapsulated within OptiX for optimized hardware utilization.^[39]^[42] Supporting these accelerators, the cuTENSOR library offers a high-performance CUDA interface for tensor primitives, including contractions (generalized matrix multiplies), reductions, and element-wise operations, all tuned to exploit Tensor Cores for maximal throughput across supported precisions like FP16, BF16, and INT8. This library abstracts complex indexing and layout transformations, allowing developers to achieve near-peak hardware performance without manual kernel optimization. In Blackwell-based systems, cuTENSOR enables tensor operations at scales exceeding 1,000 TFLOPS in mixed-precision modes, establishing a foundation for efficient deployment of large-scale tensor computations in AI and HPC.^[43]^[44]^[41]

Performance and Usage

Advantages

CUDA leverages the massive parallelism inherent in NVIDIA GPUs, which feature thousands of cores designed for data-parallel tasks. This architecture enables significant performance gains over traditional CPU computing for operations like matrix multiplication, where GPUs can achieve 10-100x speedups compared to multi-core CPUs for large-scale computations. For instance, in benchmarks with matrices up to 4096x4096, CUDA implementations deliver up to 45x speedup over parallel CPU versions and over 500x versus sequential CPU execution, highlighting the scalability for high-throughput workloads.^[45] The CUDA ecosystem provides a rich set of optimized libraries that accelerate development and performance. Libraries such as cuFFT for fast Fourier transforms and cuSPARSE for sparse matrix operations offer GPU-accelerated implementations that significantly outperform CPU equivalents, allowing developers to integrate high-performance routines without building them from scratch and thereby reducing development time.^[46] Furthermore, seamless integration with popular machine learning frameworks like TensorFlow and PyTorch enables effortless GPU acceleration for AI tasks, leveraging CUDA's backend for tensor operations and model training.^[47] CUDA's unified programming model simplifies development by allowing a single programming language, such as C++ or Fortran, for both host (CPU) and device (GPU) code, eliminating the need for separate APIs or data transfers in many cases through features like Unified Memory. Tools like CUDA Graphs further optimize pipelines by capturing sequences of kernel launches into executable graphs, reducing CPU-GPU launch overhead and improving throughput for iterative or recurrent workloads. As of 2025, CUDA 13.0 introduces enhancements such as improved asynchronous memory operations and better multi-GPU scalability, further boosting performance in distributed environments.^[24]^[5] In terms of energy efficiency, CUDA-enabled GPUs deliver high floating-point operations per watt (FLOPS/W), making them ideal for high-performance computing (HPC) and AI applications. GPU-accelerated systems can provide up to 5x better energy efficiency than CPU-only setups at equivalent performance levels, contributing to substantial power savings—over 40 terawatt-hours annually across global HPC and AI workloads.^[48]^[49]

Limitations

CUDA's exclusive compatibility with NVIDIA hardware imposes significant vendor lock-in, restricting portability to non-NVIDIA GPUs and requiring code rewrites or alternative frameworks for cross-vendor deployment.^[14] This specificity leverages NVIDIA's proprietary architecture, such as Streaming Multiprocessors, but precludes seamless integration with hardware from competitors like AMD or Intel without substantial modifications.^[14] Data transfer between host and device memory introduces notable overhead, primarily due to PCIe interconnect limitations, where peak bandwidth reaches up to 64 GB/s on PCIe 5.0 systems, though effective throughput for on-demand Unified Memory migrations can be lower due to page fault handling and latency, varying by hardware configuration.^[50] While NVLink interconnects mitigate this by providing up to 1.8 TB/s bandwidth per GPU in current generations (NVLink 5.0) in supported systems, bottlenecks persist in PCIe-based configurations, particularly for large datasets requiring frequent host-device synchronization.^[51]^[52] The programming model demands explicit manual memory management, using functions like cudaMalloc() and cudaMemcpy() for device allocation and data transfer, which increases developer burden compared to automatic host-side handling and risks errors such as memory leaks if not properly deallocated.^[14] Thread divergence in the Single Instruction, Multiple Threads (SIMT) execution model penalizes performance when threads within a warp (group of 32 threads) follow divergent control paths, as the hardware serializes execution for active lanes, reducing overall throughput.^[14] This SIMT paradigm, while enabling massive parallelism, presents a steep learning curve for developers accustomed to scalar programming, necessitating careful kernel design to minimize divergence and optimize warp-level synchronization.^[14] Scalability in CUDA applications is constrained by Amdahl's law, which limits overall speedup to S = \frac{1}{(1 - P) + \frac{P}{N}}, where P is the parallelizable fraction and N is the number of processors, particularly in mixed workloads with substantial serial components that cannot be offloaded to the GPU.^[53] In dense deployments, such as GPU clusters, high power consumption—often exceeding 10 kW per system as of 2025—and resultant heat generation necessitate advanced cooling solutions like liquid cooling to prevent thermal throttling and maintain performance.^[54]

Real-World Applications

CUDA has been widely adopted in scientific computing for accelerating complex simulations, particularly in physics and environmental modeling. In molecular dynamics, the GROMACS software package leverages CUDA to perform high-performance simulations of biomolecular systems, enabling researchers to model the behavior of proteins and other macromolecules at scales involving millions of particles.^[55] Similarly, CUDA accelerates weather prediction models such as the Weather Research and Forecasting (WRF) system, where GPU implementations achieve up to 10x speedups in numerical computations for atmospheric simulations.^[56] In artificial intelligence and machine learning, CUDA underpins training and inference for large-scale deep learning models through libraries like cuDNN, which provides optimized primitives for convolutional and recurrent neural networks.^[57] For instance, models on the scale of ChatGPT rely on NVIDIA GPUs powered by CUDA for efficient processing of vast datasets during training and real-time inference.^[58] Additionally, CUDA-enabled Python libraries such as CuPy offer drop-in replacements for NumPy arrays on GPUs, with recent enhancements supporting CUDA 12.x and 13.x for improved scalability in distributed computing environments as of 2025.^[59] The finance sector utilizes CUDA for high-throughput risk modeling and Monte Carlo simulations, which generate numerous probabilistic scenarios to assess portfolio risks and price derivatives.^[60] These GPU-accelerated methods enable real-time evaluation of complex financial instruments, such as barrier options, by parallelizing path simulations across thousands of threads.^[61] In graphics and media production, CUDA facilitates hardware-accelerated video encoding via NVENC, an on-chip encoder integrated into NVIDIA GPUs that supports efficient compression for H.264, HEVC, and AV1 formats in streaming and broadcasting applications.^[62] Beyond traditional graphics, CUDA enables real-time rendering in simulations and virtual environments, processing complex scene computations for interactive visualizations.^[31] Emerging applications of CUDA include drug discovery, where tools like AlphaFold2 employ GPU acceleration for predicting protein structures, speeding up the identification of potential therapeutic targets through multiple sequence alignments.^[63] In autonomous vehicles, CUDA supports sensor fusion within the NVIDIA DRIVE platform, integrating data from cameras, LiDAR, and radar in real time to enable perception and decision-making for safe navigation.^[64]

Comparisons

With OpenCL

OpenCL, developed by the Khronos Group as an open, royalty-free standard for parallel programming of heterogeneous systems, was first released in version 1.0 on December 8, 2008. This framework enables developers to target a wide range of devices, including GPUs from NVIDIA, AMD, and Intel, as well as CPUs, DSPs, and FPGAs, promoting cross-vendor portability without reliance on proprietary technologies.^[65] In comparison, CUDA is NVIDIA's proprietary parallel computing platform and application programming interface (API), introduced in 2006 and exclusively optimized for NVIDIA GPUs, which restricts its use to a single vendor's hardware ecosystem.^[31] This fundamental difference in standardization—OpenCL's vendor-agnostic openness versus CUDA's closed, hardware-specific model—has shaped their respective roles in GPU computing, with OpenCL emphasizing interoperability and CUDA prioritizing deep integration with NVIDIA architectures. Performance-wise, CUDA generally delivers superior results on NVIDIA GPUs, often achieving 10-30% higher throughput in compute-heavy workloads compared to equivalent OpenCL implementations, owing to its direct access to hardware-specific features like optimized memory hierarchies and instruction sets.^[66] For instance, benchmarks on NVIDIA GeForce GTX-260 hardware for Monte Carlo simulations showed CUDA outperforming OpenCL by 13% to 63% in kernel execution times, with the gap widening for larger problem sizes due to CUDA's streamlined data transfer and execution model.^[67] OpenCL, while capable of comparable peak performance when tuned, tends to require more verbose code for similar tasks; developers must explicitly handle aspects like context creation, command queues, and event synchronization, which can introduce overhead and complicate optimization across devices.^[68] The ecosystem surrounding CUDA is notably more mature and cohesive, featuring a suite of high-performance, NVIDIA-optimized libraries such as cuBLAS for basic linear algebra subprograms (BLAS) and cuDNN for deep neural network primitives, which accelerate development in domains like machine learning and scientific computing. These libraries are tightly integrated with CUDA's runtime, enabling seamless scaling and reducing the need for low-level tuning. In contrast, OpenCL's ecosystem is more fragmented, as implementations are provided by multiple vendors (e.g., NVIDIA's OpenCL driver, AMD's ROCm OpenCL, Intel's oneAPI), leading to variations in feature support, driver quality, and optimization levels that can hinder portability and reliability.^[69] This vendor-specific divergence often results in developers facing inconsistent behaviors, such as differing extension availability or suboptimal code generation, despite OpenCL's standardized core API.^[70] Adoption patterns reflect these strengths: CUDA has become the de facto standard in artificial intelligence (AI) and high-performance computing (HPC), powering millions of developers, with over 6 million in the NVIDIA Developer Program as of 2024, and the majority of deep learning frameworks like TensorFlow and PyTorch due to its performance edge and extensive tooling.^[71]^[72] For example, the 2012 AlexNet breakthrough in image recognition relied on CUDA-accelerated GPUs, solidifying its dominance in AI training and inference pipelines.^[73] OpenCL, however, sees stronger uptake in embedded systems and multi-vendor scenarios, such as mobile devices, automotive computing, and heterogeneous edge deployments, where its portability allows code to run across diverse hardware without vendor lock-in.^[74] Conformant implementations from vendors like Arm and Imagination Technologies further bolster its role in resource-constrained, cross-platform applications.^[65]

With Intel oneAPI

Intel's oneAPI, introduced in 2020, provides a unified programming model for heterogeneous computing across CPUs, GPUs, and FPGAs, leveraging Data Parallel C++ (DPC++) based on the SYCL standard to enable single-source code development for diverse accelerator architectures.^[75] In contrast, CUDA remains focused exclusively on NVIDIA GPUs, offering a proprietary platform optimized for parallel computing on those devices without native support for other hardware types like CPUs or FPGAs.^[76] This broader scope in oneAPI facilitates application development that can target multiple Intel architectures seamlessly, while CUDA's GPU-centric design excels in scenarios limited to NVIDIA hardware.^[77] oneAPI emphasizes portability through its adherence to open standards like SYCL, aiming for vendor-neutral code that can run across various hardware without vendor lock-in, including support for NVIDIA GPUs via interoperability extensions.^[75] CUDA, however, is inherently tied to NVIDIA ecosystems, requiring code rewrites or specialized tools for migration to other platforms, though it delivers superior performance on NVIDIA hardware due to deep optimizations.^[78] This standards-based approach in oneAPI promotes long-term flexibility for developers working in high-performance computing (HPC) and AI, reducing dependency on a single vendor.^[79] In terms of libraries, oneAPI includes components like oneDPL for parallel algorithms and oneMKL for mathematical kernels, which provide SYCL-based interfaces comparable to CUDA's cuBLAS for linear algebra operations, with both supporting GPU acceleration.^[80] These libraries are built on the Unified Acceleration eXecution (UXL) Foundation standards to enhance interoperability, allowing SYCL code to invoke CUDA APIs directly when needed for hybrid workflows.^[81] For instance, oneMKL's SYCL backend can replace cuBLAS calls in migrated applications, maintaining functionality while enabling cross-architecture execution.^[82] CUDA's maturity, with over 18 years of development since its 2006 launch, has fostered a robust ecosystem particularly dominant in deep learning frameworks like TensorFlow and PyTorch, where NVIDIA-optimized tools drive widespread adoption.^[75] oneAPI's ecosystem, while rapidly evolving since 2020, remains newer and less entrenched in AI workflows, though migration tools like the DPC++ Compatibility Tool aid transitions from CUDA codebases.^[83] This experience gap contributes to CUDA's edge in production-scale deep learning deployments, even as oneAPI gains traction for its inclusive hardware support.^[78]

With AMD ROCm

ROCm, introduced by AMD in 2016, is a fully open-source software platform designed for GPU-accelerated computing on AMD hardware, in contrast to CUDA, which relies on proprietary binary drivers from NVIDIA. While CUDA remains largely proprietary, NVIDIA has open-sourced some GPU kernel modules since 2022 and added RISC-V support in 2025, narrowing the openness gap slightly.^[84]^[85] This open-source approach enables greater community contributions and customization in ROCm, while CUDA's closed ecosystem provides optimized, vendor-controlled performance but limits direct modifications. A key feature of ROCm is its support for the Heterogeneous-compute Interface for Portability (HIP), which facilitates the migration of CUDA code to AMD GPUs by mapping CUDA APIs to HIP equivalents, allowing developers to port applications with minimal changes and achieve comparable performance on both NVIDIA and AMD platforms.^[86] In terms of hardware support, ROCm is tailored primarily to AMD's CDNA architecture and Instinct MI-series GPUs, such as the MI200, MI300, and MI350 series, which are optimized for data center and high-performance computing workloads.^[87] CUDA, however, supports a broader spectrum of NVIDIA GPUs across consumer, professional, and data center segments, including architectures from Pascal to Hopper and beyond, offering wider accessibility for diverse applications. Feature-wise, ROCm includes libraries like MIOpen, which serves as an analog to NVIDIA's cuDNN for deep neural network primitives such as convolutions, though it trails in maturity and optimization within the AI ecosystem, where CUDA benefits from extensive third-party integrations. Additionally, CUDA leverages NVIDIA's Tensor Cores for accelerated mixed-precision matrix operations in AI training and inference, while ROCm utilizes AMD's Matrix Cores in CDNA GPUs to perform similar tensor computations, albeit with differences in precision support and throughput efficiency.^[88] Adoption patterns highlight CUDA's dominant position in machine learning, where NVIDIA holds over 90% market share as of 2024 due to its mature ecosystem and ease of use in frameworks like TensorFlow and PyTorch.^[89] In contrast, ROCm is gaining traction in high-performance computing (HPC), powering systems like the Frontier supercomputer at Oak Ridge National Laboratory (the first exascale system) and El Capitan (the current fastest as of November 2025), both using AMD Instinct GPUs and ROCm to achieve exascale performance in scientific simulations.^[90]^[91] This positions ROCm as a strong contender in HPC environments seeking open-source alternatives, though it continues to address gaps in broader AI developer adoption.

References

[1]
CUDA Zone - Library of Resources | NVIDIA Developer
CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).<|control11|><|separator|>
[2]
CUDA | GeForce - NVIDIA
CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance.
[3]
CUDA FAQ | NVIDIA Developer
... supported by an installed base of hundreds of millions of CUDA-enabled GPUs in notebooks, workstations, compute clusters and supercomputers. Applications ...
[4]
About CUDA | NVIDIA Developer
The CUDA compute platform extends from the 1000s of general purpose compute processors featured in our GPU's compute architecture.
[5]
What's New and Important in CUDA Toolkit 13.0 - NVIDIA Developer
Aug 6, 2025 · The newest update to the CUDA Toolkit, version 13.0, features advancements to accelerate computing on the latest NVIDIA CPUs and GPUs.What's In Cuda 13.0 Beyond · Unifying Cuda For Arm: Build... · Cuda Core Compute Library...<|control11|><|separator|>
[6]
An Even Easier Introduction to CUDA (Updated) - NVIDIA Developer
May 2, 2025 · CUDA is a parallel computing platform from NVIDIA that allows developers to create high-performance applications using thousands of parallel ...An Easy Introduction to CUDA · CUDA Refresher: Getting... · How to Implement...
[7]
CUDA Toolkit Documentation 13.0 Update 2 - NVIDIA Docs
The NVIDIA CUDA Toolkit provides a development environment for creating high performance GPU-accelerated applications.CUDA C++ Programming Guide · CUDA Runtime API · CUDA Compatibility · vGPU
[8]
CUDA 11 Features Revealed | NVIDIA Technical Blog
May 14, 2020 · CUDA 11 enables you to leverage the new hardware capabilities to accelerate HPC, genomics, 5G, rendering, deep learning, data analytics, data science, robotics ...Cuda And Nvidia Ampere... · Programming Nvidia Ampere... · Memory Management
[9]
NVIDIA UNVEILS CUDA™ – THE GPU COMPUTING REVOLUTION ...
Nov 9, 2006 · THEALE, UK - 8th NOVEMBER 2006 - NVIDIA Corporation (Nasdaq: NVDA), the worldwide leader in graphics processors, today unveiled NVIDIA CUDA ...
[10]
Happy 18th Birthday CUDA! - by Babbage - The Chip Letter - Substack
Mar 22, 2025 · The first version of CUDA appeared in February 2007 so it's just passed its 18th birthday. In many places CUDA would be old enough to buy some ...
[11]
[PDF] nvidia tesla:aunified graphics and computing architecture
NVIDIA's Tesla architecture, introduced in November 2006 in the GeForce 8800 GPU, unifies the vertex and pixel processors and extends them, enabling high- ...Missing: initial | Show results with:initial
[12]
CUDA-X GPU-Accelerated Libraries - NVIDIA Developer
NVIDIA CUDA-X Libraries are a collection of libraries built on CUDA for higher performance, including math, data processing, and communication libraries.CV-CUDA · CUDA Math API Reference... · cuBLAS · cuTENSOR
[13]
CUDA-X Data Science Libraries | NVIDIA Developer
CUDA-X Data Science are open-source libraries that accelerate data science, including cuDF for pandas, cuML for scikit-learn, and cuGraph for NetworkX.
[14]
CUDA C++ Programming Guide
The programming guide to the CUDA model and interface.
[15]
Unified Memory in CUDA 6 | NVIDIA Technical Blog
Nov 18, 2013 · Unified Memory in CUDA 6 creates a shared memory pool between CPU and GPU, automatically migrating data, and simplifying memory management.What Unified Memory Delivers · Performance Through Data... · Unified Memory With C++
[16]
CUDA GPU Compute Capability - NVIDIA Developer
CUDA GPU Compute Capability. Compute capability (CC) defines the hardware features and supported instructions for each NVIDIA GPU architecture.CUDA Toolkit Documentation · CUDA C++ Programming Guide · Legacy
[17]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
[18]
Legacy CUDA GPU Compute Capability - NVIDIA Developer
Compute capability (CC) defines the hardware features and supported instructions for each NVIDIA GPU architecture. Find the compute capability for legacy GPUs ...
[19]
CUDA Compatibility - NVIDIA Docs
CUDA Compatibility describes the use of new CUDA toolkit components on systems with older base installations.
[20]
CUDA Toolkit 13.0 Update 2 - Release Notes - NVIDIA Docs
Covers the specialized computational libraries with their feature updates, performance improvements, API changes, and version history across CUDA 13.x releases.
[21]
[PDF] NVIDIA TESLA V100 GPU ARCHITECTURE
VOLTA STREAMING MULTIPROCESSOR. Volta features a new Streaming Multiprocessor (SM) architecture that delivers major improvements in performance, energy ...
[22]
[PDF] NVIDIA A100 Tensor Core GPU Architecture
A100 GPU Streaming Multiprocessor (SM). The new SM in the NVIDIA Ampere architecture-based A100 Tensor Core GPU significantly increases performance, builds ...
[23]
cuda-samples/Samples/0_Introduction/vectorAdd/vectorAdd.cu at master · NVIDIA/cuda-samples
- **Summary**: The provided content is a GitHub page snippet for NVIDIA's CUDA samples, specifically pointing to `vectorAdd.cu`. However, it only includes navigation, metadata, and footer information, with no actual CUDA C code or compilation instructions visible in the excerpt.
[24]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model
[25]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#built-in-variables
[26]
CUDA Toolkit Archive - NVIDIA Developer
Previous releases of the CUDA Toolkit, GPU Computing SDK, documentation and developer drivers can be found using the links below. Please select the release you ...
[27]
NVIDIA Releases CUDA 5, Making Programming With World's Most ...
Oct 15, 2012 · A new CUDA BLAS library allows developers to use dynamic parallelism for their own GPU-callable libraries. They can design plug-in APIs that ...
[28]
CUDA 8 Features Revealed | NVIDIA Technical Blog
Apr 5, 2016 · CUDA 8 supports the new NVIDIA Pascal architecture, including Tesla P100, P40, and P4 accelerators, and provides improved performance for ...Missing: initial | Show results with:initial
[29]
New cuBLAS 12.0 Features and Matrix Multiplication Performance ...
Feb 1, 2023 · The NVIDIA cuBLAS library in CUDA 12.0 introduces new features, including support for the FP8 format, improved GEMM performance on NVIDIA ...Nvidia Cutlass And Gemms · Cublas 12.0 Performance On... · Cublas 12.0 And Nvidia...
[30]
CUDA Toolkit - Free Tools and Training | NVIDIA Developer
The NVIDIA CUDA Toolkit provides a development environment for creating high-performance, GPU-accelerated applications.Download Now · CUDA 12.0 · NVIDIA Hopper architecture · Accelerated Computing
[31]
NVIDIA/cuda-samples - GitHub
The CUDA Samples are built using CMake. Follow the instructions below for building on Linux, Windows, and for cross-compilation to Tegra devices. Linux.
[32]
https://github.com/NVIDIA/cuda-samples
[33]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#data-types
[34]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types
[35]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#floating-point
[36]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#half-precision-floating-point-textures
[37]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
[38]
Programming Tensor Cores in CUDA 9 | NVIDIA Technical Blog
Oct 17, 2017 · NVIDIA's Volta GPU architecture introduces Tensor Cores, which provide a 12x increase in throughput for deep learning applications compared to ...
[39]
[PDF] NVIDIA TURING GPU ARCHITECTURE
Turing GPUs include an enhanced version of the Tensor Cores first introduced in the Volta GV100. GPU. The Turing Tensor Core design adds INT8 and INT4 precision ...
[40]
NVIDIA Hopper GPU Architecture
The NVIDIA Hopper architecture advances Tensor Core technology with the Transformer Engine, designed to accelerate the training of AI models.Explore The Technology... · Transformer Engine · Nvlink, Nvswitch, And Nvlink...
[41]
GB200 NVL72 | NVIDIA
GB200 NVL72 Specs ; FP64 / FP64 Tensor Core, 2,880 TFLOPS, 80 TFLOPS ; GPU Memory | Bandwidth, Up to 13.4 TB HBM3e | 576 TB/s, Up to 372 GB HBM3e | 16 TB/s.Unlocking Real-Time... · Supercharging... · Technological Breakthroughs
[42]
NVIDIA OptiX™ Ray Tracing Engine
Jul 29, 2025 · Programmable GPU-accelerated Ray-Tracing Pipeline, single-ray shader programming model using C++, and ray Tracing acceleration using RT Cores.
[43]
cuTENSOR: A High-Performance CUDA Library For Tensor Primitives
Welcome to the cuTENSOR library documentation. cuTENSOR is a high-performance CUDA library for tensor primitives.cuTENSOR Functions · cuTENSOR Data Types · API Reference
[44]
cuTENSOR 2.0: A Comprehensive Guide for Accelerating Tensor ...
Mar 8, 2024 · NVIDIA cuTENSOR is a CUDA math library that provides optimized implementations of tensor operations where tensors are dense, multi-dimensional arrays or array ...
[45]
1. Introduction — cuFFT 13.0 documentation - NVIDIA Docs
Oct 2, 2025 · This document describes cuFFT, the NVIDIA® CUDA® Fast Fourier Transform (FFT) product. It consists of two separate libraries: cuFFT and cuFFTW.cuFFTDx · cuFFTMp · Contents
[46]
What is PyTorch? | Data Science | NVIDIA Glossary
PyTorch supports dynamic computational graphs, enabling network behavior to be changed at runtime. This provides a major flexibility advantage over the majority ...Why Pytorch? · Graphs · Pytorch Use Cases
[47]
How Energy-Efficient Computing for AI Is Transforming Industries
Feb 7, 2024 · At performance parity, a GPU-accelerated cluster consumes 588 less megawatt hours per month, representing a 5x improvement in energy efficiency.
[48]
How AI and Accelerated Computing Are Driving Energy Efficiency
Jul 22, 2024 · By transitioning from CPU-only operations to GPU-accelerated systems, HPC and AI workloads can save over 40 terawatt-hours of energy annually, ...
[49]
Maximizing Unified Memory Performance in CUDA
Nov 19, 2017 · All systems are using the CUDA 9 toolkit and driver. There are two PCIe systems, one with Tesla P100 and another with Tesla V100. For both PCIe ...Missing: bottleneck | Show results with:bottleneck
[50]
[PDF] Processing Large Data on GPUs with Fast Interconnects - cs.wisc.edu
NVLink 2.0 is a new interconnect technology that links dedicated GPUs to a CPU. The high bandwidth of. NVLink 2.0 enables us to overcome the transfer bottleneck.<|separator|>
[51]
CUDA C++ Best Practices Guide 13.0 documentation - NVIDIA Docs
Oct 2, 2025 · The CUDA C++ Best Practices Guide provides practical guidelines for writing high-performance CUDA applications. It covers optimization ...
[52]
[PDF] Considerations for Scaling GPU-Ready Data Centers - NVIDIA
Data centers that support GPU servers with dense, high-power racks using advanced cooling techniques like water cooling and hot aisle containment use.
[53]
A Guide to CUDA Graphs in GROMACS 2023 | NVIDIA Technical Blog
Apr 14, 2023 · This post describes how CUDA Graphs have been recently leveraged by GROMACS, a simulation package for biomolecular systems and one of the most highly used ...
[54]
Improved GPU/CUDA Based Parallel Weather and Research ...
Apr 11, 2012 · The Weather Research and Forecasting (WRF) model is an atmospheric simulation system which is designed for both operational and research use ...
[55]
NVIDIA cuDNN - CUDA Deep Neural Network
NVIDIA cuDNN is a GPU-accelerated library for deep neural networks, providing high-performance, low-latency inference for standard routines.Deep Learning · cuDNN 9.15.0 Downloads · Read Blog · Forums
[56]
New Video: What Runs ChatGPT? | NVIDIA Technical Blog
Jun 12, 2023 · Low-rank adaptive (LoRA) fine-tuning decreases GPU usage and checkpoint size when handling billion-parameter models at an increased scale.
[57]
Effortlessly Scale NumPy from Laptops to Supercomputers with ...
Nov 18, 2024 · cuPyNumeric is the only existing distributed NumPy implementation that can support aliasing and mutation of distributed arrays in this manner.
[58]
Monte Carlo Simulations In CUDA - Barrier Option Pricing - QuantStart
In this article, I will talk about how to write Monte Carlo simulations in CUDA. More specifically, I will explain how to carry it out step-by -step.
[59]
GPU-Accelerate Algorithmic Trading Simulations by over 100x with ...
Mar 4, 2025 · The approach of statistical Monte Carlo simulations to represent the price paths achievable with Brownian motion models involve custom models ...
[60]
NVIDIA Video Codec SDK
NVIDIA GPUs contain an on-chip hardware-accelerated video encoder (NVENC), which provides video encoding for H.264, HEVC (H.265) and AV1 codecs. The software ...
[61]
Boost Alphafold2 Protein Structure Prediction with GPU-Accelerated ...
Nov 13, 2024 · AlphaFold2, revolutionizing computational and structural biology, leverages MSAs to gain highly accurate 3D protein structure predictions, which ...
[62]
In-Vehicle Computing for Autonomous Vehicles - NVIDIA
NVIDIA DRIVE AGX gives you a scalable and energy-efficient AI computing platform designed to process the complex workloads required for autonomous driving.
[63]
OpenCL for Parallel Programming of Heterogeneous Systems
OpenXR™ is a trademark owned by The Khronos Group Inc. and is registered as a trademark in China, the European Union, Japan and the United Kingdom. OpenCL™ is a ...Khronos OpenCL Registry · OpenCL News · Khronos Developer Library · Forums
[64]
https://www.nvidia.com/en-us/solutions/autonomous-vehicles/in-vehicle-computing/
[65]
None
### Performance Comparison of CUDA and OpenCL on NVIDIA GPUs
[66]
CUDA vs OpenCL: Which should I use?
If you're a C programmer, the CUDA "runtime API" is easier to use than OpenCL, though somewhat more restricted. CUDA's "driver API" is rather similar to OpenCL.Missing: ecosystem fragmented
[67]
Making AI Compute Accessible to All, Part 5: Why OpenCL and C++ ...
Mar 30, 2025 · Exploring the Limitations of OpenCL, SYCL, and Other GPU Alternatives, with Some Lessons Learned from CUDA's Dominance.
[68]
[PDF] The Hitchhiker's Guide to Cross-Platform OpenCL Application ...
Apr 19, 2016 · Many GPU vendors provide implementations of. OpenCL for their respective platforms. In principle, this means that programs adhering to the ...
[69]
CUDA Refresher: The GPU Computing Ecosystem - NVIDIA Developer
May 21, 2020 · Ease of programming and a giant leap in performance is one of the key reasons for the CUDA platform's widespread adoption. The second biggest ...
[70]
How did CUDA succeed? (Democratizing AI Compute, Part 3)
Feb 12, 2025 · CUDA's dominance was cemented with the explosion of deep learning. In 2012, AlexNet, the neural network that kickstarted the modern AI ...
[71]
Introduction to OpenCL on Embedded Systems - The Good Penguin
Jan 13, 2022 · It was around 12 years ago that OpenCL came onto the scene, initially via demos from Apple, AMD and NVIDIA which quickly lead into full Software ...
[72]
oneAPI: A Viable Alternative To CUDA* Lock-in - Intel
Jun 20, 2024 · oneAPI programming model - an alternative to CUDA* vendor lock-in for accelerated parallel computing across HPC, AI, and more on CPUs and ...Missing: comparison | Show results with:comparison
[73]
CUDA* and SYCL* Programming Model Comparison - Intel
This section compares the CUDA* and SYCL* programming models and shows how to map concepts and APIs from CUDA to SYCL. Execution Model. Kernel Function. In CUDA ...
[74]
Compare Benefits of CPUs, GPUs, and FPGAs for oneAPI Workloads
Nov 9, 2022 · Get a comprehensive overview of the architectural differences between CPUs, GPUs, and FPGAs and the oneAPI applications that are best suited ...
[75]
SYCL Interoperability: A Deep Dive into Bridging CUDA and oneAPI
In this blog, we will dive into how SYCL Interoperability enables a translation layer ready to bridge the gaps between varying APIs including CUDA.
[76]
SYCL Interoperability: A Deep Dive into Bridging CUDA and oneAPI
Nov 10, 2023 · SYCL interoperability allows direct invocation of CUDA APIs within SYCL code, acting as a translator between SYCL and backends like CUDA.
[77]
Intel® oneAPI Math Kernel Library (oneMKL)
oneMKL is a high-performance math library for Intel systems, designed to accelerate math processing and reduce development time. It is the fastest and most ...
[78]
How to Move from CUDA* Math Library Calls to Intel® oneAPI Math ...
Learn how to migrate GPU- targeted source code to use Intel® oneAPI Math Kernel Library (oneMKL) SYCL API instead of CUDA. Includes steps and code.
[79]
Bringing Nvidia® and AMD support to oneAPI
Dec 16, 2022 · It is possible to use native CUDA libraries, such as cuBLAS or cuDNN with SYCL applications using the interoperability mode. SYCL enables ...Developers Can Write Sycltm... · Supported Features · How Is This All Possible
[80]
CUDA* to SYCL* Catalog of Ready-to-Use Applications - Intel
Explore a curated collection of ready-to-use applications to help migrate CUDA code to SYCL.Missing: comparison | Show results with:comparison
[81]
What is ROCm? - AMD ROCm documentation
ROCm is a software stack, composed primarily of open-source software, that provides the tools for programming AMD Graphics Processing Units (GPUs).Missing: comparison | Show results with:comparison
[82]
HIP porting guide - AMD ROCm documentation
HIP is designed to ease the porting of existing CUDA code into the HIP environment. This page describes the available tools and provides practical suggestions ...
[83]
System requirements (Linux) - AMD ROCm documentation
Supported GPUs# ; AMD Instinct MI250. MI200. CDNA2 ; AMD Instinct MI210. MI200. CDNA2 ; AMD Instinct MI100. MI100. CDNA ; AMD Instinct MI50. N/A. GCN5.1.Missing: MI | Show results with:MI
[84]
Porting to MIOpen - AMD ROCm documentation
The following is a summary of the key differences between MIOpen and NVIDIA CUDA cuDNN. Calling miopenFindConvolution*Algorithm() is mandatory before calling ...
[85]
GPU for Machine Learning & AI in 2025: On-Premises vs Cloud
Sep 11, 2025 · NVIDIA, which holds around 90% of the GPU market share in 2024, recently stated that over 40,000 companies and 4 million developers are using ...
[86]
AMD ROCm: Powering the World's Fastest Supercomputers
Jun 10, 2025 · This blog will cover how ROCm powers the world's most advanced supercomputers, including El Capitan and Frontier, by enabling exceptional performance, ...