Fact-checked by Grok 2 weeks ago

Nvidia CUDA Compiler

The Compiler, commonly referred to as NVCC, is a driver developed by for processing C++ , enabling developers to create high-performance applications that leverage the parallel capabilities of GPUs. Released in June 2007 as part of the initial Toolkit 1.0, NVCC simplifies GPU programming by automatically separating and compiling host (executed on the CPU) using standard C++ compilers like or , while transforming device (intended for GPU execution) into intermediate PTX assembly or final SASS tailored to specific GPU architectures. This dual-compilation approach hides the complexities of , allowing seamless integration of GPU acceleration into conventional C++ programs without requiring developers to manage low-level GPU details. NVCC operates through multiple phases, including preprocessing, host and device compilation, and linking, producing a "fatbinary" that embeds GPU code within the host executable for runtime loading via the runtime API. It supports targeting virtual architectures (e.g., compute_75) for and real architectures (e.g., sm_80 for GPUs), with options for optimization levels, debugging, and resource usage reporting to aid . Since 5.0 in 2012, NVCC has enabled separate compilation of device code, facilitating modular development and link-time optimizations through tools like , which enhances across projects and supports advanced features such as . NVCC supports integration with modern host compilers, including LLVM-based ones like . Runtime link-time optimization is provided via nvJitLink since 12.0 (2023). As of 13.0 (August 2025), it supports architectures up to Blackwell, with the current version being Toolkit 13.0 Update 2 (October 2025).

Overview

Purpose and Functionality

The CUDA Compiler Driver, known as nvcc, is a proprietary tool developed by for compiling CUDA applications, which extend C/C++ with GPU-specific programming constructs. It serves as the primary interface for developers to build programs that leverage GPUs for , abstracting the underlying complexities of heterogeneous CPU-GPU execution. At its core, nvcc facilitates the of mixed host-device code by separating the host code—intended for execution on the CPU—from the device code, which runs on the GPU. The host code is processed using a standard C++ compiler (such as g++ or cl.exe) to produce conventional object files, while the device code undergoes via 's specialized front-end and back-end tools, resulting in embeddable GPU binaries. This separation enables seamless integration of CPU and GPU workloads within a single application, allowing developers to focus on algorithm design without managing low-level GPU assembly details. nvcc generates several output formats to support flexible deployment and runtime compatibility. For GPU binaries, it produces PTX (Parallel Thread Execution) as an intermediate assembly representation for virtual architectures or CUBIN files as executable binaries tailored to specific GPU compute capabilities. These are often bundled into fatbinaries, which encapsulate multiple PTX or CUBIN variants for broad hardware support, embedded directly into the host object files for whole-program execution. Additionally, nvcc supports relocatable code, which allows separate of functions into linkable objects, enabling modular and link-time optimizations before final embedding. By handling these processes, nvcc hides the intricacies of GPU-specific compilation, empowering developers to write (SPMD) parallel code that executes efficiently across thousands of GPU threads on hardware.

Role in CUDA Ecosystem

The Compiler Driver, known as nvcc, serves as the primary entry point within the toolkit for compiling source files with the .cu extension, which contain both and code. By processing these files, nvcc orchestrates the separation and compilation of CPU-executable code and GPU-executable code, ultimately producing binaries that integrate seamlessly with the to enable kernel launches and GPU execution. This positioning allows developers to leverage without managing low-level compilation intricacies, forming a foundational component of the broader . Nvcc interacts closely with host compilers such as on or MSVC on Windows to handle the non-CUDA portions of the code, forwarding these steps while embedding device code into the resulting host objects. Its outputs are designed for compatibility with essential libraries like cudart, which manage allocation, dispatching, and , as well as the underlying drivers that provide direct GPU access. These interactions ensure that compiled applications can link against libraries and execute across supported hardware without compatibility mismatches. Within the CUDA ecosystem, nvcc enhances portability by generating (PTX) as an intermediate representation, allowing code to run on diverse GPU architectures through just-in-time () compilation at runtime. This mechanism adapts binaries to the target device's compute capability, optimizing performance while accommodating variations in hardware features across generations. Additionally, nvcc supports dynamic optimization by enabling the runtime to JIT-compile PTX code, which benefits from ongoing compiler improvements without requiring recompilation. A distinctive aspect of nvcc's output is the fatbinary format, which bundles multiple compiled variants—including PTX and architecture-specific binaries—into a single file, permitting the CUDA runtime to select the most appropriate GPU code based on the detected at execution time. This approach ensures , as applications compiled with earlier toolkit versions can leverage PTX for compilation on newer GPUs, maintaining functionality across evolving architectures.

History

Initial Development

The Nvidia CUDA Compiler, commonly referred to as nvcc, originated as a core component of Nvidia's project, initiated in the mid-2000s to pioneer general-purpose computing on graphics processing units (GPUs) and transcend the confines of graphics shader workloads. was first announced in November 2006. This effort addressed the growing demand for harnessing GPU parallelism in scientific and computational applications, where traditional CPU-based processing fell short in scalability. Prior to CUDA, general-purpose GPU (GPGPU) programming was hampered by reliance on graphics-oriented frameworks, such as shaders or stream programming language developed at around 2003. These approaches necessitated mapping non-graphics algorithms onto fixed graphics pipelines, imposing constraints like limited data access patterns (e.g., gather-only operations in ), restricted integer arithmetic, and inefficient tied to texture units. By contrast, nvcc enabled a more intuitive C-like extension to C, granting developers direct control over GPU threads, memory hierarchies, and parallel execution models, thereby streamlining the development of high-performance parallel code without graphics reformulation. The initial public release of nvcc occurred in June 2007, bundled with Toolkit 1.0, which represented the first stable platform for compiling and deploying C extensions on compatible GPUs. This launch coincided with 's broader ecosystem debut, providing essential tools for hybrid CPU-GPU programming. Developed entirely in-house at as , nvcc leveraged a customized of the open-source compiler for , analyzing, and generating intermediate representations of device code, ensuring compatibility with GPU-specific optimizations from the outset.

Major Milestones and Releases

The compiler, nvcc, was first released in June 2007 as part of the initial Toolkit 1.0, enabling the compilation of C code for GPU acceleration. A significant advancement came with Toolkit 5.0 in October 2012, which introduced support for relocatable device code through the --relocatable-device-code (or -rdc=true) flag, allowing separate compilation and modular linking of device code objects without requiring full recompilation for each module. This feature marked a shift toward more flexible build processes for large-scale applications. In 2016, with CUDA Toolkit 8.0, nvcc gained improved compatibility with modern C++ dialects on the host side. CUDA Toolkit 11.0, released in May 2020, began full support for the dialect in nvcc and transitioned to an /-based front-end for improved C++ dialect support and greater compatibility with open-source tools, enhancing the compiler's ability to handle modern host code while maintaining CUDA-specific extensions. This architectural change laid the groundwork for better integration with ecosystem tools like . Building on this, ongoing enhancements for cross-compilation have been integrated, particularly for platforms, with 11.7 Update 1 providing ready-to-use cross-compilation packages for targets. CUDA Toolkit 12.0 in December 2022 introduced full support for the dialect in nvcc, including features such as concepts and modules for both host and device code, and link-time optimization (LTO) via the new nvJitLink , supporting just-in-time LTO for relocatable device code and offering improvements of up to 27% in offline scenarios and 3x for specific kernels like those in cuFFT. The latest major release, CUDA Toolkit 13.0 in August 2025, further extended nvcc capabilities with enhanced PTX support for (compute capability 9.0) and Blackwell (compute capability 10.0) architectures, including optimizations for Tensor Cores and unified memory on platforms like Jetson Thor, while unifying the toolkit for ARM64-SBSA environments to streamline cross-platform development.

Compilation Process

Code Separation and Preprocessing

The NVIDIA CUDA compiler driver, nvcc, primarily accepts input files in the .cu format, which integrate conventional C++ host code executable on the CPU with GPU device functions annotated using CUDA-specific qualifiers such as __global__, __device__, and __host__. These files may also incorporate directives, enabling features like macro definitions and conditional inclusions that apply across both host and device contexts. In the initial workflow stage, nvcc parses the source code to identify and separate and components based on these qualifiers: code, intended for CPU execution, is extracted and forwarded to an external compiler such as or G++ on /macOS, on macOS, or Microsoft Visual C++ (MSVC) on Windows; code, marked for GPU execution, is retained for internal processing by NVIDIA's -side . This separation ensures that CPU-specific constructs are handled appropriately while isolating GPU and definitions. Preprocessing occurs before separation and is managed through a C-style that resolves directives such as #include for essential headers like cuda_runtime.h, which provides runtime declarations accessible to both host and device code. Additionally, nvcc supports macro expansions via the -D flag and conditional compilation constructs (e.g., #ifdef or #if based on architecture-specific defines like CUDA_ARCH), allowing developers to tailor code for particular GPU compute capabilities without duplicating files. Include paths can be specified using the -I option to locate custom or system headers. By default, nvcc employs whole-program mode, in which the device from a single .cu file is compiled and embedded directly as a fatbinary image—a for GPU —into the corresponding host , facilitating seamless linkage during the final build. For scenarios requiring inspection of the preprocessed source, the --preprocess (or -E) directs nvcc to output the expanded to standard output or a .i , bypassing further stages.

Device Code Generation

The device code generation phase in the NVIDIA CUDA Compiler (NVCC) transforms CUDA device code, separated during preprocessing, into executable formats suitable for GPU execution. This process begins with compilation to Parallel Thread Execution (PTX), a low-level virtual instruction set architecture (ISA) that provides a portable intermediate representation independent of specific GPU hardware. PTX is generated as text-based assembly, enabling cross-architecture compatibility and serving as input for just-in-time (JIT) compilation at runtime. For instance, the compilation option -gencode arch=compute_80,code=sm_80 directs NVCC to target the compute capability 8.0 virtual architecture while producing code for the real sm_80 architecture. From PTX, NVCC invokes the PTX assembler (ptxas) to generate Stream Multiprocessor Assembly (SASS), the native binary for specific GPU architectures. This ahead-of-time (AOT) compilation produces cubin files, which are relocatable object binaries optimized for particular streaming multiprocessor (SM) versions, such as sm_80 for GPUs. In contrast, retaining PTX allows the CUDA driver to perform JIT compilation to SASS at runtime, adapting to the executing GPU's without pre-generating binaries for every possible target. This dual approach balances portability and , with AOT preferred for production deployments targeting known hardware. In separate compilation mode, enabled by --relocatable-device-code=true (or -dc), NVCC compiles individual device code units to relocatable cubin objects, which are then linked using the NVLink tool via the --device-link option to form a single executable device image. This linked image is subsequently embedded into the host executable during the final host-device linking step. For multi-architecture support, NVCC allows specification of multiple targets in a single invocation, such as --gpu-architecture=compute_75 --gpu-code=sm_75,sm_86, generating code for both Turing (sm_75) and Ampere (sm_86) architectures. The resulting outputs are packaged into a fatbinary—a container format that bundles multiple PTX and cubin images—enabling runtime selection of the appropriate variant based on the GPU's compute capability.

Internal Architecture

Front-End Components

The front-end of the employs the Edison Design Group (EDG) front-end to perform and of , tokenizing standard C++ elements alongside CUDA-specific extensions such as keywords like __threadfence__ and execution space attributes like __device__, __host__, and __global__. This lexer and parser process mixed host and device code in .cu files, constructing an that captures CUDA constructs for further analysis while forwarding host code to an external C++ compiler after . Semantic analysis follows parsing, conducting type checking to validate device and host qualifiers, ensuring that functions, variables, and types are correctly specified for their intended execution spaces and compatible across host-device boundaries. This phase enforces key restrictions, such as prohibiting in device functions and kernels to conform to the GPU's SIMT execution model without dynamic call stacks. The analysis also verifies semantic correctness of CUDA extensions, like memory fence operations and thread synchronization primitives, preventing invalid usages that could lead to on the device. The front-end includes a dedicated preprocessor that expands CUDA-specific macros, such as __CUDA_ARCH__ for conditional based on target GPU compute capabilities, alongside standard C++ preprocessing directives. For intermediate representation, the semantic analyzer generates LLVM IR—specifically NVIDIA's NVVM variant—for device code, which serves as a platform-independent form prior to PTX backend processing and supports whole-program analysis in non-separate mode for enhanced cross-module optimizations. This EDG-based front-end, inherited from the original Open64 framework used in early NVCC versions, has been integrated with the back-end since CUDA 4.1 in 2011, enabling robust support for modern C++ dialects up to in both host and device code.

Back-End and Optimization

The back-end of the CUDA compiler (nvcc) processes device code starting from (), leveraging the NVPTX back-end within the framework to generate (PTX) assembly, a () designed for GPUs. This conversion maps a subset of constructs—such as address spaces for global, shared, and local memory—to corresponding PTX directives and instructions, ensuring compatibility with GPU programming models like kernel launches and thread synchronization. The NVPTX back-end supports optimizations at this stage, including inlining and constant propagation, but defers more architecture-specific transformations to subsequent phases. Following PTX generation, nvcc invokes the proprietary PTX assembler (ptxas) as the core optimizer for device code, which applies advanced passes to produce Streaming Multiprocessor Assembly (SASS), the native, hardware-specific executable on target GPU architectures (e.g., sm_80 for GPUs). Ptxas performs critical optimizations such as , which assigns virtual registers from PTX to the limited physical registers available per thread (typically 255 per multiprocessor in modern architectures), and , which reorders operations to minimize latency from dependencies and while respecting GPU execution models like warp scheduling. These passes aim to maximize occupancy and throughput; for instance, aggressive can spill excess usage to local memory, trading faster access for increased parallelism across threads. Optimization levels in nvcc control the aggressiveness of these transformations, with flags like -O3 enabling comprehensive device code enhancements such as function inlining, , and to reduce instruction count and improve execution efficiency. Device-specific options further tune performance; for example, --maxrregcount limits the maximum registers per kernel function (e.g., to 32), forcing the compiler to prioritize higher thread occupancy over per-thread speed, which is particularly useful for memory-bound workloads on GPUs with constrained register files. Ptxas integrates these via --Xptxas options, defaulting to optimization level 3 for balanced performance. Since CUDA 11.2, nvcc supports link-time optimization (LTO) for device code during offline , enabled via -dlto, which preserves intermediate across compilation units to enable cross-module inlining and global analyses that were previously limited in separate scenarios. This feature, building on a preview in CUDA 11.0, allows optimizations like eliminating redundant computations between kernels from different object files. CUDA 12.0 introduced runtime LTO support via the nvJitLink for just-in-time (JIT) scenarios, which links LTO fragments at execution time—supporting inputs from nvcc or the Runtime Compilation (NVRTC) —to dynamically inline and application code, yielding up to 3x speedups in selective kernel finalization as demonstrated in libraries like cuFFT. In CUDA 13.0, NVCC introduced changes to ELF visibility and linkage for device code, improving binary compatibility and reducing relocation overhead in relocatable device code.

Features

Language Support and Extensions

The NVIDIA CUDA compiler, nvcc, primarily supports CUDA C/C++ as an extension of the ISO C++ standard, encompassing dialects from through as of CUDA 13.0 for both host and device code, with certain restrictions on device-side features such as dynamic polymorphism. Limited Fortran support is available indirectly through wrappers and interoperability with via the HPC SDK's nvfortran compiler, allowing Fortran applications to invoke CUDA kernels compiled by nvcc without direct compilation of Fortran source by nvcc itself. CUDA C/C++ introduces a execution model centered on kernels, which are functions executed across thousands of on the GPU and denoted by the __global__ execution qualifier, enabling asynchronous launches from via the <<<...>>> syntax. is enhanced with qualifiers such as __shared__ for fast on-chip shared accessible within a , __constant__ for read-only data broadcast to all , and __device__ for GPU-resident variables, facilitating efficient data locality and access patterns in computations. Built-in intrinsics provide low-level control, exemplified by __syncthreads(), which enforces barrier synchronization across in a to coordinate accesses and computations. Advanced language features include support for C++ templates in device code, allowing generic kernel implementations that can be instantiated for different data types at to promote and . expressions have been supported in device code since CUDA 7.0 (released in 2015), enabling concise definitions for callbacks and reductions within s. The Cooperative Groups API, introduced in CUDA 9.0, extends beyond basic barriers by allowing developers to define and manage arbitrary groups of threads—such as tiles, clusters, or grids—for collective operations like and communication, improving expressiveness in complex algorithms. A notable capability is the honoring of extern "C" linkage in device code starting with CUDA 11.0, which facilitates avoidance and interoperability with C-based libraries or external tools in GPU contexts. Additionally, nvcc supports inline PTX assembly through .ptx files or embedded asm() statements, permitting direct insertion of low-level (PTX) instructions for fine-grained optimization where high-level constructs are insufficient.

Compilation Options and Tools

The NVIDIA CUDA compiler driver, nvcc, provides options for targeting specific GPU architectures to generate compatible code for various NVIDIA hardware generations. The --gpu-architecture flag (also known as -arch) specifies the virtual architecture for generating Portable Assembly (PTX) intermediate code, using compute capability values such as compute_10 for the Blackwell architecture, while supporting a range from compute capability 1.0 () to 10.0 and beyond in recent CUDA Toolkit versions. In contrast, the --gpu-code flag directs the generation of real in Streaming Multiprocessor (SASS) format for specific hardware, denoted as sm_XX, enabling of PTX on devices with higher capabilities than the targeted binary. This dual approach allows , where PTX for a lower compute capability can be assembled on-device for newer GPUs. For debugging CUDA applications, nvcc includes flags that embed symbolic information and source mapping without fully compromising performance. The --device-debug flag (equivalent to -G) generates full debug information for device code, including variables and line numbers, but it disables most optimizations to preserve accurate debugging, making it suitable for development but not production. Complementing this, the --generate-line-info flag (or -lineinfo) adds only line-number information to the PTX output, allowing source-level debugging and profiling in tools like NVIDIA Nsight without altering the optimization level, which is recommended for performance analysis. These options integrate with debuggers such as cuda-gdb to facilitate breakpoint setting and variable inspection on the GPU. Optimization flags in nvcc balance code performance and development needs, primarily affecting device code compilation. The -G flag enforces debug mode by setting the optimization level to -O0, suppressing inlining, loop unrolling, and other transformations to ensure predictable behavior during . For release builds, -O2 or -O3 enable aggressive optimizations, including and , with -O3 providing the highest level for maximum throughput on supported architectures. Additionally, the --use_fast_math flag activates approximations in floating-point operations, such as fused multiply-add (FMAD) instructions and reduced for division and square roots, which can improve performance in compute-intensive kernels at the cost of slight accuracy loss, particularly beneficial for non-critical numerical computations. Nvcc integrates auxiliary tools to handle low-level assembly and analysis of CUDA binaries. The PTX assembler, ptxas, is automatically invoked during the compilation pipeline to convert PTX to SASS, with user-configurable options passed via --ptxas-options (or -Xptxas) for fine-tuning aspects like register usage limits or instruction scheduling. For binary inspection, nvdisasm serves as the disassembler, converting cubin or fatbin files back to readable SASS or PTX, aiding in performance tuning and verification of compiler output. Since CUDA Toolkit 12.8, nvcc supports compression modes through the --compress-mode flag (options: default, size, speed, balance, none), which applies binary compression to generated artifacts like PTX and cubin files, reducing object sizes—for example, by about 17% in default mode for certain libraries—while maintaining compatibility with loaders. In CUDA 13.0, NVCC introduced changes to ELF visibility and linkage for global functions to improve compatibility and reduce symbol visibility issues. A key capability of nvcc is its support for cross-compilation, enabling development on x86 hosts for non-x86 targets such as -based systems. This is facilitated by flags like --target-os=[linux](/page/Linux) and --target-arch=[aarch64](/page/AArch64), which configure the compiler for platforms like devices, ensuring the host code is built with compatible toolchains (e.g., via for ) while generating device code for the target GPU . Cross-compilation requires installing target-specific libraries, such as cuda-cross-aarch64 packages, and verifies compatibility through the same targeting options, allowing seamless deployment to environments without native hardware.

Usage and Integration

Command-Line Operation

The NVIDIA CUDA compiler driver, nvcc, operates from the command line with the basic syntax nvcc [options] file.cu -o output, where file.cu represents one or more input source files and -o specifies the output executable name. This invocation compiles both host and device code, embedding the device code into the host object, and by default links to the static runtime library (cudart) for runtime support. Options are prefixed with a single hyphen for short forms (e.g., -o) or double hyphens for long forms (e.g., --output-file), and they control aspects like targeting, optimization levels, and preprocessing. A typical workflow for compiling a single-file CUDA program targets a specific GPU architecture, as in nvcc -arch=sm_80 example.cu -o example, which generates code for compute capability 8.0 (corresponding to GPUs) and produces an executable named example. The -arch option, detailed further in compilation options documentation, specifies both the virtual architecture for and the binary architecture for ahead-of-time execution. Output management is handled through dedicated flags: --output-file (or -o) sets the name and path for the primary output file, overriding the default a.out for executables. To preserve intermediate artifacts for inspection, such as PTX assembly code in .ptx files or relocatable device objects in .cubin or .c files, the --keep option retains these files in the current directory instead of deleting them post-compilation. For error diagnosis and validation, nvcc provides utilities like --verbose (or -v), which logs detailed information on each compilation phase, including invoked sub-tools and file transformations. The --resource-usage flag generates a on kernel resource consumption, detailing metrics such as register usage per thread, local memory allocation, and footprint to identify potential issues. Additionally, --dry-run simulates the entire process by listing all sub-commands and options without executing them, aiding in option validation and workflow testing. Version information is accessible via nvcc --version (or -V), which reports the compiler release, such as CUDA 13.0 in the toolkit version current as of late 2025.

Build System Compatibility

The NVIDIA CUDA compiler, nvcc, integrates seamlessly with traditional build systems like Makefiles through custom compilation rules that invoke nvcc for processing source files (.cu) into object files (.o or .obj). Developers typically define rules such as %.o: %.cu to compile files using nvcc, often passing options via variables like NVCCFLAGS or CFLAGS for customization, while ensuring tracking with flags like -M or -MD to generate Makefile-compatible files. This approach allows nvcc to handle code while forwarding host code to the system's C++ , such as g++ on or cl.exe on Windows, enabling incremental builds in existing projects. CMake support for nvcc has evolved from the deprecated FindCUDA module, introduced in CMake 2.8, which provided macros like cuda_add_executable for CUDA-specific targets, to the modern FindCUDAToolkit module available since CMake 3.17. The latter enables native CUDA language support with commands like enable_language(CUDA), allowing seamless addition of CUDA sources to standard targets via add_library or add_executable, and supports separable compilation through the CUDA_SEPARABLE_COMPILATION property set to ON for relocatable device code. This facilitates hybrid projects where CUDA device code is compiled separately and linked with host C++ code using standard CMake workflows, as exemplified in sample CMakeLists.txt files that mix .cu and .cpp files. Integration with integrated development environments (IDEs) is provided through dedicated plugins and extensions that automate nvcc invocation. In , NVIDIA Nsight Visual Studio Edition offers project templates and build configurations that configure NVCCFLAGS and handle CUDA-specific properties, such as generating hybrid object files with the --compile option. Eclipse users leverage plugins for the Eclipse CDT to create CUDA projects, which set up build paths and invoke nvcc for Linux targets; the standalone Nsight Eclipse Edition was discontinued as of CUDA 11.0 in 2020. For Apple ecosystems, prior to CUDA's deprecation on macOS in version 10.2 in 2019, integration was available via command-line tools or scripts calling nvcc; modern alternatives like CLion support CUDA through and the CUDA plugin for and build management. nvcc supports cross-platform development across , Windows, and historically macOS, with host compiler detection via --compiler-bindir for non-default setups like or . For embedded and ARM-based systems, such as Jetson platforms, cross-compilation is enabled with options like --target-dir aarch64-linux to generate code for specific architectures without altering the host build environment. Since 11.0 in 2020, enhanced hybrid compilation capabilities allow nvcc to produce standard object files containing both host and relocatable device code, which can be linked using conventional tools like g++ or MSVC linkers, supporting dependency managers like or for streamlined package resolution in pipelines.

References

  1. [1]
    1. Introduction — NVIDIA CUDA Compiler Driver 13.0 documentation
    Oct 2, 2025 · It is the purpose of nvcc, the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers.
  2. [2]
    CUDA Toolkit Archive - NVIDIA Developer
    Previous releases of the CUDA Toolkit, GPU Computing SDK, documentation and developer drivers can be found using the links below. Please select the release ...Nvidia cuda 12.8.1 · CUDA Toolkit 11.8 Downloads · CUDA Toolkit 12.8 Downloads
  3. [3]
  4. [4]
  5. [5]
  6. [6]
  7. [7]
  8. [8]
  9. [9]
  10. [10]
  11. [11]
  12. [12]
  13. [13]
    About CUDA | NVIDIA Developer
    Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed ...More Than A Programming... · Widely Used By Researchers · Acceleration For All Domains<|control11|><|separator|>
  14. [14]
    CUDA Refresher: Reviewing the Origins of GPU Computing
    Apr 23, 2020 · This is the first post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or ...
  15. [15]
    [PDF] Brook for GPUs: Stream Computing on Graphics Hardware
    Brook for GPUs is a system for general-purpose computation on graphics hardware, extending C to enable GPU use as a streaming co-processor.
  16. [16]
    [PDF] From Brook to CUDA - NVIDIA
    Brook was a constrained streaming language, while CUDA is for GPU computing, with full integer and bit instructions and no branching limits.
  17. [17]
    The process of a CUDA program compilation using the NVCC ...
    Originally, NVCC had a strong association with the Open64 compiler, whose mutation was utilized to generate intermediate code before compiling it into the ...
  18. [18]
    Nvidia CUDA Compiler - Wikipedia
    NVCC is based on LLVM. According to Nvidia provided documentation, nvcc in version 7.0 supports many language constructs that are defined by the C++11 standard ...
  19. [19]
    Nvcc: what is necessary to cross-compile? - NVIDIA Developer Forums
    May 27, 2020 · For Arm64 SBSA platforms, you should use CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer. This specifically includes a ready to use cross-compilation ...
  20. [20]
    CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library
    Jan 17, 2023 · CUDA Toolkit 12.0 introduces a new nvJitLink library for Just-in-Time Link Time Optimization (JIT LTO) support.
  21. [21]
    What's New and Important in CUDA Toolkit 13.0 - NVIDIA Developer
    Aug 6, 2025 · The newest update to the CUDA Toolkit, version 13.0, features advancements to accelerate computing on the latest NVIDIA CPUs and GPUs.What's In Cuda 13.0 Beyond · Unifying Cuda For Arm: Build... · Updates To Nvcc
  22. [22]
  23. [23]
  24. [24]
  25. [25]
  26. [26]
  27. [27]
  28. [28]
  29. [29]
  30. [30]
    NVIDIA CUDA 4.1 Compiler Now Built on LLVM
    Dec 19, 2011 · We have switched over to using LLVM inside the CUDA C/C++ compiler for Fermi and future architectures. We use LLVM for optimizations and PTX code generation.
  31. [31]
  32. [32]
    User Guide for NVPTX Back-end — LLVM 22.0.0git documentation
    To support GPU programming, the NVPTX back-end supports a subset of LLVM IR along with a defined set of conventions used to represent GPU programming concepts.
  33. [33]
  34. [34]
  35. [35]
  36. [36]
    NVIDIA CUDA Fortran Programming Guide 25.9 documentation
    Typically, for best performance, we recommend generating sort routines using CUDA Thrust, the nvcc compiler, and calling those functions from Fortran.
  37. [37]
    NVIDIA HPC SDK Version 25.9 Documentation
    Sep 29, 2025 · nvcc is the CUDA C and CUDA C++ compiler driver for NVIDIA GPUs. ... Fortran that supports and is built upon the NVIDIA CUDA programming model.CUDA Fortran · HPC SDK Install Guide · HPC Compilers User's Guide
  38. [38]
  39. [39]
  40. [40]
  41. [41]
  42. [42]
    Cooperative Groups: Flexible CUDA Thread Programming
    Oct 4, 2017 · At its simplest, Cooperative Groups is an API for defining and synchronizing groups of threads in a CUDA program. Much of the Cooperative Groups ...
  43. [43]
  44. [44]
    1. Using Inline PTX Assembly in CUDA - NVIDIA Docs Hub
    Assembler statements, asm() , provide a way to insert arbitrary PTX code into your CUDA program. A simple example is: asm("membar.gl;");.
  45. [45]
    1. Introduction — CUDA-GDB 13.0 documentation - NVIDIA Docs Hub
    CUDA-GDB is an NVIDIA tool for debugging CUDA applications on Linux and QNX, an extension to GDB, allowing debugging on actual hardware.
  46. [46]
  47. [47]
    Building Cross-Platform CUDA Applications with CMake
    Aug 1, 2017 · In this post I want to show you how easy it is to build CUDA applications using the features of CMake 3.8+ (3.9 for MSVC support).
  48. [48]
    NVIDIA Nsight Visual Studio Edition
    NVIDIA Nsight Integration is a Visual Studio extension that allows you to access the power of the following NVIDIA Nsight standalone tools from within Visual ...
  49. [49]