Fact-checked by Grok 2 weeks ago

Nvidia CUDA Compiler

The NVIDIA CUDA Compiler, commonly referred to as NVCC, is a proprietary compiler driver developed by NVIDIA for processing CUDA C++ source code, enabling developers to create high-performance applications that leverage the parallel computing capabilities of NVIDIA GPUs.^[1] Released in June 2007 as part of the initial CUDA Toolkit 1.0, NVCC simplifies GPU programming by automatically separating and compiling host code (executed on the CPU) using standard C++ compilers like GCC or Clang, while transforming device code (intended for GPU execution) into intermediate PTX assembly or final SASS machine code tailored to specific GPU architectures.^[2] This dual-compilation approach hides the complexities of heterogeneous computing, allowing seamless integration of GPU acceleration into conventional C++ programs without requiring developers to manage low-level GPU details.^[1] NVCC operates through multiple phases, including preprocessing, host and device compilation, and linking, producing a "fatbinary" that embeds GPU code within the host executable for runtime loading via the CUDA runtime API.^[1] It supports targeting virtual architectures (e.g., compute_75) for forward compatibility and real architectures (e.g., sm_80 for Ampere GPUs), with options for optimization levels, debugging, and resource usage reporting to aid performance tuning.^[1] Since CUDA 5.0 in 2012, NVCC has enabled separate compilation of device code, facilitating modular development and link-time optimizations through tools like nvlink, which enhances code reuse across projects and supports advanced features such as just-in-time compilation.^[1] NVCC supports integration with modern host compilers, including LLVM-based ones like Clang. Runtime link-time optimization is provided via nvJitLink since CUDA 12.0 (2023). As of CUDA 13.0 (August 2025), it supports architectures up to Blackwell, with the current version being CUDA Toolkit 13.0 Update 2 (October 2025).^[1]^[3]^[4]

Overview

Purpose and Functionality

The NVIDIA CUDA Compiler Driver, known as nvcc, is a proprietary tool developed by NVIDIA for compiling CUDA applications, which extend C/C++ with GPU-specific programming constructs. It serves as the primary interface for developers to build programs that leverage NVIDIA GPUs for parallel computing, abstracting the underlying complexities of heterogeneous CPU-GPU execution.^[5] At its core, nvcc facilitates the compilation of mixed host-device code by separating the host code—intended for execution on the CPU—from the device code, which runs on the GPU. The host code is processed using a standard C++ compiler (such as g++ or cl.exe) to produce conventional object files, while the device code undergoes compilation via NVIDIA's specialized front-end and back-end tools, resulting in embeddable GPU binaries. This separation enables seamless integration of CPU and GPU workloads within a single application, allowing developers to focus on algorithm design without managing low-level GPU assembly details.^[6] nvcc generates several output formats to support flexible deployment and runtime compatibility. For GPU binaries, it produces PTX (Parallel Thread Execution) as an intermediate assembly representation for virtual architectures or CUBIN files as executable binaries tailored to specific GPU compute capabilities. These are often bundled into fatbinaries, which encapsulate multiple PTX or CUBIN variants for broad hardware support, embedded directly into the host object files for whole-program execution. Additionally, nvcc supports relocatable device code, which allows separate compilation of device functions into linkable objects, enabling modular development and link-time optimizations before final embedding.^[7]^[8] By handling these processes, nvcc hides the intricacies of GPU-specific compilation, empowering developers to write Single Program Multiple Data (SPMD) parallel code that executes efficiently across thousands of GPU threads on NVIDIA hardware.^[5]

Role in CUDA Ecosystem

The NVIDIA CUDA Compiler Driver, known as nvcc, serves as the primary entry point within the CUDA toolkit for compiling CUDA source files with the .cu extension, which contain both host and device code. By processing these files, nvcc orchestrates the separation and compilation of CPU-executable host code and GPU-executable device code, ultimately producing binaries that integrate seamlessly with the CUDA Runtime API to enable kernel launches and GPU execution. This positioning allows developers to leverage heterogeneous computing without managing low-level compilation intricacies, forming a foundational component of the broader CUDA programming model.^[1]^[9] Nvcc interacts closely with host compilers such as GCC on Linux or MSVC on Windows to handle the non-CUDA portions of the code, forwarding these steps while embedding device code into the resulting host objects. Its outputs are designed for compatibility with essential CUDA runtime libraries like cudart, which manage memory allocation, kernel dispatching, and synchronization, as well as the underlying NVIDIA drivers that provide direct GPU access. These interactions ensure that compiled applications can link against CUDA libraries and execute across supported hardware without compatibility mismatches.^[1]^[10] Within the CUDA ecosystem, nvcc enhances portability by generating Parallel Thread Execution (PTX) as an intermediate virtual machine representation, allowing code to run on diverse NVIDIA GPU architectures through just-in-time (JIT) compilation at runtime. This mechanism adapts binaries to the target device's compute capability, optimizing performance while accommodating variations in hardware features across generations. Additionally, nvcc supports dynamic optimization by enabling the runtime to JIT-compile PTX code, which benefits from ongoing compiler improvements without requiring recompilation.^[11]^[12] A distinctive aspect of nvcc's output is the fatbinary format, which bundles multiple compiled variants—including PTX and architecture-specific binaries—into a single file, permitting the CUDA runtime to select the most appropriate GPU code based on the detected hardware at execution time. This approach ensures forward compatibility, as applications compiled with earlier toolkit versions can leverage PTX for JIT compilation on newer GPUs, maintaining functionality across evolving NVIDIA architectures.^[13]^[14]

History

Initial Development

The Nvidia CUDA Compiler, commonly referred to as nvcc, originated as a core component of Nvidia's CUDA project, initiated in the mid-2000s to pioneer general-purpose computing on graphics processing units (GPUs) and transcend the confines of graphics shader workloads.^[15] CUDA was first announced in November 2006.^[15] This effort addressed the growing demand for harnessing GPU parallelism in scientific and computational applications, where traditional CPU-based processing fell short in scalability.^[16] Prior to CUDA, general-purpose GPU (GPGPU) programming was hampered by reliance on graphics-oriented frameworks, such as OpenGL shaders or the Brook stream programming language developed at Stanford University around 2003.^[16]^[17] These approaches necessitated mapping non-graphics algorithms onto fixed graphics pipelines, imposing constraints like limited data access patterns (e.g., gather-only operations in Brook), restricted integer arithmetic, and inefficient memory management tied to texture units.^[18] By contrast, nvcc enabled a more intuitive C-like extension to CUDA C, granting developers direct control over GPU threads, memory hierarchies, and parallel execution models, thereby streamlining the development of high-performance parallel code without graphics reformulation.^[16]^[18] The initial public release of nvcc occurred in June 2007, bundled with CUDA Toolkit 1.0, which represented the first stable platform for compiling and deploying CUDA C extensions on compatible Nvidia GPUs.^[2] This launch coincided with CUDA's broader ecosystem debut, providing essential tools for hybrid CPU-GPU programming.^[15] Developed entirely in-house at Nvidia as proprietary software, nvcc leveraged a customized fork of the open-source Open64 compiler infrastructure for parsing, analyzing, and generating intermediate representations of device code, ensuring compatibility with GPU-specific optimizations from the outset.^[19]

Major Milestones and Releases

The NVIDIA CUDA compiler, nvcc, was first released in June 2007 as part of the initial CUDA Toolkit 1.0, enabling the compilation of CUDA C code for GPU acceleration. A significant advancement came with CUDA Toolkit 5.0 in October 2012, which introduced support for relocatable device code through the --relocatable-device-code (or -rdc=true) flag, allowing separate compilation and modular linking of device code objects without requiring full recompilation for each module. This feature marked a shift toward more flexible build processes for large-scale CUDA applications.^[2] In 2016, with CUDA Toolkit 8.0, nvcc gained improved compatibility with modern C++ dialects on the host side.^[19] CUDA Toolkit 11.0, released in May 2020, began full support for the C++17 dialect in nvcc and transitioned to an LLVM/Clang-based front-end for improved C++ dialect support and greater compatibility with open-source tools, enhancing the compiler's ability to handle modern host code while maintaining CUDA-specific extensions.^[19] This architectural change laid the groundwork for better integration with ecosystem tools like Clang. Building on this, ongoing enhancements for ARM cross-compilation have been integrated, particularly for NVIDIA Jetson platforms, with CUDA 11.7 Update 1 providing ready-to-use cross-compilation packages for aarch64 targets.^[20] CUDA Toolkit 12.0 in December 2022 introduced full support for the C++20 dialect in nvcc, including features such as concepts and modules for both host and device code, and link-time optimization (LTO) via the new nvJitLink library, supporting just-in-time LTO for relocatable device code and offering performance improvements of up to 27% in offline scenarios and 3x for specific library kernels like those in cuFFT.^[3] The latest major release, CUDA Toolkit 13.0 in August 2025, further extended nvcc capabilities with enhanced PTX virtual machine support for Hopper (compute capability 9.0) and Blackwell (compute capability 10.0) architectures, including optimizations for Tensor Cores and unified memory on platforms like Jetson Thor, while unifying the toolkit for ARM64-SBSA environments to streamline cross-platform development.^[21]

Compilation Process

Code Separation and Preprocessing

The NVIDIA CUDA compiler driver, nvcc, primarily accepts input files in the .cu format, which integrate conventional C++ host code executable on the CPU with GPU device functions annotated using CUDA-specific qualifiers such as __global__, __device__, and __host__. These files may also incorporate C preprocessor directives, enabling features like macro definitions and conditional inclusions that apply across both host and device contexts.^[1] In the initial workflow stage, nvcc parses the source code to identify and separate host and device components based on these qualifiers: host code, intended for CPU execution, is extracted and forwarded to an external host compiler such as GCC or G++ on Linux/macOS, Clang on macOS, or Microsoft Visual C++ (MSVC) on Windows; device code, marked for GPU execution, is retained for internal processing by NVIDIA's device-side toolchain. This separation ensures that CPU-specific constructs are handled appropriately while isolating GPU kernel and function definitions.^[1] Preprocessing occurs before separation and is managed through a C-style preprocessor that resolves directives such as #include for essential CUDA headers like cuda_runtime.h, which provides runtime API declarations accessible to both host and device code. Additionally, nvcc supports macro expansions via the -D flag and conditional compilation constructs (e.g., #ifdef or #if based on architecture-specific defines like CUDA_ARCH), allowing developers to tailor code for particular GPU compute capabilities without duplicating files. Include paths can be specified using the -I option to locate custom or system headers.^[1] By default, nvcc employs whole-program compilation mode, in which the device code from a single .cu file is compiled and embedded directly as a fatbinary image—a container for GPU executable code—into the corresponding host object file, facilitating seamless linkage during the final build. For scenarios requiring inspection of the preprocessed source, the --preprocess (or -E) flag directs nvcc to output the expanded code to standard output or a .i file, bypassing further compilation stages.^[1]

Device Code Generation

The device code generation phase in the NVIDIA CUDA Compiler (NVCC) transforms CUDA device code, separated during preprocessing, into executable formats suitable for GPU execution. This process begins with compilation to Parallel Thread Execution (PTX), a low-level virtual instruction set architecture (ISA) that provides a portable intermediate representation independent of specific GPU hardware. PTX is generated as text-based assembly, enabling cross-architecture compatibility and serving as input for just-in-time (JIT) compilation at runtime. For instance, the compilation option -gencode arch=compute_80,code=sm_80 directs NVCC to target the compute capability 8.0 virtual architecture while producing code for the real sm_80 architecture. From PTX, NVCC invokes the PTX assembler (ptxas) to generate Stream Multiprocessor Assembly (SASS), the native binary machine code for specific NVIDIA GPU architectures. This ahead-of-time (AOT) compilation produces cubin files, which are relocatable object binaries optimized for particular streaming multiprocessor (SM) versions, such as sm_80 for Ampere GPUs. In contrast, retaining PTX allows the CUDA driver to perform JIT compilation to SASS at runtime, adapting to the executing GPU's architecture without pre-generating binaries for every possible target. This dual approach balances portability and performance, with AOT preferred for production deployments targeting known hardware. In separate compilation mode, enabled by --relocatable-device-code=true (or -dc), NVCC compiles individual device code units to relocatable cubin objects, which are then linked using the NVLink tool via the --device-link option to form a single executable device image. This linked image is subsequently embedded into the host executable during the final host-device linking step. For multi-architecture support, NVCC allows specification of multiple targets in a single invocation, such as --gpu-architecture=compute_75 --gpu-code=sm_75,sm_86, generating code for both Turing (sm_75) and Ampere (sm_86) architectures. The resulting outputs are packaged into a fatbinary—a container format that bundles multiple PTX and cubin images—enabling runtime selection of the appropriate variant based on the GPU's compute capability.

Internal Architecture

Front-End Components

The front-end of the NVIDIA CUDA Compiler (NVCC) employs the Edison Design Group (EDG) front-end to perform lexical analysis and parsing of CUDA C++ source code, tokenizing standard C++ elements alongside CUDA-specific extensions such as keywords like __threadfence__ and execution space attributes like __device__, __host__, and __global__.^[22]^[23] This lexer and parser process mixed host and device code in .cu files, constructing an abstract syntax tree (AST) that captures CUDA constructs for further analysis while forwarding host code to an external C++ compiler after transformation.^[24] Semantic analysis follows parsing, conducting type checking to validate device and host qualifiers, ensuring that functions, variables, and types are correctly specified for their intended execution spaces and compatible across host-device boundaries.^[25] This phase enforces key CUDA restrictions, such as prohibiting recursion in device functions and kernels to conform to the GPU's SIMT execution model without dynamic call stacks.^[26] The analysis also verifies semantic correctness of CUDA extensions, like memory fence operations and thread synchronization primitives, preventing invalid usages that could lead to undefined behavior on the device.^[27] The front-end includes a dedicated preprocessor that expands CUDA-specific macros, such as __CUDA_ARCH__ for conditional compilation based on target GPU compute capabilities, alongside standard C++ preprocessing directives.^[28] For intermediate representation, the semantic analyzer generates LLVM IR—specifically NVIDIA's NVVM variant—for device code, which serves as a platform-independent form prior to PTX backend processing and supports whole-program analysis in non-separate compilation mode for enhanced cross-module optimizations.^[22]^[29] This EDG-based front-end, inherited from the original Open64 framework used in early NVCC versions, has been integrated with the LLVM back-end since CUDA 4.1 in 2011, enabling robust support for modern C++ dialects up to C++20 in both host and device code.^[30]^[31]

Back-End and Optimization

The back-end of the NVIDIA CUDA compiler (nvcc) processes device code starting from LLVM Intermediate Representation (IR), leveraging the NVPTX back-end within the LLVM framework to generate Parallel Thread Execution (PTX) assembly, a virtual machine instruction set architecture (ISA) designed for NVIDIA GPUs.^[32] This conversion maps a subset of LLVM IR constructs—such as address spaces for global, shared, and local memory—to corresponding PTX directives and instructions, ensuring compatibility with GPU programming models like kernel launches and thread synchronization.^[32] The NVPTX back-end supports optimizations at this stage, including inlining and constant propagation, but defers more architecture-specific transformations to subsequent phases.^[33] Following PTX generation, nvcc invokes the proprietary PTX assembler (ptxas) as the core optimizer for device code, which applies advanced passes to produce Streaming Multiprocessor Assembly (SASS), the native, hardware-specific machine code executable on target GPU architectures (e.g., sm_80 for Ampere GPUs).^[33] Ptxas performs critical optimizations such as register allocation, which assigns virtual registers from PTX to the limited physical registers available per thread (typically 255 per multiprocessor in modern architectures), and instruction scheduling, which reorders operations to minimize latency from dependencies and resource contention while respecting GPU execution models like warp scheduling.^[34] These passes aim to maximize occupancy and throughput; for instance, aggressive register allocation can spill excess usage to local memory, trading faster access for increased parallelism across threads.^[33] Optimization levels in nvcc control the aggressiveness of these transformations, with flags like -O3 enabling comprehensive device code enhancements such as function inlining, dead code elimination, and loop unrolling to reduce instruction count and improve execution efficiency.^[33] Device-specific options further tune performance; for example, --maxrregcount limits the maximum registers per kernel function (e.g., to 32), forcing the compiler to prioritize higher thread occupancy over per-thread speed, which is particularly useful for memory-bound workloads on GPUs with constrained register files.^[33] Ptxas integrates these via --Xptxas options, defaulting to optimization level 3 for balanced performance.^[35] Since CUDA 11.2, nvcc supports link-time optimization (LTO) for device code during offline compilation, enabled via -dlto, which preserves intermediate LLVM IR across compilation units to enable cross-module inlining and global analyses that were previously limited in separate compilation scenarios.^[36] This feature, building on a preview in CUDA 11.0, allows optimizations like eliminating redundant computations between kernels from different object files.^[36] CUDA 12.0 introduced runtime LTO support via the nvJitLink library for just-in-time (JIT) scenarios, which links LTO IR fragments at execution time—supporting inputs from nvcc or the NVIDIA Runtime Compilation (NVRTC) API—to dynamically inline library and application code, yielding up to 3x speedups in selective kernel finalization as demonstrated in libraries like cuFFT.^[3] In CUDA 13.0, NVCC introduced changes to ELF visibility and linkage for device code, improving binary compatibility and reducing relocation overhead in relocatable device code.^[21]

Features

Language Support and Extensions

The NVIDIA CUDA compiler, nvcc, primarily supports CUDA C/C++ as an extension of the ISO C++ standard, encompassing dialects from C++11 through C++23 as of CUDA 13.0 for both host and device code, with certain restrictions on device-side features such as dynamic polymorphism.^[33] Limited Fortran support is available indirectly through wrappers and interoperability with CUDA Fortran via the NVIDIA HPC SDK's nvfortran compiler, allowing Fortran applications to invoke CUDA kernels compiled by nvcc without direct compilation of Fortran source by nvcc itself.^[37]^[38] CUDA C/C++ introduces a parallel execution model centered on kernels, which are functions executed across thousands of threads on the GPU and denoted by the __global__ execution qualifier, enabling asynchronous launches from host code via the <<<...>>> syntax.^[39] Memory management is enhanced with qualifiers such as __shared__ for fast on-chip shared memory accessible within a thread block, __constant__ for read-only data broadcast to all threads, and __device__ for GPU-resident variables, facilitating efficient data locality and access patterns in parallel computations.^[40] Built-in intrinsics provide low-level control, exemplified by __syncthreads(), which enforces barrier synchronization across threads in a block to coordinate memory accesses and computations.^[41] Advanced language features include support for C++ templates in device code, allowing generic kernel implementations that can be instantiated for different data types at compile time to promote code reuse and performance tuning.^[23] Lambda expressions have been supported in device code since CUDA 7.0 (released in 2015), enabling concise inline function definitions for callbacks and reductions within kernels.^[42] The Cooperative Groups API, introduced in CUDA 9.0, extends thread synchronization beyond basic barriers by allowing developers to define and manage arbitrary groups of threads—such as tiles, clusters, or grids—for collective operations like synchronization and communication, improving expressiveness in complex parallel algorithms.^[43]^[44] A notable capability is the honoring of extern "C" linkage in device code starting with CUDA 11.0, which facilitates name mangling avoidance and interoperability with C-based libraries or external tools in GPU contexts.^[23] Additionally, nvcc supports inline PTX assembly through .ptx files or embedded asm() statements, permitting direct insertion of low-level Parallel Thread Execution (PTX) instructions for fine-grained optimization where high-level CUDA constructs are insufficient.^[45]

Compilation Options and Tools

The NVIDIA CUDA compiler driver, nvcc, provides options for targeting specific GPU architectures to generate compatible code for various NVIDIA hardware generations. The --gpu-architecture flag (also known as -arch) specifies the virtual architecture for generating Portable Assembly (PTX) intermediate code, using compute capability values such as compute_10 for the Blackwell architecture, while supporting a range from compute capability 1.0 (Tesla) to 10.0 and beyond in recent CUDA Toolkit versions. In contrast, the --gpu-code flag directs the generation of real binary code in Streaming Multiprocessor Assembly (SASS) format for specific hardware, denoted as sm_XX, enabling just-in-time compilation of PTX on devices with higher capabilities than the targeted binary. This dual approach allows forward compatibility, where PTX for a lower compute capability can be assembled on-device for newer GPUs. For debugging CUDA applications, nvcc includes flags that embed symbolic information and source mapping without fully compromising performance. The --device-debug flag (equivalent to -G) generates full debug information for device code, including variables and line numbers, but it disables most optimizations to preserve accurate debugging, making it suitable for development but not production. Complementing this, the --generate-line-info flag (or -lineinfo) adds only line-number information to the PTX output, allowing source-level debugging and profiling in tools like NVIDIA Nsight without altering the optimization level, which is recommended for performance analysis. These options integrate with debuggers such as cuda-gdb to facilitate breakpoint setting and variable inspection on the GPU.^[46] Optimization flags in nvcc balance code performance and development needs, primarily affecting device code compilation. The -G flag enforces debug mode by setting the optimization level to -O0, suppressing inlining, loop unrolling, and other transformations to ensure predictable behavior during debugging. For release builds, -O2 or -O3 enable aggressive optimizations, including instruction scheduling and register allocation, with -O3 providing the highest level for maximum throughput on supported architectures. Additionally, the --use_fast_math flag activates approximations in floating-point operations, such as fused multiply-add (FMAD) instructions and reduced precision for division and square roots, which can improve performance in compute-intensive kernels at the cost of slight accuracy loss, particularly beneficial for non-critical numerical computations.^[47] Nvcc integrates auxiliary tools to handle low-level assembly and analysis of CUDA binaries. The PTX assembler, ptxas, is automatically invoked during the compilation pipeline to convert PTX to SASS, with user-configurable options passed via --ptxas-options (or -Xptxas) for fine-tuning aspects like register usage limits or instruction scheduling. For binary inspection, nvdisasm serves as the disassembler, converting cubin or fatbin files back to readable SASS or PTX, aiding in performance tuning and verification of compiler output. Since CUDA Toolkit 12.8, nvcc supports compression modes through the --compress-mode flag (options: default, size, speed, balance, none), which applies binary compression to generated artifacts like PTX and cubin files, reducing object sizes—for example, by about 17% in default mode for certain libraries—while maintaining compatibility with loaders.^[21] In CUDA 13.0, NVCC introduced changes to ELF visibility and linkage for global functions to improve compatibility and reduce symbol visibility issues.^[48] A key capability of nvcc is its support for cross-compilation, enabling development on x86 hosts for non-x86 targets such as ARM-based systems. This is facilitated by flags like --target-os=[linux](/page/Linux) and --target-arch=[aarch64](/page/AArch64), which configure the compiler for platforms like NVIDIA Jetson devices, ensuring the host code is built with compatible toolchains (e.g., via gcc for ARM) while generating device code for the target GPU architecture. Cross-compilation requires installing target-specific libraries, such as cuda-cross-aarch64 packages, and verifies compatibility through the same architecture targeting options, allowing seamless deployment to embedded environments without native hardware.

Usage and Integration

Command-Line Operation

The NVIDIA CUDA compiler driver, nvcc, operates from the command line with the basic syntax nvcc [options] file.cu -o output, where file.cu represents one or more input CUDA source files and -o specifies the output executable name.^[1] This invocation compiles both host and device code, embedding the device code into the host object, and by default links to the static CUDA runtime library (cudart) for runtime support.^[1] Options are prefixed with a single hyphen for short forms (e.g., -o) or double hyphens for long forms (e.g., --output-file), and they control aspects like architecture targeting, optimization levels, and preprocessing.^[1] A typical workflow for compiling a single-file CUDA program targets a specific GPU architecture, as in nvcc -arch=sm_80 example.cu -o example, which generates code for compute capability 8.0 (corresponding to Ampere GPUs) and produces an executable named example.^[1] The -arch option, detailed further in compilation options documentation, specifies both the virtual architecture for just-in-time compilation and the binary architecture for ahead-of-time execution.^[1] Output management is handled through dedicated flags: --output-file (or -o) sets the name and path for the primary output file, overriding the default a.out for executables.^[1] To preserve intermediate artifacts for inspection, such as PTX assembly code in .ptx files or relocatable device objects in .cubin or .c files, the --keep option retains these files in the current directory instead of deleting them post-compilation.^[1] For error diagnosis and validation, nvcc provides utilities like --verbose (or -v), which logs detailed information on each compilation phase, including invoked sub-tools and file transformations.^[1] The --resource-usage flag generates a report on kernel resource consumption, detailing metrics such as register usage per thread, local memory allocation, and shared memory footprint to identify potential overflow issues.^[1] Additionally, --dry-run simulates the entire process by listing all sub-commands and options without executing them, aiding in option validation and workflow testing.^[1] Version information is accessible via nvcc --version (or -V), which reports the compiler release, such as CUDA 13.0 in the toolkit version current as of late 2025.^[1]

Build System Compatibility

The NVIDIA CUDA compiler, nvcc, integrates seamlessly with traditional build systems like Makefiles through custom compilation rules that invoke nvcc for processing CUDA source files (.cu) into object files (.o or .obj). Developers typically define rules such as %.o: %.cu to compile CUDA files using nvcc, often passing options via variables like NVCCFLAGS or CFLAGS for customization, while ensuring dependency tracking with flags like -M or -MD to generate Makefile-compatible dependency files. This approach allows nvcc to handle device code compilation while forwarding host code to the system's C++ compiler, such as g++ on Linux or cl.exe on Windows, enabling incremental builds in existing projects.^[33] CMake support for nvcc has evolved from the deprecated FindCUDA module, introduced in CMake 2.8, which provided macros like cuda_add_executable for CUDA-specific targets, to the modern FindCUDAToolkit module available since CMake 3.17.^[49] The latter enables native CUDA language support with commands like enable_language(CUDA), allowing seamless addition of CUDA sources to standard targets via add_library or add_executable, and supports separable compilation through the CUDA_SEPARABLE_COMPILATION property set to ON for relocatable device code. This facilitates hybrid projects where CUDA device code is compiled separately and linked with host C++ code using standard CMake workflows, as exemplified in sample CMakeLists.txt files that mix .cu and .cpp files.^[50] Integration with integrated development environments (IDEs) is provided through dedicated plugins and extensions that automate nvcc invocation. In Visual Studio, NVIDIA Nsight Visual Studio Edition offers project templates and build configurations that configure NVCCFLAGS and handle CUDA-specific properties, such as generating hybrid object files with the --compile option. Eclipse users leverage plugins for the Eclipse CDT to create CUDA projects, which set up build paths and invoke nvcc for Linux targets; the standalone Nsight Eclipse Edition was discontinued as of CUDA 11.0 in 2020.^[51] For Apple ecosystems, prior to CUDA's deprecation on macOS in version 10.2 in 2019, Xcode integration was available via command-line tools or scripts calling nvcc; modern alternatives like CLion support CUDA through CMake and the CUDA plugin for syntax highlighting and build management.^[52] nvcc supports cross-platform development across Linux, Windows, and historically macOS, with host compiler detection via --compiler-bindir for non-default setups like GCC or Clang. For embedded and ARM-based systems, such as Jetson platforms, cross-compilation is enabled with options like --target-dir aarch64-linux to generate code for specific architectures without altering the host build environment. Since CUDA 11.0 in 2020, enhanced hybrid compilation capabilities allow nvcc to produce standard object files containing both host and relocatable device code, which can be linked using conventional tools like g++ or MSVC linkers, supporting dependency managers like Conan or PkgConfig for streamlined package resolution in DevOps pipelines.^[53]

References

[1]
1. Introduction — NVIDIA CUDA Compiler Driver 13.0 documentation
Oct 2, 2025 · It is the purpose of nvcc, the CUDA compiler driver, to hide the intricate details of CUDA compilation from developers.
[2]
CUDA Toolkit Archive - NVIDIA Developer
Previous releases of the CUDA Toolkit, GPU Computing SDK, documentation and developer drivers can be found using the links below. Please select the release ...Nvidia cuda 12.8.1 · CUDA Toolkit 11.8 Downloads · CUDA Toolkit 12.8 Downloads
[3]
https://developer.nvidia.com/blog/cuda-12-0-compiler-support-for-runtime-lto-using-nvjitlink-library/
[4]
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
[5]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#introduction
[6]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#the-cuda-compilation-trajectory
[7]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#supported-input-formats
[8]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#options-for-steering-gpu-code-generation
[9]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compilation-with-nvcc
[10]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-runtime
[11]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx
[12]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#just-in-time-compilation
[13]
About CUDA | NVIDIA Developer
Since its introduction in 2006, CUDA has been widely deployed through thousands of applications and published research papers, and supported by an installed ...More Than A Programming... · Widely Used By Researchers · Acceleration For All Domains<|control11|><|separator|>
[14]
CUDA Refresher: Reviewing the Origins of GPU Computing
Apr 23, 2020 · This is the first post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or ...
[15]
[PDF] Brook for GPUs: Stream Computing on Graphics Hardware
Brook for GPUs is a system for general-purpose computation on graphics hardware, extending C to enable GPU use as a streaming co-processor.
[16]
[PDF] From Brook to CUDA - NVIDIA
Brook was a constrained streaming language, while CUDA is for GPU computing, with full integer and bit instructions and no branching limits.
[17]
The process of a CUDA program compilation using the NVCC ...
Originally, NVCC had a strong association with the Open64 compiler, whose mutation was utilized to generate intermediate code before compiling it into the ...
[18]
Nvidia CUDA Compiler - Wikipedia
NVCC is based on LLVM. According to Nvidia provided documentation, nvcc in version 7.0 supports many language constructs that are defined by the C++11 standard ...
[19]
Nvcc: what is necessary to cross-compile? - NVIDIA Developer Forums
May 27, 2020 · For Arm64 SBSA platforms, you should use CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer. This specifically includes a ready to use cross-compilation ...
[20]
CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library
Jan 17, 2023 · CUDA Toolkit 12.0 introduces a new nvJitLink library for Just-in-Time Link Time Optimization (JIT LTO) support.
[21]
What's New and Important in CUDA Toolkit 13.0 - NVIDIA Developer
Aug 6, 2025 · The newest update to the CUDA Toolkit, version 13.0, features advancements to accelerate computing on the latest NVIDIA CPUs and GPUs.What's In Cuda 13.0 Beyond · Unifying Cuda For Arm: Build... · Updates To Nvcc
[22]
https://on-demand.gputechconf.com/gtc/2013/presentations/S3185-Building-GPU-Compilers-libNVVM.pdf
[23]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#c-language-extensions
[24]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#the-cuda-compilation-trajectory
[25]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-execution-space-specifiers
[26]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration-syntax
[27]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#surface-reference-syntax
[28]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compile-time-conditional-compilation
[29]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#separate-compilation
[30]
NVIDIA CUDA 4.1 Compiler Now Built on LLVM
Dec 19, 2011 · We have switched over to using LLVM inside the CUDA C/C++ compiler for Fermi and future architectures. We use LLVM for optimizations and PTX code generation.
[31]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#supported-c-dialects
[32]
User Guide for NVPTX Back-end — LLVM 22.0.0git documentation
To support GPU programming, the NVPTX back-end supports a subset of LLVM IR along with a defined set of conventions used to represent GPU programming concepts.
[33]
1. Introduction — NVIDIA CUDA Compiler Driver 13.0 documentation
Summary of each segment:
[34]
1. Introduction — PTX ISA 9.0 documentation
Summary of each segment:
[35]
https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#ptxas-options
[36]
NVIDIA CUDA Fortran Programming Guide 25.9 documentation
Typically, for best performance, we recommend generating sort routines using CUDA Thrust, the nvcc compiler, and calling those functions from Fortran.
[37]
NVIDIA HPC SDK Version 25.9 Documentation
Sep 29, 2025 · nvcc is the CUDA C and CUDA C++ compiler driver for NVIDIA GPUs. ... Fortran that supports and is built upon the NVIDIA CUDA programming model.CUDA Fortran · HPC SDK Install Guide · HPC Compilers User's Guide
[38]
https://docs.nvidia.com/hpc-sdk/
[39]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels
[40]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-qualifiers
[41]
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsics
[42]
Cooperative Groups: Flexible CUDA Thread Programming
Oct 4, 2017 · At its simplest, Cooperative Groups is an API for defining and synchronizing groups of threads in a CUDA program. Much of the Cooperative Groups ...
[43]
https://developer.nvidia.com/blog/cooperative-groups/
[44]
1. Using Inline PTX Assembly in CUDA - NVIDIA Docs Hub
Assembler statements, asm() , provide a way to insert arbitrary PTX code into your CUDA program. A simple example is: asm("membar.gl;");.
[45]
1. Introduction — CUDA-GDB 13.0 documentation - NVIDIA Docs Hub
CUDA-GDB is an NVIDIA tool for debugging CUDA applications on Linux and QNX, an extension to GDB, allowing debugging on actual hardware.
[46]
https://docs.nvidia.com/cuda/cuda-gdb/index.html
[47]
Building Cross-Platform CUDA Applications with CMake
Aug 1, 2017 · In this post I want to show you how easy it is to build CUDA applications using the features of CMake 3.8+ (3.9 for MSVC support).
[48]
NVIDIA Nsight Visual Studio Edition
NVIDIA Nsight Integration is a Visual Studio extension that allows you to access the power of the following NVIDIA Nsight standalone tools from within Visual ...
[49]
https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html