Fact-checked by Grok 2 weeks ago

ROCm

ROCm (Radeon Open Compute) is an open-source software stack developed by Advanced Micro Devices (AMD) that enables GPU-accelerated computing for high-performance computing (HPC), artificial intelligence (AI), and heterogeneous workloads on AMD Graphics Processing Units (GPUs). It provides a comprehensive ecosystem including drivers, runtime libraries, development tools, and APIs, allowing developers to program GPUs from low-level kernels to high-level applications while supporting multiple programming models such as HIP (Heterogeneous-compute Interface for Portability), OpenCL, and OpenMP. Designed primarily for Linux and Windows operating systems, ROCm optimizes performance on AMD Instinct accelerators for data center use and extends support to AMD Radeon GPUs and Ryzen APUs for consumer and workstation applications. Originally released in 2016 with version 1.0, ROCm has evolved over nearly a decade to address the growing demands of and HPC, with leading enterprises and research institutions adopting it for scalable GPU computing. Key components include specialized libraries such as MIOpen for , rocBLAS for linear algebra, and RCCL for collective communications, alongside tools like the ROCm Compute Profiler for performance analysis and HIPIFY for porting code to . Compilers like HIPCC and ROCm , combined with runtimes such as ROCR-Runtime, form the core architecture that ensures portability and compatibility with industry-standard frameworks. As of November 2025, the latest stable release is ROCm 7.1.0, which introduces enhancements in hardware monitoring via the System Management Interface ( SMI), improved resiliency for MI300X GPUs, and broader support for AI workloads through integrations with popular frameworks. This version builds on prior releases like ROCm 7.0 from September 2025, emphasizing developer productivity, enterprise scalability, and open innovation in GPU programming. ROCm's open-source nature, hosted on , fosters community contributions and customization, positioning it as a competitive alternative to proprietary platforms in the GPU computing landscape.

Overview

Definition and Purpose

ROCm (Radeon Open Compute) is an open-source software platform developed by AMD for GPU-accelerated computing, comprising a comprehensive stack that includes drivers, runtimes, application programming interfaces (APIs), and libraries to enable heterogeneous computing on AMD GPUs. Heterogeneous computing in this context refers to the integration of central processing units (CPUs) and graphics processing units (GPUs) to perform parallel processing tasks, allowing applications to offload compute-intensive operations from the host CPU to the GPU device for improved efficiency in data-parallel workloads. This stack supports programming from low-level kernels to high-level end-user applications, fostering an ecosystem for developers to leverage AMD hardware in diverse computational scenarios. The primary purpose of ROCm is to offer an open-source alternative to proprietary GPU computing platforms, such as NVIDIA's , by providing portability and compatibility across GPUs for (HPC), (AI), , and graphics workloads. By emphasizing open-source development, ROCm enables community contributions and reduces , allowing developers to migrate code more easily between and other ecosystems through tools like the Heterogeneous-compute Interface for Portability (). Its design prioritizes extracting optimal performance from HPC and AI applications, including large-scale model training and inference, while maintaining compatibility with standard deep learning frameworks. Key features of ROCm include its modular architecture, which allows independent development and integration of components, and its predominantly open-source nature under permissive licenses such as MIT for most repositories, promoting widespread adoption and customization. The platform primarily targets Linux operating systems like Ubuntu for full functionality, with growing support for Windows, including ROCm components and AI framework integrations as of 2025. Furthermore, ROCm integrates seamlessly with popular frameworks such as PyTorch and TensorFlow, enabling mixed-precision training and scalable AI workflows through optimized libraries like MIOpen and RCCL.

History and Versions

ROCm originated in 2016 as an platform developed by to enable GPU-accelerated computing on its GPUs, initially targeting (HPC) workloads on Polaris architecture hardware, such as the Radeon RX 480. The platform was first released on November 14, 2016, providing foundational support for and introducing the Heterogeneous-compute Interface for Portability (HIP) to facilitate code portability from NVIDIA's ecosystem. Early releases emphasized integration with the (HSA) standard for unified CPU-GPU programming. Subsequent milestones included the open-sourcing of additional components, such as the runtime in May 2017, broadening community contributions and ecosystem development. In December 2020, ROCm 4.0 introduced support for the CDNA architecture on MI100 GPUs and enhanced features like cooperative groups, improving compatibility and expanding to more diverse workloads. This version also marked initial steps toward broader GPU integration, though primarily focused on professional hardware. Version progression continued with ROCm 5.0 in February 2022, which delivered improved stability through bug fixes and better driver integration, alongside preliminary support for consumer GPUs like the for tasks. ROCm 6.0, released in December 2023, enhanced capabilities with optimizations for FP8 data types in , full support for MI300 GPUs, and expanded library compatibility for frameworks. These updates reflected growing emphasis on alongside HPC, with performance gains in transformer models and broader OS support including Windows previews. In September 2025, ROCm 7.0 represented a pivotal shift toward an AI-HPC hybrid ecosystem, delivering up to 3.8x performance uplifts in inference for large language models like DeepSeek compared to ROCm 6.0, full enablement of Instinct MI350 GPUs based on the CDNA 4 architecture, integration of Retrieval-Augmented Generation (RAG) tools for AI pipelines, and advanced enterprise features such as distributed inference and improved multi-GPU scaling. This release underscored AMD's commitment to open innovation, with enhanced developer tools and ecosystem partnerships to compete in AI deployments while maintaining HPC roots. ROCm 7.1.0, released on October 30, 2025, introduced enhancements in hardware monitoring via the AMD System Management Interface (AMD SMI), improved resiliency for AMD Instinct MI300X GPUs, and broader support for AI workloads through integrations with popular deep learning frameworks.

Foundations

Heterogeneous System Architecture

Heterogeneous System Architecture (HSA) is an open industry standard developed to enable seamless integration of CPUs, GPUs, and other compute devices as peer processors within a unified computing environment. It defines a where heterogeneous components share a single coherent memory space, allowing applications to treat diverse hardware as a cohesive system without the traditional barriers of separate address spaces. This architecture addresses key challenges in by promoting across devices from different vendors, thereby simplifying and enhancing overall system efficiency. Central to HSA are several key concepts that facilitate efficient resource utilization. Unified virtual addressing provides a consistent memory view across all agents, enabling pointers to reference data regardless of the hosting device and eliminating the need for explicit data transfers between CPU and GPU memory. Fine-grained allows for precise control over memory allocation and access permissions at the page level, supporting features like coherent regions with operations and barriers to maintain data consistency during concurrent execution. The -based treats each compute unit—such as a CPU or GPU compute unit—as an independent agent capable of initiating and managing workloads, which promotes scalable parallelism by dispatching tasks to the most suitable hardware with minimal overhead. In ROCm, HSA serves as the foundational layer for device interaction and kernel execution. The platform leverages the HSA Intermediate Language (HSAIL), a portable for compute kernels, which allows source code written in higher-level languages to be compiled into device-agnostic before finalization for specific hardware targets. The HSA runtime, implemented in ROCm through the ROCr library, manages device enumeration, queue creation, and signal handling, providing low-level APIs for applications to dispatch kernels and synchronize operations across agents. This integration ensures that ROCm applications can interact with GPUs as HSA-compliant agents, inheriting the standard's queuing and signaling protocols for robust heterogeneous execution. The adoption of HSA in ROCm yields significant benefits for heterogeneous workloads, particularly in enabling seamless collaboration between CPU and GPU without requiring copies. By utilizing unified spaces, developers can allocate accessible by both processors, reducing and overhead associated with traditional movement, which is especially advantageous for -intensive applications like and scientific simulations. Furthermore, HSA's support for scalable parallelism allows ROCm to efficiently distribute computations across multiple agents, improving throughput and power efficiency in diverse computing scenarios.

Programming Paradigms

ROCm supports the (SIMT) execution model, which enables efficient on GPU architectures by executing the same instruction across multiple threads simultaneously, allowing data-parallel algorithms to map onto massively parallel hardware. In this paradigm, developers launch kernels—functions that run on the GPU—as parallel tasks organized in a hierarchical structure: individual threads execute computations, grouped into thread blocks (or workgroups) that share resources, and multiple blocks form a grid for large-scale parallelism. This model draws from established GPU computing concepts but is optimized for hardware, where warps—co-scheduled groups of threads—typically consist of 64 threads to align with the architecture's size, differing from the 32-thread warps in some other ecosystems. A key aspect of ROCm's heterogeneous focus is its support for asynchronous execution, which allows non-blocking operations between CPU and GPU devices, enabling overlap of , transfer, and to maximize throughput in diverse environments. Stream-based parallelism further enhances this by organizing tasks into independent streams, where multiple kernels or memory operations can execute concurrently across devices without interference, facilitating efficient multi-device setups. Error handling in such configurations involves runtime checks and events to detect and recover from issues like out-of-memory conditions or device failures, ensuring robust operation in heterogeneous systems that integrate CPUs, GPUs, and other accelerators. The evolution of ROCm's programming paradigms has progressed from low-level, assembly-like interfaces that provided fine-grained control over GPU resources to higher-level abstractions that prioritize developer productivity and code portability across hardware vendors. This shift emphasizes avoiding vendor lock-in through standards-based models, such as those built on the Heterogeneous System Architecture (HSA), which unify memory and execution across CPU and GPU without explicit data copies. Early ROCm versions focused on direct hardware access for performance tuning, while recent developments introduce portable layers that abstract hardware differences, enabling seamless migration of code between AMD and compatible platforms. Users approaching ROCm programming require familiarity with foundational concepts, including thread blocks for local collaboration and warps for efficient instruction dispatch, adapted to AMD's optimizations like larger wavefronts for better utilization of compute units. Understanding hierarchies is also essential: global offers high-capacity but higher-latency access shared across all threads, local (or group) provides faster shared access within thread blocks for reducing global traffic, and private per thread ensures isolation for scalar variables. These elements form the prerequisites for leveraging ROCm's paradigms effectively, promoting scalable and efficient GPU-accelerated applications.

Hardware Support

Professional GPUs

ROCm provides comprehensive support for AMD's Instinct MI series GPUs, which are designed for datacenter and high-performance computing (HPC) environments, particularly in artificial intelligence (AI) and large-scale simulations. The supported families include the MI300 series, such as the MI300X and MI325X based on the CDNA 3 architecture, and the MI350 series, including models like the MI350X and MI355X utilizing the advanced CDNA 4 architecture. These GPUs are optimized for high-bandwidth memory (HBM) configurations, with the MI350 series featuring up to 288 GB of HBM3E memory to handle massive datasets in AI training and inference workloads. Additionally, they incorporate specialized matrix cores for accelerated tensor operations, enabling efficient processing of deep learning models and scientific computations. Key features of ROCm on these professional GPUs include full integration of the software stack, supporting high-precision floating-point operations such as FP64 for demanding HPC applications like climate modeling and . Multi-GPU scaling is facilitated through AMD Infinity Fabric technology, which provides high-speed, low-latency interconnects between GPUs, allowing seamless and load balancing across multiple accelerators in a single node or cluster. This enables configurations like eight-GPU systems with coherent memory access, enhancing scalability for distributed training. In 2025, ROCm 7.0 introduced full enablement for the MI350 series, marking a significant advancement in AI infrastructure support. Released in September 2025, this version delivers up to 3.5x faster inference performance compared to ROCm 6.0 on models like Llama 3.1 and DeepSeek R1, achieved through optimizations in inference engines such as vLLM and SGLang. ROCm 7.1, released in October 2025, builds on these advancements with improved resiliency for AMD Instinct MI300X GPUs and enhancements in hardware monitoring. ROCm deployment on GPUs is limited to enterprise Linux distributions, including 24.04, 9, and SUSE Linux Enterprise Server 15, to ensure stability in production environments. It does not support interoperability with consumer graphics cards, focusing exclusively on compute-oriented datacenter hardware.

Consumer GPUs

ROCm provides experimental and preview-level support for AMD's consumer GPUs based on the RDNA architectures, enabling compute workloads on desktop systems at a lower cost compared to professional series hardware. Supported architectures include (gfx1030, such as the ), (gfx1100 and gfx1101, such as the ), and partial support for RDNA 4 (gfx1200 and gfx1201, such as select Radeon RX 9000 series models starting with ROCm 6.4.1 and expanded in ROCm 7.0). This support focuses on compute-only operations, excluding or rendering during execution, which limits configurations where the GPU is attached to a display for simultaneous visual output. Key features on these consumer GPUs include basic (Heterogeneous-compute Interface for Portability) for porting code and for , allowing developers to run applications without full enterprise-level optimization. However, precision support is reduced; for instance, double-precision floating-point (FP64) operations are available but perform at a significantly lower rate (approximately 1/32 of FP32 throughput on RDNA architectures), making them unsuitable for high-precision scientific simulations that demand full-rate FP64 as found in professional GPUs. Multi-GPU configurations are in preview status with limited validation, supporting up to two simultaneous compute workloads but prone to errors like GPU resets or out-of-memory issues in demanding scenarios, contrasting with the robust scalability of accelerators. Primary use cases for ROCm on consumer Radeon GPUs involve entry-level AI and machine learning tasks on desktops, such as local inference for large language models (e.g., via PyTorch or TensorFlow integrations) and lightweight training for personal development workflows. These enable accessible experimentation with generative AI, like running Hugging Face models for content creation or basic scientific computing, though performance caveats include intermittent crashes during extended runs and no backward pass support for ML training on Windows. In 2025, developments like ROCm 7.0 expanded RDNA 4 compatibility and added Windows preview support for Radeon GPUs, enhancing broader accessibility for AI enthusiasts while maintaining a secondary focus to the more mature Instinct ecosystem for production-scale deployments. ROCm 7.1 further introduces initial support for select Ryzen APUs.

System Requirements

ROCm primarily supports operating systems, with official compatibility for distributions including 24.04.3 and 22.04.5, (RHEL) 10.0, 9.6, 9.4, and 8.10, Server (SLES) 15 SP7, 13 and 12, 9, Linux 3.0, and 10, 9, and 8. Limited support is available on Windows through the (WSL2), enabling ROCm development on compatible GPUs and APUs, though it is not as comprehensive as native support. ROCm does not support macOS. The software requires the open-source amdgpu , version 5.15 or later, along with ROCm-specific kernel modules such as kfd and amdgpu for GPU management and . These drivers handle device initialization, , and PCIe communication, ensuring compatibility with supported GPUs. Supported kernel versions vary by distribution; for example, 24.04.3 uses kernel 6.8 or higher, while RHEL 8.10 supports kernel 4.18. Beyond GPUs, ROCm runs on x86_64 architectures with CPUs that support PCIe atomics, such as Zen-based processors (first generation and later) or Haswell and subsequent generations. Limited ARM64 support is available in experimental configurations for select accelerators. For and workloads, a minimum of 16 GB system RAM is recommended to handle data loading and model training efficiently, while GPUs require PCIe 4.0 or higher interfaces for optimal bandwidth and performance in datacenter environments. As of November 2025, ROCm 7.1 offers enhanced container support through compatibility with and Podman for streamlined cloud and edge deployments, including advanced features like improved multi-GPU scaling.

Programming Model

HIP Interface

(Heterogeneous-compute Interface for Portability) is a C++ and kernel language developed by as part of the ROCm platform, enabling developers to create portable applications that run on both AMD GPUs via ROCm and GPUs via from a single source codebase. This interface targets systems, supporting CPU and GPU execution while minimizing performance overhead compared to native or ROCm coding. 's design emphasizes familiarity for programmers, with calls and kernel syntax that closely mirror , allowing straightforward porting of applications without major rewrites. Central to HIP are its kernel definition, memory management, and execution mechanisms. Kernels are defined using attributes like __global__ or the HIP_KERNEL macro, similar to CUDA, and launched either with the familiar triple-chevron syntax kernel<<<blocks, threads>>>(args) or the explicit hipLaunchKernelGGL macro for greater portability and template support. Memory operations include hipMalloc for device memory allocation, hipMemcpy for host-device data transfers (supporting synchronous and asynchronous variants), and hipFree for deallocation, providing direct analogs to CUDA's memory API. Execution control is handled through hipLaunchKernelGGL(kernel, dim3 grid, dim3 block, size_t sharedMem, hipStream_t stream, args...), which specifies grid and block dimensions, shared memory size, and an optional stream for concurrency. HIP ensures portability by compiling code to either AMD's ROCm backend using the HIP-Clang compiler or NVIDIA's CUDA backend using NVCC, orchestrated by the hipcc driver utility that automatically sets include paths, libraries, and target-specific options. It supports asynchronous operations via , created with hipStreamCreate and synchronized using hipStreamSynchronize or hipStreamWaitEvent, allowing overlapping computation and data transfers for improved throughput. Events, managed through hipEventCreate, hipEventRecord, and hipEventSynchronize, provide fine-grained timing and synchronization points within . Advanced features include unified memory support via hipMallocManaged, which allocates memory accessible from both host and device without explicit copies, leveraging (HSA) for unified addressing as detailed in section. For multi-GPU environments, HIP enables device enumeration with hipGetDeviceCount to query available GPUs and hipSetDevice to select a target, facilitating across multiple accelerators. The following code snippet illustrates a basic HIP kernel launch and memory management:
cpp
#include <hip/hip_runtime.h>

__global__ void vectorAdd(const float *A, const float *B, float *C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) C[i] = A[i] + B[i];
}

int main() {
    int N = 1000;
    size_t size = N * sizeof(float);
    float *h_A, *h_B, *h_C;
    float *d_A, *d_B, *d_C;

    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);

    hipMalloc(&d_A, size);
    hipMalloc(&d_B, size);
    hipMalloc(&d_C, size);

    // Initialize host arrays (omitted for brevity)

    hipMemcpy(d_A, h_A, size, hipMemcpyHostToDevice);
    hipMemcpy(d_B, h_B, size, hipMemcpyHostToDevice);

    hipLaunchKernelGGL(vectorAdd, dim3(1), dim3(256, 1, 1), 0, 0, d_A, d_B, d_C, N);

    hipMemcpy(h_C, d_C, size, hipMemcpyDeviceToHost);

    hipFree(d_A);
    hipFree(d_B);
    hipFree(d_C);
    free(h_A);
    free(h_B);
    free(h_C);

    return 0;
}
This example demonstrates allocation, data transfer, kernel execution, and cleanup, highlighting HIP's CUDA-like workflow.

OpenCL and OpenMP Support

ROCm provides support for OpenCL, enabling developers to write portable parallel computing kernels that can execute on AMD GPUs as well as other hardware platforms. The implementation is handled through the ROCm Compute Language Runtime (ROCclr), which serves as a virtual device interface within the broader AMD Compute Language Runtimes (CLR) framework, facilitating the execution of OpenCL programs on AMD hardware. ROCclr integrates with the OpenCL runtime to manage device interactions, memory allocation, and kernel dispatching, allowing standard OpenCL C kernel language to define compute-intensive tasks such as vector operations or image processing. Kernels are compiled using Clang with support for OpenCL C versions up to 2.0, where the -cl-std=CL2.0 flag enables full conformance, though higher versions like 3.0 remain experimental and not fully roadmap-integrated as of ROCm 7.1. Execution occurs via core OpenCL APIs, including clEnqueueNDRangeKernel for launching multi-dimensional work-groups on the GPU, ensuring efficient parallel task distribution across compute units. This OpenCL support is particularly suited for legacy applications or vendor-agnostic codebases requiring cross-platform compatibility, though it may incur overhead when mixed with ROCm's HIP interface due to separate runtime layers. Unlike HIP, which offers AMD-specific optimizations, OpenCL prioritizes standardization but lacks some performance enhancements tailored to ROCm's architecture, such as direct integration with AMD's memory hierarchy. ROCm also incorporates OpenMP support for directive-based heterogeneous programming, allowing incremental offloading of CPU code to AMD GPUs without full rewrites. The implementation relies on an LLVM-based toolchain, including Clang, which fully adheres to the OpenMP 4.5 standard and partially supports features from OpenMP 5.0, 5.1, and 5.2, such as device constructs for data mapping and task dependencies. As of ROCm 7.1, support for OpenMP in Fortran applications has been added, including integration with compilers and runtime libraries. Key directives include #pragma omp target for marking regions to offload from host to device, enabling automatic code movement and execution on the GPU, along with associated clauses like map for data transfer and teams for controlling parallelism granularity. This offloading model leverages the ROCm runtime to handle synchronization and resource allocation, making it accessible for scientific computing workloads like simulations or linear algebra routines. While effective for straightforward offloads, OpenMP in ROCm remains experimental for more complex scenarios, such as dynamic task graphs involving irregular dependencies or nested parallelism, where full feature parity with CPU-only execution is not yet achieved due to ongoing LLVM developments. Interoperability with other ROCm components, like HIP, is possible but limited by directive overhead, positioning OpenMP as a bridge for standards-compliant portability rather than peak performance tuning.

Core Software Stack

Runtimes and Drivers

The ROCm software stack relies on low-level kernel drivers and runtimes to interface directly with AMD GPU hardware, enabling efficient execution of compute workloads. The primary kernel driver is ROCk, an amdgpu-based component that manages GPU initialization, interrupt handling, and power management for discrete AMD GPUs. ROCk integrates with the Linux kernel's AMDGPU module and Kernel Fusion Driver (KFD) to provide the foundational hardware abstraction necessary for heterogeneous computing. This driver ensures stable operation by handling device discovery, resource allocation at the kernel level, and coordination between CPU and GPU for tasks like memory mapping and event processing. At the runtime layer, ROCr serves as AMD's implementation of the Heterogeneous System Architecture (HSA) runtime, acting as a thin user-mode API that bridges applications to the underlying hardware. ROCr facilitates queue management through HSA's architected queuing model, allowing asynchronous dispatch of compute packets to GPU queues with low latency. It also handles signal-based synchronization, where HSA signals enable fine-grained coordination between host and device operations, such as waiting for kernel completion or barrier dependencies. Complementing ROCr is ROCt, the HSA thunk interface, which provides a lightweight user-space bridge to the ROCk kernel driver, managing ioctl communications for direct hardware access without heavy overhead. Core functionalities of these components include command queue submission via HSA's Architected Queuing Language (AQL) packets, which encapsulate kernel dispatches, barriers, and memory operations for execution on AMD GPUs. Memory allocation is exposed through HSA APIs like hsa_memory_allocate, supporting fine-grained and coarse-grained regions with immediate visibility for coherent data sharing across agents. Synchronization mechanisms, such as barrier packets (HSA_PACKET_TYPE_BARRIER_AND and HSA_PACKET_TYPE_BARRIER_OR) and fence scopes (HSA_FENCE_SCOPE_SYSTEM), ensure ordered execution and data consistency without busy-waiting on the host. These elements collectively support scalable, low-level control over GPU resources, forming the execution backbone for higher-level ROCm components. In 2025, ROCm 7.0 introduced significant enhancements to runtimes and drivers, particularly for scalability and reliability on advanced hardware. ROCr was updated to version 1.18.0, adding support for (based on ) with optimized P2P memory copies utilizing all available SDMA engines for improved multi-GPU throughput. The (version 30.10) was modularized for independent updates, enhancing compatibility and error resilience through better reporting via hipGetLastError and new event notifications in for migration and thermal events. These changes enable production-grade scalability for MI350 deployments, achieving up to 3.8x performance uplifts in key workloads compared to ROCm 6.0 while bolstering fault tolerance in large-scale systems. ROCm 7.1.0, released on October 30, 2025, further improved the runtime layer with enhancements to HIP runtime compatibility with NVIDIA CUDA, including new APIs for memory management (e.g., hipExtMallocAsync, hipExtMemPool*), cooperative groups, and nested tile partitioning. These updates enhance cross-platform portability and efficiency for heterogeneous workloads, building on the HSA foundation provided by ROCr.

Compilers and Tools

ROCm's compilation infrastructure relies on LLVM-based tools optimized for heterogeneous computing on AMD GPUs. The primary compiler is ROCmCC, a Clang/LLVM-based frontend designed for high-performance computing across AMD GPUs and CPUs, supporting models like , OpenMP, and OpenCL. It integrates with the AMDGPU backend in LLVM to generate intermediate representations such as (Heterogeneous System Architecture Intermediate Language) for GPU kernels. ROCm-CompilerSupport provides the necessary extensions and libraries within the LLVM project, including the AMD Code Object Manager (comgr) for handling GPU code objects, ensuring seamless integration for ROCm applications. HIPCC serves as the compiler driver for HIP code, acting as a wrapper around Clang (specifically amdclang++) to automate the compilation process. It handles HIP source files by invoking the underlying pipeline to produce executable binaries, setting default include paths and linking against ROCm libraries. For offloading computations to AMD GPUs, developers use Clang with flags such as --offload-arch=<target-id> (e.g., --offload-arch=gfx908) to specify the GPU architecture like GFX9 or GFX11, or -mcpu=<target-id> to target specific processors, enabling single-source C++ code to run on both CPU and GPU. Key tools facilitate development and porting. HIPIFY automates the migration of applications to HIP by translating , replacing APIs with HIP equivalents, and adjusting kernel syntax—using either the Clang-based hipify-clang for comprehensive or the Perl-based hipify-perl for simpler substitutions. It supports common runtime calls, device qualifiers like __global__, and standard libraries but requires manual review for unsupported features or third-party dependencies. Similarly, GPUFORT is a source-to-source translator for codes, converting or OpenACC directives to +HIP or + 4.5+, aiding legacy HPC applications in adopting ROCm without full rewrites. At the mid-level, ROCclr (now integrated into the AMD Compute Language Runtimes, or CLR) acts as a common runtime layer for dispatching and kernels, providing a unified for heterogeneous execution while abstracting hardware specifics. It includes implementations for (hipamd) and (opencl) subcomponents, built atop HIP-Clang for runtime APIs like streams and . Debugging workflows leverage ROCgdb, the ROCm source-level debugger based on GDB, which supports heterogeneous debugging of applications across x86 hosts and GPUs. It enables setting breakpoints in GPU kernels, single-stepping through device code, and inspecting memory or variables, though it currently focuses on source-line accuracy without full symbolic support for variables.

Libraries

Basic Linear Algebra

rocBLAS serves as the primary (BLAS) library within the ROCm ecosystem, providing implementations for levels 1, 2, and 3 operations optimized for GPUs. It is implemented in C++ and leverages the ROCm runtime to execute vector, matrix-vector, and matrix-matrix computations on the GPU. hipBLAS, a companion library, offers compatibility by porting the cuBLAS to , enabling developers to adapt NVIDIA-focused code to ROCm with minimal changes while maintaining access to rocBLAS's underlying functionality. A of rocBLAS is its support for the operation, defined as C = \alpha A B + \beta C, where A and B are input matrices, C is the output matrix, and \alpha and \beta are scalar parameters. This routine, along with other level-3 BLAS functions, incorporates optimizations tailored to 's matrix core instructions, such as the Matrix Fused Multiply-Add (MFMA) operations available on MI100 and MI200 series GPUs. These enhancements exploit hardware-specific capabilities like tensor cores for accelerated dense linear algebra, ensuring efficient handling of large-scale computations in workloads. Key features of rocBLAS include support for half-precision (FP16), which reduces and boosts throughput for compatible operations, and batched variants of routines like for processing multiple independent problems simultaneously. Integration with the allows seamless through libraries like hipBLASLt, where multiple operations can be combined into a single GPU to minimize data transfers and improve overall efficiency. The library is particularly tuned for accelerators, delivering high-performance implementations that scale with GPU architecture advancements in ROCm 7.0 and later releases, including ROCm 7.1.0 (October 2025) which adds support for gfx1150/gfx1151 architectures and an threads sample. In practice, developers invoke rocBLAS functions via a host-side API initialized with a rocblas_handle. For example, the single-precision GEMM can be performed using rocblas_sgemm, which computes C = \alpha A B + \beta C on the GPU by passing matrix dimensions, pointers to device memory, and scalars to the function. Asynchronous execution is supported through HIP streams, allowing overlapping computation with data movement for further performance gains.

Advanced Solvers and FFT

The ROCm platform provides advanced linear algebra solvers through rocSOLVER and its HIP-portable counterpart hipSOLVER, which implement a subset of routines optimized for GPUs. rocSOLVER supports key decompositions such as factorization via rocsolver_getrf and QR factorization via rocsolver_geqrf, enabling efficient solution of linear systems and least-squares problems in scientific computing workflows. Additionally, it includes eigenvalue solvers like rocsolver_syev for symmetric matrices and rocsolver_heev for Hermitian matrices, as well as singular value decomposition () through rocsolver_gesvd, which computes the decomposition A = U \Sigma V^H for general matrices A. hipSOLVER acts as a marshalling layer, supporting rocSOLVER as a backend alongside NVIDIA's cuSOLVER, and exposes an closely aligned with cuSOLVER's dense linear algebra interface, such as hipsolverDnCreate for handle management and hipsolverDnGesvd for , ensuring portability across GPU vendors without code changes. For frequency-domain computations, rocFFT and hipFFT deliver high-performance discrete Fourier transforms (DFTs) tailored to GPU architectures. rocFFT supports 1D, , and FFT plans created via rocfft_plan_create, accommodating real-to-complex, complex-to-real, and complex-to-complex transforms across data types like single- and double-precision floating-point. Batched operations are handled efficiently by specifying the number_of_transforms parameter in plan creation, allowing simultaneous execution of multiple independent FFTs to exploit GPU parallelism for large-scale tasks. hipFFT provides a cuFFT-compatible , including functions like hipfftExecC2C for executing complex-to-complex transforms on plans, which maps seamlessly to rocFFT on hardware while supporting cuFFT backends on GPUs. These libraries incorporate optimizations to enhance throughput and resource utilization, particularly for compute-intensive applications. In rocSOLVER, internal implementations bypass rocBLAS calls for small- and medium-sized matrices when optimizations are enabled, reducing overhead and improving performance for decompositions and solvers. rocFFT leverages batched execution and user-managed work buffers to minimize memory transfers, enabling memory-efficient processing of large datasets by auto-allocating temporary storage only when needed during rocfft_execute. Building on basic linear algebra operations from rocBLAS, these solvers and FFT routines facilitate advanced numerical methods in (HPC). ROCm 7.0 (September 2025) introduced significant enhancements, including hybrid CPU-GPU execution modes in rocSOLVER, using Cuppen's algorithm for better , performance gains in routines like rocsolver_bdsqr for bidiagonal , rocsolver_syev/rocsolver_heev for eigenvalues, and rocsolver_geqr2/rocsolver_geqrf for QR , as well as reduced memory footprint for eigensolvers such as rocsolver_stedc and generalized variants. hipSOLVER improved compatibility for workflows under backends. For FFT, rocFFT gained new single-precision kernels and optimized execution plans for large 1D transforms, boosting throughput in simulation-heavy workloads like . These updates collectively enhanced efficiency on MI350 GPUs. ROCm 7.1.0 (October 2025) further optimized rocSOLVER performance for LARF, LARFT, GEQR2, GEQRF, STEDC, and eigensolvers, and improved rocFFT with single-kernel plans for certain sizes and better performance for specific FFTs and MPI decompositions, supporting larger-scale HPC applications with improved precision and reduced resource demands.

Machine Learning Libraries

ROCm provides a suite of specialized libraries optimized for workloads on GPUs, focusing on primitives, tensor operations, and sparse computations essential for models. These libraries leverage the programming model to ensure portability and compatibility with CUDA-based code, allowing developers to adapt existing machine learning applications with minimal changes. Central to ROCm's capabilities is MIOpen, AMD's open-source primitives library. MIOpen delivers high-performance implementations of key operations for convolutional neural networks (CNNs), including convolutions, activations, and pooling layers, with optimizations such as fusion to reduce usage and GPU launch overheads. It supports advanced types like bfloat16 for efficient training of large models, making it a foundational component for accelerating workloads on AMD and GPUs. Complementing MIOpen, hipTensor is a high-performance HIP C++ library designed for tensor primitives, particularly tensor contractions critical for transformer-based architectures and other deep learning models. It exploits specialized matrix cores in modern AMD GPUs, such as those in the CDNA architecture, to achieve efficient computation of multi-dimensional tensor operations, enabling scalable performance in machine learning pipelines. For sparse matrix operations prevalent in machine learning, such as those in recommendation systems and sparse neural networks, rocSPARSE provides optimized routines for sparse linear algebra subprograms using the HIP language. This library handles sparse matrix-vector multiplications and other sparse formats, supporting efficient processing of data-sparse models on ROCm-enabled hardware. ROCm integrates with ONNX Runtime through a dedicated execution provider, enabling accelerated and training of ONNX models on GPUs. This support facilitates deployment of diverse models, including transformers, with optimizations for low-precision formats like INT8 and INT4 to enhance efficiency. In ROCm 7.0 (2025), enhancements included support for retrieval-augmented generation () pipelines, demonstrated through tutorials integrating tools like LlamaIndex and Ollama for building AI applications on GPUs. Additionally, optimized kernels for models delivered up to 3x speedup in performance compared to ROCm 6.0, as shown in benchmarks on MI300X platforms, boosting productivity for large-scale AI development. ROCm 7.1.0 (October 2025) added further improvements, such as MIOpen's trust verify find mode and HIP kernel for backward layer , along with bfloat16/half float mixed precision support in rocSPARSE for multiple routines.

Ecosystem

Third-Party Integrations

ROCm integrates seamlessly with major frameworks, enabling GPU acceleration on hardware. PyTorch offers native support through ROCm-specific wheels, allowing developers to run workloads directly on accelerators and GPUs without code modifications. utilizes an -maintained plugin for ROCm compatibility, facilitating the execution of training and inference tasks. Similarly, provides built-in ROCm backend support, optimizing and autodifferentiation for in scientific simulations and research. In ROCm 7.0, these integrations achieve comparable performance to in many workloads, particularly in memory-bound inference scenarios with large language models, demonstrating near parity through optimized libraries like MIOpen and hipRTC. In , ROCm enables GPU acceleration for several key scientific applications. OpenFOAM, a popular open-source toolbox for , leverages ROCm via OpenMP target offloading and HIP ports to accelerate simulations such as heat transfer and fluid flow on GPUs, achieving significant speedups in solver performance. GROMACS, used for simulations in biochemistry, supports ROCm through its HIP backend, allowing efficient GPU offloading for and workloads on platforms like the exascale . ABINIT, an electronic structure package for , incorporates ROCm-compatible GPU acceleration via OpenMP offload directives, enabling faster ground-state calculations and computations on hardware. ROCm facilitates interoperability with graphics and provides language bindings for broader adoption. Through , ROCm supports resource sharing between compute kernels and Vulkan graphics pipelines, enabling hybrid applications in rendering and visualization by mapping buffers and textures across . For Python developers, hip-python offers low-level bindings to the HIP runtime and ROCm libraries like rocBLAS and RCCL, simplifying GPU programming in and scripts. Fortran users benefit from hipfort, which exposes and accelerated math libraries, allowing legacy HPC codes to offload computations to GPUs without extensive rewrites. In 2025, ROCm expanded its ecosystem with enhanced support for retrieval-augmented generation () in applications, providing tools and workflows to build end-to-end pipelines on GPUs for improved generative accuracy using external knowledge bases. Additionally, announced an expanded partnership with to integrate Instinct GPUs and ROCm into its cloud infrastructure, enabling large-scale and HPC workloads through superclusters powered by up to 50,000 AMD Instinct MI450 Series GPUs, planned for availability starting in Q3 2026.

Distribution and Installation

ROCm is distributed primarily through official AMD repositories, providing binary packages for supported Linux distributions such as and (RHEL). For 22.04 (Jammy) and 24.04 (Noble), users add the AMD repository by downloading the GPG key and creating a sources list file, followed by updating the package index with apt update. Installation then proceeds via apt install rocm, which pulls in the core runtime, or specialized metapackages like rocm-dev for the full development stack including compilers, libraries, and tools. On RHEL 8.10 and 9.4, a similar process uses dnf after enabling the repository, installing packages like rocm for runtime components. Binary packages are available for ROCm 7.0 and later versions, ensuring compatibility with AMD Instinct accelerators and Radeon GPUs meeting system requirements. Docker containers offer a containerized for isolated environments, with official ROCm images hosted on Hub under the rocm namespace, such as rocm/[pytorch](/page/PyTorch) for workflows. These images include pre-built ROCm stacks and can be run with GPU access by mounting the host's device files using options like --device /dev/kfd --device /dev/dri. For custom builds, source compilation is supported via TheRock, 's open-source build system introduced in ROCm 7.9 preview, which uses to assemble the ROCm core SDK from repositories, bundling dependencies for platforms like 24.04. Third-party distributions extend accessibility for specific use cases. Conda-forge provides ROCm packages tailored for and environments, such as rocm-device-libs and rocm-smi, installable via conda install -c conda-forge rocm-device-libs, allowing integration without full system package management. Spack, a popular in (HPC) clusters, supports ROCm installation and source builds through its ROCm-specific recipes, enabling variant configurations for multi-version deployments across supercomputers. Cloud providers offer pre-configured images; for instance, provides AMD GPU instances with ROCm-enabled virtual machines for AI and HPC workloads, while AWS supports ROCm on AMD-powered EC2 instances via standard installation methods. The installation process typically involves adding the repository, installing the base rocm package, and verifying functionality with the rocminfo , which queries GPU details and ROCm version. Common troubleshooting includes resolving driver conflicts by ensuring the latest kernel is installed and blacklisting conflicting modules like Nouveau, as well as checking compatibility matrices for user-space and kernel versions. Users should after and add their account to the render and video groups for proper GPU .

Learning and Community Resources

The official documentation for ROCm is hosted at rocm.docs.amd.com, providing comprehensive guides for installation, programming, and optimization on GPUs. This resource includes the programming guide, which details the C++ runtime and kernel language for creating portable applications across and hardware, emphasizing environments. Additionally, the ROCm AI Developer Hub offers tutorials in Jupyter Notebook format, covering inference, fine-tuning, pretraining, and GPU development, such as deploying models with vLLM and fine-tuning with Transformers. These materials support hands-on learning for basics through example repositories and AI porting workflows from using tools like HIPIFY. ROCm's GitHub organization, under ROCm/ROCm, maintains over 350 open-source repositories as of 2025, serving as a central hub for developers to explore code examples and contribute to the ecosystem. Key learning resources include the rocm-examples repository, which provides introductory and advanced samples for programming, and the HIP-Examples depot for kernel-level demonstrations. Contributions occur via pull requests and issue discussions on these repositories, fostering collaborative improvements to ROCm components like libraries and tools. For 2025 updates, official ROCm blogs highlight optimizations for the MI350 series GPUs, including enhanced performance in distributed inference and enterprise workloads. Community support for ROCm is facilitated through the AMD Developer Hub, which includes forums, webinars, and best practices for troubleshooting and sharing experiences. Developers can engage in discussions on and participate in AMD-hosted events like the Advancing AI conference series, where ROCm advancements are showcased annually. Recent guides address emerging needs, such as building Retrieval-Augmented Generation (RAG) pipelines for enterprise using vLLM, , and on ROCm, enabling scalable, fact-grounded applications. These resources bridge installation with practical application, supporting users in and development.

Comparisons

With NVIDIA CUDA

ROCm and NVIDIA CUDA share several architectural similarities that facilitate developer transition and code portability. The Heterogeneous-compute Interface for Portability (HIP) in ROCm is designed to closely mirror CUDA's syntax and API, allowing developers to port CUDA applications to ROCm with minimal changes, often through automated tools like hipify. Both platforms support Single Instruction, Multiple Threads (SIMT) execution models for parallel processing on GPUs and stream-based asynchronous operations for overlapping computation and data transfer, enabling efficient workload management. This HIP-CUDA alignment promotes dual-vendor portability, where a single codebase can target both AMD and NVIDIA hardware without extensive rewrites. Key differences lie in their foundational approaches and openness. ROCm is an open-source platform built on the (HSA), which provides a unified model that allows seamless sharing of between CPU and GPU without explicit transfers in many scenarios, simplifying programming for heterogeneous systems. In contrast, is a proprietary ecosystem requiring more explicit management, such as manual allocations and copies via cudaMalloc and cudaMemcpy, though it supports optional unified since CUDA 6.0. 's closed nature limits customization, while ROCm's open-source model fosters community contributions and integration with distributions. Regarding ecosystem scale, benefits from a larger, more mature library of third-party tools and frameworks optimized over nearly two decades, whereas ROCm's ecosystem, while smaller, is rapidly expanding in AI and (HPC) domains through partnerships like and support. In terms of performance, ROCm 7.1 achieves competitive results relative to on hardware, particularly for workloads. In the MLPerf Inference v5.1 benchmarks from September 2025, MI325X GPUs with ROCm demonstrated near parity or outperformance against H200 systems with ; for instance, Mixtral-8x7B offline throughput improved 23% over prior submissions and exceeded H200 averages, while Llama2-70B and SD-XL scenarios showed results competitive with H200 in offline, , and interactive modes. Overall, ROCm delivers 80-95% of 's performance in optimized ML tasks on equivalent hardware, though it may require additional tuning and lags in some mature tools due to 's longer development history. Adoption patterns highlight CUDA's dominance in academic research and commercial AI, driven by its extensive tooling and NVIDIA's market leadership, with over 4 million developers using it as of 2025. ROCm is gaining traction in open-source HPC environments, powering systems like the exascale at , which leverages ROCm for its MI250X GPUs to achieve world-leading performance in scientific simulations. This growth positions ROCm as a viable alternative for cost-sensitive, open ecosystems, especially as invests in AI optimizations.

With Intel oneAPI

ROCm and Intel's oneAPI share several foundational similarities as open-source platforms designed for . Both emphasize portability across accelerators, leveraging standards such as for single-source C++ programming models that enable code to target diverse hardware without vendor-specific rewrites. They also support offload directives for GPU acceleration, allowing developers to use familiar parallel programming constructs for compute-intensive tasks. Additionally, both incorporate interoperability, facilitating legacy code migration and cross-platform execution through intermediate representations like SPIR-V. Key differences arise in their scope and programming paradigms. ROCm is tailored specifically for AMD GPUs, utilizing the Heterogeneous-compute Interface for Portability () as its core language, which mirrors syntax for easier porting from ecosystems while optimizing for 's architecture. In contrast, oneAPI targets a multi-vendor landscape encompassing CPUs, GPUs, and FPGAs from , , , and others, primarily through Data Parallel C++ (DPC++), an extension of that promotes unified codebases across architectures. This broader ambition is advanced by the Unified Acceleration (UXL) Foundation, an open consortium evolving oneAPI standards to foster industry-wide . Performance characteristics reflect these hardware focuses. On accelerators, ROCm delivers significant AI workloads uplifts, such as up to 3.5 times faster compared to prior versions in ROCm 7.0, leveraging deep hardware-specific optimizations for and . Conversely, oneAPI achieves superior efficiency on GPUs, with tailored libraries like oneDNN providing up to 2x throughput gains in operations due to integrated compilation and vector extensions. via SPIR-V enables hybrid deployments, allowing SYCL/DPC++ code to execute on AMD hardware through ROCm's runtime. In terms of ecosystem, oneAPI offers expansive hardware coverage and tooling, including comprehensive libraries for , HPC, and that span Intel's full portfolio, making it ideal for diverse deployments. ROCm, however, provides deeper, AMD-centric optimizations, such as specialized kernels for Instinct series in . Both platforms integrate with —ROCm via native backends for AMD GPUs and oneAPI through the Intel Extension for PyTorch (IPEX) using —but differ in development tools, with ROCm emphasizing ROCprof for profiling and oneAPI focusing on the DPC++ compiler suite for cross-vendor debugging.

References

  1. [1]
    What is ROCm? - AMD ROCm documentation
    ROCm is a software stack, composed primarily of open-source software, that provides the tools for programming AMD Graphics Processing Units (GPUs).
  2. [2]
    Use ROCm on Radeon and Ryzen
    Unlock Local AI Development on Your AMD Hardware. Transform your AMD-powered system into a powerful and private machine learning workstation.Use Rocm On Radeon And Ryzen · Expanded Platform Support · Rocmtm Key Capabilities
  3. [3]
    AMD ROCm™ Software
    AMD ROCm is an open software stack including drivers, development tools, and APIs that enable GPU programming from low-level kernel to end-user applications.What's New in ROCm 7 · Discover ROCm for AI · AMD Infinity Hub
  4. [4]
    ROCm release history
    ROCm release history# ; 7.0.1. September 17, 2025 ; 7.0.0. September 16, 2025 ; 6.4.3. August 7, 2025 ; 6.4.2. July 21, 2025.
  5. [5]
    ROCm 7.1.0 release notes
    The ROCm Data Center tool (RDC) hardware monitoring capabilities have been expanded by integrating the new AMDSMI API. This enhancement enables more ...
  6. [6]
    AMD ROCm 7.0: Built for Developers, Advancing Open Innovation
    Sep 16, 2025 · ROCm 7.0 empowers both developers and enterprises to move faster, scale smarter, and deploy AI with confidence.
  7. [7]
    AMD ROCm™ Software - GitHub Home
    With ROCm, you can customize your GPU software to meet your specific needs. You can develop, collaborate, test, and deploy your applications in a free, open ...Popular repositories - Loading · Issues 175 · ROCm/TheRock · Rocm-smi
  8. [8]
    Programming guide - AMD ROCm documentation
    ROCm provides a robust environment for heterogeneous programs running on CPUs and AMD GPUs. ROCm supports various programming languages and frameworks.
  9. [9]
    What is ROCm? - AMD ROCm documentation
    ROCm is an open-source stack, composed primarily of open-source software, designed for graphics processing unit (GPU) computation.
  10. [10]
    HIP porting guide - AMD ROCm documentation
    Library Equivalents​​ ROCm provides libraries to ease porting of code relying on CUDA libraries. Most CUDA libraries have a corresponding HIP library. There are ...Hip Porting Guide · Porting A Cuda Project · Identifying Device...Missing: alternative | Show results with:alternative<|separator|>
  11. [11]
    Use ROCm for AI
    ROCm is an open-source software platform that enables high-performance computing and machine learning applications. It features the ability to accelerate ...Installing ROCm and deep... · Tutorials for AI developers · Training · Inference<|control11|><|separator|>
  12. [12]
    ROCm license - AMD ROCm documentation
    ROCm is released by Advanced Micro Devices, Inc. (AMD) and is licensed per component separately. The following table is a list of ROCm components with links to ...<|control11|><|separator|>
  13. [13]
    Compatibility matrix - AMD ROCm documentation
    Use this matrix to view the ROCm compatibility and system requirements across successive major and minor releases. You can also refer to the past versions ...
  14. [14]
    PyTorch compatibility - AMD ROCm documentation
    PyTorch on ROCm provides mixed-precision and large-scale training using MIOpen and RCCL libraries. PyTorch provides two high-level features: Tensor computation ...
  15. [15]
    TensorFlow compatibility - AMD ROCm documentation
    The official TensorFlow repository includes full ROCm support. AMD maintains a TensorFlow ROCm repository in order to quickly add bug fixes, updates, and ...
  16. [16]
    AMD Releases New Version of ROCm, the Most Versatile Open ...
    Nov 14, 2016 · Upcoming releases of ROCm are expected to support AMD "Zen"-based x86 CPUs, ARM AArch64 CPU architecture starting with Cavium ThunderX ...
  17. [17]
    Everything You Need to Know About Why AMD Open Sourced the ...
    Oct 18, 2017 · Last May, AMD open sourced the OpenCL driver stack for ROCm. With this they kept their promise to open source (almost) everything.
  18. [18]
    Radeon ROCm 5.0 Released With Some RDNA2 GPU Support
    Feb 10, 2022 · Overnight AMD quietly released ROCm 5.0 for improving the Radeon Open eCosystem. Most exciting with ROCm 5.0 is having some level of Navi 2x / ...
  19. [19]
    AMD ROCm 6.0 Now Available To Download With MI300 ... - Phoronix
    Dec 15, 2023 · AMD ROCm 6.0 Now Available To Download With MI300 Support, PyTorch FP8 & More AI. Written by Michael Larabel in Radeon on 15 December 2023 at ...
  20. [20]
    AMD ROCm 7 Announced: MI350 Support, New Algorithms, Models ...
    Jun 12, 2025 · Breaking down the performance uplifts, we can see up to a 3.2x increase in Llama 3.1 70B, a 3.4x increase in Qwen2-72B, and up to 3.8x in Deep ...
  21. [21]
    AMD ROCm 7.0 Officially Released With Many Significant ... - Phoronix
    Sep 16, 2025 · ROCm 7.0.0 is officially out and all of the ROCm 7.0 documentation has also been published along with the binaries being available via the AMD ...<|control11|><|separator|>
  22. [22]
    ROCm 7.0: An AI-Ready Powerhouse for Performance, Efficiency ...
    Sep 16, 2025 · ROCm 7.0 raises the bar for end-to-end AI enablement. With breakthrough training and inference performance on the AMD Instinct™ MI350 series ...Missing: RAG | Show results with:RAG
  23. [23]
    What is Heterogeneous System Architecture (HSA)?
    Aug 31, 2012 · HSA is all about delivering new, improved user experiences through advances in computing architectures that deliver improvements across all four key vectors.
  24. [24]
    HSA Announces Publication of New Guide to Heterogeneous ...
    Dec 17, 2015 · “Heterogeneous computing is a key enabler of the next generation of compute environments, wherein entire systems will interconnect autonomously ...
  25. [25]
    ROCR 1.18.0 Documentation - AMD ROCm documentation
    The ROCm runtime (ROCR) is AMD's implementation of HSA runtime, which is a thin, user-mode API that exposes the necessary interfaces to access and interact ...Missing: HSAIL | Show results with:HSAIL
  26. [26]
    Unified memory — HIP 6.2.41133 Documentation
    It is particularly useful in heterogeneous computing environments with heavy memory usage with both a CPU and a GPU, which would require large memory transfers.Missing: seamless | Show results with:seamless
  27. [27]
    Introduction to the HIP programming model
    In heterogeneous programming, the CPU is available for processing operations but the host application has the additional task of managing data and ...
  28. [28]
    HIP programming model - AMD ROCm documentation
    The SIMT programming model behind the HIP device-side execution is a middle-ground between SMT (Simultaneous Multi-Threading) programming known from multicore ...Missing: paradigms | Show results with:paradigms
  29. [29]
    Multi-device management — HIP 7.1.0 Documentation
    Streams enable asynchronous task execution, allowing multiple devices to process data concurrently without blocking one another. Events provide a mechanism for ...Missing: parallelism | Show results with:parallelism
  30. [30]
    Asynchronous concurrent execution — HIP 7.1.0 Documentation
    Asynchronous concurrent execution is important for efficient parallelism and resource utilization, with techniques such as overlapping computation and data ...Missing: heterogeneous | Show results with:heterogeneous
  31. [31]
    ROCm Revisited: Evolution of the High-Performance GPU ...
    Jun 9, 2025 · In this blog post, we aim to highlight AMD's ROCm ecosystem and the evolution of the software stack throughout the years.
  32. [32]
    Application portability with HIP - ROCm™ Blogs - AMD
    Apr 26, 2024 · HIP enables platform-independent GPU programs, allowing CUDA code to run on both AMD and NVIDIA GPUs with a portable build system.Application Portability With... · Converting Cuda Applications... · Hipify ToolsMissing: alternative | Show results with:alternative
  33. [33]
    Performance guidelines — HIP 6.2.41134 Documentation
    This chapter describes a set of best practices designed to help developers optimize the performance of HIP-capable GPU architectures.Parallel Execution · Memory Throughput... · Optimization For Maximum...
  34. [34]
  35. [35]
    AMD Instinct | Solution - GIGABYTE Global
    The AMD Instinct™ MI350 Series GPUs, launched in June 2025, represent a ... Expanded Hardware & Platform Support : ROCm 7 is fully compatible with AMD Instinct ...
  36. [36]
    AMD Instinct GPU Validated System | MiTAC Computing Technology
    With features like double-precision (FP64) performance and high inter-GPU bandwidth through AMD Infinity Fabric™, Instinct accelerators empower researchers ...
  37. [37]
    AMD Instinct™ MI300 Series microarchitecture
    The GPUs are using seven high-bandwidth, low-latency AMD Infinity Fabric™ links (red lines) to form a fully connected 8-GPU system. previous. GPU ...
  38. [38]
    AMD Unveils ROCm 7: AI Inference Acceleration Up to 3.8x and Full ...
    Jun 12, 2025 · The main performance gain was recorded in inference tasks: up to 3.5× faster than ROCm 6, with a maximum 3.8× in DeepSeek R1, 3.2× in Llama 3.1 ...
  39. [39]
    System requirements (Linux) - AMD ROCm documentation
    The following table shows the supported AMD Instinct™ GPUs, and Radeon™ PRO and Radeon GPUs. ... AMD Instinct MI200 Series GPUs only supports Ubuntu 24.04.Missing: MI | Show results with:MI
  40. [40]
    AMD Releases ROCm 6.4.1 With RDNA4 GPU Support - Phoronix
    May 21, 2025 · AMD ROCm 6.4.1 is now officially released. With ROCm 6.4.1 there is formal support for RDNA4 GPUs, including the Radeon RX 9000 series consumer graphics cards.
  41. [41]
    Radeon Limitations and recommended settings
    AMD has identified common errors when running ROCm™ on Radeon™ multi-GPU configuration at this time, along with the applicable recommendations. See mGPU ...
  42. [42]
    AMD Seeking Feedback Around What Radeon GPUs You Would ...
    Jan 22, 2025 · Even RDNA2 can do AI upscaling reasonably well. RDNA3 should be more than capable of running something better. These are their current high end ...<|control11|><|separator|>
  43. [43]
    Getting Started Guide: Using AMD ROCm™ Software on Radeon ...
    Support for Hugging Face models and tools on Radeon GPUs using ROCm, allowing users to unlock the full potential of LLMs on their desktop systems. Radeon ...
  44. [44]
    AMD unveils ROCm 7 — new platform boosts AI performance up to ...
    Jun 13, 2025 · The biggest change brought by ROCm 7 for client PCs is the extension of ROCm to Windows and Radeon GPUs, which allows the use of discrete and ...
  45. [45]
    ROCm 7.0.0 release notes
    Sep 16, 2025 · Virtualization support#. ROCm 7.0.0 introduces support for KVM Passthrough for AMD Instinct MI350X and MI355X GPUs. All KVM-based SR-IOV ...<|separator|>
  46. [46]
    System requirements (Linux) - AMD ROCm documentation
    Sep 17, 2025 · ROCm requires CPUs that support PCIe™ atomics. Modern CPUs after the release of 1st generation AMD Zen CPU and Intel™ Haswell support PCIe ...
  47. [47]
    WSL support matrices by ROCm version
    This section provides information on the compatibility of ROCm™ components, Radeon™ GPUs, and the Radeon Software for Windows Subsystem for Linux® (WSL). To ...
  48. [48]
    AMD ROCm 7.0 Software: Supercharging AI and HPC Infrastructure ...
    Oct 1, 2025 · The MI350 GPU leverages advanced FP4 and FP6 datatype support, offering outstanding compute density and memory efficiency for transformer ...Missing: September RAG
  49. [49]
    Prerequisites to use ROCm on Radeon desktop GPUs for machine ...
    ROCm is an extension of HSA platform architecture, and shares queuing model, memory model, signaling and synchronization protocols. Platform atomics are ...
  50. [50]
    How ROCm uses PCIe atomics — AMD GPU Driver (amdgpu) 30.20.0
    PCIe for atomic operations​​ ROCm requires CPUs that support PCIe atomics. Similarly, all connected I/O devices should also support PCIe atomics for optimum ...
  51. [51]
    ROCm installation for Linux
    Oct 9, 2025 · While the package manager is the recommended method, you can still install ROCm using the AMDGPU installer by following the legacy process.System requirements · ROCm installation overview · ROCm Runfile Installer · JAX
  52. [52]
    What is HIP? - AMD ROCm documentation
    HIP is a thin API with little or no performance impact over coding directly in NVIDIA CUDA or AMD ROCm. HIP enables coding in a single-source C++ programming ...
  53. [53]
    [PDF] HIP Documentation
    Sep 13, 2024 · The Heterogeneous-computing Interface for Portability (HIP) API is a C++ runtime API and kernel language that lets developers create ...
  54. [54]
    Kernel Language Syntax — HIP Documentation
    The hipLaunchKernelGGL macro always starts with the five parameters specified above, followed by the kernel arguments. HIPIFY tools optionally convert Cuda ...
  55. [55]
    HIP compilers — HIP 6.2.41134 Documentation
    ROCm provides the compiler driver hipcc , that can be used on AMD ROCm and NVIDIA CUDA platforms. On ROCm, hipcc takes care of the following: Setting the ...
  56. [56]
    Event management — HIP 7.1.0 Documentation
    Record an event in the specified stream. hipEventQuery() or hipEventSynchronize() must be used to determine when the event transitions from “recording”
  57. [57]
    Unified memory management — HIP 6.2.41134 Documentation
    HIP managed memory allocation API: The hipMallocManaged() is a dynamic memory allocator available on all GPUs with unified memory support. For more details, ...
  58. [58]
    Device management — HIP 7.1.52801 Documentation
    ### Summary of Multi-GPU Support and Related Functions
  59. [59]
    AMD compute language runtimes (CLR)
    opencl - contains implementation of OpenCL™ on AMD platform. It is hosted at clr/opencl. rocclr - contains ROCm compute runtime used in HIP and OpenCL™ .
  60. [60]
    ROCm/clr - GitHub
    AMD CLR (Compute Language Runtime) contains source codes for AMD's compute languages runtimes: HIP and OpenCL™ .Missing: implementation | Show results with:implementation
  61. [61]
    OpenCL Programming Guide — ROCm 4.5.0 documentation
    If there is a kernel compilation error, the error code is CL_BUILD_PROGRAM_FAILURE, in which case it is necessary to print out the build log.
  62. [62]
    ROCm OpenMP support — llvm-project 20.0.0 Documentation
    The ROCm installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of OpenMP 5.0, 5.1, and 5.2 standards.
  63. [63]
    OpenMP support in ROCm
    This document briefly describes the installation location of the OpenMP toolchain, example usage of device offloading, and usage of rocprof with OpenMP ...
  64. [64]
    Support, Getting Involved, and FAQ - LLVM/OpenMP
    The OpenMP AMDGPU offloading support depends on the ROCm math libraries and the HSA ROCr / ROCt runtimes. These are normally provided by a standard ROCm ...
  65. [65]
  66. [66]
    What is ROCR? - AMD ROCm documentation
    The ROCm runtime (ROCR) is AMD's implementation of HSA runtime, which is a thin, user-mode API that exposes the necessary interfaces to access and interact ...
  67. [67]
  68. [68]
    API — ROCR 1.18.0 Documentation
    The HSA runtime passes three arguments to the callback: the allocation size, the application data, and a pointer to a memory location where the application ...
  69. [69]
  70. [70]
    AMD Launches ROCm 7.0, Up to 3.8x Performance Uplift Over ...
    Sep 17, 2025 · AMD today unveiled ROCm 7.0, a massive update to its open GPU software platform for AI workloads across datacenter racks and even client ...
  71. [71]
    ROCm LLVM compiler infrastructure — llvm-project 20.0.0 ...
    Learn more about the AMD ROCm LLVM compiler infrastructure and its various components and tools, including the open-source ROCm LLVM fork and associated ...
  72. [72]
    User Guide for AMDGPU Backend — LLVM 20.0.0git documentation
    Sep 30, 2025 · Use the Clang options -mcpu=<target-id> or --offload-arch=<target-id> to specify the AMDGPU processor together with optional target features.
  73. [73]
    Rocm-CompilerSupport has moved! - GitHub
    May 14, 2024 · Rocm-CompilerSupport has moved! This project is now located in the AMD Fork of the LLVM Project, under the "amd/comgr" directory.
  74. [74]
    HIPIFY documentation
    HIPIFY is a ROCm tool to help developers migrate GPU programming from NVIDIA's CUDA language to AMD's HIP C++ programming language for use on AMD GPUs.Missing: alternative | Show results with:alternative
  75. [75]
    GPUFORT: S2S translation tool for CUDA Fortran and ... - GitHub
    GPUFORT was developed to translate a number of HPC apps to code formats that are well supported by AMD's ROCm ecosystem.
  76. [76]
    ROCgdb 16.3 Documentation - AMD ROCm documentation
    ROCgdb is the AMD source-level debugger for Linux, based on the GNU Debugger (GDB). ROCgdb enables heterogeneous debugging on the ROCm software.Missing: ROCD | Show results with:ROCD
  77. [77]
    rocBLAS design and usage notes - AMD ROCm documentation
    The rocBLAS library uses Tensile and hipBLASLt internally, which supply high-performance implementations of GEMM. Tensile is installed as part of the ...
  78. [78]
    rocSOLVER LAPACK-like functions - AMD ROCm documentation
    An optimized internal implementation without rocBLAS calls could be executed with small and mid-size matrices if optimizations are enabled (default option). For ...
  79. [79]
    rocSOLVER 3.31.0 Documentation - AMD ROCm documentation
    rocSOLVER implements LAPACK routines on top of the AMD ROCm platform. rocSOLVER is implemented in the HIP programming language and optimized for AMD GPUs.
  80. [80]
    hipSOLVER 3.1.0 Documentation - AMD ROCm documentation
    hipSOLVER is a LAPACK marshalling library, with multiple supported backends. It sits between the application and a 'worker' LAPACK library.Missing: hipFFT | Show results with:hipFFT
  81. [81]
    rocFFT 1.0.35 Documentation - AMD ROCm documentation
    The rocFFT library provides a fast and accurate implementation of the discrete Fast Fourier Transform (FFT) written in HIP for GPU devices.
  82. [82]
    hipFFT API usage - AMD ROCm documentation
    hipFFT API usage#. This section describes how to use the hipFFT library API. The hipFFT API follows the NVIDIA CUDA cuFFT API.
  83. [83]
    Working with rocFFT - AMD ROCm documentation
    Working with rocFFT#. This topic describes how to use rocFFT, including how to structure the workflow, set up and clean up the library, and use plans, ...
  84. [84]
    ROCm libraries - AMD ROCm documentation
    Applies to Linux and Windows, Machine Learning and Computer Vision, Primitives, Communication, Math
  85. [85]
    MIOpen 3.5.1 Documentation - AMD ROCm documentation
    MIOpen documentation​​ MIOpen is one of the first libraries to publicly support the bfloat16 data type for convolutions, which allows for efficient training at ...
  86. [86]
    What is MIOpen? - AMD ROCm documentation
    MIOpen is AMD's open-source, deep-learning primitives library for GPUs. It implements fusion to optimize for memory bandwidth and GPU launch overheads.
  87. [87]
    hipTensor 2.0.0 Documentation - AMD ROCm documentation
    hipTensor is a high-performance HIP C++ library for accelerating tensor primitives. It leverages specialized GPU matrix cores on the latest AMD discrete GPUs.
  88. [88]
    What is hipTensor? - AMD ROCm documentation
    hipTensor is a high-performance HIP library for tensor primitives. It's the AMD C++ library for accelerating tensor primitives, leveraging specialized GPU ...
  89. [89]
    rocSPARSE 4.1.0 Documentation
    rocSPARSE is a library that provides basic linear algebra subroutines for sparse matrices and vectors. It's created using the HIP programming language, ...Missing: machine learning
  90. [90]
    AMD - ROCm | onnxruntime
    The intent is to get users up and running with their custom workload in python and provides an environment of prebuild ROCm, Onnxruntime and MIGraphX packages ...Missing: integration | Show results with:integration
  91. [91]
    Constructing a RAG system using LlamaIndex and Ollama ...
    This tutorial demonstrates how to construct a RAG pipeline using LlamaIndex and Ollama on AMD Radeon GPUs with ROCm. For further details, see the ...
  92. [92]
    [PDF] AMD ROCM™ 7 SOFTWARE SOLUTION GUIDE FOR AMD ...
    A preview version of ROCm 7 software showed an average 3.5× higher inference throughput performance on AMD InstinctTM MI300X 8x GPU platform2 and up to 3× ...
  93. [93]
    Deep learning frameworks for ROCm
    It also provides ROCm-compatible versions of popular frameworks and libraries, such as PyTorch, TensorFlow, JAX, and others. The AMD ROCm organization ...
  94. [94]
    TensorFlow on ROCm installation
    This topic covers setup instructions and the necessary files to build, test, and run TensorFlow with ROCm support in a Docker environment. ... PyTorch on ROCm ...
  95. [95]
    ROCm vs CUDA: A Performance Showdown for Modern AI Workloads
    Aug 7, 2025 · ROCm + AMD MI325X is ready for prime time. See benchmarks vs CUDA and why more teams are switching to ROCm for AI performance and cost.
  96. [96]
    Building an Accelerated OpenFOAM Proof-of-Concept Application ...
    Jul 24, 2025 · RapidCFD is an open-source OpenFOAM implementation that runs almost all simulations on NVIDIA GPUs. ... https://github.com/ROCm/roc-stdpar.
  97. [97]
    GROMACS on AMD GPU-Based HPC Platforms - arXiv
    Here, we share the results of our work on readying GROMACS for AMD GPU platforms using SYCL, and demonstrate performance on Cray EX235a machines with MI250X ...Missing: ABINIT | Show results with:ABINIT
  98. [98]
    Frontier User Guide - OLCF User Documentation
    ... RocBLAS has over 1,000 device libraries that may be `dlopen`'d by RocBLAS ... Use an alternative BLAS library such as Magma (for GPU) or cray-libsci or Openblas ( ...
  99. [99]
    Parallelism - abinit
    This page gives hints on how to set parameters for a parallel calculation with the ABINIT package.
  100. [100]
    A collection of examples for the ROCm software stack - GitHub
    This repository is a collection of examples to enable new users to start using ROCm, as well as provide more advanced examples for experienced users.
  101. [101]
    HIP Python - AMD ROCm documentation
    Jun 23, 2023 · HIP Python provides low-level Cython and Python® bindings for the HIP runtime, HIPRTC, multiple math libraries and the communication library RCCL.
  102. [102]
    ROCm/hipfort: Fortran interfaces for ROCm libraries - GitHub
    The current batch of HIPFORT interfaces is derived from ROCm 4.5.0. The following tables list the supported API: HIP · hipBLAS · hipFFT · hipRAND · hipSOLVER ...
  103. [103]
    From Ingestion to Inference: RAG Pipelines on AMD GPUs
    Oct 2, 2025 · Build a RAG enhanced GenAI application that improves the quality of model responses by incorporating data that is missing in the model ...
  104. [104]
    Oracle and AMD Expand Partnership to Help Customers Achieve ...
    Oct 14, 2025 · Beginning in calendar Q3 2026, Oracle will be the first hyperscaler to offer a publicly available AI supercluster powered by 50,000 AMD ...
  105. [105]
    Ubuntu native installation - AMD ROCm documentation
    Install ROCm. System requirements · User & AMD GPU Driver (amdgpu) · Quick start installation guide · Detailed install · Prerequisites · Installation methods.
  106. [106]
    Red Hat Enterprise Linux native installation
    For information about the AMDGPU driver installation, see the Red Hat Enterprise Linux native installation in the AMD Instinct Data Center GPU Documentation.Installing · Rocm Runtime Packages · Rocm Developer Packages
  107. [107]
    Running ROCm Docker containers — ROCm installation (Linux)
    To grant a Docker container access to the host's AMD GPUs, run your container with the following options. See the Docker documentation to learn more about the ...Prerequisites · Accessing Gpus In Containers · Docker Compose
  108. [108]
    Build the ROCm Core SDK from source — AMD ROCm 7.9.0 preview
    Oct 23, 2025 · Learn how to build the ROCm Core SDK from source using TheRock. Includes references to environment setup guides for Ubuntu 24.04 and Windows ...Prerequisites · High-Level Build Process · Platform-Specific Setup
  109. [109]
    ROCm/TheRock: The HIP Environment and ROCm Kit - GitHub
    TheRock (The HIP Environment and ROCm Kit) is a lightweight open source build platform for HIP and ROCm. The project is currently in an early preview state.
  110. [110]
    Rocm Device Libs - conda install - Anaconda
    To install this package run one of the following: conda install conda-forge::rocm-device-libs conda install conda-forge/label/cf202003::rocm-device-libs ...<|separator|>
  111. [111]
    Using Spack to install ROCm packages
    Spack is a package management tool designed to support multiple software versions and configurations on a wide variety of platforms and environments.Missing: HPC | Show results with:HPC
  112. [112]
    ROCm/rocm-spack: A flexible package manager that ... - GitHub
    It covers basic to advanced usage, packaging, developer features, and large HPC deployments. You can do all of the exercises on your own laptop using a Docker ...
  113. [113]
    AMD and Microsoft Bring Cloud-to-Client Power to Developers
    May 19, 2025 · With ROCm support for both Cloud and Windows ... ROCm integrates seamlessly with Microsoft Azure, enabling powerful AI and HPC workloads.Amd Empowers Developers To... · Amd And Microsoft... · Amd Rocm Everywhere -- Build...Missing: AWS | Show results with:AWS
  114. [114]
  115. [115]
    User and AMD GPU Driver (amdgpu) support matrix
    Starting from ROCm™ 6.4.0, forward and backward compatibility between the AMD GPU Driver (amdgpu) and its user space software is provided up to a year apart ( ...
  116. [116]
    AMD ROCm documentation — ROCm Documentation
    ROCm is an open-source software platform optimized to extract HPC and AI workload performance from AMD Instinct GPUs and AMD Radeon GPUs while maintaining ...System requirements (Linux) · What is ROCm? · ROCm libraries · ROCm license
  117. [117]
    HIP 7.1.52801 Documentation - AMD ROCm documentation
    HIP is a C++ runtime API and kernel language that lets you create portable applications for AMD and NVIDIA GPUs from a single source code.Introduction to the HIP... · Install HIP · HIP compilers · HIP graphsMissing: date | Show results with:date
  118. [118]
    ROCm™ AI Developer Hub - AMD
    Access ROCm software platforms, tutorials, blogs, open source projects, and other resources for AI development on AMD GPUs.
  119. [119]
    Tutorials for AI developers - AMD ROCm documentation
    Tutorials for AI developers 7.0 ... RAG with LlamaIndex and Ollama · OCR with vision-language models with vLLM · Building AI pipelines for voice assistants.
  120. [120]
    AMD ROCm™ Software
    - **Description**: AMD ROCm™ Software is an open-source stack for GPU computation by AMD.
  121. [121]
    ROCm ROCm · Discussions - GitHub
    Explore the GitHub Discussions forum for ROCm ROCm. Discuss code, ask questions & collaborate with the developer community.
  122. [122]
    Posted in 2025 - ROCm™ Blogs - AMD
    In this blog from the AMD Silo AI Programs, we build a simple Retrieval‑Augmented Generation (RAG) pipeline. While pretrained models are powerful, they lack ...
  123. [123]
  124. [124]
    Retrieval Augmented Generation (RAG) with vLLM, LangChain and ...
    Learn AI-powered knowledge retrieval that enriches prompts with proprietary data to deliver accurate and context-aware answers.
  125. [125]
    Application portability with HIP - AMD GPUOpen
    Apr 30, 2024 · The hipify tools can scan code to identify any unsupported CUDA functions. A list of supported CUDA APIs can be found in ROCm's HIPIFY ...<|separator|>
  126. [126]
    ROCm vs CUDA: GPU Computing Comparison (October 2025)
    Oct 22, 2025 · CUDA now typically outperforms ROCm by 10% to 30%, down from 40% to 50% gaps in previous years. · ROCm costs 15% to 40% less (depending on the ...Missing: parity MLPerf
  127. [127]
    AMD ROCm™ Software for HPC
    AMD ROCm™ software empowers developers to optimize HPC and Supercomputing applications on AMD Instinct™ accelerators.Missing: integrations ABINIT
  128. [128]
    invexed/hipSYCL: Implementation of SYCL for CPUs, AMD ... - GitHub
    Hardware and OS requirements. We support CPUs, NVIDIA CUDA GPUs and AMD GPUs that are supported by ROCm; hipSYCL currently does not support other operating ...
  129. [129]
    OpenMP* Support - Intel
    The Intel oneAPI DPC++/C++ Compiler supports OpenMP C++ pragmas that comply with OpenMP C++ Application Program Interface (API) specification 5.0.
  130. [130]
    OpenCL™ Code Interoperability - Intel
    OpenCL™ Code Interoperability. The oneAPI programming model enables developers to continue using all OpenCL code features via different parts of the SYCL* API.
  131. [131]
    UXL Foundation: Unified Acceleration
    oneAPI is designed to enable developers to use a single code base across multiple accelerators and architectures, supporting artificial intelligence, high ...Oneapi Developer Summit 2025... · Steering Members · Contributing MembersMissing: ROCm | Show results with:ROCm
  132. [132]
    AMD GPU's boosting ROCm 7.0 software libraries are here
    Wed 17 Sep 2025 // 20:40 UTC. AMD closed the performance gap with Nvidia's Blackwell accelerators with the launch of the MI355X this spring.<|control11|><|separator|>
  133. [133]
    Compiling SYCL with Different GPUs - Intel
    May 11, 2022 · This document demonstrates how a SYCL application can be compiled and executed on different graphics processing units (GPUs) from Intel, AMD, NVIDIA, etc.
  134. [134]
    PyTorch on ROCm installation
    PyTorch on ROCm provides mixed-precision and large-scale training using AMD MIOpen and RCCL libraries. This topic covers setup instructions and the necessary ...
  135. [135]
    Accelerate Your AI: PyTorch 2.4 Now Supports Intel GPUs for Faster ...
    Aug 29, 2024 · PyTorch 2.4 now supports Intel Data Center GPU Max Series and the SYCL software stack, making it easier to speed up your AI workflows for both training and ...