Remote direct memory access
Remote direct memory access (RDMA) is a networking technology that enables the direct transfer of data between the memory of two computers over a network, bypassing the CPU, operating system kernel, cache, and context switches on both endpoints to achieve low-latency and high-throughput communication.[1] This is facilitated by specialized network interface controllers (NICs), known as RNICs, which handle data placement directly into application buffers using protocols like the Remote Direct Memory Access Protocol (RDMAP).[1] At its core, RDMA operates through operations such as RDMA Write (one-sided transfer to remote memory), RDMA Read (remote fetch of data), and Send (two-sided message passing), all of which ensure reliable, ordered delivery over underlying transports while providing memory protection via steering tags (STags) to prevent unauthorized access.[1] The technology minimizes data copies and CPU involvement by leveraging direct data placement (DDP), allowing applications to specify exact memory locations for transfers without intermediate buffering.[1] RDMA is implemented across several standards: InfiniBand, a channel-based fabric architecture designed for high-performance computing (HPC) that natively supports RDMA semantics for low-latency interconnects between servers, storage, and GPUs;[2] RDMA over Converged Ethernet (RoCE), which extends RDMA capabilities to standard Ethernet networks at Layers 2 and 3 for scalable data center deployments;[3] and iWARP (Internet Wide Area RDMA Protocol), which maps RDMA over TCP/IP for compatibility with existing Ethernet infrastructure.[1] These implementations, governed by bodies like the InfiniBand Trade Association (IBTA) and the Internet Engineering Task Force (IETF), ensure interoperability and evolving features such as enhanced telemetry and higher port densities in recent specifications.[4] RDMA's key advantages include reduced latency (often sub-microsecond), high bandwidth (up to hundreds of Gbps), and near-zero CPU overhead, making it essential for demanding applications like AI training, big data analytics, distributed storage (e.g., NVMe-oF), and cloud-scale clustering.[2] By offloading network processing to hardware, it enhances scalability in modern data centers, where InfiniBand and RoCE together power 73% of the TOP500 supercomputers as of November 2024.[5]Fundamentals
Definition and Core Concepts
Remote Direct Memory Access (RDMA) is a networking technology that enables direct data transfers between the main memory of networked computers without involving the central processing unit (CPU), operating system (OS), cache, or traditional network stack on either endpoint.[6] This approach offloads data movement to the network interface hardware, allowing applications to access remote memory as if it were local, thereby achieving low-latency and high-throughput communication essential for high-performance computing and data centers.[7] At its core, RDMA incorporates zero-copy networking, where data is transferred directly from the virtual memory of one node to the virtual memory of another without intermediate buffering or copying in the kernel or user space.[8] It also features kernel bypass, permitting user-level applications to interact directly with the network hardware, eliminating OS overhead during transfers.[9] RDMA operations are categorized into single-sided and two-sided types: single-sided operations, such as RDMA Write and RDMA Read, allow the initiator to specify both local and remote memory buffers while bypassing the remote CPU entirely for completion notification; in contrast, two-sided operations, like Send and Receive, require both endpoints to post buffers and involve explicit coordination, resembling traditional message passing.[10] RDMA extends the principles of traditional local Direct Memory Access (DMA), where peripheral devices access host memory independently of the CPU, to remote scenarios across a network, enabling similar efficiency over distances.[11] Unlike conventional TCP/IP networking, which relies on multiple data copies through the kernel and incurs significant CPU involvement for processing packets, RDMA minimizes these bottlenecks to deliver superior performance in bandwidth-intensive applications.[12] The basic architecture relies on RDMA-enabled Network Interface Cards (RNICs), specialized hardware that independently manages memory registration, queue processing, and data transfers without host intervention.[7]Key Operational Principles
Remote Direct Memory Access (RDMA) enables efficient data transfer by placing incoming or outgoing data directly into the memory buffers of user-space applications on remote hosts, bypassing the operating system kernel to eliminate the overhead of data copying through kernel space. This direct placement is achieved through hardware support in RDMA-capable network interface controllers (RNICs), which manage transfers independently of the CPU.[13] A key enabler of this process is the avoidance of kernel involvement, which prevents costly context switches and interrupts that characterize traditional TCP/IP networking; instead, the RNIC handles packet processing, error detection, and retransmissions at the hardware level.[14] To support secure and predictable direct access, applications must register specific memory regions with the RNIC prior to use in RDMA operations. This registration process pins the buffers in physical memory, mapping virtual addresses to physical ones and preventing paging or swapping that could disrupt hardware access, while also establishing protection domains to enforce access permissions.[13] Pinning ensures that the RNIC can translate and validate addresses without software intervention, maintaining the zero-copy nature of transfers. At the core of RDMA's asynchronous operation model are queue pairs (QPs), each comprising a send queue (SQ) for outgoing work and a receive queue (RQ) for incoming work. Applications post work requests (WRs) to the SQ to initiate sends or RDMA writes, or to the RQ to prepare for receives or RDMA reads, with the RNIC dequeuing and executing these requests in hardware.[15] This queue-based mechanism allows for efficient batching and pipelining of operations, decoupling application logic from low-level network handling. Work request completion is signaled through completion queues (CQs), which the application monitors via polling or event notification to retrieve completion queue elements (CQEs) containing status, opcode, and byte count details.[16] CQs enable scalable, low-overhead notification without relying on interrupts, supporting high-throughput scenarios by allowing multiple QPs to share a single CQ. RDMA defines transport semantics to balance reliability, ordering, and overhead. The Reliable Connected (RC) service establishes a dedicated connection between QPs, guaranteeing in-order delivery, exactly-once semantics, and flow control through hardware acknowledgments and retransmissions.[17] In contrast, the Unreliable Datagram (UD) service provides connectionless, best-effort delivery akin to UDP, with no ordering or reliability guarantees but minimal setup overhead, ideal for fire-and-forget messaging.[18] This hardware-centric design yields significant latency reductions, approximated as: \text{RDMA latency} \approx \text{RNIC processing time} + \text{network transit time} typically under 5 μs end-to-end for small messages in local clusters, versus over 100 μs for equivalent TCP/IP transfers involving kernel traversal and buffering.[16][19]History and Development
Origins and Early Standards
Remote direct memory access (RDMA) emerged in the 1990s as a response to performance bottlenecks in high-performance computing (HPC) environments, particularly in cluster-based supercomputing systems where traditional network interfaces incurred high latency due to operating system kernel involvement in data transfers. These bottlenecks limited scalability in parallel applications, such as scientific simulations and large-scale data processing, by introducing overheads from context switches and data copying between user and kernel spaces. Early research in user-level networking, including projects like U-Net at UC Berkeley, highlighted the need for direct hardware access to memory without CPU intervention to achieve low-latency, high-bandwidth communication in distributed systems. A pivotal early standardization effort was the Virtual Interface Architecture (VIA), a software specification developed by Compaq, Intel, and Microsoft to enable protected, user-level networking over system area networks (SANs). Released in version 1.0 on December 16, 1997, VIA provided abstractions for zero-copy data transfers and remote memory operations, aiming to reduce communication latency for HPC clusters and enterprise applications like transaction processing.[20] By allowing applications to directly manage network interfaces via virtual interfaces and completion queues, VIA addressed key limitations of kernel-mediated networking, influencing subsequent RDMA designs.[21] Building on VIA's concepts, the InfiniBand Trade Association (IBTA) was formed in August 1999 by industry leaders including Intel, Microsoft, Dell, Hewlett-Packard, IBM, and Mellanox to develop a unified architecture for high-speed interconnects in HPC and data centers. The InfiniBand Architecture (IBA) specification version 1.0 was released in October 2000, defining a switched fabric protocol with native support for RDMA operations like send/receive and direct memory writes/reads to bypass CPU and OS involvement.[22] Initial hardware implementations followed shortly, with Mellanox shipping the first InfiniBand devices, such as the InfiniBridge MT21108 host channel adapter, in January 2001, enabling practical deployment in supercomputing clusters.[23] Intel contributed significantly to early InfiniBand development through its involvement in the IBTA and silicon design efforts.[24]Evolution and Adoption Milestones
The standardization of iWARP by the Internet Engineering Task Force (IETF) in 2007 marked an early milestone in extending RDMA capabilities over standard TCP/IP networks, with RFC 5040 defining the core Remote Direct Memory Access Protocol (RDMAP) and related specifications (RFC 5041–5044) enabling direct data placement and framing over reliable transports.[25] This laid the groundwork for Ethernet-based RDMA implementations, broadening accessibility beyond proprietary fabrics. In 2010, the InfiniBand Trade Association (IBTA) released the initial RoCE specification (v1), integrating RDMA semantics directly into Ethernet frames to leverage existing data center infrastructure without requiring specialized hardware. This was followed by RoCE v2 in September 2014, which added routable IP/UDP encapsulation to support Layer 3 network traversal, enhancing scalability in multi-subnet environments.[26] Operating system adoption accelerated RDMA's integration into mainstream computing. Linux kernels began supporting RDMA features in the late 2000s, with initial NFS/RDMA client implementation in version 2.6.24 (December 2007) and server support in 2.6.25 (April 2008), enabling efficient file system operations over RDMA fabrics.[27] Microsoft introduced native RDMA support in Windows Server 2012 via SMB Direct, allowing low-CPU file sharing over RDMA-capable adapters for storage and clustering workloads.[28] Virtualization platforms followed suit, with VMware integrating paravirtual RDMA (PVRDMA) in vSphere 6.5 (October 2016), permitting virtual machines to access RDMA hardware for high-throughput networking.[29] By the 2010s, RDMA had achieved widespread adoption in high-performance computing (HPC), powering a majority of Top500 supercomputers through InfiniBand and emerging Ethernet variants, driven by demands for low-latency interconnects in scientific simulations and big data processing.[30] Market momentum surged in 2018 with the proliferation of 100 Gbit/s RDMA hardware from vendors like Broadcom and Supermicro, enabling cost-effective scaling for enterprise clusters and reducing latency bottlenecks in bandwidth-intensive applications.[31] In April 2020, NVIDIA completed its $7 billion acquisition of Mellanox, enhancing RDMA and InfiniBand integration with GPU technologies for AI and HPC workloads.[32] Post-2020, RDMA adoption boomed in data centers, fueled by AI/ML workloads requiring ultra-low latency data movement, with the RDMA networking market expanding from approximately $1 billion prior to 2021 to over $6 billion in 2023.[33] As of June 2024, RDMA-based networks powered over 90% of TOP500 supercomputers, with the market projected to exceed $22 billion by 2028 fueled by AI/ML demands.[30] Intel's Omni-Path Architecture, announced in November 2014 and commercially released in 2015, emerged as a cost-competitive alternative to InfiniBand, offering 100 Gbit/s throughput with lower latency and power consumption for HPC fabrics, and remained in use through the 2020s despite eventual discontinuation of further generations.[34]Protocols and Implementations
InfiniBand and RoCE
InfiniBand serves as a foundational protocol for remote direct memory access (RDMA), defined as a channel-based interconnect architecture that employs a switched fabric topology to enable high-speed connectivity between servers and storage systems.[3][35] This topology facilitates scalable, point-to-point communication with minimal latency, supporting data rates up to 800 Gbit/s via the eXtended Data Rate (XDR) standard as of 2025.[36] InfiniBand ensures lossless data transmission through credit-based flow control, where receivers grant credits to senders to manage buffer usage and prevent packet drops.[37] The architecture is governed by specifications from the InfiniBand Trade Association (IBTA), with ongoing revisions such as Volume 1 Release 2.0 in 2025, which enhance switch density, scalability, and memory placement for reduced latency.[38] RoCE, or RDMA over Converged Ethernet, adapts InfiniBand's RDMA capabilities to standard Ethernet networks, allowing efficient, low-latency data transfers over converged infrastructure.[39] It exists in two versions: RoCE v1, which operates solely at Ethernet Layer 2 within a single broadcast domain, and RoCE v2, which is routable across Layer 3 networks using IP and UDP encapsulation for broader scalability.[40] Like InfiniBand, RoCE requires lossless Ethernet environments, achieved through mechanisms such as Priority-based Flow Control (PFC) to avoid packet loss and maintain performance.[39] The primary differences between InfiniBand and RoCE lie in their underlying hardware and deployment: InfiniBand utilizes dedicated native hardware for its fabric, providing optimized, purpose-built performance, whereas RoCE overlays RDMA functionality onto existing Ethernet infrastructure, leveraging commodity switches and NICs for cost-effective integration.[41] Despite these distinctions, both protocols share the IBTA Verbs API, a standardized interface for managing RDMA operations like queue pairs and work requests.[42] RoCE standards emerged as IBTA supplements, with the initial RoCE specification released in 2010 and RoCE v2 formalized in the 2010s to address routing limitations of the earlier version.[43]iWARP, Omni-Path, and Emerging Standards
iWARP, or Internet Wide Area RDMA Protocol, is a standards-based implementation of RDMA that operates over TCP/IP, enabling direct memory access across standard Ethernet networks without requiring specialized lossless fabrics.[1] Defined by the Internet Engineering Task Force (IETF) in 2007, iWARP consists of a layered protocol stack including the Remote Direct Memory Access Protocol (RDMAP) for RDMA operations, the Direct Data Placement (DDP) protocol for efficient data transfer into application buffers, and the Marker PDU Aligned Framing (MPA) for TCP framing to ensure reliable delivery.[1][44] This approach leverages existing Ethernet infrastructure, avoiding the need for priority flow control or other enhancements mandated by protocols like RoCE, but introduces higher protocol overhead due to TCP's reliability mechanisms, such as acknowledgments and retransmissions.[1] iWARP's design prioritizes compatibility with conventional IP networks, making it suitable for enterprise and wide-area deployments where hardware modifications are impractical. Omni-Path Architecture (OPA), introduced by Intel in 2014, is a proprietary high-performance interconnect designed specifically for scalability in high-performance computing (HPC) environments, offering RDMA capabilities with low latency and high bandwidth. OPA supports data rates up to 100 Gbps per port in its initial generation, with optimizations for message rates exceeding 10 million per second and end-to-end latencies under 1 microsecond, positioning it as a cost-effective alternative to InfiniBand for large-scale clusters. The architecture employs the Omni-Path Interface (OPI) specification, which defines a standardized electrical and protocol interface for host fabric adapters and switches, facilitating interoperability among components while emphasizing power efficiency and fabric manageability. Although plans for a 200 Gbps second-generation OPA were announced, Intel discontinued development in 2019, shifting focus to other interconnect technologies. OPA's fabric supports up to thousands of nodes with features like adaptive routing and congestion control, enhancing reliability in HPC workloads. Emerging standards are extending RDMA's reach into Ethernet-centric and software-defined environments, particularly for AI and cloud-scale applications. The Ultra Ethernet Consortium (UEC), formed in 2023 by industry leaders including Intel, AMD, and Broadcom, is developing an open Ethernet-based specification optimized for AI and HPC, featuring the Ultra Ethernet Transport (UET) protocol as a modern RDMA alternative to RoCE.[45] UEC 1.0, released in June 2025, introduces RDMA enhancements with intelligent low-latency transport, congestion control tailored for high-throughput AI training, and IP-routable packet structures to support massive-scale clusters without proprietary fabrics.[46] Complementing hardware advancements, Soft-RoCE provides a software-emulated RDMA implementation over standard Ethernet, allowing systems without dedicated RDMA hardware to perform direct memory transfers via kernel drivers like those in Linux.[47] This emulation layer maps RDMA verbs to TCP/UDP transports, enabling testing and deployment in virtualized or legacy environments with performance approaching hardware solutions for smaller-scale use cases.[48] These developments reflect ongoing IETF and industry efforts to evolve RDMA standards for broader interoperability and efficiency in diverse networking ecosystems.[49]Technical Mechanisms
Memory Access and Data Transfer
Remote Direct Memory Access (RDMA) supports several core operations for efficient data transfer between nodes, categorized into one-sided and two-sided semantics. One-sided operations, such as RDMA Read and RDMA Write, enable direct access to remote memory without involving the remote CPU, allowing the initiator to pull data (RDMA Read) from or push data (RDMA Write) to a specified remote memory region using a provided remote key (rkey).[50][51] Two-sided operations, including Send and Receive, function like message passing, where the sender posts a Send work request to deliver data to a pre-posted Receive buffer on the remote side, requiring coordination and CPU involvement on both ends for completion signaling.[50][51] Additionally, atomic operations, such as Compare-and-Swap or Fetch-and-Add, provide one-sided mechanisms for synchronized remote memory updates, where the RNIC performs the operation atomically and returns the result to the initiator without remote CPU intervention.[51] Before performing RDMA operations, applications must register memory regions to enable safe direct access by the RDMA Network Interface Card (RNIC). This registration process pins the specified virtual memory pages in physical memory to prevent paging, maps virtual addresses to physical ones for the RNIC, and assigns permissions such as local read/write or remote access types (e.g., remote read, write, or atomic).[50] Upon successful registration, the RNIC generates a local key (lkey) for the application to reference the region in local operations and a remote key (rkey) to share with remote peers, which serves as an access control token to validate and authorize incoming remote requests.[50] This pinning ensures data integrity during transfers but consumes system resources, as registered regions remain fixed in physical memory until deregistered.[50] Data transfer in RDMA begins when the initiator application posts a work request (WR) describing the operation—such as the remote address, length, and rkey for one-sided verbs—to a send queue within a queue pair (QP), a paired set of queues for communication between endpoints.[52] The RNIC then autonomously processes the WR by initiating Direct Memory Access (DMA) engines to transfer data directly between the local and remote memory regions, bypassing the host CPU for both data movement and protocol processing in one-sided operations.[53] For two-sided operations, the remote side must have a corresponding Receive WR posted, after which the RNIC signals completion via completion queues for error handling and synchronization.[51] This flow achieves low-latency transfers by offloading all network and memory operations to hardware. RDMA's efficiency stems from minimal protocol overhead, enabling theoretical maximum throughput calculated as link speed divided by the packet overhead factor, which accounts for headers and encapsulation.[54] For instance, on 100 Gbit/s links, RDMA protocols commonly achieve approximately 95% efficiency under optimal conditions with large payloads and lossless networks, approaching line rate while reducing CPU utilization to near zero.[54][55]APIs and Queue Management
The Verbs API, standardized by the InfiniBand Trade Association (IBTA) in the InfiniBand Architecture Specification, provides a user-space programming interface for RDMA operations on InfiniBand and RoCE networks. It enables applications to directly manage hardware resources and initiate transfers without kernel involvement, supporting functions for resource allocation, work request posting, and event polling. This API is foundational for high-performance networking, allowing developers to implement efficient data movement semantics.[56] Central to the Verbs API are functions for posting and completing work requests, such asibv_post_send() to enqueue send operations on a queue pair's send queue and ibv_post_recv() for receive operations on the receive queue. Completion events are retrieved via ibv_poll_cq(), which dequeues work completion structures containing status, opcode, and byte counts from a completion queue. These mechanisms ensure asynchronous, non-blocking operation, with signaled work requests generating interrupts for notification.[57]
Queue management encompasses the creation and configuration of core RDMA resources: queue pairs (QPs), completion queues (CQs), and protection domains (PDs). Queue pairs, created with ibv_create_qp(), represent bidirectional communication endpoints comprising a send queue for outgoing work requests and a receive queue for incoming ones; they support transport types like reliable connection (RC) or unreliable datagram (UD). Completion queues, allocated via ibv_create_cq(), hold entries for finished work requests from one or more QPs, with polling removing entries to track progress. Protection domains, obtained through ibv_alloc_pd(), enforce isolation by grouping QPs, memory regions, and address handles, restricting network access to authorized host memory and preventing unauthorized reads or writes. These resources are destroyed with corresponding deallocation functions like ibv_destroy_qp() and ibv_destroy_cq() upon completion.[50]
Work completions (WCs) are handled by applications polling CQs, where each WC includes fields for status, vendor error codes, and completion flags to indicate success or failure. Multiple QPs can share a CQ for efficiency, but overflow events trigger IBV_EVENT_CQ_ERR if the queue fills without polling. Protection domains integrate with memory registration to validate access rights during operations, ensuring faults like invalid keys result in controlled errors rather than crashes.[58]
The primary library for Verbs API implementation on Linux is libibverbs, part of the OpenFabrics Enterprise Distribution (OFED), which abstracts hardware-specific drivers for InfiniBand, RoCE, and iWARP. It supports user-space direct access via the ib_uverbs kernel module, enabling low-latency operations. On Windows, the Network Direct Kernel Provider Interface (NDKPI), an NDIS extension, delivers a Verbs-compatible API for RDMA, allowing independent hardware vendors to implement kernel-mode support for protocols like RoCE and iWARP. Verbs extensions for Ethernet, such as those in libibverbs-rocee, facilitate RDMA over converged networks by mapping InfiniBand semantics to Ethernet transports.[59][60][61]
Error handling in the Verbs API centers on completion status codes within WCs, with IBV_WC_SUCCESS denoting successful completion and codes like IBV_WC_LOC_QP_OP_ERR or IBV_WC_REM_INV_REQ_ERR signaling local or remote errors such as queue underflow or invalid requests. In unreliable modes like UD, negative acknowledgments (NAKs) from the remote side—such as for sequence errors or receiver not ready (RNR)—are reflected in WC status codes like IBV_WC_REM_OP_ERR, prompting application retries since hardware does not automatically recover. Events like IBV_EVENT_QP_FATAL indicate irrecoverable QP errors, requiring resource recreation.[62][63]