Fact-checked by Grok 2 weeks ago

Remote direct memory access

Remote direct memory access (RDMA) is a networking technology that enables the direct transfer of data between the memory of two computers over a network, bypassing the CPU, operating system kernel, cache, and context switches on both endpoints to achieve low-latency and high-throughput communication.^[1] This is facilitated by specialized network interface controllers (NICs), known as RNICs, which handle data placement directly into application buffers using protocols like the Remote Direct Memory Access Protocol (RDMAP).^[1] At its core, RDMA operates through operations such as RDMA Write (one-sided transfer to remote memory), RDMA Read (remote fetch of data), and Send (two-sided message passing), all of which ensure reliable, ordered delivery over underlying transports while providing memory protection via steering tags (STags) to prevent unauthorized access.^[1] The technology minimizes data copies and CPU involvement by leveraging direct data placement (DDP), allowing applications to specify exact memory locations for transfers without intermediate buffering.^[1] RDMA is implemented across several standards: InfiniBand, a channel-based fabric architecture designed for high-performance computing (HPC) that natively supports RDMA semantics for low-latency interconnects between servers, storage, and GPUs;^[2] RDMA over Converged Ethernet (RoCE), which extends RDMA capabilities to standard Ethernet networks at Layers 2 and 3 for scalable data center deployments;^[3] and iWARP (Internet Wide Area RDMA Protocol), which maps RDMA over TCP/IP for compatibility with existing Ethernet infrastructure.^[1] These implementations, governed by bodies like the InfiniBand Trade Association (IBTA) and the Internet Engineering Task Force (IETF), ensure interoperability and evolving features such as enhanced telemetry and higher port densities in recent specifications.^[4] RDMA's key advantages include reduced latency (often sub-microsecond), high bandwidth (up to hundreds of Gbps), and near-zero CPU overhead, making it essential for demanding applications like AI training, big data analytics, distributed storage (e.g., NVMe-oF), and cloud-scale clustering.^[2] By offloading network processing to hardware, it enhances scalability in modern data centers, where InfiniBand and RoCE together power 73% of the TOP500 supercomputers as of November 2024.^[5]

Fundamentals

Definition and Core Concepts

Remote Direct Memory Access (RDMA) is a networking technology that enables direct data transfers between the main memory of networked computers without involving the central processing unit (CPU), operating system (OS), cache, or traditional network stack on either endpoint.^[6] This approach offloads data movement to the network interface hardware, allowing applications to access remote memory as if it were local, thereby achieving low-latency and high-throughput communication essential for high-performance computing and data centers.^[7] At its core, RDMA incorporates zero-copy networking, where data is transferred directly from the virtual memory of one node to the virtual memory of another without intermediate buffering or copying in the kernel or user space.^[8] It also features kernel bypass, permitting user-level applications to interact directly with the network hardware, eliminating OS overhead during transfers.^[9] RDMA operations are categorized into single-sided and two-sided types: single-sided operations, such as RDMA Write and RDMA Read, allow the initiator to specify both local and remote memory buffers while bypassing the remote CPU entirely for completion notification; in contrast, two-sided operations, like Send and Receive, require both endpoints to post buffers and involve explicit coordination, resembling traditional message passing.^[10] RDMA extends the principles of traditional local Direct Memory Access (DMA), where peripheral devices access host memory independently of the CPU, to remote scenarios across a network, enabling similar efficiency over distances.^[11] Unlike conventional TCP/IP networking, which relies on multiple data copies through the kernel and incurs significant CPU involvement for processing packets, RDMA minimizes these bottlenecks to deliver superior performance in bandwidth-intensive applications.^[12] The basic architecture relies on RDMA-enabled Network Interface Cards (RNICs), specialized hardware that independently manages memory registration, queue processing, and data transfers without host intervention.^[7]

Key Operational Principles

Remote Direct Memory Access (RDMA) enables efficient data transfer by placing incoming or outgoing data directly into the memory buffers of user-space applications on remote hosts, bypassing the operating system kernel to eliminate the overhead of data copying through kernel space. This direct placement is achieved through hardware support in RDMA-capable network interface controllers (RNICs), which manage transfers independently of the CPU.^[13] A key enabler of this process is the avoidance of kernel involvement, which prevents costly context switches and interrupts that characterize traditional TCP/IP networking; instead, the RNIC handles packet processing, error detection, and retransmissions at the hardware level.^[14] To support secure and predictable direct access, applications must register specific memory regions with the RNIC prior to use in RDMA operations. This registration process pins the buffers in physical memory, mapping virtual addresses to physical ones and preventing paging or swapping that could disrupt hardware access, while also establishing protection domains to enforce access permissions.^[13] Pinning ensures that the RNIC can translate and validate addresses without software intervention, maintaining the zero-copy nature of transfers. At the core of RDMA's asynchronous operation model are queue pairs (QPs), each comprising a send queue (SQ) for outgoing work and a receive queue (RQ) for incoming work. Applications post work requests (WRs) to the SQ to initiate sends or RDMA writes, or to the RQ to prepare for receives or RDMA reads, with the RNIC dequeuing and executing these requests in hardware.^[15] This queue-based mechanism allows for efficient batching and pipelining of operations, decoupling application logic from low-level network handling. Work request completion is signaled through completion queues (CQs), which the application monitors via polling or event notification to retrieve completion queue elements (CQEs) containing status, opcode, and byte count details.^[16] CQs enable scalable, low-overhead notification without relying on interrupts, supporting high-throughput scenarios by allowing multiple QPs to share a single CQ. RDMA defines transport semantics to balance reliability, ordering, and overhead. The Reliable Connected (RC) service establishes a dedicated connection between QPs, guaranteeing in-order delivery, exactly-once semantics, and flow control through hardware acknowledgments and retransmissions.^[17] In contrast, the Unreliable Datagram (UD) service provides connectionless, best-effort delivery akin to UDP, with no ordering or reliability guarantees but minimal setup overhead, ideal for fire-and-forget messaging.^[18] This hardware-centric design yields significant latency reductions, approximated as:

\text{RDMA latency} \approx \text{RNIC processing time} + \text{network transit time}

typically under 5 μs end-to-end for small messages in local clusters, versus over 100 μs for equivalent TCP/IP transfers involving kernel traversal and buffering.^[16]^[19]

History and Development

Origins and Early Standards

Remote direct memory access (RDMA) emerged in the 1990s as a response to performance bottlenecks in high-performance computing (HPC) environments, particularly in cluster-based supercomputing systems where traditional network interfaces incurred high latency due to operating system kernel involvement in data transfers. These bottlenecks limited scalability in parallel applications, such as scientific simulations and large-scale data processing, by introducing overheads from context switches and data copying between user and kernel spaces. Early research in user-level networking, including projects like U-Net at UC Berkeley, highlighted the need for direct hardware access to memory without CPU intervention to achieve low-latency, high-bandwidth communication in distributed systems. A pivotal early standardization effort was the Virtual Interface Architecture (VIA), a software specification developed by Compaq, Intel, and Microsoft to enable protected, user-level networking over system area networks (SANs). Released in version 1.0 on December 16, 1997, VIA provided abstractions for zero-copy data transfers and remote memory operations, aiming to reduce communication latency for HPC clusters and enterprise applications like transaction processing.^[20] By allowing applications to directly manage network interfaces via virtual interfaces and completion queues, VIA addressed key limitations of kernel-mediated networking, influencing subsequent RDMA designs.^[21] Building on VIA's concepts, the InfiniBand Trade Association (IBTA) was formed in August 1999 by industry leaders including Intel, Microsoft, Dell, Hewlett-Packard, IBM, and Mellanox to develop a unified architecture for high-speed interconnects in HPC and data centers. The InfiniBand Architecture (IBA) specification version 1.0 was released in October 2000, defining a switched fabric protocol with native support for RDMA operations like send/receive and direct memory writes/reads to bypass CPU and OS involvement.^[22] Initial hardware implementations followed shortly, with Mellanox shipping the first InfiniBand devices, such as the InfiniBridge MT21108 host channel adapter, in January 2001, enabling practical deployment in supercomputing clusters.^[23] Intel contributed significantly to early InfiniBand development through its involvement in the IBTA and silicon design efforts.^[24]

Evolution and Adoption Milestones

The standardization of iWARP by the Internet Engineering Task Force (IETF) in 2007 marked an early milestone in extending RDMA capabilities over standard TCP/IP networks, with RFC 5040 defining the core Remote Direct Memory Access Protocol (RDMAP) and related specifications (RFC 5041–5044) enabling direct data placement and framing over reliable transports.^[25] This laid the groundwork for Ethernet-based RDMA implementations, broadening accessibility beyond proprietary fabrics. In 2010, the InfiniBand Trade Association (IBTA) released the initial RoCE specification (v1), integrating RDMA semantics directly into Ethernet frames to leverage existing data center infrastructure without requiring specialized hardware. This was followed by RoCE v2 in September 2014, which added routable IP/UDP encapsulation to support Layer 3 network traversal, enhancing scalability in multi-subnet environments.^[26] Operating system adoption accelerated RDMA's integration into mainstream computing. Linux kernels began supporting RDMA features in the late 2000s, with initial NFS/RDMA client implementation in version 2.6.24 (December 2007) and server support in 2.6.25 (April 2008), enabling efficient file system operations over RDMA fabrics.^[27] Microsoft introduced native RDMA support in Windows Server 2012 via SMB Direct, allowing low-CPU file sharing over RDMA-capable adapters for storage and clustering workloads.^[28] Virtualization platforms followed suit, with VMware integrating paravirtual RDMA (PVRDMA) in vSphere 6.5 (October 2016), permitting virtual machines to access RDMA hardware for high-throughput networking.^[29] By the 2010s, RDMA had achieved widespread adoption in high-performance computing (HPC), powering a majority of Top500 supercomputers through InfiniBand and emerging Ethernet variants, driven by demands for low-latency interconnects in scientific simulations and big data processing.^[30] Market momentum surged in 2018 with the proliferation of 100 Gbit/s RDMA hardware from vendors like Broadcom and Supermicro, enabling cost-effective scaling for enterprise clusters and reducing latency bottlenecks in bandwidth-intensive applications.^[31] In April 2020, NVIDIA completed its $7 billion acquisition of Mellanox, enhancing RDMA and InfiniBand integration with GPU technologies for AI and HPC workloads.^[32] Post-2020, RDMA adoption boomed in data centers, fueled by AI /ML workloads requiring ultra-low latency data movement, with the RDMA networking market expanding from approximately $1 billion prior to 2021 to over $6 billion in 2023.^[33] As of June 2024, RDMA-based networks powered over 90% of TOP500 supercomputers, with the market projected to exceed $22 billion by 2028 fueled by AI /ML demands.^[30] Intel's Omni-Path Architecture, announced in November 2014 and commercially released in 2015, emerged as a cost-competitive alternative to InfiniBand, offering 100 Gbit/s throughput with lower latency and power consumption for HPC fabrics, and remained in use through the 2020s despite eventual discontinuation of further generations.^[34]

Protocols and Implementations

InfiniBand and RoCE

InfiniBand serves as a foundational protocol for remote direct memory access (RDMA), defined as a channel-based interconnect architecture that employs a switched fabric topology to enable high-speed connectivity between servers and storage systems.^[3]^[35] This topology facilitates scalable, point-to-point communication with minimal latency, supporting data rates up to 800 Gbit/s via the eXtended Data Rate (XDR) standard as of 2025.^[36] InfiniBand ensures lossless data transmission through credit-based flow control, where receivers grant credits to senders to manage buffer usage and prevent packet drops.^[37] The architecture is governed by specifications from the InfiniBand Trade Association (IBTA), with ongoing revisions such as Volume 1 Release 2.0 in 2025, which enhance switch density, scalability, and memory placement for reduced latency.^[38] RoCE, or RDMA over Converged Ethernet, adapts InfiniBand's RDMA capabilities to standard Ethernet networks, allowing efficient, low-latency data transfers over converged infrastructure.^[39] It exists in two versions: RoCE v1, which operates solely at Ethernet Layer 2 within a single broadcast domain, and RoCE v2, which is routable across Layer 3 networks using IP and UDP encapsulation for broader scalability.^[40] Like InfiniBand, RoCE requires lossless Ethernet environments, achieved through mechanisms such as Priority-based Flow Control (PFC) to avoid packet loss and maintain performance.^[39] The primary differences between InfiniBand and RoCE lie in their underlying hardware and deployment: InfiniBand utilizes dedicated native hardware for its fabric, providing optimized, purpose-built performance, whereas RoCE overlays RDMA functionality onto existing Ethernet infrastructure, leveraging commodity switches and NICs for cost-effective integration.^[41] Despite these distinctions, both protocols share the IBTA Verbs API, a standardized interface for managing RDMA operations like queue pairs and work requests.^[42] RoCE standards emerged as IBTA supplements, with the initial RoCE specification released in 2010 and RoCE v2 formalized in the 2010s to address routing limitations of the earlier version.^[43]

iWARP, Omni-Path, and Emerging Standards

iWARP, or Internet Wide Area RDMA Protocol, is a standards-based implementation of RDMA that operates over TCP/IP, enabling direct memory access across standard Ethernet networks without requiring specialized lossless fabrics.^[1] Defined by the Internet Engineering Task Force (IETF) in 2007, iWARP consists of a layered protocol stack including the Remote Direct Memory Access Protocol (RDMAP) for RDMA operations, the Direct Data Placement (DDP) protocol for efficient data transfer into application buffers, and the Marker PDU Aligned Framing (MPA) for TCP framing to ensure reliable delivery.^[1]^[44] This approach leverages existing Ethernet infrastructure, avoiding the need for priority flow control or other enhancements mandated by protocols like RoCE, but introduces higher protocol overhead due to TCP's reliability mechanisms, such as acknowledgments and retransmissions.^[1] iWARP's design prioritizes compatibility with conventional IP networks, making it suitable for enterprise and wide-area deployments where hardware modifications are impractical. Omni-Path Architecture (OPA), introduced by Intel in 2014, is a proprietary high-performance interconnect designed specifically for scalability in high-performance computing (HPC) environments, offering RDMA capabilities with low latency and high bandwidth. OPA supports data rates up to 100 Gbps per port in its initial generation, with optimizations for message rates exceeding 10 million per second and end-to-end latencies under 1 microsecond, positioning it as a cost-effective alternative to InfiniBand for large-scale clusters. The architecture employs the Omni-Path Interface (OPI) specification, which defines a standardized electrical and protocol interface for host fabric adapters and switches, facilitating interoperability among components while emphasizing power efficiency and fabric manageability. Although plans for a 200 Gbps second-generation OPA were announced, Intel discontinued development in 2019, shifting focus to other interconnect technologies. OPA's fabric supports up to thousands of nodes with features like adaptive routing and congestion control, enhancing reliability in HPC workloads. Emerging standards are extending RDMA's reach into Ethernet-centric and software-defined environments, particularly for AI and cloud-scale applications. The Ultra Ethernet Consortium (UEC), formed in 2023 by industry leaders including Intel, AMD, and Broadcom, is developing an open Ethernet-based specification optimized for AI and HPC, featuring the Ultra Ethernet Transport (UET) protocol as a modern RDMA alternative to RoCE.^[45] UEC 1.0, released in June 2025, introduces RDMA enhancements with intelligent low-latency transport, congestion control tailored for high-throughput AI training, and IP-routable packet structures to support massive-scale clusters without proprietary fabrics.^[46] Complementing hardware advancements, Soft-RoCE provides a software-emulated RDMA implementation over standard Ethernet, allowing systems without dedicated RDMA hardware to perform direct memory transfers via kernel drivers like those in Linux.^[47] This emulation layer maps RDMA verbs to TCP/UDP transports, enabling testing and deployment in virtualized or legacy environments with performance approaching hardware solutions for smaller-scale use cases.^[48] These developments reflect ongoing IETF and industry efforts to evolve RDMA standards for broader interoperability and efficiency in diverse networking ecosystems.^[49]

Technical Mechanisms

Memory Access and Data Transfer

Remote Direct Memory Access (RDMA) supports several core operations for efficient data transfer between nodes, categorized into one-sided and two-sided semantics. One-sided operations, such as RDMA Read and RDMA Write, enable direct access to remote memory without involving the remote CPU, allowing the initiator to pull data (RDMA Read) from or push data (RDMA Write) to a specified remote memory region using a provided remote key (rkey).^[50]^[51] Two-sided operations, including Send and Receive, function like message passing, where the sender posts a Send work request to deliver data to a pre-posted Receive buffer on the remote side, requiring coordination and CPU involvement on both ends for completion signaling.^[50]^[51] Additionally, atomic operations, such as Compare-and-Swap or Fetch-and-Add, provide one-sided mechanisms for synchronized remote memory updates, where the RNIC performs the operation atomically and returns the result to the initiator without remote CPU intervention.^[51] Before performing RDMA operations, applications must register memory regions to enable safe direct access by the RDMA Network Interface Card (RNIC). This registration process pins the specified virtual memory pages in physical memory to prevent paging, maps virtual addresses to physical ones for the RNIC, and assigns permissions such as local read/write or remote access types (e.g., remote read, write, or atomic).^[50] Upon successful registration, the RNIC generates a local key (lkey) for the application to reference the region in local operations and a remote key (rkey) to share with remote peers, which serves as an access control token to validate and authorize incoming remote requests.^[50] This pinning ensures data integrity during transfers but consumes system resources, as registered regions remain fixed in physical memory until deregistered.^[50] Data transfer in RDMA begins when the initiator application posts a work request (WR) describing the operation—such as the remote address, length, and rkey for one-sided verbs—to a send queue within a queue pair (QP), a paired set of queues for communication between endpoints.^[52] The RNIC then autonomously processes the WR by initiating Direct Memory Access (DMA) engines to transfer data directly between the local and remote memory regions, bypassing the host CPU for both data movement and protocol processing in one-sided operations.^[53] For two-sided operations, the remote side must have a corresponding Receive WR posted, after which the RNIC signals completion via completion queues for error handling and synchronization.^[51] This flow achieves low-latency transfers by offloading all network and memory operations to hardware. RDMA's efficiency stems from minimal protocol overhead, enabling theoretical maximum throughput calculated as link speed divided by the packet overhead factor, which accounts for headers and encapsulation.^[54] For instance, on 100 Gbit/s links, RDMA protocols commonly achieve approximately 95% efficiency under optimal conditions with large payloads and lossless networks, approaching line rate while reducing CPU utilization to near zero.^[54]^[55]

APIs and Queue Management

The Verbs API, standardized by the InfiniBand Trade Association (IBTA) in the InfiniBand Architecture Specification, provides a user-space programming interface for RDMA operations on InfiniBand and RoCE networks. It enables applications to directly manage hardware resources and initiate transfers without kernel involvement, supporting functions for resource allocation, work request posting, and event polling. This API is foundational for high-performance networking, allowing developers to implement efficient data movement semantics.^[56] Central to the Verbs API are functions for posting and completing work requests, such as ibv_post_send() to enqueue send operations on a queue pair's send queue and ibv_post_recv() for receive operations on the receive queue. Completion events are retrieved via ibv_poll_cq(), which dequeues work completion structures containing status, opcode, and byte counts from a completion queue. These mechanisms ensure asynchronous, non-blocking operation, with signaled work requests generating interrupts for notification.^[57] Queue management encompasses the creation and configuration of core RDMA resources: queue pairs (QPs), completion queues (CQs), and protection domains (PDs). Queue pairs, created with ibv_create_qp(), represent bidirectional communication endpoints comprising a send queue for outgoing work requests and a receive queue for incoming ones; they support transport types like reliable connection (RC) or unreliable datagram (UD). Completion queues, allocated via ibv_create_cq(), hold entries for finished work requests from one or more QPs, with polling removing entries to track progress. Protection domains, obtained through ibv_alloc_pd(), enforce isolation by grouping QPs, memory regions, and address handles, restricting network access to authorized host memory and preventing unauthorized reads or writes. These resources are destroyed with corresponding deallocation functions like ibv_destroy_qp() and ibv_destroy_cq() upon completion.^[50] Work completions (WCs) are handled by applications polling CQs, where each WC includes fields for status, vendor error codes, and completion flags to indicate success or failure. Multiple QPs can share a CQ for efficiency, but overflow events trigger IBV_EVENT_CQ_ERR if the queue fills without polling. Protection domains integrate with memory registration to validate access rights during operations, ensuring faults like invalid keys result in controlled errors rather than crashes.^[58] The primary library for Verbs API implementation on Linux is libibverbs, part of the OpenFabrics Enterprise Distribution (OFED), which abstracts hardware-specific drivers for InfiniBand, RoCE, and iWARP. It supports user-space direct access via the ib_uverbs kernel module, enabling low-latency operations. On Windows, the Network Direct Kernel Provider Interface (NDKPI), an NDIS extension, delivers a Verbs-compatible API for RDMA, allowing independent hardware vendors to implement kernel-mode support for protocols like RoCE and iWARP. Verbs extensions for Ethernet, such as those in libibverbs-rocee, facilitate RDMA over converged networks by mapping InfiniBand semantics to Ethernet transports.^[59]^[60]^[61] Error handling in the Verbs API centers on completion status codes within WCs, with IBV_WC_SUCCESS denoting successful completion and codes like IBV_WC_LOC_QP_OP_ERR or IBV_WC_REM_INV_REQ_ERR signaling local or remote errors such as queue underflow or invalid requests. In unreliable modes like UD, negative acknowledgments (NAKs) from the remote side—such as for sequence errors or receiver not ready (RNR)—are reflected in WC status codes like IBV_WC_REM_OP_ERR, prompting application retries since hardware does not automatically recover. Events like IBV_EVENT_QP_FATAL indicate irrecoverable QP errors, requiring resource recreation.^[62]^[63]

Applications and Use Cases

High-Performance Computing and Storage

Remote Direct Memory Access (RDMA) plays a pivotal role in high-performance computing (HPC) clusters by enabling efficient, low-latency data transfers that bypass the operating system kernel, allowing direct access to remote memory. This capability is particularly valuable for Message Passing Interface (MPI) implementations, such as MVAPICH2, which leverage RDMA over networks like InfiniBand to support scalable parallel processing in supercomputing environments.^[64]^[65] RDMA facilitates low-latency messaging essential for tightly coupled applications, reducing communication overhead in distributed simulations and scientific computations. Since the early 2000s, RDMA-enabled interconnects like InfiniBand have been integral to TOP500 supercomputers, powering a significant portion of the world's fastest systems and contributing to their high performance rankings.^[66]^[5] In storage systems, RDMA underpins NVMe over Fabrics (NVMe-oF), extending the high-speed, low-latency NVMe protocol across networked block storage environments. NVMe-oF utilizes RDMA transports to enable direct memory-to-memory data movement between hosts and storage arrays, minimizing CPU involvement and supporting protocols such as NVMe/InfiniBand and NVMe/RoCE.^[67]^[68] This architecture delivers scalable I/O performance for large-scale data-intensive workloads, with RDMA ensuring efficient handling of small-block random accesses common in HPC storage.^[69] Parallel file systems like Lustre integrate RDMA through its LNet routing layer to optimize I/O operations in HPC clusters, enabling zero-copy transfers and full bandwidth utilization for distributed data access. In conjunction with workload managers such as SLURM, Lustre's RDMA support facilitates high-throughput parallel I/O, aggregating performance across multiple object storage servers (OSS) to handle massive datasets from scientific simulations.^[70]^[71] Deployments in large clusters achieve aggregate throughputs exceeding 100 GB/s, scaling linearly with additional storage targets to meet the demands of petabyte-scale environments.^[72] A notable application is CERN's Large Hadron Collider (LHC) computing infrastructure, where RDMA enhances data movement in experiment readout systems like ATLAS and SND@HL-LHC. By implementing RDMA in front-end electronics and event buffers, CERN achieves efficient, high-bandwidth transfers of collision data across distributed processing nodes, supporting real-time analysis of terabytes generated per second.^[73]^[74]

AI/ML and Cloud Environments

In artificial intelligence and machine learning workflows, Remote Direct Memory Access (RDMA) plays a pivotal role in distributed training frameworks by enabling efficient communication primitives such as all-reduce operations within parameter server architectures. In TensorFlow, RDMA accelerates deep learning tasks by integrating with gRPC for low-latency data exchanges during gradient aggregation in parameter servers, reducing communication overhead compared to traditional TCP/IP stacks. Similarly, PyTorch leverages the NCCL backend for RDMA-supported all-reduce operations in distributed data parallel (DDP) training, enabling efficient gradient synchronization across nodes without CPU intervention, improving scalability for large-scale models.^[75] These mechanisms allow RDMA to handle the intensive bursty traffic patterns inherent in synchronous training paradigms. A key enabler in these setups is GPU-direct transfers via GPUDirect RDMA, which permits direct peer-to-peer data movement between GPUs across networked nodes, bypassing host CPU and memory copies to minimize latency. This technology, integrated into the CUDA toolkit, supports RDMA protocols like RoCE, enabling up to 10x performance gains in data throughput for neural network training on large datasets. By facilitating memory-to-memory transfers at line-rate speeds, GPUDirect RDMA ensures that AI workloads maintain high efficiency during collective operations like all-reduce. In cloud environments, RDMA integration with orchestration platforms like Kubernetes enhances AI/ML scalability through specialized drivers such as DraNet, a 2025 implementation from Google that uses the Dynamic Resource Allocation (DRA) API to dynamically attach high-performance RDMA interfaces to pods for demanding workloads. This allows seamless provisioning of RDMA resources alongside accelerators in Google Kubernetes Engine (GKE), optimizing east-west traffic for distributed training without manual configuration. Complementing this, Alibaba's Stellar platform introduces Para-Virtualized Direct Memory Access (PVDMA) in 2025, enabling on-demand memory pinning and dynamic allocation in virtualized cloud AI setups to support RDMA over multi-tenant environments with minimal overhead. Exemplifying practical deployments, NVIDIA DGX clusters utilize RoCE-based RDMA for multi-node AI training, as seen in configurations with DGX A100 systems connected via 200 Gbps Ethernet fabrics to enable GPU-direct collectives in Kubernetes-orchestrated environments. These setups support efficient scaling of training jobs across dozens of nodes, with RDMA ensuring lossless, low-latency synchronization for frameworks like PyTorch. The growing adoption of RDMA in AI is underscored by market projections estimating the RDMA networking segment for AI/ML to exceed $22 billion by 2028, driven by surging demand for high-throughput interconnects.^[76] RDMA's benefits in these domains are particularly evident in handling petabyte-scale datasets during distributed training, where its sub-microsecond latency and high bandwidth enable rapid iteration over massive inputs without stalling GPU compute resources. For instance, in environments processing exabyte-level AI corpora, RDMA facilitates efficient data shuffling and aggregation, reducing time-to-accuracy by minimizing network bottlenecks in multi-node setups.

Advantages and Limitations

Performance Benefits

Remote Direct Memory Access (RDMA) provides significant performance advantages over traditional TCP/IP networking, primarily through its ability to bypass the operating system kernel and CPU involvement in data transfers. This kernel bypass enables ultra-low latency, with round-trip times as low as approximately 2 μs in high-speed RDMA networks using modern network interface cards (NICs).^[55] In contrast, TCP/IP-based communications in data centers typically incur latencies of 50-100 μs for small messages due to protocol processing and context switching overheads.^[77] For remote accesses across nodes, RDMA maintains latencies under 10 μs, making it ideal for latency-sensitive workloads.^[55] RDMA also delivers high throughput close to line-rate speeds, such as 200 Gbit/s or up to 800 Gbit/s in recent implementations, with minimal overhead from its zero-copy semantics that eliminate intermediate data buffering.^[78]^[79] Bandwidth efficiency in RDMA transfers often reaches 90-95% of the physical link capacity, as demonstrated in micro-benchmarks using tools like IB Perftest.^[55] The zero-copy mechanism further reduces CPU utilization to less than 1% during large data transfers, compared to 50-90% overhead in CPU-bound TCP/IP operations, freeing resources for application processing.^[80] In terms of scalability, RDMA supports the creation of millions of queue pairs (QPs) per node in large clusters, enabling massive parallelism without proportional increases in latency or resource contention.^[81] This capability, combined with low CPU overhead, contributes to improved energy efficiency in hyperscale environments.^[55]

Challenges and Drawbacks

Remote Direct Memory Access (RDMA) relies on specialized hardware, particularly Remote Network Interface Cards (RNICs), such as the NVIDIA ConnectX series, which are essential for enabling direct memory transfers without CPU involvement.^[82] These RNICs incorporate dedicated microarchitecture resources like caches and processing units to handle RDMA operations, distinguishing them from standard Ethernet NICs that lack such capabilities.^[7] While RDMA-capable NICs were historically more expensive, recent models have prices comparable to standard Ethernet NICs due to advancements in chipsets.^[83] Implementing RDMA adds considerable complexity to network configuration, as it demands a lossless Ethernet environment to prevent packet drops that could degrade performance. This necessitates enabling mechanisms like Priority-based Flow Control (PFC) to pause traffic on specific priorities and Explicit Congestion Notification (ECN) to signal impending congestion, ensuring end-to-end reliability without retransmissions.^[82] Misconfigurations in these features can lead to issues like PFC deadlocks or head-of-line blocking, complicating deployment in shared infrastructures.^[84] Furthermore, RDMA's software ecosystem exhibits limited compatibility, particularly with operating systems; native support is available in Windows Server editions and Windows 11 Pro for Workstations via features like SMB Direct, though integration often requires custom libraries like libibverbs or WinOF.^[85]^[86] Many applications require custom libraries like libibverbs or WinOF for integration, restricting widespread adoption beyond specialized environments. Scalability poses notable hurdles in large-scale deployments, where the finite resources of RNICs—such as queue pairs (QPs) and completion queues (CQs)—can become exhausted under high connection counts, leading to cache misses and stalled processing.^[7] For instance, in clusters with thousands of nodes, wide access patterns across numerous QPs trigger frequent misses in the RNIC's internal context memory (ICM), exacerbating resource contention.^[7] Debugging these issues is further hampered by the scarcity of comprehensive tools; traditional network diagnostics often fall short, requiring specialized approaches like simulated annealing-based anomaly detection or custom telemetry to isolate microarchitecture bottlenecks.^[87]^[88] Additional drawbacks include vendor lock-in, especially with InfiniBand implementations, where the ecosystem is dominated by a few providers like NVIDIA, limiting interoperability and increasing dependency on specific hardware and firmware updates.^[89] Migrating existing TCP/IP-based applications to RDMA involves substantial refactoring to leverage verbs APIs and handle differences in reliability semantics, often necessitating protocol gateways or hybrid stacks that introduce overhead and compatibility risks.^[90] These factors collectively elevate the barrier to entry for RDMA adoption in diverse computing environments.

Security and Future Directions

Security Risks and Mitigations

Remote Direct Memory Access (RDMA) introduces significant security risks due to its design, which enables direct memory manipulation between endpoints while bypassing traditional operating system protections such as kernel firewalls and privilege checks. This one-sided communication model allows remote initiators to read from or write to a target's memory without involving the target's CPU, potentially exposing sensitive data if access controls fail. For instance, predictable remote keys (rkeys) in certain RDMA network interface cards (NICs), such as Mellanox ConnectX series, can be exploited to gain unauthorized access to protected memory regions, leading to data theft or corruption.^[91]^[92] A notable vulnerability is the exposure of rkeys, which serve as access permissions for memory regions but can be intercepted or guessed in insecure setups, enabling attackers to perform unauthorized reads or writes even across trusted connections. Additionally, RDMA's lack of native encryption in base protocols like RoCE (RDMA over Converged Ethernet) and iWARP exposes data in transit to eavesdropping, particularly in shared cloud environments where lateral movement by compromised nodes is a concern. Side-channel attacks further compound these issues; in multi-tenant setups with shared NICs, timing differences from page table entry misses or memory registration operations can leak information via covert channels.^[93]^[91]^[94] Availability threats are exemplified by denial-of-service (DoS) attacks, such as the LoRDMA attack identified in 2024, which exploits interactions between Priority Flow Control (PFC) and Data Center Quantized Congestion Notification (DCQCN) using low-rate burst traffic to degrade legitimate RDMA flows. This attack coordinates short bursts from multiple bots to trigger PFC pauses, misleading congestion control and causing up to 56% performance loss on victim flows across multiple hops, even with minimal direct contention.^[95] To mitigate these risks, RDMA implementations rely on built-in hardware and protocol features, including memory protection domains (PDs) that isolate resources and limit the scope of registered memory regions to specific queue pairs (QPs), preventing unauthorized access across connections. Strict QP policies enforce access controls by binding operations to specific PDs and using type-2 memory windows that pin permissions to queue pair numbers (QPNs), reducing the attack surface from key exposure. For confidentiality and integrity in transit, IPsec can be layered over RoCE and iWARP to provide encryption and authentication, dropping spoofed packets while integrating with Internet security standards, though it adds overhead and is not supported natively for InfiniBand.^[93]^[96] Advanced mitigations include programmable network defenses like those in Bedrock, which enable source authentication and fine-grained access control directly in the data plane to counter unauthorized RDMA operations without centralized trust. Hardware enhancements, such as secure boot in RDMA-capable NICs like NVIDIA BlueField DPUs, verify firmware integrity during initialization to prevent tampered components from introducing vulnerabilities. IETF discussions, including early drafts on RDMA security considerations, have addressed concerns like handle predictability to bolster protocol robustness.^[97]^[98]^[99]

Recent Innovations and Trends

In recent years, innovations in RDMA have focused on enhancing virtualization, offloading, and scalability to meet the demands of cloud AI and distributed systems. Alibaba's Stellar network introduces Para-Virtualized Direct Memory Access (PVDMA), enabling on-demand memory pinning that reduces overhead in virtualized environments by dynamically allocating RDMA-accessible memory without persistent pinning, improving efficiency for AI workloads in multi-tenant clouds.^[100] Similarly, the ROS2 system offloads RDMA-based object storage operations to NVIDIA BlueField-3 SmartNICs, separating control and data paths to achieve low-latency I/O for AI training while minimizing host CPU involvement and supporting POSIX compatibility.^[101] Microsoft's SRC protocol decouples queue pairs (QPs) from network connections, introducing lightweight reliable streams that scale to thousands of connections per endpoint, addressing QP exhaustion in large-scale RDMA deployments and boosting throughput in disaggregated memory systems.^[102] Emerging trends highlight RDMA's expansion beyond traditional data centers. Patents since 2023 enable RDMA over 5G cellular networks, allowing direct memory access between user equipment and edge servers for ultra-low-latency applications like augmented reality and industrial IoT.^[103] The Ultra Ethernet Consortium, formed in 2023, develops an RDMA-compatible transport layer for Ethernet-based AI fabrics, replacing legacy RoCE with scalable protocols that support massive GPU clusters without InfiniBand dependencies.^[104] In edge computing, RDMA optimizations like status-byte-assisted transmission reduce congestion in multi-access edge computing (MEC) environments, enabling real-time data processing for 5G-connected devices with minimal latency.^[105] Market growth underscores RDMA's integration into containerized ecosystems, with Google's DraNet in 2025 providing a Kubernetes-native driver for dynamic RDMA resource allocation, simplifying high-performance networking for AI/ML workloads via declarative APIs and GPUDirect RDMA support.^[106] Projections indicate RDMA networks will reach 800 Gbit/s speeds as early as 2026, driven by AI-driven demand and optical interconnect advancements.^[107] Looking ahead, hybrid RDMA-TCP protocols like SMC-R and Jakiro facilitate broader adoption by combining RDMA's low latency with TCP's compatibility in virtual private clouds, enabling seamless migration for legacy applications without full infrastructure overhauls.^[108] Additionally, quantum-safe encryption is emerging as a priority for RDMA, with post-quantum cryptography integrations in high-speed networks to protect against future quantum threats in AI and edge deployments.^[109] As of November 2025, efforts continue to integrate post-quantum algorithms into RDMA protocols for enhanced long-term security.

References

[1]
RFC 5040 - A Remote Direct Memory Access Protocol Specification
This document defines a Remote Direct Memory Access Protocol (RDMAP) that operates over the Direct Data Placement Protocol (DDP protocol).
[2]
InfiniBand - A low-latency, high-bandwidth interconnect
Jun 27, 2023 · RDMA significantly reduces CPU overhead and latency, making it well-suited for AI and HPC workloads that involve frequent data exchanges between ...
[3]
https://www.infinibandta.org/ibta-specification/
[4]
[PDF] What's new – Volume 1 Release 1.8 - InfiniBand Trade Association
Jul 31, 2024 · Specification update overview. • Volume 1, Release 1.8, published July 31, 2024. • The specification defines InfiniBand and RoCE. • Available ...<|control11|><|separator|>
[5]
[PDF] Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key ...
Unlike Send/Recv Verbs, RDMA operations are one-sided, since an RDMA operation can com- plete without any knowledge of the remote process.
[6]
[PDF] Understanding RDMA Microarchitecture Resources for Performance ...
Apr 17, 2023 · In this paper, we visit one particular hardware device, the. RDMA NIC (RNIC). RDMA offloads the network stack from. OS kernel to NIC hardware to ...
[7]
[PDF] RDMA programming concepts - OpenFabrics Alliance
❖“zero copy” – RDMA transfers data directly from user virtual memory on one node to user virtual memory on another node, TCP copies.Missing: core two-
[8]
[PDF] Network Support for Remote Direct Memory Access
What is RDMA? • A (relatively) new method for high-speed inter- machine communication. – new standards. – new ...<|control11|><|separator|>
[9]
[PDF] Zero Overhead Monitoring for Cloud-native Infrastructure using RDMA
Jul 13, 2022 · RDMA supports both one-sided and two-sided operations. The one-sided op- erations directly operate on the remote memory via read and write ...
[10]
[PDF] Comparison of 40G RDMA and Traditional Ethernet Technologies
RDMA allows for communication between systems but can bypass the overhead associated with the operating system kernel, so applications have reduced latency and ...
[11]
[PDF] USER-LEVEL NETWORKING :: REMOTE DIRECT MEMORY ACCESS
• Zero-copy sends, but needs one-copy for receives. • Intel iWarp is aimed at something similar. • Tries to offer RDMA with zero-copy on both sides under the.Missing: concepts | Show results with:concepts
[12]
[PDF] FileMR: Rethinking RDMA Networking for Scalable Persistent Memory
Feb 27, 2020 · RDMA gives a client direct access to memory on a remote machine and mediates this access through a memory region abstraction that handles the.
[13]
[PDF] Arrakis: The Operating System is the Control Plane - USENIX
Oct 6, 2014 · Remote Direct Memory Access (RDMA) is another popular model for user-level networking [48]. RDMA gives applications the ability to read from ...<|control11|><|separator|>
[14]
[PDF] Flor: An Open High Performance RDMA Framework Over ... - USENIX
Jul 12, 2023 · the data, the RNICs generate Completion Queue Elements. (CQEs) into Completion Queues (CQs) as the transmit completion signals for users.<|control11|><|separator|>
[15]
[PDF] TeRM: Extending RDMA-Attached Memory with SSD - USENIX
Feb 29, 2024 · Abstract. RDMA-based in-memory storage systems offer high perfor- mance but are restricted by the capacity of physical memory.Missing: context | Show results with:context
[16]
On using connection-oriented vs. connection-less transport for ...
InfiniBand provides two transport modes: (i)Connection-oriented Reliable connection (RC) supporting Memory and Channel semantics and (ii) Connection-less ...
[17]
[PDF] On using Connection-Oriented vs. Connection-Less Transport for ...
InfiniBand allows for four conduits of message transport: Reliable Connection (RC), Unre- liable Connection, Reliable Datagram (RD) and Unreliable Data- gram ( ...
[18]
[PDF] When Cloud Storage Meets RDMA - USENIX
The latency of traditional network stack (e.g., TCP/IP) is generally within hundreds of microseconds [13]. The maximum achievable TCP bandwidth per kernel ...
[19]
Compaq, Intel and Microsoft Announce Completion of the Virtual ...
Dec 16, 1997 · The VI Architecture specification provides an industry-standard high-speed cluster communication interface that promises substantial benefits ...Missing: 1999 | Show results with:1999
[20]
Virtual Interface Architecture - USENIX
VIA requires that memory used for every data transfer request be registered. ... The VIA specification defines two RDMA operations, RDMA Write and RDMA Read.
[21]
The IBTA Celebrates 20 Years of Growth and Industry Success
Aug 27, 2019 · In August 1999, the InfiniBand Trade Association (IBTA) was formed by a group of industry leaders with a plan to develop the InfiniBand ...
[22]
Mellanox Introduces InfiniBand Server Blade Architecture - HPCwire
Dec 14, 2001 · In January 2001, Mellanox Technologies delivered the InfiniBridge MT21108, the first 1X/4X InfiniBand device to market, and is now shipping ...
[23]
Intel, Mellanox drive Infiniband silicon to market - EE Times
“The first pieces of Infiniband may also show up in late 2001, but because it is an external interface, it will be of little use until all of its ...
[24]
RFC 5040: A Remote Direct Memory Access Protocol Specification
RFC 5040 RDMA Protocol Specification October 2007 ; 1. Introduction ; 1.1. Architectural Goals ...
[25]
Slidecast: IBTA Releases Updated Specification for RoCEv2
Sep 16, 2014 · Slidecast: IBTA Releases Updated Specification for RoCEv2. September 16, 2014 ... Major cloud providers and Web 2.0 companies have converged on ...
[26]
nfs-rdma.txt - The Linux Kernel Archives
The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server was first included in the following release, Linux 2.6.25. In our testing, we ...
[27]
Improve performance of a file server with SMB Direct - Microsoft Learn
Jan 16, 2025 · Windows Server includes a feature called Server Message Block (SMB) Direct, which supports the use of network adapters that have Remote Direct Memory Access ( ...Requirements · Disabling and enabling SMB...
[28]
[PDF] Configuring PVRDMA in VMware vSphere 6.5 - Lenovo Press
Paravirtual. RDMA (PVRDMA) is a new device that can emulate RDMA in the ESXi hypervisor. In vSphere, a virtual machine can use a PVRDMA network adapter to ...
[29]
RDMA Networks Are a Key Enabler to AI/ML Deployments, RDMA ...
Jun 25, 2024 · RDMA Switching, which has grown from a $1B market to one that will exceed $18B in 2028, shows the importance of technology as a critical enabler of AI/ML.
[30]
Broadcom Announces Production Availability of Industry's First 100G ...
Aug 7, 2018 · Broadcom Announces Production Availability of Industry's First 100G Programmable Storage Adapter with NVMe-oF Support. Highly-optimized Stingray ...
[31]
Intel Reveals Details for Future High-Performance Computing ...
Nov 17, 2014 · Intel Omni-Path Architecture page: www.intel.com/omnipath; Intel ... Released Nov 17, 2014 • 9:05 AM EST. Email Alerts · Tear Sheet ...
[32]
[PDF] Introduction to InfiniBand™ - Networking
The fabric topology of InfiniBand allows communication to be simplified between storage and server. Removal of the Fibre Channel network allows servers to ...
[33]
New InfiniBand® and RoCE Specification Introduces Memory ...
Aug 18, 2021 · Data persistence is guaranteed using RDMA and the new MPE operations via InfiniBand or RoCE interconnects. This represents a significant ...
[34]
[PDF] InfiniBand FAQ - Networking
Dec 22, 2014 · 19. How Does Credit-based Flow Control Work? Flow control is used to manage the flow of data between links in order to guarantee a lossless.
[35]
New InfiniBand Specification Enables Data Centers with Enhanced ...
Oct 2, 2023 · The new 1.6 specification enables the build out of 256-port top-of-rack (ToR) switches, up from 64-ports.
[36]
RDMA over Converged Ethernet (RoCE) - NVIDIA Docs
Oct 23, 2023 · RDMA over Converged Ethernet (RoCE) is a mechanism to provide this efficient data transfer with very low latencies on lossless Ethernet networks.Missing: 2009 2014
[37]
IBTA Announces New RoCE Specification
Mar 19, 2018 · Big news! The IBTA today announced the updated specification for Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE), RoCEv2.Missing: v1 | Show results with:v1
[38]
RoCE and InfiniBand: Which should I choose?
Mar 19, 2018 · RoCE and InfiniBand both offer many of the features of RDMA, but there is a fundamental difference between an RDMA fabric built on Ethernet ...
[39]
Comparison of RDMA Technologies - NVIDIA Docs
May 23, 2023 · The key difference is that RDMA pro- vides a messaging service which applications can use to directly access the virtual memory on remote ...Missing: local DMA
[40]
InfiniBand Trade Association Releases Updated Specification for ...
Sep 16, 2014 · “IBM is in strong support of the IBTA RoCEv2 standard. RoCEv2 is a key network technology enabling high performance commercial cluster solutions ...
[41]
RFC 5041 - Direct Data Placement over Reliable Transports
The Direct Data Placement protocol provides information to Place the incoming data directly into an upper layer protocol's receive buffer without intermediate ...
[42]
Ultra Ethernet Consortium
Delivering an Ethernet based open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC ...Ultra Ethernet Specification... · Blog · Membership · Working GroupsMissing: RDMA 2023
[43]
Ultra Ethernet Consortium Releases Specification - HPCwire
Jun 11, 2025 · Modern RDMA for Ethernet and IP: Supporting intelligent, low-latency transport for high-throughput environments. Open Standards and ...
[44]
Software-based RoCE, A New Way to Experience RDMA
Nov 11, 2015 · Soft-RoCE, a software implementation of the RDMA transport that offers unique advantages when deploying RDMA technology and meets the demand for efficient data ...Missing: emulation | Show results with:emulation
[45]
RDMA over Converged Ethernet (RoCE) - NVIDIA Docs
Oct 23, 2023 · Soft RoCE driver implements the InfiniBand RDMA transport over the Linux network stack. It enables a system with standard Ethernet adapter to ...Missing: emulation | Show results with:emulation
[46]
[PDF] 23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf
Ultra Ethernet Consortium's members believe it is time to start afresh and replace the legacy. RoCE protocol with Ultra Ethernet Transport, a modern transport ...
[47]
Key Concepts - NVIDIA Docs
May 23, 2023 · A Completion Queue is a mechanism to notify the application about information of ended Work Requests (status, opcode, size, source).Missing: principles semantics
[49]
[PDF] Deconstructing RDMA-enabled Distributed Transactions: Hybrid is ...
The client starts an. RDMA request by posting the requests (called Verbs) to the sender queue, which can either be one-sided or two- sided verbs. The client ...
[50]
[PDF] Correct, Fast Remote Persistence - arXiv
Sep 4, 2019 · The “one-sided” RDMA operations – RDMA. READ and RDMA WRITE – do not require any participation from the responder's CPU; these operations read ...
[51]
[PDF] Efficient Wide Area Data Transfer Protocols for 100 Gbps Networks ...
Nov 17, 2013 · RDMA-based protocols are a well-proven data center technology offering high performance and efficiency, that can also be utilized in wide-area ...
[52]
[PDF] Design Guidelines for High Performance RDMA Systems | USENIX
Jun 22, 2016 · RDMA networks usually provide high band- width and low latency: NICs with 100 Gbps of per-port bandwidth and ∼ 2μs round-trip latency are ...<|control11|><|separator|>
[53]
Libibverbs library - IBM
The Libibverbs library enables user-space processes to use Remote Direct Memory Access (RDMA) verbs.
[54]
Programming Examples Using IBV Verbs - NVIDIA Docs
Apr 24, 2024 · This code demonstrates how to perform the following operations using the VPI Verbs API: Send, Receive, RDMA Read, RDMA Write.
[55]
Events - NVIDIA Docs
May 23, 2023 · This event is generated when an error occurs on a Queue Pair (QP) which prevents the generation of completions while accessing or processing the Work Queue.
[56]
Userspace verbs access - The Linux Kernel documentation
The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS, enables direct userspace access to IB hardware via “verbs,” as described in chapter 11.
[57]
https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/Programming%2BExamples%2BUsing%2BIBV%2BVerbs
[58]
Overview of Network Direct Kernel Provider Interface (NDKPI)
Dec 14, 2021 · The Network Direct Kernel Provider Interface (NDKPI) is an extension to NDIS that allows IHVs to provide kernel-mode Remote Direct Memory Access (RDMA) support ...
[59]
[PDF] RDMA Aware Networks Programming User Manual | NVIDIA Docs
Mar 3, 2011 · There are three types of QP: UD Unreliable Datagram,. Unreliable Connection, and Reliable Connection. RC (Reliable Connection). A QP Transport ...
[60]
InfiniBand Network Architecture - Nak Errors - O'Reilly
Remote Operational Error Nak. Results in an error completion and may not be retried. Receiver Not Ready (RNR) Nak. May be retried ...
[61]
[PDF] High Performance Pipelined Process Migration with RDMA
InfiniBand em- powers many of today's Top500 Super Computers [18]. MVAPICH2 [13] is an open source MPI-2 imple- mentation using InfiniBand, iWARP and other RDMA ...
[62]
[PDF] MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning
High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE).
[63]
InfiniBand Fends Off Supercomputing Challengers - TOP500
Dec 12, 2017 · InfiniBand still rules the roost when it comes to true supercomputing performance. We run the numbers and show how InfiniBand still dominates the top ...
[64]
InfiniBand and RoCE Advances Further in the TOP500 November ...
Dec 4, 2024 · Offering substantial bandwidth and exceptional efficiency, RDMA-based networking not only reduces total ownership costs but also accelerates ...<|separator|>
[65]
NVMe Over Fabrics – Part Two - NVM Express
NVMeTM Over Fabrics replaces the PCIe transport with a fabric technology such as RDMA or Fibre Channel (FC) fabric as shown in Figure 3. Transports for RDMA ...
[66]
[PDF] NVM Express NVMe over RDMA Transport Specification, Revision 1.2
Jul 30, 2025 · The diagram in Figure 2 illustrates the layering of the. RDMA transport and common RDMA providers (iWARP, InfiniBand™, and RoCE) within the host ...<|separator|>
[67]
[PDF] NVMe™/TCP vs. RDMA with RoCEV2 - Western Digital
NVMe-oF RDMA uses RDMA protocols, such as. InfiniBand or RDMA over Converged Ethernet (RoCE), to enable high-performance, low-latency access to NVMe storage ...
[68]
Introduction to Lustre - Lustre Wiki
May 2, 2024 · LNet allows for full RDMA throughput and zero copy communications when available. LNet supports routing, which provides maximum flexibility ...
[69]
[PDF] Lustre File System: High-Performance Storage Architecture and ...
Hand in hand with aggregating file system capacity with many servers, I/O throughput is also aggregated and scales with addi-.
[70]
Lustre Unveiled: Evolution, Design, Advancements, and Current ...
Jun 18, 2025 · ... throughput of hundreds of GB/sec. Many HPC facilities used Lustre as a global filesystem, serving multiple clusters at an impressive scale.
[71]
[PDF] FPGA implementation of RDMA for ATLAS Readout with FELIX at ...
RDMA communication is implemented using software on both end of the links. Exploring opportunities to improve data throughput as part of the High Luminosity LHC ...
[72]
[PDF] TECHNICAL PROPOSAL SND@HL-LHC Scattering and Neutrino ...
Mar 6, 2025 · RDMA transfers between the readout servers (BU and DS) and the event buffer, ensuring ef- ficient data movement. The event buffer is ...
[73]
[PDF] FaRM: Fast Remote Memory - USENIX
Apr 2, 2014 · FaRM's RDMA-based messaging achieves a request rate between 11x and 9x higher than TCP/IP for request sizes between 16 and 512 bytes, which are ...
[74]
[PDF] RDMA - IP Core for RoCE v2 at 100/200Gbps - BittWare
Jan 9, 2025 · The IP core also provides a low-latency FPGA implementation of RoCE v2 at 200Gbs or 100Gbps throughput.
[75]
[PDF] Bringing Zero-Copy RDMA to Database Systems - VLDB Endowment
Our experiments show that the network throughput can increase from 18 Gbps per CPU core to up to 98 Gbps (on a 100 Gbps card) with virtually zero CPU usage ...
[76]
https://www.naddod.com/blog/why-ai-ml-networks-rely-on-rdma
[77]
RDMA Over Converged Ethernet (RoCE) - NVIDIA Docs
Sep 8, 2023 · How To Configure RoCE Over a Lossless Fabric (PFC+ECN) End-to-End Using ConnectX-4 and Spectrum (Trust L2) · How To Run RoCE Over L2 Enabled ...Missing: complexity | Show results with:complexity
[78]
Choosing the Right NIC: Standard, TOE, RDMA or DPU for Storage
Both RoCE and iWARP require specialized RDMA NICs (rNICs) with the hardware logic for RDMA. ... Cost: DPUs are more expensive than traditional or even RDMA NICs.Missing: dependency | Show results with:dependency<|separator|>
[79]
Using PFC and ECN queuing methods to create lossless fabrics for ...
Jun 25, 2024 · This article will explain advanced queueing solutions used by all the major OEMs in the Network Operating Systems (NOS) that support ECN and PFC.
[80]
How to configure RDMA in win 10 enterprise - Intel Community
May 14, 2025 · We have checked this further with our team and unfortunately RDMA is not supported on windows 10 for E810 series ethernet controller I'm afraid.
[81]
[PDF] Collie: Finding Performance Anomalies in RDMA Subsystems
Apr 6, 2022 · We design and implement Collie, a tool for users to systematically uncover performance anomalies in RDMA subsystems without the need to access ...
[82]
https://docs.nvidia.com/networking/display/Onyxv3103004/RDMA%2BOver%2BConverged%2BEthernet%2B%28RoCE%29
[83]
[PDF] RDMA over Commodity Ethernet at Scale - Microsoft
RoCEv2 supports RDMA over Ethernet instead of Infiniband. Unlike TCP, RDMA needs a lossless network; i.e. there must be no packet loss due to buffer overflow ...Missing: migration | Show results with:migration
[84]
[PDF] MigrOS: Transparent Operating Systems Live Migration ... - arXiv
Oct 23, 2020 · MigrOS is an OS-level architecture for transparent live migration of RDMA-applications, addressing difficulties with RDMA networks and OS ...
[85]
[PDF] Security Threats and Opportunities in One-Sided Network ...
The most interesting high-level insight is that one-sided communication is a double-edged sword in security: it can cause security threats and offer op- ... RDMA/ ...
[86]
ReDMArk: Bypassing RDMA Security Mechanisms - USENIX
ReDMArk shows that current security mechanisms of IB-based architectures are insufficient against both in-network attackers and attackers located on end hosts.Missing: risks | Show results with:risks
[87]
[PDF] Bypassing RDMA Security Mechanisms - ReDMArk - USENIX
Aug 13, 2021 · RDMA can also enable one-sided operations, where the CPU at the target node is not notified of incoming RDMA requests. Even though several ...Missing: zero- | Show results with:zero-
[88]
[PDF] RAGNAR: Exploring Volatile-Channel Vulnerabilities on RDMA NIC
Security Issues on RDMA. RDMA introduces variant security issues. We categorize these issues from two perspectives: SW/HW, RDMA-targeted/related. • RDMA ...Missing: risks | Show results with:risks
[89]
[PDF] LoRDMA: A New Low-Rate DoS Attack in RDMA Networks
Feb 26, 2024 · In this paper, we investigate the security of traffic control mechanisms in RDMA networks with extensive experiments and theoretical analysis.<|separator|>
[90]
RFC 8166 - Remote Direct Memory Access Transport for Remote ...
1. Protection Domains The use of Protection Domains to limit the exposure of memory regions to a single connection is critical. · 2. Handle Predictability ...
[91]
[PDF] Bedrock: Programmable Network Support for Secure RDMA Systems
RoCEv2. A widely used implementation is “RDMA over. Converged Ethernet Version 2” (RoCEv2) [15], where RDMA packets are carried by Ethernet frames over ...
[92]
Secure Boot - NVIDIA Docs
May 9, 2024 · Secure boot is a process which verifies each element in the boot process prior to execution, and halts or enters a special state if a ...Missing: NICs | Show results with:NICs
[93]
draft-ietf-rddp-rdma-concerns-01 - DDP and RDMA Concerns
This draft describes technical concerns that should be considered in the design of standardized RDMA and DDP protocols/mechanisms for use with Internet ...
[94]
Alibaba Stellar: A New Generation RDMA Network for Cloud AI
Aug 27, 2025 · Stellar introduces three key innovations: Para-Virtualized Direct Memory Access (PVDMA) for on-demand memory pinning, extended Memory ...
[95]
An RDMA-First Object Storage System with SmartNIC Offload - arXiv
We present ROS2, an RDMA-first object storage system design that offloads the DAOS client to an NVIDIA BlueField-3 SmartNIC while leaving the ...
[96]
[PDF] A Scalable Reliable Connection for RDMA with Decoupled QPs and ...
Jun 24, 2025 · RDMA supports both connection-oriented and datagram-based communications in reliable and unreliable mode, including Reliable Connection (RC), ...
[97]
Remote direct memory access (rdma) in next generation cellular ...
Various embodiments herein are directed to enabling remote direct memory access (RDMA) between a user equipment (UE) and a cellular network.
[98]
[PDF] 23.08.10 UEC Overview Presentation - Ultra Ethernet Consortium
ETHERNET IS THE WAY. Why? • Open / Multivendor: Switches, NICs, cables, optics, tools, software. • Scalable: Addressing and routing for rack-, building-, ...
[99]
Accurate and fast congestion feedback in MEC-enabled RDMA ...
Mar 25, 2024 · The interconnection of edge servers forms small-scale data centers, enabling MEC to provide low-latency network services for mobile users.
[100]
Unlocking High-Performance AI/ML in Kubernetes with DRANet and ...
Jul 15, 2025 · DraNet is a native integration with Kubernetes that uses the core Dynamic Resource Allocation (DRA) API to address these challenges by treating ...
[101]
Introduction to NVIDIA's AI/ML GPU networking solutions - WWT
Aug 5, 2024 · The current Infiniband and Spectrum-X lines of switches, cables, and DPUs show roadmaps out to 800 Gbps and 1.6 Tbps by 2026 and 3.2 Tbps by ...
[102]
Part 2: SMC-R: A hybrid solution of TCP and RDMA - Alibaba Cloud
Jun 15, 2022 · And SMC-R is further compatible with the socket interface while providing RDMA services, which can improve network performance for TCP ...Missing: adoption | Show results with:adoption
[103]
Quantum-safe networks explained - Ericsson
Quantum-safe networks are designed to resist the most powerful future quantum threats. Get the latest insights and explore what it means for today's ...Missing: RDMA | Show results with:RDMA