io_uring
io_uring is a Linux-specific asynchronous input/output (I/O) interface designed to enable efficient, scalable handling of I/O operations by applications through shared ring buffers between user space and kernel space.[1] It allows users to submit multiple I/O requests asynchronously via a submission queue (SQ) and retrieve completions from a completion queue (CQ), minimizing system call overhead and supporting operations on files, sockets, and other descriptors.[1] Developed by kernel developer Jens Axboe to address the limitations of prior asynchronous I/O mechanisms like Linux AIO—which suffered from high overhead, restricted support for buffered I/O, and scalability issues—io_uring was first integrated into the Linux kernel with version 5.1, released in May 2019.[2] The interface is set up using theio_uring_setup() system call, which creates a file descriptor for mapping the SQ and CQ rings into user space via mmap(), enabling direct access without repeated context switches.[2]
Key advantages of io_uring include its ability to batch submissions for reduced latency, support for polling modes to avoid blocking, and compatibility with buffered I/O when data resides in the page cache, outperforming traditional AIO in scenarios requiring high throughput.[2] It provides a growing set of operation codes (opcodes) for tasks such as reading/writing vectors (IORING_OP_READV, IORING_OP_WRITEV), file synchronization (IORING_OP_FSYNC), network messaging (IORING_OP_SENDMSG, IORING_OP_RECVMSG), and more advanced features like timeouts (IORING_OP_TIMEOUT) and operation linking for dependencies, with expansions continuing in kernels up to version 6.x.[3] Additional capabilities, such as registered buffers for zero-copy transfers, further enhance its performance for demanding workloads in databases, web servers, and storage systems.[1] Since its introduction, io_uring has gained widespread adoption, evidenced by its use in projects like MariaDB's InnoDB storage engine via the liburing library,[4] and ongoing kernel enhancements reflect its role as a foundational technology for modern Linux I/O.[3]
Fundamentals
Overview
io_uring is a scalable asynchronous I/O API for the Linux kernel, designed to handle file and network operations efficiently.[1] Introduced in Linux kernel version 5.1 in May 2019, it provides a modern interface for submitting and completing I/O requests without blocking the calling process.[5] Developed by Jens Axboe, io_uring addresses limitations in prior asynchronous I/O mechanisms by enabling high-throughput operations in performance-critical applications.[2] The primary purpose of io_uring is to improve scalability in multi-threaded environments, particularly for workloads involving frequent I/O, by reducing the number of system calls and associated context switches.[1] Traditional synchronous I/O requires per-operation syscalls, which become bottlenecks under high concurrency, while earlier asynchronous options like POSIX AIO suffer from overhead in completion notification. io_uring mitigates these issues through batched submissions and efficient polling mechanisms, allowing applications to process thousands of I/O requests with minimal kernel-user transitions.[2] At its core, io_uring relies on two shared ring buffers mapped into user space: the submission queue (SQ) for enqueueing I/O requests and the completion queue (CQ) for retrieving results.[1] Users populate submission queue entries (SQEs) in the SQ ring, which the kernel consumes asynchronously; upon completion, the kernel appends completion queue entries (CQEs) to the CQ ring for user-side polling.[2] This workflow enables zero-copy communication between user space and the kernel, supporting both buffered and direct I/O operations across filesystems and sockets.[6]Advantages over Prior APIs
Previous asynchronous I/O mechanisms in Linux, such as epoll and POSIX AIO, impose significant limitations on scalability and efficiency in high-throughput scenarios. Epoll offers scalable event notification for monitoring file descriptor readiness, particularly for sockets and network I/O, but relies on a polling model that necessitates separate system calls like readv or writev for actual data transfer after notifications. This decoupled approach incurs repeated context switches and syscall overhead, especially burdensome when managing numerous file descriptors.[2] POSIX AIO, accessible via the libaio library on Linux (implementing the kaio interface), enables asynchronous file I/O but suffers from per-operation system calls for submissions and completions, with limited native batching capabilities. These design choices result in high overhead and poor scalability, as each I/O request typically requires two syscalls, constraining performance to approximately 151 KIOPS on 2 cores with a queue depth of 128 in controlled benchmarks. Additionally, libaio provides inadequate support for buffered I/O and lacks efficient multi-queue handling in threaded environments, often necessitating fixed buffer allocations that increase memory pressure.[7][2] io_uring overcomes these constraints through its ring buffer architecture, which shares submission and completion queues between user space and kernel via a single mmap, allowing batched submissions of up to 4096 entries in one io_uring_enter(2) syscall to minimize invocations. This batching, combined with support for both blocking and non-blocking modes, enables handling thousands of concurrent I/Os without proportional increases in threads or syscalls, achieving up to 182 KIOPS under similar benchmark conditions to libaio.[1][7] Moreover, io_uring facilitates zero-copy operations using registered buffers and fixed files, reducing data movement between user and kernel spaces, and performs buffered I/O without context switches when data resides in the page cache. These features provide a unified interface for file and network I/O, surpassing epoll's notification-only scope and POSIX AIO's file-centric limitations, while offering better multi-queue support for threaded scalability and lower overall memory usage compared to libaio's fixed buffers.[2][1]Architecture
Ring Buffers
io_uring utilizes two shared circular ring buffers to facilitate communication between user space and the kernel: the submission queue (SQ), which holds user-submitted I/O requests in the form of submission queue entries (SQEs) defined by thestruct io_uring_sqe, and the completion queue (CQ), which stores kernel-generated completions via completion queue entries (CQEs) in the struct io_uring_cqe.[1][2]
The SQ layout encompasses an array of SQEs, each containing fields such as opcode (specifying the I/O operation), fd (file descriptor), off (offset), addr (buffer address), len (length), user_data (application-specific identifier), and flags (operation modifiers), along with atomic head and tail pointers for synchronization and an indirection array of indices to the actual SQEs for efficient access.[2][1] The CQ layout includes an array of CQEs, where each entry features user_data (echoing the SQE's identifier), res (result code or byte count), and flags (completion indicators), managed similarly with atomic head and tail pointers to track processed and pending events.[2][1]
These ring buffers are mapped into user space memory using mmap(2) on the file descriptor returned by io_uring_setup(2), with offsets like IORING_OFF_SQ_RING for the SQ ring, IORING_OFF_SQES for the SQE array, and IORING_OFF_CQ_RING for the CQ; a single mmap call suffices when the IORING_FEAT_SINGLE_MMAP feature is available (Linux kernel 5.4+), allowing the kernel to write completions directly into user space without data copying.[1][2]
Ring sizes are set via the entries parameter in io_uring_setup(2), specifying the SQ capacity (with CQ typically twice that size), where the value must be a power of two, has a practical minimum of 8 entries, and is clamped to a maximum of 32,768 (IORING_MAX_ENTRIES) if the IORING_SETUP_CLAMP flag is used to handle excessive requests.[8][2] SQ overflow is managed by the user checking the IORING_SQ_NEED_WAKEUP flag in the SQ ring, which signals if the kernel thread requires explicit waking via io_uring_enter(2) when the queue is full.[1]
Head and tail management ensures lock-free operation: the user increments the SQ tail after adding SQEs, the kernel advances the SQ head upon consumption and the CQ tail upon completion posting, while the user progresses the CQ head after retrieving events; to poll for completions, the user invokes io_uring_enter(2) with the IORING_ENTER_GETEVENTS flag to harvest available CQEs without blocking.[1][9]
Submission and Completion Mechanics
In io_uring, the submission process begins when the user application populates a submission queue entry (SQE) in the shared submission queue ring buffer, specifying the desired operation through fields such as the opcode (e.g.,IORING_OP_READV for vectored reads), the file descriptor (fd), buffer addresses (addr and len), and an optional user-defined user_data value for tracking the request.[1] The application then advances the submission queue tail pointer atomically using a release memory order to signal availability to the kernel, ensuring visibility of the new SQE without immediate system calls in polled modes.[1] To trigger kernel processing of pending SQEs, the application invokes the io_uring_enter system call, passing the ring file descriptor, the number of SQEs to submit, and optional flags like IORING_ENTER_GETEVENTS to also harvest completions in a single call.[1] This design enables batching multiple submissions efficiently, reducing context switches compared to traditional asynchronous I/O interfaces.[10]
Upon completion of the requested operation, the kernel appends a completion queue entry (CQE) to the shared completion queue ring buffer, which includes the original user_data for request identification, a res field indicating the result (positive for bytes transferred in success cases, or negative as -errno for errors), and flags for additional metadata.[1] The user application retrieves CQEs by polling the completion queue head pointer, processing entries as they become available (completions may arrive out of submission order), and advancing the head atomically after consumption to acknowledge readiness for more.[1] This polling can occur via io_uring_wait_cqe for blocking waits or through io_uring_enter with event flags for non-blocking level-triggered notifications, allowing the application to handle completions asynchronously without dedicated threads in many scenarios.[1]
io_uring supports multishot operations for certain opcodes, such as IORING_OP_ACCEPT for repeated socket accepts, where a single SQE can generate multiple CQEs until explicitly canceled, with the IORING_CQE_F_MORE flag in subsequent CQEs signaling ongoing activity.[1] Error handling is integrated into the CQE res field, where values less than zero denote failures (e.g., -EINVAL for invalid arguments), and the absence of a separate errno mechanism aligns with the asynchronous model by embedding all status directly in the queue.[1] For buffer management in fixed buffer registrations, the IORING_CQE_F_BUFFER flag in the CQE provides a buffer ID in its upper bits, enabling efficient reuse without per-completion lookups.[1]
Request cancellation is facilitated by submitting a dedicated SQE with the IORING_OP_ASYNC_CANCEL opcode, targeting either a specific user_data value or all in-flight requests via flags like IORING_ASYNC_CANCEL_ALL, allowing applications to abort operations dynamically during runtime flows.[1] The kernel processes cancellations asynchronously, posting a CQE with res set to zero on success, -ENOENT if the target was not found, or -EALREADY if already completed.[1] This mechanism ensures safe interruption without blocking, maintaining the overall non-blocking nature of io_uring's submission-completion cycle.[10]
Interface
Setup and Initialization
The io_uring instance is initialized using theio_uring_setup() system call, which creates a submission queue (SQ) and a completion queue (CQ) with at least the specified number of entries and returns a file descriptor referencing the instance.[8] This file descriptor serves as the handle for subsequent operations, such as mapping the queues into user space and registering resources. The call takes two arguments: an unsigned 32-bit integer entries denoting the minimum queue size (typically a power of two, up to IORING_MAX_ENTRIES (32768 since Linux 5.4)[11]), and a pointer to a struct io_uring_params for configuration options and kernel-provided feedback.[8]
The struct io_uring_params allows fine-tuning of the instance's behavior through its fields and flags. Key flags in the flags field include IORING_SETUP_IOPOLL, which enables busy-polling for I/O completion on supported devices (requiring direct I/O and device-side polling support), and IORING_SETUP_SQPOLL, which dedicates a kernel thread to poll the SQ, eliminating the need for user-space system calls to submit requests.[8] For SQ polling configurations, fields like sq_thread_cpu specify the preferred CPU for the polling thread (or UINT_MAX for kernel selection, added in Linux 5.16), while sq_thread_idle sets the idle timeout in milliseconds before the thread sleeps (0 disables idling, also added in Linux 5.16).[8] Users can also request a custom CQ size via the IORING_SETUP_CQSIZE flag combined with setting cq_entries (which must exceed the SQ size and be rounded to the next power of two). Upon return, the kernel populates fields like sq_entries and cq_entries with the actual allocated sizes, features indicating supported capabilities (e.g., IORING_FEAT_SINGLE_MMAP for combined ring mapping since Linux 5.4), and offset structures sq_off and cq_off for memory mapping.[8]
After setup, the SQ ring, SQ entries buffer, and CQ ring must be mapped into the application's address space using mmap() on the file descriptor, with offsets derived from the params structure to avoid fixed assumptions about layout. The SQ ring is mapped as a shared, read-write memory region of size sq_off.array + sq_entries * sizeof(__u32) at offset IORING_OFF_SQ_RING, the SQ submission queue entries (SQEs) as another shared region of size sq_entries * sizeof(struct io_uring_sqe) at IORING_OFF_SQES, and the CQ ring similarly at IORING_OFF_CQ_RING.[8] Since Linux 5.4, if IORING_FEAT_SINGLE_MMAP is supported, a single mmap() call can map both SQ and CQ rings contiguously by using the larger of the two sizes at offset 0, simplifying allocation. For advanced configurations, such as registering buffers or files for reuse, the io_uring_register() system call is invoked on the file descriptor post-setup.[8]
To release an io_uring instance, the application closes the file descriptor returned by io_uring_setup(), which frees kernel-allocated resources including the queues and any associated threads; pending operations may complete asynchronously after close.[8] io_uring requires Linux kernel version 5.1 or later; support can be probed by attempting the system call (expecting -ENOSYS on older kernels) or checking the kernel version via uname(2).[8]
Core Operations
The core operations of io_uring are defined by submission queue entries (SQEs), which encapsulate asynchronous I/O requests submitted to the kernel via shared ring buffers, as detailed in the submission and completion mechanics.[1] Each SQE includes common fields such as the opcode to specify the operation type, fd for the file descriptor, addr pointing to buffer or structure data, off for file offsets, len for buffer lengths or counts, user_data to track requests on completion, and flags including IOSQE_IO_DRAIN to enforce ordering by ensuring prior operations complete before this one starts.[1] For file I/O, io_uring supports scatter-gather operations through IORING_OP_READV, which performs vectored reads into multiple buffers specified by an iovec array at addr with len indicating the number of vectors, starting from offset off.[1] Similarly, IORING_OP_WRITEV enables vectored writes from iovecs, using the same parameter layout to gather data from user-provided buffers.[1] For zero-copy reads with pre-registered buffers, IORING_OP_READ_FIXED targets fixed buffer indices instead of user pointers, requiring prior buffer registration via io_uring_register(2) to avoid address translation overhead.[1] Socket operations include IORING_OP_ACCEPT, which accepts incoming connections on a listening socket specified by fd, storing the new descriptor in the completion queue entry (CQE) and optionally providing addr and addr_len for peer address details along with accept_flags.[1] IORING_OP_CONNECT initiates a connection on fd to a destination given by addr and addr_len.[1] Message-based transfers use IORING_OP_SENDMSG and IORING_OP_RECVMSG, both relying on an msghdr structure at addr for ancillary data and control messages, with msg_flags to control behavior like non-blocking sends.[1] Basic file management and metadata operations encompass IORING_OP_OPENAT, which opens a file relative to directory fd using pathname at addr and flags in open_flags, returning the new descriptor in the CQE.[1] IORING_OP_CLOSE simply closes the file at fd.[1] For synchronization, IORING_OP_FSYNC flushes metadata or data to disk on fd, with fsync_flags specifying the scope such as data or metadata only.[1] Multishot support enhances efficiency for polling operations, as seen in IORING_OP_POLL_ADD, which monitors fd for events in poll_events; when the IORING_POLL_ADD_MULTI flag is set, it generates multiple CQEs for recurring events without resubmitting the SQE, until explicitly removed.[1]Advanced Features
Synchronization Options
io_uring provides several mechanisms to control the ordering and dependencies of I/O operations, enabling developers to enforce synchronization without relying on external locking primitives. These features are particularly useful for applications requiring sequential execution of dependent requests, such as file operations where a write must follow a successful open. The primary tools for this are flags set on submission queue entries (SQEs) and specific opcodes that build dependency structures.[1] Ordering is achieved through flags like IOSQE_IO_DRAIN and IOSQE_IO_LINK. The IOSQE_IO_DRAIN flag acts as a pipeline barrier, ensuring that the flagged SQE is not processed until all previously submitted SQEs have completed; additionally, no subsequent SQEs will start until the flagged operation finishes. This enforces strict serialization across the submission queue, preventing interleaving of operations.[12] In contrast, IOSQE_IO_LINK links the current SQE to the next one in the ring, delaying the execution of the linked SQE until the current one completes successfully; if an error occurs, the chain breaks, and subsequent linked SQEs are canceled. Chains formed by IOSQE_IO_LINK terminate at the first unlinked SQE or at submission boundaries, allowing multiple independent chains to execute in parallel for improved concurrency. A variant, IOSQE_IO_HARDLINK, maintains the chain even on errors, providing more robust dependency handling. These linking mechanisms enable the construction of dependency graphs, primarily in the form of linear chains, where operations form ordered sequences that can run concurrently if not interdependent. For timed dependencies, the IORING_OP_LINK_TIMEOUT opcode introduces a timeout linked specifically to a prior operation in the chain; if the operation remains outstanding when the timeout expires, it is canceled, or the timeout itself is canceled upon operation completion. This allows precise control over chained sequences with failure safety nets, differing from the ring-wide IORING_OP_TIMEOUT by targeting individual linked requests rather than the entire completion queue. Barriers in io_uring are primarily implicit through the IOSQE_IO_DRAIN flag, which serializes operations across the submission pipeline. Explicit synchronization can leverage registered capabilities via IORING_REGISTER_PROBE to verify supported features before using advanced ordering, ensuring compatibility. Additionally, poll operations support user-side filtering through prepoll semantics in IORING_OP_POLL_ADD, where applications can pre-check readiness conditions before kernel submission to optimize event handling. Multishot synchronization is facilitated by IORING_OP_POLL_ADD with the IORING_POLL_ADD_MULTI flag (available since kernel 5.5), which arms a persistent poll request that generates multiple completion queue entries (CQEs) for repeated events without resubmission; each CQE includes the IORING_CQE_F_MORE flag if more events are pending. To terminate such a multishot poll, applications submit IORING_OP_POLL_REMOVE, specifying the original poll request's address to disarm it and prevent further notifications. The threading model of io_uring supports flexible concurrency: the submission queue (SQ) is designed as a multi-producer structure, using atomic updates to the tail pointer for lock-free insertions from multiple threads. The completion queue (CQ), however, is intended for single-consumer access, relying on sequence counts in the ring head and tail pointers—along with per-CQE user_data for matching—to allow safe, lock-free consumption without traditional synchronization primitives like mutexes. Memory barriers ensure visibility of updates across threads, maintaining coherence in shared ring buffers.Registered Buffers and Fixed Files
io_uring provides mechanisms for registering user-space buffers and file descriptors to optimize repeated I/O operations by avoiding per-operation setup costs. Registered buffers allow applications to pin specific memory regions in kernel address space, enabling efficient access without repeated mapping or copying for high-frequency I/O workloads.[13] To register buffers, applications use theio_uring_register system call with the IORING_REGISTER_BUFFERS opcode, passing an array of struct iovec describing the buffers and the number of entries. Upon successful registration, the kernel maps these buffers into its address space and assigns them indices, which can then be referenced in submission queue entries (SQEs) using operations like IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED by specifying the buffer index in the buf_index field. This approach is particularly useful for O_DIRECT I/O, where buffers must align with page boundaries and remain stable during operations. Buffers must be anonymous memory, such as from malloc(3), and are charged against the process's RLIMIT_MEMLOCK limit. There is a maximum size of 1 GiB per buffer.[13]
Fixed files extend this optimization to file descriptors, allowing registration of an array of open files to reduce the overhead of file descriptor lookups and reference counting on each I/O submission. Registration occurs via io_uring_register with IORING_REGISTER_FILES, providing an array of file descriptors and their count; the kernel stores internal references indexed from 0. In SQEs, applications set the IOSQE_FIXED_FILE flag and use the index instead of the raw file descriptor, streamlining operations like reads or writes on frequently accessed files. Updates to the registered set can be made with IORING_REGISTER_FILES_UPDATE. This is beneficial for scenarios involving many concurrent file handles, as it minimizes syscalls and context switches per operation. The number of fixed files is limited by the system's RLIMIT_NOFILE per process.[13]
For dynamic feature detection, io_uring supports probe registration through io_uring_register with the IORING_REGISTER_PROBE opcode, which populates an io_uring_probe structure detailing the opcodes and flags supported by the current kernel version. This enables applications to query capabilities at runtime without assuming a specific kernel feature set, facilitating portability across different Linux distributions and versions. The probe includes fields like last_op for the highest supported opcode and an array of io_uring_probe_op entries indicating availability and flags for each opcode.[13]
De-registration releases these resources using the same io_uring_register call with unregistration opcodes: IORING_UNREGISTER_BUFFERS for buffers and IORING_UNREGISTER_FILES for files, passing NULL arguments. This synchronously frees kernel mappings and references, though automatic cleanup occurs when the io_uring instance is torn down via close(2). Applications may unregister to reallocate resources or switch sets, but frequent changes should be minimized to preserve performance gains.[13]
These registration features reduce kernel pinning and lookup overhead for high-frequency I/O, enabling zero-copy paths by allowing direct kernel access to user buffers without per-operation validation or copying. By front-loading the setup costs during registration—typically done once per buffer or file set—they support efficient, scalable asynchronous I/O in performance-critical applications.[13]
History and Development
Origins and Initial Release
io_uring was primarily developed by Jens Axboe, a prominent Linux kernel contributor while working at Facebook (now Meta) and previously employed by Red Hat, to address scalability challenges in asynchronous I/O for high-performance storage engines such as RocksDB.[14] The project stemmed from longstanding limitations in the Linux AIO interface, which suffered from inefficiencies in submission and completion handling, particularly under high-load scenarios involving numerous concurrent operations.[2] In late 2018, Axboe initiated discussions on the Linux Kernel Mailing List (LKML) regarding enhancements to AIO, including a prototype for polled AIO that laid groundwork for io_uring's ring-based design. This prototype was shared via kernel patches and early GitHub repositories, focusing on reducing system call overhead and improving throughput for block I/O workloads.[2] The interface was merged into the Linux kernel version 5.1, released on May 5, 2019, introducing basic submission queue (SQ) and completion queue (CQ) ring buffers supporting core operations like read, write, and poll.[5] Early adoption followed swiftly with the release of the liburing userspace library in 2019 by Axboe, which simplified io_uring integration for applications and emphasized block I/O use cases.[15] io_uring drew inspiration from predecessors such as the Storage Performance Development Kit (SPDK), a user-space library achieving high I/O performance through polling, as well as earlier asynchronous I/O patches dating back to 2007 discussions on syslets and fibrils. These influences guided io_uring's emphasis on efficient, low-overhead queuing without kernel-user context switches for every operation.[2][7]Evolution Across Kernel Versions
io_uring has undergone significant enhancements since its initial introduction in Linux kernel 5.1, with major feature additions in subsequent releases addressing limitations such as limited support for fixed resources and polling mechanisms. In kernel 5.5, released in January 2020, key additions included support for fixed buffers to reduce the overhead of buffer registration for repeated I/O operations, multishot poll operations allowing a single submission to handle multiple poll events without resubmission, and timeout operations for scheduling asynchronous timeouts integrated into the ring buffer workflow.[3] Kernel 5.10, released in December 2020, expanded io_uring's capabilities for resource management and networking by introducing registered files to pin file descriptors for reuse across submissions, reducing setup costs, and adding probe support to query available operations at runtime. It also enhanced socket operations, including zero-copy send (send_zc) for efficient network transmission without data copying. These changes began addressing early gaps in network support, which were initially absent in io_uring's core design.[16][17] By kernel 5.19 in July 2022, io_uring gained more advanced file handling with the IORING_OP_OPEN_CLOSE operation for asynchronous open and close of files directly through the interface, and deferred cancellation support to allow non-immediate cancellation of pending requests for better control in complex workflows. Concurrently, the liburing user-space library reached version 2.0, providing improved APIs for these features and better integration tools for developers.[18][17] Starting with kernel 6.0 in October 2022 and continuing through subsequent releases up to 6.12 (November 2024) and into the 6.x series as of November 2025, io_uring saw further maturation with SQPOLL enhancements for more efficient submission queue polling in kernel threads, true asynchronous support extended to all filesystems including buffered writes on XFS without blocking, and the io_uring_cmd interface for direct passthrough commands to device drivers, enabling custom I/O operations. Ongoing enhancements in later 6.x kernels, such as improved zero-copy networking support in 6.15, continue to broaden its applicability.[19][20][21] Ongoing development is driven by maintainer Jens Axboe through contributions on the Linux Kernel Mailing List (LKML), focusing on performance optimizations and broader applicability.[22]Security Considerations
Known Vulnerabilities
One notable vulnerability in io_uring is CVE-2022-29582, a use-after-free flaw in the timeout handling mechanism caused by a race condition in fs/io_uring.c, which could allow a local attacker to escalate privileges. This issue was fixed in Linux kernel version 5.17.3. Another significant issue is CVE-2023-1872, a use-after-free vulnerability in the io_file_get_fixed function within the io_uring subsystem, potentially leading to memory corruption, particularly in multi-threaded environments where submission queue (SQ) operations overflow handling is inadequate; it affects kernels from 5.10 to 6.2.[23] The flaw enables local privilege escalation and was addressed in subsequent stable kernel releases. Earlier concerns include CVE-2021-41073, involving races during buffer registration via IORING_OP_PROVIDE_BUFFERS in loop_rw_iter of fs/io_uring.c, which permits local users to gain elevated privileges and cause denial-of-service (DoS) through pinned memory exhaustion by repeatedly registering buffers without proper limits.[24] This was patched in kernel 5.14.7. More recent vulnerabilities include CVE-2024-0582, a use-after-free in io_uring buffer handling that allows local privilege escalation via page manipulation, publicly disclosed and exploited in early 2024, with fixes backported to kernels 5.15+ and 6.x series.[25] Another is CVE-2024-53187, an integer overflow in io_pin_pages during buffer pinning, leading to potential memory corruption, resolved in kernel commits around December 2024.[26] In 2025, CVE-2025-39698 affects io_uring futex operations, enabling local denial-of-service or escalation, patched in September 2025 stable releases.[27] These vulnerabilities often arise from races in shared submission and completion queue memory between user space and the kernel, as well as improper reference counting in asynchronous execution paths, which are amplified by io_uring's model of direct access to kernel structures without traditional syscall mediation.[23][24] The impacts of these flaws are confined to local attackers, facilitating privilege escalation or system denial-of-service through resource exhaustion, with no reported remote exploits as of November 2025.[23][24] Security audits continue through kernel self-tests and community reviews to identify potential weaknesses in io_uring's complex concurrency model.Best Practices for Safe Usage
When configuring an io_uring instance, select the smallest power-of-2 ring size that accommodates the expected workload to minimize memory usage while ensuring sufficient capacity for concurrent operations.[8] Enabling the IORING_SETUP_SQPOLL flag, which spawns a dedicated kernel thread for polling the submission queue, should be restricted to trusted code due to its privileged nature and potential for resource exhaustion if misused.[1] To limit exposure to denial-of-service attacks from excessive pinned memory, configure the RLIMIT_MEMLOCK resource limit conservatively, as registered buffers and fixed files consume locked memory charged against this quota.[13] Robust error handling is essential for reliable io_uring usage; always inspect theres field and flags in each completion queue entry (CQE) to determine success or failure, where negative values indicate errno equivalents such as -EINVAL for invalid arguments.[1] For operations returning -EAGAIN, indicating a non-blocking resource unavailability, implement retry logic that resubmits the request after a brief delay or upon notification via polling, avoiding busy-waiting loops that could degrade system performance.[1] As referenced in submission and completion mechanics, thorough CQE validation prevents silent failures in asynchronous workflows.
In multithreaded environments, confine submission queue (SQ) operations to a single thread to eliminate race conditions during entry enqueuing, aligning with the lock-free single-producer design of the SQ ring.[28] For completion queue (CQ) consumption in multi-consumer scenarios, adhere to the ring's sequence-locking protocol by verifying the sequence counters in head and tail pointers before accessing entries, ensuring consistency without explicit locks.[1]
To enhance auditability and portability across kernel versions, probe supported opcodes at runtime using the IORING_REGISTER_PROBE operation via io_uring_register(2), allowing applications to adapt to available features dynamically.[13] Prevent file descriptor leaks by avoiding direct use of legacy close(2) on registered files; instead, unregister them explicitly with io_uring_register(2) before closure to ensure proper release by the kernel.[13]
In containerized deployments, apply the PR_SET_NO_NEW_PRIVS prctl(2) flag to the process using io_uring to block privilege escalations via execve(2), complementing seccomp or AppArmor profiles that may restrict io_uring syscalls for security. Continuously monitor pinned memory consumption from registered resources to avert out-of-memory conditions, particularly in resource-constrained environments.[29]
Leverage the liburing library's high-level wrappers, such as io_uring_queue_init and io_uring_submit, which abstract low-level syscalls and incorporate safety checks for common pitfalls like buffer alignment and ring synchronization.[30] Enable io_uring in the kernel by setting CONFIG_IO_URING=y during compilation, and activate debug facilities like ftrace or dynamic debug (via pr_debug in io_uring code) for troubleshooting submission and completion behaviors in development.[31]
Performance and Adoption
Benchmark Results
Benchmarking io_uring has demonstrated significant performance advantages in high-throughput scenarios, particularly for storage and network I/O on modern hardware. Using the fio tool on NVMe SSDs, io_uring achieves up to 481,000 IOPS for 4K random reads on consumer drives like the Samsung 980.[32] On Intel Optane SSD 900P, single-threaded random read IOPS can reach up to 200,000 at higher queue depths, with sequential reads up to 1.8 million, highlighting its scalability for queue depths (QD) greater than 4.[33] Comparisons with traditional asynchronous I/O (AIO) show io_uring delivering up to 15% higher IOPS at elevated queue depths due to reduced lock contention and efficient submission-completion mechanics.[33] Latency metrics further underscore io_uring's efficiency, with single-digit microsecond p95 latencies observed for batched random reads on SSDs in kernel versions around 6.3.[33] In database workloads like RocksDB, io_uring reduces tail latency by 16.5% compared to synchronous alternatives, benefiting from features like submission queue polling (SQ_POLL).[33] For network I/O, zero-copy operations via io_uring_prep_send_zc yield up to 2.44 times the throughput of MSG_ZEROCOPY in UDP benchmarks, reaching over 116 GB/s on dummy devices, which supports line-rate performance on multi-queue NICs exceeding 100 Gbps in aggregate.[34] These gains stem from batched submissions that minimize system call overhead, scaling linearly across up to 1024 threads without significant contention.[33] Standard test setups include fio for storage IOPS and latency on NVMe/ext4 filesystems, liburing examples for API-specific evaluations, and will-it-scale for concurrency tests up to 1 million connections, often on multi-core systems like Intel Xeon with Ubuntu kernels.[33] Recent analyses from 2023-2024, including academic evaluations, confirm io_uring's edge in mixed workloads, though results vary by configuration—e.g., IOSQE_ASYNC flags boost IOPS by up to 15% at higher queue depths in evaluations including RocksDB.[33] Despite these strengths, io_uring exhibits overhead for small I/O operations under 4KB, where batching introduces up to 15% lower IOPS at low queue depths (QD 1-2) due to worker thread scheduling.[33] Polling modes like SQ_POLL, while reducing latency by 50% at high QD, increase CPU utilization by dedicating cores, potentially raising power consumption in latency-sensitive environments.[33] In Linux kernel 6.12, enhancements to asynchronous filesystem support, including async discard operations, deliver approximately 20% IOPS gains on slower storage devices by avoiding blocking calls, as measured on basic NVMe setups transitioning from synchronous to fully async modes.[35] This update builds on prior versions, enabling better integration for discard-heavy workloads without performance regressions.[20]| Benchmark | Setup | io_uring IOPS | Comparison (e.g., AIO) | Source |
|---|---|---|---|---|
| 4K Random Read | Samsung 980 NVMe, fio, kernel 5.x | 481,000 | N/A (rated ~500k) | Phoronix (2021)[32] |
| Random Read | Intel Optane 900P, fio, single-threaded QD>4, kernel 6.3 | ~200,000 | Up to +15% vs. non-async at QD≥4 | Thesis (2024)[33] |
| RocksDB p95 Tail Latency Reduction | Intel Xeon, ext4, SQ_POLL, kernel 6.3 | 16.5% vs. sync | N/A | Thesis (2024)[33] |
| UDP Zero-Copy Send | Dummy NIC, 64KB packets | 116 GB/s | 2.44x vs. MSG_ZEROCOPY | Phoronix (2021)[34] |
| Async Discard | Basic NVMe, kernel 6.12 | ~20% gain | vs. sync discard | Phoronix/Liburing Wiki (2024)[35][20] |