Fact-checked by Grok 2 weeks ago

io_uring

io_uring is a Linux-specific asynchronous input/output (I/O) interface designed to enable efficient, scalable handling of I/O operations by applications through shared ring buffers between user space and kernel space.^[1] It allows users to submit multiple I/O requests asynchronously via a submission queue (SQ) and retrieve completions from a completion queue (CQ), minimizing system call overhead and supporting operations on files, sockets, and other descriptors.^[1] Developed by kernel developer Jens Axboe to address the limitations of prior asynchronous I/O mechanisms like Linux AIO—which suffered from high overhead, restricted support for buffered I/O, and scalability issues—io_uring was first integrated into the Linux kernel with version 5.1, released in May 2019.^[2] The interface is set up using the io_uring_setup() system call, which creates a file descriptor for mapping the SQ and CQ rings into user space via mmap(), enabling direct access without repeated context switches.^[2] Key advantages of io_uring include its ability to batch submissions for reduced latency, support for polling modes to avoid blocking, and compatibility with buffered I/O when data resides in the page cache, outperforming traditional AIO in scenarios requiring high throughput.^[2] It provides a growing set of operation codes (opcodes) for tasks such as reading/writing vectors (IORING_OP_READV, IORING_OP_WRITEV), file synchronization (IORING_OP_FSYNC), network messaging (IORING_OP_SENDMSG, IORING_OP_RECVMSG), and more advanced features like timeouts (IORING_OP_TIMEOUT) and operation linking for dependencies, with expansions continuing in kernels up to version 6.x.^[3] Additional capabilities, such as registered buffers for zero-copy transfers, further enhance its performance for demanding workloads in databases, web servers, and storage systems.^[1] Since its introduction, io_uring has gained widespread adoption, evidenced by its use in projects like MariaDB's InnoDB storage engine via the liburing library,^[4] and ongoing kernel enhancements reflect its role as a foundational technology for modern Linux I/O.^[3]

Fundamentals

Overview

io_uring is a scalable asynchronous I/O API for the Linux kernel, designed to handle file and network operations efficiently.^[1] Introduced in Linux kernel version 5.1 in May 2019, it provides a modern interface for submitting and completing I/O requests without blocking the calling process.^[5] Developed by Jens Axboe, io_uring addresses limitations in prior asynchronous I/O mechanisms by enabling high-throughput operations in performance-critical applications.^[2] The primary purpose of io_uring is to improve scalability in multi-threaded environments, particularly for workloads involving frequent I/O, by reducing the number of system calls and associated context switches.^[1] Traditional synchronous I/O requires per-operation syscalls, which become bottlenecks under high concurrency, while earlier asynchronous options like POSIX AIO suffer from overhead in completion notification. io_uring mitigates these issues through batched submissions and efficient polling mechanisms, allowing applications to process thousands of I/O requests with minimal kernel-user transitions.^[2] At its core, io_uring relies on two shared ring buffers mapped into user space: the submission queue (SQ) for enqueueing I/O requests and the completion queue (CQ) for retrieving results.^[1] Users populate submission queue entries (SQEs) in the SQ ring, which the kernel consumes asynchronously; upon completion, the kernel appends completion queue entries (CQEs) to the CQ ring for user-side polling.^[2] This workflow enables zero-copy communication between user space and the kernel, supporting both buffered and direct I/O operations across filesystems and sockets.^[6]

Advantages over Prior APIs

Previous asynchronous I/O mechanisms in Linux, such as epoll and POSIX AIO, impose significant limitations on scalability and efficiency in high-throughput scenarios. Epoll offers scalable event notification for monitoring file descriptor readiness, particularly for sockets and network I/O, but relies on a polling model that necessitates separate system calls like readv or writev for actual data transfer after notifications. This decoupled approach incurs repeated context switches and syscall overhead, especially burdensome when managing numerous file descriptors.^[2] POSIX AIO, accessible via the libaio library on Linux (implementing the kaio interface), enables asynchronous file I/O but suffers from per-operation system calls for submissions and completions, with limited native batching capabilities. These design choices result in high overhead and poor scalability, as each I/O request typically requires two syscalls, constraining performance to approximately 151 KIOPS on 2 cores with a queue depth of 128 in controlled benchmarks. Additionally, libaio provides inadequate support for buffered I/O and lacks efficient multi-queue handling in threaded environments, often necessitating fixed buffer allocations that increase memory pressure.^[7]^[2] io_uring overcomes these constraints through its ring buffer architecture, which shares submission and completion queues between user space and kernel via a single mmap, allowing batched submissions of up to 4096 entries in one io_uring_enter(2) syscall to minimize invocations. This batching, combined with support for both blocking and non-blocking modes, enables handling thousands of concurrent I/Os without proportional increases in threads or syscalls, achieving up to 182 KIOPS under similar benchmark conditions to libaio.^[1]^[7] Moreover, io_uring facilitates zero-copy operations using registered buffers and fixed files, reducing data movement between user and kernel spaces, and performs buffered I/O without context switches when data resides in the page cache. These features provide a unified interface for file and network I/O, surpassing epoll's notification-only scope and POSIX AIO's file-centric limitations, while offering better multi-queue support for threaded scalability and lower overall memory usage compared to libaio's fixed buffers.^[2]^[1]

Architecture

Ring Buffers

io_uring utilizes two shared circular ring buffers to facilitate communication between user space and the kernel: the submission queue (SQ), which holds user-submitted I/O requests in the form of submission queue entries (SQEs) defined by the struct io_uring_sqe, and the completion queue (CQ), which stores kernel-generated completions via completion queue entries (CQEs) in the struct io_uring_cqe.^[1]^[2] The SQ layout encompasses an array of SQEs, each containing fields such as opcode (specifying the I/O operation), fd (file descriptor), off (offset), addr (buffer address), len (length), user_data (application-specific identifier), and flags (operation modifiers), along with atomic head and tail pointers for synchronization and an indirection array of indices to the actual SQEs for efficient access.^[2]^[1] The CQ layout includes an array of CQEs, where each entry features user_data (echoing the SQE's identifier), res (result code or byte count), and flags (completion indicators), managed similarly with atomic head and tail pointers to track processed and pending events.^[2]^[1] These ring buffers are mapped into user space memory using mmap(2) on the file descriptor returned by io_uring_setup(2), with offsets like IORING_OFF_SQ_RING for the SQ ring, IORING_OFF_SQES for the SQE array, and IORING_OFF_CQ_RING for the CQ; a single mmap call suffices when the IORING_FEAT_SINGLE_MMAP feature is available (Linux kernel 5.4+), allowing the kernel to write completions directly into user space without data copying.^[1]^[2] Ring sizes are set via the entries parameter in io_uring_setup(2), specifying the SQ capacity (with CQ typically twice that size), where the value must be a power of two, has a practical minimum of 8 entries, and is clamped to a maximum of 32,768 (IORING_MAX_ENTRIES) if the IORING_SETUP_CLAMP flag is used to handle excessive requests.^[8]^[2] SQ overflow is managed by the user checking the IORING_SQ_NEED_WAKEUP flag in the SQ ring, which signals if the kernel thread requires explicit waking via io_uring_enter(2) when the queue is full.^[1] Head and tail management ensures lock-free operation: the user increments the SQ tail after adding SQEs, the kernel advances the SQ head upon consumption and the CQ tail upon completion posting, while the user progresses the CQ head after retrieving events; to poll for completions, the user invokes io_uring_enter(2) with the IORING_ENTER_GETEVENTS flag to harvest available CQEs without blocking.^[1]^[9]

Submission and Completion Mechanics

In io_uring, the submission process begins when the user application populates a submission queue entry (SQE) in the shared submission queue ring buffer, specifying the desired operation through fields such as the opcode (e.g., IORING_OP_READV for vectored reads), the file descriptor (fd), buffer addresses (addr and len), and an optional user-defined user_data value for tracking the request.^[1] The application then advances the submission queue tail pointer atomically using a release memory order to signal availability to the kernel, ensuring visibility of the new SQE without immediate system calls in polled modes.^[1] To trigger kernel processing of pending SQEs, the application invokes the io_uring_enter system call, passing the ring file descriptor, the number of SQEs to submit, and optional flags like IORING_ENTER_GETEVENTS to also harvest completions in a single call.^[1] This design enables batching multiple submissions efficiently, reducing context switches compared to traditional asynchronous I/O interfaces.^[10] Upon completion of the requested operation, the kernel appends a completion queue entry (CQE) to the shared completion queue ring buffer, which includes the original user_data for request identification, a res field indicating the result (positive for bytes transferred in success cases, or negative as -errno for errors), and flags for additional metadata.^[1] The user application retrieves CQEs by polling the completion queue head pointer, processing entries as they become available (completions may arrive out of submission order), and advancing the head atomically after consumption to acknowledge readiness for more.^[1] This polling can occur via io_uring_wait_cqe for blocking waits or through io_uring_enter with event flags for non-blocking level-triggered notifications, allowing the application to handle completions asynchronously without dedicated threads in many scenarios.^[1] io_uring supports multishot operations for certain opcodes, such as IORING_OP_ACCEPT for repeated socket accepts, where a single SQE can generate multiple CQEs until explicitly canceled, with the IORING_CQE_F_MORE flag in subsequent CQEs signaling ongoing activity.^[1] Error handling is integrated into the CQE res field, where values less than zero denote failures (e.g., -EINVAL for invalid arguments), and the absence of a separate errno mechanism aligns with the asynchronous model by embedding all status directly in the queue.^[1] For buffer management in fixed buffer registrations, the IORING_CQE_F_BUFFER flag in the CQE provides a buffer ID in its upper bits, enabling efficient reuse without per-completion lookups.^[1] Request cancellation is facilitated by submitting a dedicated SQE with the IORING_OP_ASYNC_CANCEL opcode, targeting either a specific user_data value or all in-flight requests via flags like IORING_ASYNC_CANCEL_ALL, allowing applications to abort operations dynamically during runtime flows.^[1] The kernel processes cancellations asynchronously, posting a CQE with res set to zero on success, -ENOENT if the target was not found, or -EALREADY if already completed.^[1] This mechanism ensures safe interruption without blocking, maintaining the overall non-blocking nature of io_uring's submission-completion cycle.^[10]

Interface

Setup and Initialization

The io_uring instance is initialized using the io_uring_setup() system call, which creates a submission queue (SQ) and a completion queue (CQ) with at least the specified number of entries and returns a file descriptor referencing the instance.^[8] This file descriptor serves as the handle for subsequent operations, such as mapping the queues into user space and registering resources. The call takes two arguments: an unsigned 32-bit integer entries denoting the minimum queue size (typically a power of two, up to IORING_MAX_ENTRIES (32768 since Linux 5.4)^[11]), and a pointer to a struct io_uring_params for configuration options and kernel-provided feedback.^[8] The struct io_uring_params allows fine-tuning of the instance's behavior through its fields and flags. Key flags in the flags field include IORING_SETUP_IOPOLL, which enables busy-polling for I/O completion on supported devices (requiring direct I/O and device-side polling support), and IORING_SETUP_SQPOLL, which dedicates a kernel thread to poll the SQ, eliminating the need for user-space system calls to submit requests.^[8] For SQ polling configurations, fields like sq_thread_cpu specify the preferred CPU for the polling thread (or UINT_MAX for kernel selection, added in Linux 5.16), while sq_thread_idle sets the idle timeout in milliseconds before the thread sleeps (0 disables idling, also added in Linux 5.16).^[8] Users can also request a custom CQ size via the IORING_SETUP_CQSIZE flag combined with setting cq_entries (which must exceed the SQ size and be rounded to the next power of two). Upon return, the kernel populates fields like sq_entries and cq_entries with the actual allocated sizes, features indicating supported capabilities (e.g., IORING_FEAT_SINGLE_MMAP for combined ring mapping since Linux 5.4), and offset structures sq_off and cq_off for memory mapping.^[8] After setup, the SQ ring, SQ entries buffer, and CQ ring must be mapped into the application's address space using mmap() on the file descriptor, with offsets derived from the params structure to avoid fixed assumptions about layout. The SQ ring is mapped as a shared, read-write memory region of size sq_off.array + sq_entries * sizeof(__u32) at offset IORING_OFF_SQ_RING, the SQ submission queue entries (SQEs) as another shared region of size sq_entries * sizeof(struct io_uring_sqe) at IORING_OFF_SQES, and the CQ ring similarly at IORING_OFF_CQ_RING.^[8] Since Linux 5.4, if IORING_FEAT_SINGLE_MMAP is supported, a single mmap() call can map both SQ and CQ rings contiguously by using the larger of the two sizes at offset 0, simplifying allocation. For advanced configurations, such as registering buffers or files for reuse, the io_uring_register() system call is invoked on the file descriptor post-setup.^[8] To release an io_uring instance, the application closes the file descriptor returned by io_uring_setup(), which frees kernel-allocated resources including the queues and any associated threads; pending operations may complete asynchronously after close.^[8] io_uring requires Linux kernel version 5.1 or later; support can be probed by attempting the system call (expecting -ENOSYS on older kernels) or checking the kernel version via uname(2).^[8]

Core Operations

The core operations of io_uring are defined by submission queue entries (SQEs), which encapsulate asynchronous I/O requests submitted to the kernel via shared ring buffers, as detailed in the submission and completion mechanics.^[1] Each SQE includes common fields such as the opcode to specify the operation type, fd for the file descriptor, addr pointing to buffer or structure data, off for file offsets, len for buffer lengths or counts, user_data to track requests on completion, and flags including IOSQE_IO_DRAIN to enforce ordering by ensuring prior operations complete before this one starts.^[1] For file I/O, io_uring supports scatter-gather operations through IORING_OP_READV, which performs vectored reads into multiple buffers specified by an iovec array at addr with len indicating the number of vectors, starting from offset off.^[1] Similarly, IORING_OP_WRITEV enables vectored writes from iovecs, using the same parameter layout to gather data from user-provided buffers.^[1] For zero-copy reads with pre-registered buffers, IORING_OP_READ_FIXED targets fixed buffer indices instead of user pointers, requiring prior buffer registration via io_uring_register(2) to avoid address translation overhead.^[1] Socket operations include IORING_OP_ACCEPT, which accepts incoming connections on a listening socket specified by fd, storing the new descriptor in the completion queue entry (CQE) and optionally providing addr and addr_len for peer address details along with accept_flags.^[1] IORING_OP_CONNECT initiates a connection on fd to a destination given by addr and addr_len.^[1] Message-based transfers use IORING_OP_SENDMSG and IORING_OP_RECVMSG, both relying on an msghdr structure at addr for ancillary data and control messages, with msg_flags to control behavior like non-blocking sends.^[1] Basic file management and metadata operations encompass IORING_OP_OPENAT, which opens a file relative to directory fd using pathname at addr and flags in open_flags, returning the new descriptor in the CQE.^[1] IORING_OP_CLOSE simply closes the file at fd.^[1] For synchronization, IORING_OP_FSYNC flushes metadata or data to disk on fd, with fsync_flags specifying the scope such as data or metadata only.^[1] Multishot support enhances efficiency for polling operations, as seen in IORING_OP_POLL_ADD, which monitors fd for events in poll_events; when the IORING_POLL_ADD_MULTI flag is set, it generates multiple CQEs for recurring events without resubmitting the SQE, until explicitly removed.^[1]

Advanced Features

Synchronization Options

io_uring provides several mechanisms to control the ordering and dependencies of I/O operations, enabling developers to enforce synchronization without relying on external locking primitives. These features are particularly useful for applications requiring sequential execution of dependent requests, such as file operations where a write must follow a successful open. The primary tools for this are flags set on submission queue entries (SQEs) and specific opcodes that build dependency structures.^[1] Ordering is achieved through flags like IOSQE_IO_DRAIN and IOSQE_IO_LINK. The IOSQE_IO_DRAIN flag acts as a pipeline barrier, ensuring that the flagged SQE is not processed until all previously submitted SQEs have completed; additionally, no subsequent SQEs will start until the flagged operation finishes. This enforces strict serialization across the submission queue, preventing interleaving of operations.^[12] In contrast, IOSQE_IO_LINK links the current SQE to the next one in the ring, delaying the execution of the linked SQE until the current one completes successfully; if an error occurs, the chain breaks, and subsequent linked SQEs are canceled. Chains formed by IOSQE_IO_LINK terminate at the first unlinked SQE or at submission boundaries, allowing multiple independent chains to execute in parallel for improved concurrency. A variant, IOSQE_IO_HARDLINK, maintains the chain even on errors, providing more robust dependency handling. These linking mechanisms enable the construction of dependency graphs, primarily in the form of linear chains, where operations form ordered sequences that can run concurrently if not interdependent. For timed dependencies, the IORING_OP_LINK_TIMEOUT opcode introduces a timeout linked specifically to a prior operation in the chain; if the operation remains outstanding when the timeout expires, it is canceled, or the timeout itself is canceled upon operation completion. This allows precise control over chained sequences with failure safety nets, differing from the ring-wide IORING_OP_TIMEOUT by targeting individual linked requests rather than the entire completion queue. Barriers in io_uring are primarily implicit through the IOSQE_IO_DRAIN flag, which serializes operations across the submission pipeline. Explicit synchronization can leverage registered capabilities via IORING_REGISTER_PROBE to verify supported features before using advanced ordering, ensuring compatibility. Additionally, poll operations support user-side filtering through prepoll semantics in IORING_OP_POLL_ADD, where applications can pre-check readiness conditions before kernel submission to optimize event handling. Multishot synchronization is facilitated by IORING_OP_POLL_ADD with the IORING_POLL_ADD_MULTI flag (available since kernel 5.5), which arms a persistent poll request that generates multiple completion queue entries (CQEs) for repeated events without resubmission; each CQE includes the IORING_CQE_F_MORE flag if more events are pending. To terminate such a multishot poll, applications submit IORING_OP_POLL_REMOVE, specifying the original poll request's address to disarm it and prevent further notifications. The threading model of io_uring supports flexible concurrency: the submission queue (SQ) is designed as a multi-producer structure, using atomic updates to the tail pointer for lock-free insertions from multiple threads. The completion queue (CQ), however, is intended for single-consumer access, relying on sequence counts in the ring head and tail pointers—along with per-CQE user_data for matching—to allow safe, lock-free consumption without traditional synchronization primitives like mutexes. Memory barriers ensure visibility of updates across threads, maintaining coherence in shared ring buffers.

Registered Buffers and Fixed Files

io_uring provides mechanisms for registering user-space buffers and file descriptors to optimize repeated I/O operations by avoiding per-operation setup costs. Registered buffers allow applications to pin specific memory regions in kernel address space, enabling efficient access without repeated mapping or copying for high-frequency I/O workloads.^[13] To register buffers, applications use the io_uring_register system call with the IORING_REGISTER_BUFFERS opcode, passing an array of struct iovec describing the buffers and the number of entries. Upon successful registration, the kernel maps these buffers into its address space and assigns them indices, which can then be referenced in submission queue entries (SQEs) using operations like IORING_OP_READ_FIXED or IORING_OP_WRITE_FIXED by specifying the buffer index in the buf_index field. This approach is particularly useful for O_DIRECT I/O, where buffers must align with page boundaries and remain stable during operations. Buffers must be anonymous memory, such as from malloc(3), and are charged against the process's RLIMIT_MEMLOCK limit. There is a maximum size of 1 GiB per buffer.^[13] Fixed files extend this optimization to file descriptors, allowing registration of an array of open files to reduce the overhead of file descriptor lookups and reference counting on each I/O submission. Registration occurs via io_uring_register with IORING_REGISTER_FILES, providing an array of file descriptors and their count; the kernel stores internal references indexed from 0. In SQEs, applications set the IOSQE_FIXED_FILE flag and use the index instead of the raw file descriptor, streamlining operations like reads or writes on frequently accessed files. Updates to the registered set can be made with IORING_REGISTER_FILES_UPDATE. This is beneficial for scenarios involving many concurrent file handles, as it minimizes syscalls and context switches per operation. The number of fixed files is limited by the system's RLIMIT_NOFILE per process.^[13] For dynamic feature detection, io_uring supports probe registration through io_uring_register with the IORING_REGISTER_PROBE opcode, which populates an io_uring_probe structure detailing the opcodes and flags supported by the current kernel version. This enables applications to query capabilities at runtime without assuming a specific kernel feature set, facilitating portability across different Linux distributions and versions. The probe includes fields like last_op for the highest supported opcode and an array of io_uring_probe_op entries indicating availability and flags for each opcode.^[13] De-registration releases these resources using the same io_uring_register call with unregistration opcodes: IORING_UNREGISTER_BUFFERS for buffers and IORING_UNREGISTER_FILES for files, passing NULL arguments. This synchronously frees kernel mappings and references, though automatic cleanup occurs when the io_uring instance is torn down via close(2). Applications may unregister to reallocate resources or switch sets, but frequent changes should be minimized to preserve performance gains.^[13] These registration features reduce kernel pinning and lookup overhead for high-frequency I/O, enabling zero-copy paths by allowing direct kernel access to user buffers without per-operation validation or copying. By front-loading the setup costs during registration—typically done once per buffer or file set—they support efficient, scalable asynchronous I/O in performance-critical applications.^[13]

History and Development

Origins and Initial Release

io_uring was primarily developed by Jens Axboe, a prominent Linux kernel contributor while working at Facebook (now Meta) and previously employed by Red Hat, to address scalability challenges in asynchronous I/O for high-performance storage engines such as RocksDB.^[14] The project stemmed from longstanding limitations in the Linux AIO interface, which suffered from inefficiencies in submission and completion handling, particularly under high-load scenarios involving numerous concurrent operations.^[2] In late 2018, Axboe initiated discussions on the Linux Kernel Mailing List (LKML) regarding enhancements to AIO, including a prototype for polled AIO that laid groundwork for io_uring's ring-based design. This prototype was shared via kernel patches and early GitHub repositories, focusing on reducing system call overhead and improving throughput for block I/O workloads.^[2] The interface was merged into the Linux kernel version 5.1, released on May 5, 2019, introducing basic submission queue (SQ) and completion queue (CQ) ring buffers supporting core operations like read, write, and poll.^[5] Early adoption followed swiftly with the release of the liburing userspace library in 2019 by Axboe, which simplified io_uring integration for applications and emphasized block I/O use cases.^[15] io_uring drew inspiration from predecessors such as the Storage Performance Development Kit (SPDK), a user-space library achieving high I/O performance through polling, as well as earlier asynchronous I/O patches dating back to 2007 discussions on syslets and fibrils. These influences guided io_uring's emphasis on efficient, low-overhead queuing without kernel-user context switches for every operation.^[2]^[7]

Evolution Across Kernel Versions

io_uring has undergone significant enhancements since its initial introduction in Linux kernel 5.1, with major feature additions in subsequent releases addressing limitations such as limited support for fixed resources and polling mechanisms. In kernel 5.5, released in January 2020, key additions included support for fixed buffers to reduce the overhead of buffer registration for repeated I/O operations, multishot poll operations allowing a single submission to handle multiple poll events without resubmission, and timeout operations for scheduling asynchronous timeouts integrated into the ring buffer workflow.^[3] Kernel 5.10, released in December 2020, expanded io_uring's capabilities for resource management and networking by introducing registered files to pin file descriptors for reuse across submissions, reducing setup costs, and adding probe support to query available operations at runtime. It also enhanced socket operations, including zero-copy send (send_zc) for efficient network transmission without data copying. These changes began addressing early gaps in network support, which were initially absent in io_uring's core design.^[16]^[17] By kernel 5.19 in July 2022, io_uring gained more advanced file handling with the IORING_OP_OPEN_CLOSE operation for asynchronous open and close of files directly through the interface, and deferred cancellation support to allow non-immediate cancellation of pending requests for better control in complex workflows. Concurrently, the liburing user-space library reached version 2.0, providing improved APIs for these features and better integration tools for developers.^[18]^[17] Starting with kernel 6.0 in October 2022 and continuing through subsequent releases up to 6.12 (November 2024) and into the 6.x series as of November 2025, io_uring saw further maturation with SQPOLL enhancements for more efficient submission queue polling in kernel threads, true asynchronous support extended to all filesystems including buffered writes on XFS without blocking, and the io_uring_cmd interface for direct passthrough commands to device drivers, enabling custom I/O operations. Ongoing enhancements in later 6.x kernels, such as improved zero-copy networking support in 6.15, continue to broaden its applicability.^[19]^[20]^[21] Ongoing development is driven by maintainer Jens Axboe through contributions on the Linux Kernel Mailing List (LKML), focusing on performance optimizations and broader applicability.^[22]

Security Considerations

Known Vulnerabilities

One notable vulnerability in io_uring is CVE-2022-29582, a use-after-free flaw in the timeout handling mechanism caused by a race condition in fs/io_uring.c, which could allow a local attacker to escalate privileges. This issue was fixed in Linux kernel version 5.17.3. Another significant issue is CVE-2023-1872, a use-after-free vulnerability in the io_file_get_fixed function within the io_uring subsystem, potentially leading to memory corruption, particularly in multi-threaded environments where submission queue (SQ) operations overflow handling is inadequate; it affects kernels from 5.10 to 6.2.^[23] The flaw enables local privilege escalation and was addressed in subsequent stable kernel releases. Earlier concerns include CVE-2021-41073, involving races during buffer registration via IORING_OP_PROVIDE_BUFFERS in loop_rw_iter of fs/io_uring.c, which permits local users to gain elevated privileges and cause denial-of-service (DoS) through pinned memory exhaustion by repeatedly registering buffers without proper limits.^[24] This was patched in kernel 5.14.7. More recent vulnerabilities include CVE-2024-0582, a use-after-free in io_uring buffer handling that allows local privilege escalation via page manipulation, publicly disclosed and exploited in early 2024, with fixes backported to kernels 5.15+ and 6.x series.^[25] Another is CVE-2024-53187, an integer overflow in io_pin_pages during buffer pinning, leading to potential memory corruption, resolved in kernel commits around December 2024.^[26] In 2025, CVE-2025-39698 affects io_uring futex operations, enabling local denial-of-service or escalation, patched in September 2025 stable releases.^[27] These vulnerabilities often arise from races in shared submission and completion queue memory between user space and the kernel, as well as improper reference counting in asynchronous execution paths, which are amplified by io_uring's model of direct access to kernel structures without traditional syscall mediation.^[23]^[24] The impacts of these flaws are confined to local attackers, facilitating privilege escalation or system denial-of-service through resource exhaustion, with no reported remote exploits as of November 2025.^[23]^[24] Security audits continue through kernel self-tests and community reviews to identify potential weaknesses in io_uring's complex concurrency model.

Best Practices for Safe Usage

When configuring an io_uring instance, select the smallest power-of-2 ring size that accommodates the expected workload to minimize memory usage while ensuring sufficient capacity for concurrent operations.^[8] Enabling the IORING_SETUP_SQPOLL flag, which spawns a dedicated kernel thread for polling the submission queue, should be restricted to trusted code due to its privileged nature and potential for resource exhaustion if misused.^[1] To limit exposure to denial-of-service attacks from excessive pinned memory, configure the RLIMIT_MEMLOCK resource limit conservatively, as registered buffers and fixed files consume locked memory charged against this quota.^[13] Robust error handling is essential for reliable io_uring usage; always inspect the res field and flags in each completion queue entry (CQE) to determine success or failure, where negative values indicate errno equivalents such as -EINVAL for invalid arguments.^[1] For operations returning -EAGAIN, indicating a non-blocking resource unavailability, implement retry logic that resubmits the request after a brief delay or upon notification via polling, avoiding busy-waiting loops that could degrade system performance.^[1] As referenced in submission and completion mechanics, thorough CQE validation prevents silent failures in asynchronous workflows. In multithreaded environments, confine submission queue (SQ) operations to a single thread to eliminate race conditions during entry enqueuing, aligning with the lock-free single-producer design of the SQ ring.^[28] For completion queue (CQ) consumption in multi-consumer scenarios, adhere to the ring's sequence-locking protocol by verifying the sequence counters in head and tail pointers before accessing entries, ensuring consistency without explicit locks.^[1] To enhance auditability and portability across kernel versions, probe supported opcodes at runtime using the IORING_REGISTER_PROBE operation via io_uring_register(2), allowing applications to adapt to available features dynamically.^[13] Prevent file descriptor leaks by avoiding direct use of legacy close(2) on registered files; instead, unregister them explicitly with io_uring_register(2) before closure to ensure proper release by the kernel.^[13] In containerized deployments, apply the PR_SET_NO_NEW_PRIVS prctl(2) flag to the process using io_uring to block privilege escalations via execve(2), complementing seccomp or AppArmor profiles that may restrict io_uring syscalls for security. Continuously monitor pinned memory consumption from registered resources to avert out-of-memory conditions, particularly in resource-constrained environments.^[29] Leverage the liburing library's high-level wrappers, such as io_uring_queue_init and io_uring_submit, which abstract low-level syscalls and incorporate safety checks for common pitfalls like buffer alignment and ring synchronization.^[30] Enable io_uring in the kernel by setting CONFIG_IO_URING=y during compilation, and activate debug facilities like ftrace or dynamic debug (via pr_debug in io_uring code) for troubleshooting submission and completion behaviors in development.^[31]

Performance and Adoption

Benchmark Results

Benchmarking io_uring has demonstrated significant performance advantages in high-throughput scenarios, particularly for storage and network I/O on modern hardware. Using the fio tool on NVMe SSDs, io_uring achieves up to 481,000 IOPS for 4K random reads on consumer drives like the Samsung 980.^[32] On Intel Optane SSD 900P, single-threaded random read IOPS can reach up to 200,000 at higher queue depths, with sequential reads up to 1.8 million, highlighting its scalability for queue depths (QD) greater than 4.^[33] Comparisons with traditional asynchronous I/O (AIO) show io_uring delivering up to 15% higher IOPS at elevated queue depths due to reduced lock contention and efficient submission-completion mechanics.^[33] Latency metrics further underscore io_uring's efficiency, with single-digit microsecond p95 latencies observed for batched random reads on SSDs in kernel versions around 6.3.^[33] In database workloads like RocksDB, io_uring reduces tail latency by 16.5% compared to synchronous alternatives, benefiting from features like submission queue polling (SQ_POLL).^[33] For network I/O, zero-copy operations via io_uring_prep_send_zc yield up to 2.44 times the throughput of MSG_ZEROCOPY in UDP benchmarks, reaching over 116 GB/s on dummy devices, which supports line-rate performance on multi-queue NICs exceeding 100 Gbps in aggregate.^[34] These gains stem from batched submissions that minimize system call overhead, scaling linearly across up to 1024 threads without significant contention.^[33] Standard test setups include fio for storage IOPS and latency on NVMe/ext4 filesystems, liburing examples for API-specific evaluations, and will-it-scale for concurrency tests up to 1 million connections, often on multi-core systems like Intel Xeon with Ubuntu kernels.^[33] Recent analyses from 2023-2024, including academic evaluations, confirm io_uring's edge in mixed workloads, though results vary by configuration—e.g., IOSQE_ASYNC flags boost IOPS by up to 15% at higher queue depths in evaluations including RocksDB.^[33] Despite these strengths, io_uring exhibits overhead for small I/O operations under 4KB, where batching introduces up to 15% lower IOPS at low queue depths (QD 1-2) due to worker thread scheduling.^[33] Polling modes like SQ_POLL, while reducing latency by 50% at high QD, increase CPU utilization by dedicating cores, potentially raising power consumption in latency-sensitive environments.^[33] In Linux kernel 6.12, enhancements to asynchronous filesystem support, including async discard operations, deliver approximately 20% IOPS gains on slower storage devices by avoiding blocking calls, as measured on basic NVMe setups transitioning from synchronous to fully async modes.^[35] This update builds on prior versions, enabling better integration for discard-heavy workloads without performance regressions.^[20]

Benchmark	Setup	io_uring IOPS	Comparison (e.g., AIO)	Source
4K Random Read	Samsung 980 NVMe, fio, kernel 5.x	481,000	N/A (rated ~500k)	Phoronix (2021)^[32]
Random Read	Intel Optane 900P, fio, single-threaded QD>4, kernel 6.3	~200,000	Up to +15% vs. non-async at QD≥4	Thesis (2024)^[33]
RocksDB p95 Tail Latency Reduction	Intel Xeon, ext4, SQ_POLL, kernel 6.3	16.5% vs. sync	N/A	Thesis (2024)^[33]
UDP Zero-Copy Send	Dummy NIC, 64KB packets	116 GB/s	2.44x vs. MSG_ZEROCOPY	Phoronix (2021)^[34]
Async Discard	Basic NVMe, kernel 6.12	~20% gain	vs. sync discard	Phoronix/Liburing Wiki (2024)^[35]^[20]

Real-World Applications

io_uring has seen adoption in various production libraries that provide user-space interfaces to its functionality. The official liburing library, developed by io_uring's creator Jens Axboe, serves as a C wrapper simplifying setup, submission, and completion handling for applications.^[15] It is prominently used in the Flexible I/O Tester (fio) for benchmarking asynchronous I/O operations across storage devices. Similarly, the stress-ng utility leverages liburing to test io_uring under high-load scenarios, including file and network I/O stress. In web servers and related services, io_uring enhances storage and network throughput. Patches for io_uring support in Nginx have been proposed since 2021 but remain unmerged as of 2025. ScyllaDB, a high-performance NoSQL database compatible with Cassandra, utilizes io_uring for efficient storage I/O, allowing it to handle massive-scale workloads with lower CPU overhead.^[36] For Redis, community-driven modules like AOFUring incorporate io_uring to accelerate asynchronous persistence in Append-Only File (AOF) mode, improving write throughput in memory-bound scenarios.^[37] System services have incorporated io_uring to optimize logging and virtualization tasks. QEMU, the open-source emulator, supports io_uring as an asynchronous I/O engine for block devices and VirtIO interfaces since 2020, with ongoing enhancements in 2025 to expand io_uring usage, though it is not yet the default engine as of November 2025.^[38] Databases leverage io_uring for block device operations to boost query performance. RocksDB, Facebook's embedded key-value store, added asynchronous I/O support including io_uring in 2022, enabling efficient handling of scans and multi-gets on NVMe storage.^[39] PostgreSQL 18, released in September 2025, includes io_uring support via the io_method configuration and pg_io_uring extension, significantly accelerating read operations on modern filesystems.^[40] As of 2025, emerging applications highlight io_uring's expansion into edge computing and kernel extensions. Cloudflare employs io_uring in its edge runtime for efficient worker pool management and zero-copy networking, scaling to handle global traffic spikes.^[41] Android's AOSP includes limited io_uring support for privileged processes, though it remains disabled by default for security in user-space apps. Combinations of io_uring with eBPF enable custom kernel modules for observability and I/O filtering, as seen in projects like ScyllaDB's storage optimizations.^[36] Despite its benefits, io_uring adoption faces challenges such as limited compatibility with kernels older than 5.1, requiring fallbacks or upgrades in legacy environments.^[42] Additionally, debugging asynchronous operation chains is complex due to non-deterministic completion orders and reduced visibility into kernel-side execution.

References

[1]
io_uring(7) - Linux manual page - man7.org
io_uring is a Linux-specific API for asynchronous I/O. It allows the user to submit one or more I/O requests, which are processed asynchronously without ...
[2]
Ringing in a new asynchronous I/O API
### Summary of io_uring Introduction, Design, Comparison, and Benefits
[3]
The rapid growth of io_uring - LWN.net
Jan 24, 2020 · One year ago, the io_uring subsystem did not exist in the mainline kernel; it showed up in the 5.1 release in May 2019.
[4]
An Introduction to the io_uring Asynchronous I/O Framework | linux
May 27, 2020 · The io_uring Asynchronous I/O (AIO) framework is a new Linux I/O interface, first introduced in upstream Linux kernel version 5.1 (March 2019).System Call Api · Io_uring_setup · Io_uring_enter
[5]
[PDF] A systematic study of libaio, SPDK, and io_uring - Large Research
Jun 15, 2022 · We perform one of the first in-depth studies of io_uring, and compare its performance and dis-. /advantages with the established libaio and SPDK ...
[6]
io_uring_setup(2) - Linux manual page
### Ring Buffer Size: Entries Parameter, Limits, Min and Max
[7]
io_uring_enter(2) - Linux manual page - man7.org
RETURN VALUE top io_uring_enter(2) returns the number of I/Os successfully consumed. This can be zero if to_submit was zero or if the submission queue was ...Missing: IORING_MAX_ENTRIES | Show results with:IORING_MAX_ENTRIES
[8]
[PDF] Efficient IO with io_uring - kernel.dk
This article is intended to serve as an introduction to the newest Linux IO interface, io_uring, and compare it to the existing offerings.
[9]
io_uring_register(2) - Linux manual page - man7.org
The maximum size allowed is 2^15 (32768). bgid is the buffer group ID associated with this ring. SQEs that select a buffer have a buffer group associated with ...
[10]
Jens Axboe on X: "Adding io_uring support for RocksDB multi-get ...
Sep 6, 2019 · Conversation. Jens Axboe · @axboe. Adding io_uring support for RocksDB multi-get. Embedded video. GIF. 2:14 PM · Sep 6, 2019.
[11]
Linux_5.1 - Linux Kernel Newbies
Linux 5.1 released on 5 May 2019. Summary: This release includes io_uring, an high-performance interface for asynchronous I/O; it also adds improvements in ...
[12]
Library providing helpers for the Linux kernel io_uring support - GitHub
liburing provides helpers to setup and teardown io_uring instances, and also a simplified interface for applications that don't need (or want) to deal with the ...
[13]
Linux_5.10 - Linux Kernel Newbies
Dec 13, 2020 · List of changes and new features merged in the Linux kernel during the 5.10 development cycle.
[14]
[PDF] What's new with io_uring - kernel.dk
Jens Axboe. Software Engineer. What's new with io_uring. FASTER IO WITH A CONSISTENT MODEL? YES PLEASE. Page 2. Quick primer. Page 3. Source: https://medium.com ...
[15]
Many IO_uring Improvements Submitted For Linux 5.19 - Phoronix
May 23, 2022 · On the IO_uring core side there is multi-shot accept, support for more types of cancellation types, support for cooperative task_work signaling, speed-ups for ...
[16]
Linux_6.0 - Linux Kernel Newbies
Oct 2, 2022 · Linux 6.0 includes a Runtime Verification system, io_uring features, Btrfs send v2, XFS scalability, task scheduler improvements, and a dma-buf ...
[17]
What's new with io_uring in 6.11 and 6.12 · axboe/liburing Wiki
Oct 4, 2024 · What's new with io_uring in 6.11 and 6.12 · Speedup of MSG_RING requests · Add support for bind/listen · Improved support for coalescing huge page ...Missing: 6.0 | Show results with:6.0
[18]
Axboe: io_uring and networking in 2023 - LWN.net
Feb 15, 2023 · The advantage of RDMA is zero-copy receive. The advantage of io_uring provided buffers is the ability to receive efficiently from many sockets ...
[19]
https://kernelnewbies.org/Linux_6.0
[20]
CVE-2021-41073 Detail - NVD
Sep 19, 2021 · loop_rw_iter in fs/io_uring.c in the Linux kernel 5.10 through 5.14.6 allows local users to gain privileges by using IORING_OP_PROVIDE_BUFFERS ...Missing: 41053 | Show results with:41053
[21]
Linux_6.8 - Linux Kernel Newbies
May 5, 2024 · The Linux kernel added support for stacking security modules in Linux 5.1. LSM-aware applications, however, need some interfaces to get more ...New system calls to deal with... · Driver for new Intel Xe graphics · Networking
[22]
submit requests from any thread · Issue #109 · axboe/liburing - GitHub
Apr 13, 2020 · The shared memory for requests makes it challenging to synchronize access between multiple threads that want to use the same io_uring instance.Missing: cq_threads | Show results with:cq_threads
[23]
In search of an appropriate RLIMIT_MEMLOCK default - LWN.net
As a way of eliminating this cost, the io_uring subsystem allows users to "register" their buffers; that operation sets up the buffers for I/O and leaves them ...Missing: advantages | Show results with:advantages<|control11|><|separator|>
[24]
README - liburing - io_uring library - kernel.dk
... registered buffers deplete it quickly. root isn't under this restriction ... RLIMIT_MEMLOCK as it is only used for registering buffers. Regressions ...
[25]
CONFIG_IO_URING: Enable IO uring support - cateee.net Homepage
This option enables support for the io_uring interface, enabling applications to submit and complete IO through submission and completion rings.
[26]
Samsung 980 NVMe SSD Linux Performance Review - Phoronix
Mar 26, 2021 · 4K random reads came in at 481k IOPS under Linux with FIO using IO_uring, just under the 500k rated and in-line with drives like the aging ...
[27]
[PDF] Exploring the Performance of the io_uring Kernel I/O Interface
Jun 25, 2024 · The io_uring interface is based on ring buffers (queues), which allow for lock-free and zero-copy operation. To use io_uring, an application ...
[28]
IO_uring Network Zero-Copy Send Is Boasting Mighty Speed-Ups
Nov 30, 2021 · Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff): The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s, or 2.44 ...Missing: send_zc | Show results with:send_zc
[29]
IO_uring Async Discard Submitted For Linux 6.12 - Phoronix
Sep 15, 2024 · IO_uring async discard support was sent in as one of the feature pulls aiming for Linux 6.12. Applications can now issue async discards rather than blocking ...
[30]
How io_uring and eBPF Will Revolutionize Programming in Linux
May 5, 2020 · There is very little difference between linux-aio and io_uring for the basic interface. But when advanced features are used, a 5% difference is ...
[31]
[PDF] Introducing AOFUring, an io_uring AOF Extension - Large Research
Aug 21, 2024 · This step involves interpreting the results, handling any errors, and performing subsequent actions based on the completion status. io_uring ...Missing: practices | Show results with:practices
[32]
Features/IOUring - QEMU
May 23, 2019 · io_uring is a Linux API for asynchronous I/O. It is designed for higher performance than the older Linux AIO API that QEMU supports.
[33]
https://atlarge-research.com/pdfs/2024-bingimarsson-msc_thesis.pdf
[34]
Missing Manuals - io_uring worker pool - The Cloudflare Blog
Feb 4, 2022 · This type of work is limited based on the size of the SQ ring.2. Work that may never complete, we call this unbounded work. The amount of ...
[35]
Why you should use io_uring for network I/O | Red Hat Developer
Apr 12, 2023 · Many io_uring features will soon be available in Red Hat Enterprise Linux 9.3, which is distributed with kernel version 5.14. The latest ...