Fact-checked by Grok 2 weeks ago

seccomp

Seccomp, short for secure computing , is a feature that enables processes to transition into a restricted state where they can only invoke a limited subset of s, thereby minimizing the kernel's and enhancing by preventing unauthorized or malicious operations. Introduced in version 2.6.12 in , seccomp initially operated in a strict that permitted only four essential system calls—read(), write(), _exit(), and sigreturn()—with any violation resulting in the process being terminated via SIGKILL. This was enabled through the /proc/PID/seccomp interface until 2007, when 2.6.23 replaced it with the prctl(PR_SET_SECCOMP) using the SECCOMP_MODE_STRICT operation. To provide greater flexibility, seccomp evolved with the addition of a filter mode in Linux 3.5 (2012), known as seccomp BPF (Berkeley Packet Filter), which allows processes to define custom filters for incoming system calls using BPF programs that inspect the system call number and arguments via a struct seccomp_data. These filters can return actions such as allowing the call (SECCOMP_RET_ALLOW), returning an errno (SECCOMP_RET_ERRNO), trapping to a handler (SECCOMP_RET_TRAP), logging the event (SECCOMP_RET_LOG), or killing the process or thread (SECCOMP_RET_KILL_PROCESS or SECCOMP_RET_KILL_THREAD), with precedence given to the most restrictive action. The filter mode requires either the CAP_SYS_ADMIN capability or the PR_SET_NO_NEW_PRIVS flag to prevent privilege escalation in child processes, and it supports layering multiple filters while preserving them across fork(), clone(), and execve() unless explicitly disallowed. Further enhancements include the introduction of the dedicated seccomp() in 3.17 (2014), which supports operations like SECCOMP_SET_MODE_FILTER for installing BPF filters, SECCOMP_GET_ACTION_AVAIL for checking supported actions, and thread synchronization via SECCOMP_FILTER_FLAG_TSYNC. Seccomp also features user-space notifications (SECCOMP_RET_USER_NOTIF) for deferring decisions on system calls to a process, introduced in later kernels, and it has been extended to support various architectures including (since 3.5), (since 3.8), and others up to (since 4.6). In practice, seccomp is widely used for sandboxing applications, such as in container runtimes like and , where it restricts containerized processes to a of necessary system calls, reducing potential vulnerabilities without requiring modifications. By leveraging BPF's maturity and safety—avoiding pointer dereferences to prevent time-of-check-to-time-of-use (TOCTOU) attacks—seccomp provides a powerful, low-overhead mechanism for fine-grained control over process behavior, often combined with other security features like capabilities. Libraries such as libseccomp simplify its use by offering a platform-independent for generating and loading filters.

Overview

Definition and Purpose

Seccomp, short for Secure Computing mode, is a feature that enables processes to voluntarily restrict their access to system calls by transitioning into a mode where only a predefined set of safe system calls is permitted. This mechanism allows applications to declare in advance which kernel interfaces they require, thereby limiting interactions with potentially vulnerable code. The primary purpose of seccomp is to reduce the 's exposed to untrusted or potentially compromised processes, thereby mitigating risks from exploits such as buffer overflows or privilege escalations that could leverage unrestricted system calls. By confining processes to a minimal set of system calls, seccomp enhances security in environments where code may execute with elevated privileges or in shared systems, preventing malicious or erroneous invocations of dangerous functions. Seccomp operates as a one-way transition: once activated, a process enters a secure state from which it cannot disable the restrictions or expand its access, ensuring that subsequent code—potentially altered by an attacker—cannot undo the safeguards. This feature was first introduced in the version 2.6.12 in 2005, addressing the increasing demand for and sandboxing in multi-tenant computing environments.

Key Components

In the filter mode, seccomp's core components revolve around system call filters constructed using (BPF) instructions, which enable processes to inspect incoming s and determine their disposition before execution. These filters operate on a struct seccomp_data structure that provides access to the number, identifier, instruction pointer, and up to six arguments, allowing for precise evaluation without dereferencing pointers to mitigate time-of-check-to-time-of-use (TOCTOU) vulnerabilities. The BPF programs, expressed in a restricted instruction set, load the relevant data into registers via BPF_ABS operations and apply conditional logic to decide outcomes, effectively restricting the kernel's by limiting allowable s. The predefined actions form the decision points of these filters, returning specific values that dictate the kernel's response to a . SECCOMP_RET_ALLOW permits the to proceed normally, while SECCOMP_RET_KILL terminates the offending (or process, depending on the variant) with a SIGSYS signal. SECCOMP_RET_TRAP delivers a SIGSYS signal to the process, providing details of the intercepted call for handling in a signal handler, and SECCOMP_RET_ERRNO returns a custom errno value (from the lower 16 bits of the return code) without executing the call. Additional actions like SECCOMP_RET_TRACE notify a ptrace-attached tracer and SECCOMP_RET_LOG enable logging before allowing the call, but the primary actions focus on allowance, termination, trapping, or error injection to enforce security policies. Integration with the occurs through user-space APIs that load filters into kernel space for evaluation at system call entry points. Filters are installed using the seccomp(SECCOMP_SET_MODE_FILTER, ...) or the prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...) interface, passing a struct sock_fprog containing the BPF program; this requires either the PR_SET_NO_NEW_PRIVS bit set or CAP_SYS_ADMIN capability to prevent . Once loaded, the kernel evaluates the filter chain in reverse order of addition (most recently added first) upon each system call attempt, taking the highest-precedence action immediately to halt or modify execution before the kernel handler runs. Audit logging provides optional observability for denied s, integrating with the kernel's subsystem to record events when specified actions are configured in /proc/sys/kernel/seccomp/actions_logged. Logs capture details such as the system call number, arguments, and process information, aiding in and without affecting the of allowed calls; by default, actions like KILL, , and ERRNO trigger if enabled, but ALLOW does not.

History

Origins and Development

Seccomp originated in early 2005 when Andrea Arcangeli, then working at , proposed it as a kernel feature to enhance in multi-tenant environments. The primary motivation stemmed from the need to securely execute untrusted code in and scenarios, such as Arcangeli's CPUShare project, which aimed to rent out CPU resources without risking host system compromise. By limiting access, seccomp addressed vulnerabilities where syscalls served as common entry points for attacks, building on the broader evolution of security mechanisms like and SELinux that emphasized mandatory access controls and sandboxing. The initial proposal focused on a simple, irrevocable mode that transitioned a into a restricted state, permitting only essential syscalls such as exit(), read() on a fixed , write() on a fixed , and sigreturn(). This design ensured minimal while allowing basic I/O for number-crunching or execution tasks. Arcangeli's patches were merged into the mainline in version 2.6.12, released on June 17, 2005, marking seccomp's early adoption as a built-in facility for lightweight confinement. Development involved collaboration among kernel developers, including Paul Moore, who contributed to integrating seccomp actions with the Linux Audit subsystem for logging and compliance monitoring. This foundational work laid the groundwork for later enhancements, such as the more flexible filter mode introduced in subsequent kernel versions.

Major Milestones

The introduction of seccomp-BPF, also known as seccomp mode 2 or filter mode, occurred in version 3.5, released in July 2012. This advancement enabled processes to load programmable (BPF) programs to inspect and filter system calls at runtime, providing greater flexibility for syscall whitelisting and restriction compared to the earlier strict mode. Between 2013 and 2015, seccomp saw further refinements, including the addition of the dedicated seccomp() system call in 3.17 (October 2014), which allowed direct loading and management of filters, complementing the existing prctl(PR_SET_SECCOMP) interface for better multi-threaded support and filter handling. During this period, architecture-specific filter support expanded, with seccomp-BPF becoming available on architectures starting in 3.8 (February 2013) and on additional platforms like , , and others in subsequent releases, enabling broader portability. In 2019, 5.0 introduced user-space notifications through the SECCOMP_RET_USER_NOTIF , enhancing seccomp by allowing filters to defer syscall decisions to a supervising user-space via file descriptors, which improved handling of complex scenarios without relying on . From 2020 to 2025, seccomp benefited from ecosystem advancements starting in 5.7 (May 2020), which included allowing thread synchronization (TSYNC) and user notifications (USER_NOTIF) to be used together, along with improved BPF verifier efficiency and tooling that indirectly enhanced filter performance and development workflows, as well as better compatibility for debugging filtered processes. Concurrently, adoption grew in languages like through bindings such as libseccomp-rs, providing safer, memory-safe interfaces for filter construction and loading without major breaks, while kernel optimizations focused on reducing overhead in high-throughput environments. In January 2025, libseccomp version 2.6.0 was released, adding support for new architectures including (little and big endian) and LoongArch. As of November 2025, no major new seccomp-specific features have been introduced in the kernel beyond ongoing improvements.

Seccomp Modes

Strict Mode

Strict mode, also known as mode 1, represents the original and most restrictive form of seccomp enforcement in the , designed to minimize the by severely limiting access. It is activated by invoking the prctl(PR_SET_SECCOMP, 1) (equivalent to SECCOMP_MODE_STRICT), which transitions the calling into this mode without the ability to revert or apply further configurations. Once enabled, the can only execute a minimal set of calls: read(2) and write(2) restricted to the already-open standard file descriptors 0 (stdin), 1 (stdout), and 2 (stderr); _exit(2) for termination; and sigreturn(2) for signal handling restoration. Any attempt to invoke other immediately terminates the thread with a SIGKILL signal, preventing execution and ensuring no kernel interaction beyond the permitted operations. This rigid enforcement makes strict mode suitable for ultra-minimal environments, such as legacy applications or highly trusted codebases that perform simple computations without needing broader interactions, including early implementations of sandboxed interpreters processing untrusted via standard I/O. For instance, it is applicable in number-crunching scenarios where input is received through or sockets and output is produced accordingly, without requiring operations, networking, or management beyond basic . However, its inflexibility poses significant limitations for modern applications, as it prohibits essential dynamic operations like opening new files, establishing network connections, or handling multi-threaded scenarios with exit_group(2), rendering it impractical for anything beyond absolute minimalism. In contrast to filter mode, which permits programmable syscall policies, strict mode offers no exceptions or nuanced control, enforcing a allow-or-kill .

Filter Mode

Filter mode, also known as seccomp mode 2, represents an advanced configuration of the seccomp facility that allows processes to implement programmable policies using Berkeley Packet Filters (BPF). Unlike the rigid restrictions of strict mode, filter mode enables fine-grained control by loading a custom BPF program that dynamically evaluates incoming s at runtime. This mode is activated by invoking the prctl with PR_SET_SECCOMP and SECCOMP_MODE_FILTER as arguments, passing a pointer to the BPF program structure, or equivalently using the seccomp with the SECCOMP_SET_MODE_FILTER operation. Activation requires either the CAP_SYS_ADMIN capability or prior invocation of prctl(PR_SET_NO_NEW_PRIVS, 1) to drop privileges, ensuring the filter cannot be altered post-loading. Upon entering the for a , the seccomp subsystem executes the loaded BPF program against a structure provided by the . This structure includes key fields such as nr (the number), arch (the identifier, e.g., AUDIT_ARCH_X86_64), and an array args[] containing up to six arguments, along with additional metadata like instruction pointer and thread flags. The BPF program processes this data to determine whether to allow, deny, or take alternative actions on the call, with evaluation occurring before the handler executes, thus preventing unauthorized operations early in the entry path. This runtime inspection mechanism supports extensible policies that adapt to the specific context of each invocation. Filter mode excels in policy enforcement by permitting whitelisting or blacklisting of specific system calls, as well as conditional allowances based on argument values, enabling sophisticated security profiles. For instance, a policy might whitelist the open system call but restrict it to read-only modes by checking the flags argument for O_RDONLY, or allow mount only for specific filesystem types. Such conditionals provide granular control beyond simple syscall enumeration, allowing applications to maintain necessary functionality while minimizing the kernel attack surface. This contrasts with strict mode's precursor approach, which offers only a predefined whitelist without programmability. Regarding thread inheritance, filters loaded in filter mode automatically propagate to child processes created via , , or execve, applying the same policy tree unless explicitly modified by the child. In multi-threaded environments, where with CLONE_THREAD creates sibling threads, the filter applies per task but requires to avoid inconsistencies; this is achieved using the SECCOMP_FILTER_FLAG_TSYNC during filter addition, which ensures all threads in the process adopt the same filter tree atomically. This inheritance model maintains security isolation while supporting concurrent execution, with the enforcing that new threads cannot weaken existing protections without proper privileges.

Seccomp Filters

Berkeley Packet Filter Integration

Seccomp integrates the (BPF) by adapting a subset of classic BPF (cBPF) instructions to enable programmable filtering within the . This adaptation restricts the instruction set to safe operations, including loading immediate values (BPF_LD | BPF_IMM, BPF_LDX | BPF_IMM), arithmetic operations (BPF_ALU with opcodes like BPF_ADD, BPF_SUB, BPF_MUL, BPF_DIV, BPF_AND, BPF_OR, BPF_XOR, BPF_LSH, BPF_RSH, BPF_NEG), conditional jumps (BPF_JMP with conditions like BPF_JEQ, BPF_JGT, BPF_JGE, BPF_JSET), memory loads/stores to 16 slots of (BPF_LD | BPF_MEM, BPF_ST), transfers (BPF_MISC | BPF_TAX, BPF_MISC | BPF_TXA), and crucially, word loads from the input (BPF_LD | BPF_W | BPF_ABS) to read from struct seccomp_data at fixed offsets, while prohibiting pointer dereferencing to mitigate time-of-check-to-time-of-use (TOCTOU) vulnerabilities. Seccomp-specific extensions allow the BPF program to access a structured input via the struct seccomp_data, defined as { int nr; __u32 arch; __u64 instruction_pointer; __u64 args[6]; }, with 32-bit offsets: syscall number (nr) at 0, architecture (arch) at 4, instruction pointer at 8 (lower 32 bits) and 12 (upper), and arguments (args[0] to args[5]) at 16/20, 24/28, ..., 48/52 (lower/upper 32 bits for each u64). This alignment enables efficient 32-bit word reads during BPF execution. The program structure consists of linear or jump-based bytecode sequences compiled into a struct sock_fprog format, typically generated from user-space tools like libseccomp or seccomp-tools, and executed on a with two 32-bit registers (A and X), 16 words of read-write scratch memory, and read-only access to the seccomp data packet. The bytecode processes this input to compute an action value, returned via a BPF_RET instruction, which determines the fate of the —such as allowing it to proceed or triggering a denial. This design leverages cBPF's simplicity for predictability, ensuring the program terminates quickly without unbounded computation. In the , seccomp s are attached as a tree to the task_struct of the process via the seccomp field, forming a chain where parent and child s are evaluated compositely. Upon a entry, the invokes seccomp_run() (or more precisely, __secure_computing() in modern implementations) in the syscall entry path, prior to dispatching to the specific handler, passing the current register state mapped to struct seccomp_data. The function traverses the tree from to , executing each BPF and propagating the lowest (most restrictive) value; optimizations like just-in-time () compilation and caching for constant outcomes minimize overhead, typically adding only a few CPU cycles per invocation for simple s. To ensure safety, the performs rigorous validation on user-supplied BPF programs during attachment, using a verifier in kernel/seccomp.c to check for validity: it scans the bytecode for no , no out-of-bounds memory accesses, no uninitialized reads, and crucially, no infinite loops by enforcing forward-only jumps or bounded backward jumps within a flat . This prevents denial-of-service attacks from malformed programs, rejecting invalid filters and returning -EINVAL to user space, thereby maintaining stability without requiring the full verifier complexity.

Filter Actions

Seccomp filter actions determine the outcome when a is evaluated against an attached (BPF) program, allowing fine-grained control over process behavior without necessarily executing the call. These actions are encoded as specific return values from the BPF program, with precedence based on bitmasks to resolve conflicts in multi-layer filters. The processes the highest-precedence , ensuring consistent enforcement across architectures supporting seccomp filters. The most permissive action is SECCOMP_RET_ALLOW (0x7fff0000), which permits the to execute with its original arguments and registers intact. This serves as the default behavior for unmatched system calls in filter mode, enabling processes to operate normally unless explicitly restricted. Unlike other actions, it does not trigger logging or signals, prioritizing performance in allowlist scenarios. SECCOMP_RET_KILL (0x00000000) terminates the calling thread immediately by delivering a SIGSYS signal, preventing the from proceeding. In single-threaded processes, this effectively kills the entire process and may produce a since 4.11. If auditing is enabled and the action is in the logged set (configurable via /proc/sys/kernel/seccomp/actions_logged since 4.14), the kernel records the violation for monitoring. SECCOMP_RET_TRAP (0x00030000) interrupts the process by sending a SIGSYS signal to the calling , providing detailed information about the attempted through a seccomp_info structure populated in the siginfo_t. This includes fields like si_call_addr (instruction pointer), si_syscall ( number), si_arch (), and si_errno (set to the BPF program's SECCOMP_RET_DATA value). The is not executed, allowing handlers to inspect or respond to the event, such as in custom signal handlers. SECCOMP_RET_ERRNO (0x00050000 | errno) causes the kernel to return a specified error code to userspace without executing the system call, where the errno value (ranging from 0 to 4095) is derived from the lower 16 bits of the BPF return value combined with the base mask. This action simulates a failed system call transparently, useful for denying access while maintaining application compatibility, and it supports errno values up to the maximum portable range without core dumps or signals. SECCOMP_RET_TRACE (0x7ff00000) notifies any attached ptrace tracer by raising a PTRACE_EVENT_SECCOMP event, allowing debugging tools to intervene, modify arguments, or decide whether to proceed with the call. If no tracer is present, the kernel returns -ENOSYS to the process, effectively denying the call. The event message includes the BPF program's SECCOMP_RET_DATA value via PTRACE_GETEVENTMSG, enabling advanced interception without full process termination. Introduced in Linux 5.0, SECCOMP_RET_USER_NOTIF (0x7ff00001) queues a notification to a userspace handler via a obtained through seccomp(2) with SECCOMP_SET_MODE_FILTER and the SECCOMP_FILTER_FLAG_NEW_LISTENER flag. The kernel suspends the until the handler responds using ioctls on the notification , such as SECCOMP_IOCTL_NOTIF_RESP to allow, deny, or emulate the call (optionally providing a replacement via SECCOMP_IOCTL_NOTIF_ADDFD), allowing unprivileged userspace to emulate or forward calls securely; a pidfd can be used to manage the task if needed. If no handler acknowledges, it returns -ENOSYS.

Programming Interface

System Calls and Prctl

The primary interface for enabling seccomp in the Linux kernel is the prctl system call with the PR_SET_SECCOMP option, which serves as a legacy mechanism to transition a process into a secure computing mode. This call takes the form prctl(PR_SET_SECCOMP, int mode), where mode specifies either SECCOMP_MODE_STRICT (value 1) for a highly restrictive environment allowing only basic system calls like exit, sigreturn, read, and write on already-open file descriptors, or SECCOMP_MODE_FILTER (value 2) to activate a Berkeley Packet Filter (BPF)-based filter for more granular control. For filter mode, a third argument—a pointer to a struct sock_fprog containing the BPF program—must be provided; otherwise, it is unused. On success, prctl returns a nonnegative value (typically 0), while failure returns -1 with errno set, such as EINVAL for an invalid mode or filter program. Introduced in version 3.17, the dedicated seccomp system call provides a modern, extensible interface that supersedes the limitations of prctl for mode operations. Invoked as seccomp(int op, unsigned int flags, const struct sock_fprog *uargs), it primarily uses the SECCOMP_SET_MODE_FILTER operation (op) to validate and load a BPF program specified in uargs, which defines the allowed s and actions based on the struct seccomp_data input. When flags is 0, this call is functionally equivalent to prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, uargs), but additional flags like SECCOMP_FILTER_FLAG_LOG enable of decisions. Success returns 0; errors return -1 with errno set, including EFAULT for an invalid pointer in uargs, EACCES for insufficient privileges, and E2BIG if the BPF program exceeds size limits such as 4096 instructions. Enabling seccomp via either interface requires specific privilege checks: the calling process must either possess the CAP_SYS_ADMIN capability in its initial user namespace or have previously set the no_new_privs bit using prctl(PR_SET_NO_NEW_PRIVS, 1). These requirements ensure that only authorized processes can restrict system call access, preventing unprivileged escalation. Once activated, seccomp mode changes are irreversible for the calling thread and its descendants created via fork or clone, as the filter persists across these operations unless explicitly allowed otherwise during execve; however, execve itself may inherit the filter if permitted by the program.

Loading Filters

Seccomp filters are constructed either directly using raw (BPF) assembly or through higher-level libraries such as libseccomp, which abstracts the process into a more user-friendly . In the raw approach, developers build a filter program as an array of BPF instructions stored in a struct sock_fprog, which consists of an unsigned short len field indicating the number of instructions and a pointer to an array of struct sock_filter elements, each containing a 16-bit code, two 8-bit jump offsets (jt and jf), and a 32-bit constant (k). This structure encapsulates the filter logic, which inspects details via the seccomp_data structure, including the number (nr), (arch), and up to six arguments (args[0-5]). Libraries like libseccomp simplify construction by initializing a filter context with seccomp_init(), specifying a default action such as SCMP_ACT_KILL for disallowed calls, followed by adding specific rules using functions like seccomp_rule_add() to allow or restrict syscalls with optional argument checks. Internally, libseccomp compiles these rules into the equivalent BPF bytecode and populates a sock_fprog structure for kernel submission. Once constructed, the filter is attached to the current process or thread via the seccomp() system call, invoked as seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog), where prog is the sock_fprog pointer; this requires prior invocation of prctl(PR_SET_NO_NEW_PRIVS, 1) to drop privileges or possession of the CAP_SYS_ADMIN capability. Supported flags include SECCOMP_FILTER_FLAG_TSYNC to synchronize the filter across all threads in the process, SECCOMP_FILTER_FLAG_LOG to enable logging of filter actions (available since Linux 4.14), and SECCOMP_FILTER_FLAG_NEW_LISTENER to return a file descriptor for user notifications on trapped syscalls (since Linux 5.0). Multiple filters can be stacked to form a per , with each new filter appended to the tree and evaluated in reverse chronological order—meaning later-loaded filters are checked first, enabling overrides of earlier rules. For instance, a subsequently loaded filter can refine permissions for specific syscalls, and the first matching with the highest precedence (e.g., SECCOMP_RET_KILL_PROCESS over SECCOMP_RET_ALLOW) determines the outcome. The enforces limits, such as a maximum of 4096 instructions per filter and 32768 total across the tree, to prevent resource exhaustion. Verification occurs during loading, where the kernel's BPF verifier rejects invalid programs—such as those with unreachable instructions, excessive jumps, or invalid actions—returning EINVAL to the caller. User-space tools like seccomp-tools facilitate further analysis by dumping BPF code from running processes via ptrace, disassembling it into readable format with syscall names, and emulating filter behavior against test inputs, supporting architectures like x86_64 and ARM64. These steps ensure filters are correctly constructed and attached before enforcement begins.

Applications and Usage

In Containerization

Seccomp plays a crucial role in enhancing the security of containerized environments by restricting system calls available to processes, thereby limiting potential attack surfaces in container runtimes like . In , the default seccomp profile integrated into the dockerd daemon enforces a whitelist-based policy that disables approximately 44 system calls out of over 300, targeting those deemed dangerous for container , such as mount (which requires elevated privileges like CAP_SYS_ADMIN) and keyctl (related to non-namespaced kernel keyrings). This profile is sourced from the project's default.json and applies automatically to containers unless explicitly overridden, providing a baseline level of protection against attempts within the container. Docker allows customization of seccomp profiles to tailor restrictions to specific workloads, using the --security-opt seccomp=/path/to/profile.json flag during execution, which loads a user-defined file compliant with the (OCI) specification. For instance, administrators can generate custom profiles using tools like seccomp-gen, which analyzes output from a containerized application to whitelist only the necessary syscalls, thereby minimizing unnecessary interactions while preserving functionality. This flexibility enables fine-grained control, such as allowing additional syscalls for or specialized applications without disabling seccomp entirely. In Kubernetes, seccomp integration occurs at the pod specification level through the securityContext.seccompProfile field, where the type can be set to RuntimeDefault to apply the underlying container runtime's default profile (e.g., Docker's or containerd's) or Localhost for a custom profile stored on the node at /var/lib/kubelet/seccomp/. This configuration enforces syscall restrictions per container, promoting multi-tenant isolation in shared clusters by preventing workloads from invoking syscalls that could compromise the host kernel or other pods. Kubernetes supports inheritance of seccomp settings from pod to container levels, ensuring consistent application across orchestrated environments. The primary benefits of seccomp in include preventing container escapes and attacks by confining processes to a minimal set of approved interactions, thus enforcing the principle of least privilege at the syscall layer. By blocking unauthorized syscalls, seccomp reduces the risk of exploits that rely on vulnerabilities, complementing other mechanisms like namespaces and ; for example, filter mode enables these dynamic, programmable restrictions essential for runtime adaptability. Profiles can be generated and tested iteratively using tools like seccomp-gen to balance and performance without over-restricting benign operations. Seccomp's adoption in production clusters is widespread to achieve with established standards, such as the Kubernetes Benchmarks (which recommend enabling default seccomp profiles for all workloads) and NIST SP 800-190 (which highlights seccomp as a key control for application protection). These guidelines underscore seccomp's role in mitigating container-specific risks, making it a standard practice in hardened deployments for regulatory alignment and threat mitigation.

In Web Browsers

Google Chrome integrates seccomp-BPF into its multi-process architecture to enhance renderer process isolation, a feature introduced in version 23 around late 2012 and widely deployed by 2013. This mechanism confines renderer processes—responsible for executing untrusted web content—to a restricted set of system calls, permitting essential operations like network communication via sendto and recvfrom while prohibiting dangerous actions such as file creation with open or process spawning with fork. These filters complement other isolation techniques, including Linux namespaces, which limit the process's visibility into the file system, process list, and network stack, collectively forming Chrome's layered sandbox defense. The seccomp filters for Chrome's renderer are generated at build time through automated tools and policy definitions within the codebase, such as those in the sandbox/linux/seccomp_bpf directory, ensuring consistent and auditable restrictions across deployments. When a violation occurs—such as an attempt to invoke a disallowed syscall—the terminates the process immediately via SIGSYS or SIGKILL, preventing potential ; these incidents are captured as crashes and reported to Google's Crashpad system for analysis and telemetry. Other browsers have adopted similar syscall filtering for , though implementations vary by platform. Firefox began applying seccomp filters to content processes in 2016, with broader rollout by 2017, particularly targeting media decoding and web rendering to mitigate risks from malformed content; this focuses on denying filesystem and device access while allowing rendering essentials. Apple's Safari, built on , employs the macOS kernel's native ing framework—mandatory access controls via Seatbelt profiles—rather than seccomp, as the latter is Linux-specific; this achieves comparable isolation for web content processes by restricting entitlements like file I/O and . The adoption of in has demonstrably bolstered by blocking interaction vectors commonly targeted in attacks.

Limitations and Future Developments

Current Constraints

Seccomp imposes notable performance overhead due to the evaluation of (BPF) programs on each . Measurements indicate that this adds approximately 66 to 226 cycles per syscall for common operations like getpid or open, depending on the syscall complexity and . More intricate filters or high rates of system calls exacerbate this overhead, as the linear nature of BPF execution scales with filter depth and invocation frequency, potentially impacting latency-sensitive applications. Compatibility remains a key constraint, with seccomp BPF filtering primarily optimized for x86-64 architectures since its inception, while support for others like (since kernel 3.8) and ARM64 (since kernel 3.14) exists. has seccomp BPF support since 3.16. Additionally, seccomp filters cannot dynamically incorporate new syscalls without reloading the entire filter program, as the transition to secure mode is irreversible and requires process reinitialization. Debugging seccomp violations presents significant challenges, as tracing requires integration with tools like auditd to log events in /var/log/audit/audit.log, without which violations may go unnoticed beyond process termination. The seccomp user-space notification mechanism, available since kernel 4.14, allows supervisors to intercept and handle violations but demands additional setup, such as polling and privileged oversight, complicating routine diagnostics. Security gaps further limit seccomp's efficacy, as it cannot intercept or filter kernel-internal operations or syscalls invoked indirectly through the virtual dynamic shared object (), which executes certain calls—like gettimeofday—entirely in user space, bypassing filters and leading to inconsistent behavior across systems. Moreover, seccomp offers no protection against kernel bugs or vulnerabilities, as it operates solely at the user-kernel boundary and assumes a trusted implementation. These limitations can be partially mitigated through filter actions like tracing, though such approaches introduce their own overheads.

Ongoing Enhancements

Since Linux 5.3, proposals have emerged to migrate seccomp filters from classic BPF (cBPF) to extended BPF (), enabling more expressive policies with features like stateful tracking via maps and advanced helpers for argument validation and memory access. This shift addresses limitations in cBPF's static nature, allowing dynamic behaviors such as and flow integrity checks, though full upstream integration remains under discussion as of 2025 to balance and . Enhancements to the SECCOMP_RET_USER_NOTIF mechanism, introduced in 4.14 and refined in subsequent releases, support asynchronous handling through non-blocking file descriptors and multi-process scenarios via a single listener managing notifications across PIDs. In 5.5, the SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV flag improved signal preemption during notifications, reducing latency in environments, while later versions like 5.9 added support for adding file descriptors dynamically. These updates facilitate unprivileged syscall interception without overhead, enabling efficient user-space mediation. Seccomp is increasingly combined with Landlock, a stackable Linux Security Module (LSM) merged in 5.13, to form hybrid sandboxing approaches that layer syscall filtering with filesystem access controls. This integration, experimental in tools since 6.0, allows unprivileged processes to restrict both kernel interfaces and ambient rights like path traversal, enhancing isolation in applications without requiring privileged modes. Developers leverage this complementarity for comprehensive confinement, as seccomp handles syscall granularity while Landlock enforces LSM hooks for resource policies. Future directions for seccomp include potential support for syscall argument emulation via enhanced notifiers and automated profile generation using for dynamic policy adaptation, as discussed in recent research prototypes. These build on proposals without plans to deprecate legacy cBPF modes, preserving for existing deployments.

References

  1. [1]
    Seccomp BPF (SECure COMPuting with filters)
    Seccomp filtering provides a means for a process to specify a filter for incoming system calls. The filter is expressed as a Berkeley Packet Filter (BPF) ...
  2. [2]
    A seccomp overview - LWN.net
    Sep 2, 2015 · The first version of seccomp was merged in 2005 into Linux 2.6.12. It was enabled by writing a "1" to /proc/PID/seccomp. Once that was done, the ...
  3. [3]
    seccomp(2) - Linux manual page - man7.org
    DESCRIPTION top. The seccomp() system call operates on the Secure Computing (seccomp) state of the calling process. Currently, Linux supports the following ...
  4. [4]
    Seccomp security profiles for Docker
    Secure computing mode ( seccomp ) is a Linux kernel feature. You can use it to restrict the actions available within the container. The seccomp() system ...
  5. [5]
    Restrict a Container's Syscalls with seccomp - Kubernetes
    Oct 31, 2023 · Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the ...
  6. [6]
    Seccomp BPF (SECure COMPuting with filters) — The Linux Kernel documentation
    - **Definition**: Seccomp (SECure COMPuting) is a mechanism allowing processes to specify filters for incoming system calls using Berkeley Packet Filter (BPF) programs.
  7. [7]
    Seccomp and sandboxing - LWN.net
    May 13, 2009 · Seccomp was meant to support a side business of his which would enable owners of Linux systems to rent out their CPUs to people doing serious ...Missing: first | Show results with:first
  8. [8]
    Andrea Arcangeli: seccomp for ppc64 - LKML
    Mar 26, 2005 · This kernel feature is useful for number crunching applications that may need to compute untrusted bytecode during their execution.Missing: origins Red Hat
  9. [9]
    Kernel development - LWN.net
    Mar 4, 2015 · To help application writers use the facility, Paul Moore created libseccomp, and he has just released version 2.2. ... type=SECCOMP msg=audit( ...
  10. [10]
    seccomp_unotify(2) - Linux manual page - man7.org
    By contrast, the user-space notification mechanism allows the seccomp filter to delegate the handling of the system call to another user-space process. Note ...Missing: 4.18 | Show results with:4.18
  11. [11]
    Linux_5.7 - Linux Kernel Newbies
    Jun 3, 2020 · Linux 5.7 adds Thermal Pressure, frequency invariant scheduler accounting, a new exFAT file system, and ARM Pointer Authentication support.
  12. [12]
    Rust Language Bindings for the libseccomp Library - GitHub
    The libseccomp library provides an easy to use, platform independent, interface to the Linux Kernel's syscall filtering mechanism. The libseccomp API is ...
  13. [13]
    prctl(2) - Linux manual page
    ### Summary of PR_SET_SECCOMP with SECCOMP_MODE_STRICT
  14. [14]
    Overview of BPF in Seccomp - seraphin.xyz
    This article aims to give a comprehensive overview of BPF programs in seccomp and what they can do, as a companion to existing documentation.Missing: integration | Show results with:integration
  15. [15]
    dynamic seccomp policies (using BPF filters) - LWN.net
    Jan 11, 2012 · From: Will Drewry <wad@chromium.org>. To: linux-kernel@vger.kernel.org. Subject: [RFC,PATCH 0/2] dynamic seccomp policies (using BPF filters).
  16. [16]
    seccomp(2) - Linux manual page - man7.org
    The seccomp() system call operates on the Secure Computing (seccomp) state of the calling process. Currently, Linux supports the following operation values: ...
  17. [17]
    Provide powerful tools for seccomp analysis - GitHub
    Features · Dump - Automatically dumps seccomp BPF from execution file(s). · Disasm - Converts seccomp BPF to a human readable format. With simple decompilation.
  18. [18]
  19. [19]
    blacktop/seccomp-gen: Docker Secure Computing Profile Generator
    This tool allows you to pipe the output of strace through it and will auto-generate a docker seccomp profile that can be used to only whitelist the syscalls.
  20. [20]
    Seccomp and Kubernetes
    Sep 3, 2025 · Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the ...
  21. [21]
    CIS Kubernetes Benchmarks
    This CIS Benchmark is the product of a community consensus process and consists of secure configuration guidelines developed for Kubernetes.Missing: adoption | Show results with:adoption
  22. [22]
    [PDF] Application Container Security Guide
    Section 7 provides the conclusion for the document. Page 14. NIST SP 800-190. APPLICATION CONTAINER SECURITY GUIDE. 2.
  23. [23]
    A safer playground for your Linux and Chrome OS renderers
    Nov 19, 2012 · Chrome uses sandboxing with seccomp-bpf, a kernel filter, to isolate attackers by rejecting system calls, improving security for Linux and  ...<|control11|><|separator|>
  24. [24]
    Introducing Chrome's next-generation Linux sandbox - cr0 blog
    Sep 6, 2012 · With Seccomp-BPF, BPF programs can now be used to evaluate system call numbers and their parameters. ... Check about:sandbox in Chrome 22+ and see ...
  25. [25]
    Crash Reports - The Chromium Projects
    When a Chrome process crashes, Breakpad or Crashpad springs into action by gathering information about the exception state, callstacks, stack memory, and ...
  26. [26]
    Security/Sandbox/Seccomp - Mozilla Wiki
    Jul 25, 2016 · What is Seccomp. Intro to seccomp and seccomp-bpf. Seccomp stands for secure computing mode. It's a simple sandboxing tool in the Linux ...Missing: date | Show results with:date
  27. [27]
    Security of runtime process in iOS, iPadOS, and visionOS
    Dec 19, 2024 · iOS, iPadOS, and visionOS help ensure runtime security by using a “sandbox,” declared entitlements, and Address Space Layout Randomization (ASLR).Missing: seccomp equivalent
  28. [28]
    Cracking the Walls of the Safari Sandbox - RET2 Systems Blog
    Jul 25, 2018 · We will discuss our experience evaluating the Safari sandbox on macOS for security vulnerabilities. We will select a software component exposed to the sandbox.Missing: seccomp | Show results with:seccomp
  29. [29]
    Evaluating Mitigations & Vulnerabilities in Chrome
    Oct 3, 2024 · In this post we discuss several axes along which to evaluate the potential harm to users from exploits, and how they apply to the Chrome browser.
  30. [30]
    Cloud threat horizons report H2 2025 - Google Cloud
    In this 12th edition of the Google Cloud Threat Horizons Report, we share our analysis of an increasingly sophisticated threat landscape where actors are ...<|control11|><|separator|>
  31. [31]
    [PDF] arXiv:2406.07429v1 [cs.CR] 11 Jun 2024
    Jun 11, 2024 · combined with the use of SUD or seccomp results in a total overhead of approximately 250 cycles. This fixed cost is from the trampoline ...
  32. [32]
    adding seccomp and service jailing to procd - LWN.net
    Jul 1, 2015 · The overhead of seccomp is hard to measure but most likely linear. The more syscalls get called, the higher the performance hit is. However, as ...Missing: per | Show results with:per
  33. [33]
    Reporting Seccomp violations - Unix & Linux Stack Exchange
    Aug 12, 2016 · I would like to see any seccomp violations in some logs. Some googling online shows that these violations are reported to either syslog or audit.log.Missing: difficulties tracing
  34. [34]
    Seccomp user-space notification and signals - LWN.net
    Apr 9, 2021 · Normally, seccomp() is used to implement a simple sort of attack-surface reduction, making much of the system-call space off limits for the ...
  35. [35]
    eBPF seccomp() filters - LWN.net
    May 31, 2021 · The seccomp() mechanism allows a process to load a BPF program to restrict its future use of system calls; it is a simple but flexible sandboxing mechanism ...
  36. [36]
    (PDF) Programmable System Call Security with eBPF - ResearchGate
    Feb 20, 2023 · Seccomp, Linux's system call filtering module, is widely used by modern container technologies, mobile apps, and system management services.
  37. [37]
    eBPF Seccomp filters - LWN.net
    Feb 13, 2018 · Sargun Dhillon (3): bpf, seccomp: Add eBPF filter capabilities seccomp, ptrace: Add a mechanism to retrieve attached eBPF seccomp filters ...
  38. [38]
    [PDF] Revisiting eBPF Seccomp Filters - Tianyin Xu
    Sep 12, 2022 · We have developed a full-fledged eBPF Seccomp filter support and systematically analyzed its security [5]. ... ARCANGELI,. Andrea (Red Hat); Dr ...
  39. [39]
    System-call interception for unprivileged containers - LWN.net
    Jun 29, 2022 · So seccomp() was extended to support exactly that, he said. A new type of filter was added to get a user-space notification when the system call ...
  40. [40]
    Seccomp Notify - New Frontiers in Unprivileged Container ...
    Jul 23, 2020 · In this article I'd like to take a closer look at the new seccomp notify feature we have been developing both in the kernel and in userspace and that is seeing ...
  41. [41]
    Landlock: unprivileged access control
    The goal of Landlock is to enable restriction of ambient rights (eg global filesystem or network access) for a set of processes.Missing: integration | Show results with:integration
  42. [42]
    Landlock (finally) sets sail - LWN.net
    Jun 17, 2021 · Like seccomp(), Landlock is an unprivileged sandboxing mechanism; it allows a process to confine itself. The long-term vision has always ...Missing: hybrid | Show results with:hybrid
  43. [43]
    landlock(7) - Linux manual page - man7.org
    Because Landlock is a stackable Linux Security Module (LSM), it makes it possible to create safe security sandboxes as new security layers in addition to the ...
  44. [44]
    Sandboxing services with Landlock - media.ccc.de
    Landlock is an unprivileged kernel feature that enables all Linux users to sandbox their processes. Complementary to seccomp, developers can leverage ...
  45. [45]
    [PDF] © 2025 Jinghao Jia - Tianyin Xu
    Jul 14, 2025 · We implement the new Seccomp-eBPF program type on top of Seccomp in the Linux kernel. We maintain the existing Seccomp interface with tamper ...
  46. [46]
    Seccomp bpf? - LWN.net
    Aug 13, 2025 · No, seccomp-bpf uses cBPF which is frozen for new features and is a much smaller subset of eBPF. AFAICS the signature checking in the linked ...<|separator|>