seccomp
Seccomp, short for secure computing mode, is a Linux kernel feature that enables processes to transition into a restricted state where they can only invoke a limited subset of system calls, thereby minimizing the kernel's attack surface and enhancing security by preventing unauthorized or malicious operations.[1] Introduced in Linux kernel version 2.6.12 in 2005, seccomp initially operated in a strict mode that permitted only four essential system calls—read(), write(), _exit(), and sigreturn()—with any violation resulting in the process being terminated via SIGKILL.[2] This mode was enabled through the /proc/PID/seccomp interface until 2007, when Linux 2.6.23 replaced it with the prctl(PR_SET_SECCOMP) system call using the SECCOMP_MODE_STRICT operation.[2]
To provide greater flexibility, seccomp evolved with the addition of a filter mode in Linux 3.5 (2012), known as seccomp BPF (Berkeley Packet Filter), which allows processes to define custom filters for incoming system calls using BPF programs that inspect the system call number and arguments via a struct seccomp_data.[1][2] These filters can return actions such as allowing the call (SECCOMP_RET_ALLOW), returning an errno (SECCOMP_RET_ERRNO), trapping to a handler (SECCOMP_RET_TRAP), logging the event (SECCOMP_RET_LOG), or killing the process or thread (SECCOMP_RET_KILL_PROCESS or SECCOMP_RET_KILL_THREAD), with precedence given to the most restrictive action.[3] The filter mode requires either the CAP_SYS_ADMIN capability or the PR_SET_NO_NEW_PRIVS flag to prevent privilege escalation in child processes, and it supports layering multiple filters while preserving them across fork(), clone(), and execve() unless explicitly disallowed.[1]
Further enhancements include the introduction of the dedicated seccomp() system call in Linux 3.17 (2014), which supports operations like SECCOMP_SET_MODE_FILTER for installing BPF filters, SECCOMP_GET_ACTION_AVAIL for checking supported actions, and thread synchronization via SECCOMP_FILTER_FLAG_TSYNC.[3][2] Seccomp also features user-space notifications (SECCOMP_RET_USER_NOTIF) for deferring decisions on system calls to a monitor process, introduced in later kernels, and it has been extended to support various architectures including x86-64 (since 3.5), ARM (since 3.8), and others up to PA-RISC (since 4.6).[1][3]
In practice, seccomp is widely used for sandboxing applications, such as in container runtimes like Docker and Kubernetes, where it restricts containerized processes to a whitelist of necessary system calls, reducing potential vulnerabilities without requiring kernel modifications.[4][5] By leveraging BPF's maturity and safety—avoiding pointer dereferences to prevent time-of-check-to-time-of-use (TOCTOU) attacks—seccomp provides a powerful, low-overhead mechanism for fine-grained control over process behavior, often combined with other security features like Linux capabilities.[1] Libraries such as libseccomp simplify its use by offering a platform-independent API for generating and loading filters.[2]
Overview
Definition and Purpose
Seccomp, short for Secure Computing mode, is a Linux kernel feature that enables processes to voluntarily restrict their access to system calls by transitioning into a mode where only a predefined set of safe system calls is permitted.[6] This mechanism allows applications to declare in advance which kernel interfaces they require, thereby limiting interactions with potentially vulnerable kernel code.[2] The primary purpose of seccomp is to reduce the kernel's attack surface exposed to untrusted or potentially compromised processes, thereby mitigating risks from exploits such as buffer overflows or privilege escalations that could leverage unrestricted system calls.[2] By confining processes to a minimal set of system calls, seccomp enhances security in environments where code may execute with elevated privileges or in shared systems, preventing malicious or erroneous invocations of dangerous kernel functions.[6] Seccomp operates as a one-way transition: once activated, a process enters a secure state from which it cannot disable the restrictions or expand its system call access, ensuring that subsequent code—potentially altered by an attacker—cannot undo the safeguards.[2] This feature was first introduced in the Linux kernel version 2.6.12 in 2005, addressing the increasing demand for process isolation and sandboxing in multi-tenant computing environments.[2]Key Components
In the filter mode, seccomp's core components revolve around system call filters constructed using Berkeley Packet Filter (BPF) instructions, which enable processes to inspect incoming system calls and determine their disposition before execution. These filters operate on astruct seccomp_data structure that provides access to the system call number, architecture identifier, instruction pointer, and up to six system call arguments, allowing for precise evaluation without dereferencing pointers to mitigate time-of-check-to-time-of-use (TOCTOU) vulnerabilities.[1][3] The BPF programs, expressed in a restricted instruction set, load the relevant data into registers via BPF_ABS operations and apply conditional logic to decide outcomes, effectively restricting the kernel's attack surface by limiting allowable system calls.[1]
The predefined actions form the decision points of these filters, returning specific values that dictate the kernel's response to a system call. SECCOMP_RET_ALLOW permits the system call to proceed normally, while SECCOMP_RET_KILL terminates the offending thread (or process, depending on the variant) with a SIGSYS signal. SECCOMP_RET_TRAP delivers a SIGSYS signal to the process, providing details of the intercepted call for handling in a signal handler, and SECCOMP_RET_ERRNO returns a custom errno value (from the lower 16 bits of the return code) without executing the call. Additional actions like SECCOMP_RET_TRACE notify a ptrace-attached tracer and SECCOMP_RET_LOG enable logging before allowing the call, but the primary actions focus on allowance, termination, trapping, or error injection to enforce security policies.[1][3]
Integration with the kernel occurs through user-space APIs that load filters into kernel space for evaluation at system call entry points. Filters are installed using the seccomp(SECCOMP_SET_MODE_FILTER, ...) system call or the prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...) interface, passing a struct sock_fprog containing the BPF program; this requires either the PR_SET_NO_NEW_PRIVS bit set or CAP_SYS_ADMIN capability to prevent privilege escalation. Once loaded, the kernel evaluates the filter chain in reverse order of addition (most recently added first) upon each system call attempt, taking the highest-precedence action immediately to halt or modify execution before the kernel handler runs.[1][3]
Audit logging provides optional observability for denied system calls, integrating with the kernel's audit subsystem to record events when specified actions are configured in /proc/sys/kernel/seccomp/actions_logged. Logs capture details such as the system call number, arguments, and process information, aiding in security monitoring and debugging without affecting the performance of allowed calls; by default, actions like KILL, TRAP, and ERRNO trigger logging if enabled, but ALLOW does not.[1][3]
History
Origins and Development
Seccomp originated in early 2005 when Andrea Arcangeli, then working at Red Hat, proposed it as a kernel feature to enhance process isolation in multi-tenant environments.[7] The primary motivation stemmed from the need to securely execute untrusted code in virtualization and grid computing scenarios, such as Arcangeli's CPUShare project, which aimed to rent out CPU resources without risking host system compromise.[2] By limiting system call access, seccomp addressed vulnerabilities where syscalls served as common entry points for attacks, building on the broader evolution of Linux security mechanisms like AppArmor and SELinux that emphasized mandatory access controls and sandboxing.[7] The initial proposal focused on a simple, irrevocable mode that transitioned a process into a restricted state, permitting only essential syscalls such as exit(), read() on a fixed file descriptor, write() on a fixed file descriptor, and sigreturn().[2] This design ensured minimal attack surface while allowing basic I/O for number-crunching or bytecode execution tasks.[8] Arcangeli's patches were merged into the Linux kernel mainline in version 2.6.12, released on June 17, 2005, marking seccomp's early adoption as a built-in facility for lightweight process confinement.[7][9] Development involved collaboration among kernel developers, including Paul Moore, who contributed to integrating seccomp actions with the Linux Audit subsystem for logging and compliance monitoring.[10] This foundational work laid the groundwork for later enhancements, such as the more flexible filter mode introduced in subsequent kernel versions.[2]Major Milestones
The introduction of seccomp-BPF, also known as seccomp mode 2 or filter mode, occurred in Linux kernel version 3.5, released in July 2012. This advancement enabled processes to load programmable Berkeley Packet Filter (BPF) programs to inspect and filter system calls at runtime, providing greater flexibility for syscall whitelisting and restriction compared to the earlier strict mode.[2] Between 2013 and 2015, seccomp saw further refinements, including the addition of the dedicated seccomp() system call in Linux 3.17 (October 2014), which allowed direct loading and management of filters, complementing the existing prctl(PR_SET_SECCOMP) interface for better multi-threaded support and filter handling.[3] During this period, architecture-specific filter support expanded, with seccomp-BPF becoming available on ARM architectures starting in Linux 3.8 (February 2013) and on additional platforms like x86-64, i386, and others in subsequent releases, enabling broader portability.[3] In 2019, Linux kernel 5.0 introduced user-space notifications through the SECCOMP_RET_USER_NOTIF action, enhancing seccomp by allowing filters to defer syscall decisions to a supervising user-space process via file descriptors, which improved handling of complex scenarios without relying on ptrace.[11] From 2020 to 2025, seccomp benefited from eBPF ecosystem advancements starting in Linux 5.7 (May 2020), which included allowing thread synchronization (TSYNC) and user notifications (USER_NOTIF) to be used together, along with improved BPF verifier efficiency and tooling that indirectly enhanced filter performance and development workflows, as well as better ptrace compatibility for debugging filtered processes.[12] Concurrently, adoption grew in languages like Rust through bindings such as libseccomp-rs, providing safer, memory-safe interfaces for filter construction and loading without major API breaks, while kernel optimizations focused on reducing overhead in high-throughput environments. In January 2025, libseccomp version 2.6.0 was released, adding support for new architectures including SuperH (little and big endian) and LoongArch.[13][14] As of November 2025, no major new seccomp-specific features have been introduced in the kernel beyond ongoing eBPF improvements.Seccomp Modes
Strict Mode
Strict mode, also known as mode 1, represents the original and most restrictive form of seccomp enforcement in the Linux kernel, designed to minimize the attack surface by severely limiting system call access.[3] It is activated by invoking theprctl(PR_SET_SECCOMP, 1) system call (equivalent to SECCOMP_MODE_STRICT), which transitions the calling thread into this mode without the ability to revert or apply further configurations.[15] Once enabled, the process can only execute a minimal set of system calls: read(2) and write(2) restricted to the already-open standard file descriptors 0 (stdin), 1 (stdout), and 2 (stderr); _exit(2) for process termination; and sigreturn(2) for signal handling restoration.[3] Any attempt to invoke other system calls immediately terminates the thread with a SIGKILL signal, preventing execution and ensuring no kernel interaction beyond the permitted operations.[1]
This rigid enforcement makes strict mode suitable for ultra-minimal environments, such as legacy applications or highly trusted codebases that perform simple computations without needing broader kernel interactions, including early implementations of sandboxed interpreters processing untrusted bytecode via standard I/O.[3] For instance, it is applicable in number-crunching scenarios where input is received through pipes or sockets and output is produced accordingly, without requiring file operations, networking, or process management beyond basic exit.[3] However, its inflexibility poses significant limitations for modern applications, as it prohibits essential dynamic operations like opening new files, establishing network connections, or handling multi-threaded scenarios with exit_group(2), rendering it impractical for anything beyond absolute minimalism.[1] In contrast to filter mode, which permits programmable syscall policies, strict mode offers no exceptions or nuanced control, enforcing a binary allow-or-kill paradigm.[3]
Filter Mode
Filter mode, also known as seccomp mode 2, represents an advanced configuration of the seccomp facility that allows processes to implement programmable system call policies using Berkeley Packet Filters (BPF). Unlike the rigid restrictions of strict mode, filter mode enables fine-grained control by loading a custom BPF program that dynamically evaluates incoming system calls at runtime. This mode is activated by invoking the prctl system call with PR_SET_SECCOMP and SECCOMP_MODE_FILTER as arguments, passing a pointer to the BPF program structure, or equivalently using the seccomp system call with the SECCOMP_SET_MODE_FILTER operation. Activation requires either the CAP_SYS_ADMIN capability or prior invocation of prctl(PR_SET_NO_NEW_PRIVS, 1) to drop privileges, ensuring the filter cannot be altered post-loading.[1][3] Upon entering the kernel for a system call, the seccomp subsystem executes the loaded BPF program against a struct seccomp_data structure provided by the kernel. This structure includes key fields such as nr (the system call number), arch (the architecture identifier, e.g., AUDIT_ARCH_X86_64), and an array args[] containing up to six system call arguments, along with additional metadata like instruction pointer and thread flags. The BPF program processes this data to determine whether to allow, deny, or take alternative actions on the call, with evaluation occurring before the system call handler executes, thus preventing unauthorized operations early in the kernel entry path. This runtime inspection mechanism supports extensible policies that adapt to the specific context of each invocation.[1][3] Filter mode excels in policy enforcement by permitting whitelisting or blacklisting of specific system calls, as well as conditional allowances based on argument values, enabling sophisticated security profiles. For instance, a policy might whitelist the open system call but restrict it to read-only modes by checking the flags argument for O_RDONLY, or allow mount only for specific filesystem types. Such conditionals provide granular control beyond simple syscall enumeration, allowing applications to maintain necessary functionality while minimizing the kernel attack surface. This contrasts with strict mode's precursor approach, which offers only a predefined whitelist without programmability.[1][3] Regarding thread inheritance, filters loaded in filter mode automatically propagate to child processes created via fork, clone, or execve, applying the same policy tree unless explicitly modified by the child. In multi-threaded environments, where clone with CLONE_THREAD creates sibling threads, the filter applies per task but requires synchronization to avoid inconsistencies; this is achieved using the SECCOMP_FILTER_FLAG_TSYNC flag during filter addition, which ensures all threads in the process adopt the same filter tree atomically. This inheritance model maintains security isolation while supporting concurrent execution, with the kernel enforcing that new threads cannot weaken existing protections without proper privileges.[1][3]Seccomp Filters
Berkeley Packet Filter Integration
Seccomp integrates the Berkeley Packet Filter (BPF) by adapting a subset of classic BPF (cBPF) instructions to enable programmable system call filtering within the Linux kernel. This adaptation restricts the instruction set to safe operations, including loading immediate values (BPF_LD | BPF_IMM, BPF_LDX | BPF_IMM), arithmetic operations (BPF_ALU with opcodes like BPF_ADD, BPF_SUB, BPF_MUL, BPF_DIV, BPF_AND, BPF_OR, BPF_XOR, BPF_LSH, BPF_RSH, BPF_NEG), conditional jumps (BPF_JMP with conditions like BPF_JEQ, BPF_JGT, BPF_JGE, BPF_JSET), memory loads/stores to 16 slots of scratch space (BPF_LD | BPF_MEM, BPF_ST), register transfers (BPF_MISC | BPF_TAX, BPF_MISC | BPF_TXA), and crucially, absolute word loads from the input data (BPF_LD | BPF_W | BPF_ABS) to read from struct seccomp_data at fixed offsets, while prohibiting pointer dereferencing to mitigate time-of-check-to-time-of-use (TOCTOU) vulnerabilities. Seccomp-specific extensions allow the BPF program to access a structured input via the struct seccomp_data, defined as { int nr; __u32 arch; __u64 instruction_pointer; __u64 args[6]; }, with 32-bit offsets: syscall number (nr) at 0, architecture (arch) at 4, instruction pointer at 8 (lower 32 bits) and 12 (upper), and arguments (args[0] to args[5]) at 16/20, 24/28, ..., 48/52 (lower/upper 32 bits for each u64). This alignment enables efficient 32-bit word reads during BPF execution.[1][16]
The program structure consists of linear or jump-based bytecode sequences compiled into a struct sock_fprog format, typically generated from user-space tools like libseccomp or seccomp-tools, and executed on a virtual machine with two 32-bit registers (A and X), 16 words of read-write scratch memory, and read-only access to the seccomp data packet. The bytecode processes this input to compute an action value, returned via a BPF_RET instruction, which determines the fate of the system call—such as allowing it to proceed or triggering a denial. This design leverages cBPF's simplicity for predictability, ensuring the program terminates quickly without unbounded computation.[1][17]
In the kernel, seccomp filters are attached as a tree to the task_struct of the process via the seccomp field, forming a filter chain where parent and child filters are evaluated compositely. Upon a system call entry, the kernel invokes seccomp_run() (or more precisely, __secure_computing() in modern implementations) in the syscall entry path, prior to dispatching to the specific handler, passing the current register state mapped to struct seccomp_data. The function traverses the filter tree from leaf to root, executing each BPF program and propagating the lowest (most restrictive) action value; optimizations like just-in-time (JIT) compilation and action caching for constant outcomes minimize overhead, typically adding only a few CPU cycles per invocation for simple filters.[16][1]
To ensure safety, the kernel performs rigorous validation on user-supplied BPF programs during attachment, using a verifier in kernel/seccomp.c to check for validity: it scans the bytecode for no division by zero, no out-of-bounds memory accesses, no uninitialized reads, and crucially, no infinite loops by enforcing forward-only jumps or bounded backward jumps within a flat control flow graph. This prevents denial-of-service attacks from malformed programs, rejecting invalid filters and returning -EINVAL to user space, thereby maintaining kernel stability without requiring the full eBPF verifier complexity.[16]
Filter Actions
Seccomp filter actions determine the outcome when a system call is evaluated against an attached Berkeley Packet Filter (BPF) program, allowing fine-grained control over process behavior without necessarily executing the call. These actions are encoded as specific return values from the BPF program, with precedence based on bitmasks to resolve conflicts in multi-layer filters. The kernel processes the highest-precedence action, ensuring consistent enforcement across architectures supporting seccomp filters. The most permissive action is SECCOMP_RET_ALLOW (0x7fff0000), which permits the system call to execute with its original arguments and registers intact. This serves as the default behavior for unmatched system calls in filter mode, enabling processes to operate normally unless explicitly restricted. Unlike other actions, it does not trigger logging or signals, prioritizing performance in allowlist scenarios. SECCOMP_RET_KILL (0x00000000) terminates the calling thread immediately by delivering a SIGSYS signal, preventing the system call from proceeding. In single-threaded processes, this effectively kills the entire process and may produce a core dump since Linux 4.11. If auditing is enabled and the action is in the logged set (configurable via /proc/sys/kernel/seccomp/actions_logged since Linux 4.14), the kernel records the violation for security monitoring. SECCOMP_RET_TRAP (0x00030000) interrupts the process by sending a SIGSYS signal to the calling thread, providing detailed information about the attempted system call through a seccomp_info structure populated in the siginfo_t. This includes fields like si_call_addr (instruction pointer), si_syscall (system call number), si_arch (architecture), and si_errno (set to the BPF program's SECCOMP_RET_DATA value). The system call is not executed, allowing handlers to inspect or respond to the event, such as in custom signal handlers. SECCOMP_RET_ERRNO (0x00050000 | errno) causes the kernel to return a specified error code to userspace without executing the system call, where the errno value (ranging from 0 to 4095) is derived from the lower 16 bits of the BPF return value combined with the base mask. This action simulates a failed system call transparently, useful for denying access while maintaining application compatibility, and it supports errno values up to the maximum portable range without core dumps or signals. SECCOMP_RET_TRACE (0x7ff00000) notifies any attached ptrace tracer by raising a PTRACE_EVENT_SECCOMP event, allowing debugging tools to intervene, modify arguments, or decide whether to proceed with the call. If no tracer is present, the kernel returns -ENOSYS to the process, effectively denying the call. The event message includes the BPF program's SECCOMP_RET_DATA value via PTRACE_GETEVENTMSG, enabling advanced interception without full process termination. Introduced in Linux 5.0, SECCOMP_RET_USER_NOTIF (0x7ff00001) queues a notification to a userspace handler via a file descriptor obtained through seccomp(2) with SECCOMP_SET_MODE_FILTER and the SECCOMP_FILTER_FLAG_NEW_LISTENER flag. The kernel suspends the system call until the handler responds using ioctls on the notification file descriptor, such as SECCOMP_IOCTL_NOTIF_RESP to allow, deny, or emulate the call (optionally providing a replacement file descriptor via SECCOMP_IOCTL_NOTIF_ADDFD), allowing unprivileged userspace to emulate or forward calls securely; a pidfd can be used to manage the task if needed. If no handler acknowledges, it returns -ENOSYS.[3]Programming Interface
System Calls and Prctl
The primary interface for enabling seccomp in the Linux kernel is theprctl system call with the PR_SET_SECCOMP option, which serves as a legacy mechanism to transition a process into a secure computing mode. This call takes the form prctl(PR_SET_SECCOMP, int mode), where mode specifies either SECCOMP_MODE_STRICT (value 1) for a highly restrictive environment allowing only basic system calls like exit, sigreturn, read, and write on already-open file descriptors, or SECCOMP_MODE_FILTER (value 2) to activate a Berkeley Packet Filter (BPF)-based filter for more granular control. For filter mode, a third argument—a pointer to a struct sock_fprog containing the BPF program—must be provided; otherwise, it is unused. On success, prctl returns a nonnegative value (typically 0), while failure returns -1 with errno set, such as EINVAL for an invalid mode or filter program.[15][6]
Introduced in Linux kernel version 3.17, the dedicated seccomp system call provides a modern, extensible interface that supersedes the limitations of prctl for filter mode operations. Invoked as seccomp(int op, unsigned int flags, const struct sock_fprog *uargs), it primarily uses the SECCOMP_SET_MODE_FILTER operation (op) to validate and load a BPF program specified in uargs, which defines the allowed system calls and actions based on the struct seccomp_data input. When flags is 0, this call is functionally equivalent to prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, uargs), but additional flags like SECCOMP_FILTER_FLAG_LOG enable logging of filter decisions. Success returns 0; errors return -1 with errno set, including EFAULT for an invalid pointer in uargs, EACCES for insufficient privileges, and E2BIG if the BPF program exceeds size limits such as 4096 instructions.[3][6]
Enabling seccomp via either interface requires specific privilege checks: the calling process must either possess the CAP_SYS_ADMIN capability in its initial user namespace or have previously set the no_new_privs bit using prctl(PR_SET_NO_NEW_PRIVS, 1). These requirements ensure that only authorized processes can restrict system call access, preventing unprivileged escalation. Once activated, seccomp mode changes are irreversible for the calling thread and its descendants created via fork or clone, as the filter persists across these operations unless explicitly allowed otherwise during execve; however, execve itself may inherit the filter if permitted by the program.[3][15][6]
Loading Filters
Seccomp filters are constructed either directly using raw Berkeley Packet Filter (BPF) assembly or through higher-level libraries such as libseccomp, which abstracts the process into a more user-friendly API. In the raw approach, developers build a filter program as an array of BPF instructions stored in astruct sock_fprog, which consists of an unsigned short len field indicating the number of instructions and a pointer to an array of struct sock_filter elements, each containing a 16-bit code, two 8-bit jump offsets (jt and jf), and a 32-bit constant (k). This structure encapsulates the filter logic, which inspects system call details via the seccomp_data structure, including the system call number (nr), architecture (arch), and up to six arguments (args[0-5]).[18]
Libraries like libseccomp simplify construction by initializing a filter context with seccomp_init(), specifying a default action such as SCMP_ACT_KILL for disallowed calls, followed by adding specific rules using functions like seccomp_rule_add() to allow or restrict syscalls with optional argument checks. Internally, libseccomp compiles these rules into the equivalent BPF bytecode and populates a sock_fprog structure for kernel submission. Once constructed, the filter is attached to the current process or thread via the seccomp() system call, invoked as seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog), where prog is the sock_fprog pointer; this requires prior invocation of prctl(PR_SET_NO_NEW_PRIVS, 1) to drop privileges or possession of the CAP_SYS_ADMIN capability. Supported flags include SECCOMP_FILTER_FLAG_TSYNC to synchronize the filter across all threads in the process, SECCOMP_FILTER_FLAG_LOG to enable logging of filter actions (available since Linux 4.14), and SECCOMP_FILTER_FLAG_NEW_LISTENER to return a file descriptor for user notifications on trapped syscalls (since Linux 5.0).[18]
Multiple filters can be stacked to form a filter tree per thread, with each new filter appended to the tree and evaluated in reverse chronological order—meaning later-loaded filters are checked first, enabling overrides of earlier rules. For instance, a subsequently loaded filter can refine permissions for specific syscalls, and the first matching action with the highest precedence (e.g., SECCOMP_RET_KILL_PROCESS over SECCOMP_RET_ALLOW) determines the outcome. The kernel enforces limits, such as a maximum of 4096 instructions per filter and 32768 total across the tree, to prevent resource exhaustion.[18]
Verification occurs during loading, where the kernel's BPF verifier rejects invalid programs—such as those with unreachable instructions, excessive jumps, or invalid actions—returning EINVAL to the caller. User-space tools like seccomp-tools facilitate further analysis by dumping BPF code from running processes via ptrace, disassembling it into readable format with syscall names, and emulating filter behavior against test inputs, supporting architectures like x86_64 and ARM64. These steps ensure filters are correctly constructed and attached before enforcement begins.[18][19]
Applications and Usage
In Containerization
Seccomp plays a crucial role in enhancing the security of containerized environments by restricting system calls available to container processes, thereby limiting potential attack surfaces in container runtimes like Docker. In Docker, the default seccomp profile integrated into the dockerd daemon enforces a whitelist-based policy that disables approximately 44 system calls out of over 300, targeting those deemed dangerous for container isolation, such asmount (which requires elevated privileges like CAP_SYS_ADMIN) and keyctl (related to non-namespaced kernel keyrings). This profile is sourced from the Moby project's default.json and applies automatically to containers unless explicitly overridden, providing a baseline level of protection against privilege escalation attempts within the container.[4][20]
Docker allows customization of seccomp profiles to tailor restrictions to specific workloads, using the --security-opt seccomp=/path/to/profile.json flag during container execution, which loads a user-defined JSON file compliant with the Open Container Initiative (OCI) specification. For instance, administrators can generate custom profiles using tools like seccomp-gen, which analyzes strace output from a containerized application to whitelist only the necessary syscalls, thereby minimizing unnecessary kernel interactions while preserving functionality. This flexibility enables fine-grained control, such as allowing additional syscalls for debugging or specialized applications without disabling seccomp entirely.[4][21]
In Kubernetes, seccomp integration occurs at the pod specification level through the securityContext.seccompProfile field, where the type can be set to RuntimeDefault to apply the underlying container runtime's default profile (e.g., Docker's or containerd's) or Localhost for a custom profile stored on the node at /var/lib/kubelet/seccomp/. This configuration enforces syscall restrictions per container, promoting multi-tenant isolation in shared clusters by preventing workloads from invoking syscalls that could compromise the host kernel or other pods. Kubernetes supports inheritance of seccomp settings from pod to container levels, ensuring consistent application across orchestrated environments.[22]
The primary benefits of seccomp in containerization include preventing container escapes and breakout attacks by confining processes to a minimal set of approved kernel interactions, thus enforcing the principle of least privilege at the syscall layer. By blocking unauthorized syscalls, seccomp reduces the risk of exploits that rely on kernel vulnerabilities, complementing other security mechanisms like namespaces and cgroups; for example, filter mode enables these dynamic, programmable restrictions essential for runtime adaptability. Profiles can be generated and tested iteratively using tools like seccomp-gen to balance security and performance without over-restricting benign operations.[4][22][21]
Seccomp's adoption in production Kubernetes clusters is widespread to achieve compliance with established security standards, such as the CIS Kubernetes Benchmarks (which recommend enabling default seccomp profiles for all workloads) and NIST SP 800-190 (which highlights seccomp as a key control for application container runtime protection). These guidelines underscore seccomp's role in mitigating container-specific risks, making it a standard practice in hardened deployments for regulatory alignment and threat mitigation.[23][24]
In Web Browsers
Google Chrome integrates seccomp-BPF into its multi-process architecture to enhance renderer process isolation, a feature introduced in version 23 around late 2012 and widely deployed by 2013. This mechanism confines renderer processes—responsible for executing untrusted web content—to a restricted set of system calls, permitting essential operations like network communication viasendto and recvfrom while prohibiting dangerous actions such as file creation with open or process spawning with fork.[25][26] These filters complement other isolation techniques, including Linux namespaces, which limit the process's visibility into the file system, process list, and network stack, collectively forming Chrome's layered sandbox defense.
The seccomp filters for Chrome's renderer sandbox are generated at build time through automated tools and policy definitions within the Chromium codebase, such as those in the sandbox/linux/seccomp_bpf directory, ensuring consistent and auditable restrictions across deployments. When a violation occurs—such as an attempt to invoke a disallowed syscall—the kernel terminates the process immediately via SIGSYS or SIGKILL, preventing potential exploitation; these incidents are captured as crashes and reported to Google's Crashpad system for analysis and telemetry.[27]
Other browsers have adopted similar syscall filtering for process isolation, though implementations vary by platform. Mozilla Firefox began applying seccomp filters to content processes in 2016, with broader rollout by 2017, particularly targeting media decoding and web rendering to mitigate risks from malformed content; this focuses on denying filesystem and device access while allowing rendering essentials.[28] Apple's Safari, built on WebKit, employs the macOS XNU kernel's native sandboxing framework—mandatory access controls via Seatbelt profiles—rather than seccomp, as the latter is Linux-specific; this achieves comparable isolation for web content processes by restricting entitlements like file I/O and inter-process communication.[29][30]
The adoption of seccomp-BPF in Chrome has demonstrably bolstered security by blocking kernel interaction vectors commonly targeted in browser attacks.[31][32]
Limitations and Future Developments
Current Constraints
Seccomp imposes notable performance overhead due to the evaluation of Berkeley Packet Filter (BPF) programs on each system call. Measurements indicate that this adds approximately 66 to 226 cycles per syscall for common operations likegetpid or open, depending on the syscall complexity and filter design.[33] More intricate filters or high rates of system calls exacerbate this overhead, as the linear nature of BPF execution scales with filter depth and invocation frequency, potentially impacting latency-sensitive applications.[33][34]
Compatibility remains a key constraint, with seccomp BPF filtering primarily optimized for x86-64 architectures since its inception, while support for others like ARM (since kernel 3.8) and ARM64 (since kernel 3.14) exists.[6][3] MIPS architecture has seccomp BPF support since Linux 3.16.[3] Additionally, seccomp filters cannot dynamically incorporate new syscalls without reloading the entire filter program, as the transition to secure mode is irreversible and requires process reinitialization.[6]
Debugging seccomp violations presents significant challenges, as tracing requires integration with tools like auditd to log events in /var/log/audit/audit.log, without which violations may go unnoticed beyond process termination.[35] The seccomp user-space notification mechanism, available since kernel 4.14, allows supervisors to intercept and handle violations but demands additional setup, such as file descriptor polling and privileged oversight, complicating routine diagnostics.[6][36]
Security gaps further limit seccomp's efficacy, as it cannot intercept or filter kernel-internal operations or syscalls invoked indirectly through the virtual dynamic shared object (vDSO), which executes certain calls—like gettimeofday—entirely in user space, bypassing filters and leading to inconsistent behavior across systems.[6] Moreover, seccomp offers no protection against kernel bugs or vulnerabilities, as it operates solely at the user-kernel boundary and assumes a trusted kernel implementation.[6] These limitations can be partially mitigated through filter actions like tracing, though such approaches introduce their own overheads.[6]