Completely Fair Scheduler
The Completely Fair Scheduler (CFS) is a process scheduler in the Linux kernel designed to allocate CPU time equitably among tasks by approximating an "ideal multi-tasking CPU" scenario, where each runnable task receives a proportional share of processor resources as if executing in parallel on separate hardware.[1] Developed primarily by Ingo Molnar, CFS was first announced on April 13, 2007, as part of a modular scheduler core overhaul, and was merged into the mainline Linux kernel with version 2.6.23 in October 2007, supplanting the earlier O(1) scheduler as the default for general-purpose (SCHED_NORMAL) workloads.[2][1] This integration marked a shift toward a fairness-focused design that prioritizes desktop interactivity and server efficiency without relying on fixed time slices or complex heuristics.[1] At its core, CFS employs a red-black tree (rbtree) data structure to maintain tasks in chronological order based on their virtual runtime—a metric that accumulates actual CPU usage adjusted for task priority (nice levels)—ensuring the task with the lowest virtual runtime is selected for execution.[1] This approach uses nanosecond-granularity accounting for precise fairness, with tunable parameters like scheduling granularity (default 4–6 ms for desktop, 10–20 ms for throughput-oriented systems) to balance responsiveness and efficiency.[1] Unlike prior schedulers, CFS avoids runqueue arrays and periodic ticks for selection, reducing overhead and eliminating artifacts such as unfairness from array switches.[1][2] CFS operates within a class-based framework, handling SCHED_NORMAL, SCHED_BATCH (for CPU-intensive tasks), and SCHED_IDLE policies, while real-time classes (SCHED_FIFO and SCHED_RR) are managed separately via sched_rt.c.[1] Key extensions include group scheduling (enabled via CONFIG_FAIR_GROUP_SCHED), which allows hierarchical resource control through cgroups and cpu.shares for proportional bandwidth allocation, and bandwidth control to limit maximum CPU usage per group.[3][1] These features enhance scalability in multi-core SMP systems through simplified load-balancing, where idle CPUs actively pull tasks from busy ones.[1] Compared to the O(1) scheduler, CFS offers superior resistance to manipulation attacks (e.g., via tools like fiftyp.c), better handling of priority adjustments, and no performance regressions in benchmarks, while reducing kernel code size and improving overall system isolation.[2][1] It served as the cornerstone of Linux scheduling for nearly two decades, powering diverse environments from desktops to servers. However, CFS was superseded by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, merged as the default in Linux kernel 6.6 (released October 2023), to address certain limitations in fairness and latency; while CFS code remains for compatibility, EEVDF is the current standard for general-purpose scheduling as of 2025.[4][5][6]Introduction
Overview
The Completely Fair Scheduler (CFS) served as the default process scheduler for SCHED_NORMAL tasks in the Linux kernel from version 2.6.23, released in October 2007, through version 6.5.[1] Developed primarily by Ingo Molnár, CFS was designed to handle the scheduling of typical desktop and server workloads by ensuring equitable CPU resource allocation among competing processes.[1] At its core, CFS employs a fair queuing mechanism that distributes CPU time proportionally to runnable tasks based on their priority and nice values, aiming to approximate an "ideal multi-tasking CPU" where each task receives its fair share without relying on fixed timeslices.[1] This approach prioritizes conceptual fairness over rigid quanta, using virtual runtime as the key metric to track and balance actual execution time against a task's entitled share, thereby minimizing scheduling overhead and promoting responsiveness.[1] In Linux kernel version 6.6, released in late 2023, CFS was succeeded by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler as the new default for the fair scheduling class, addressing certain latency issues in CFS while building on its foundational principles.[4][6] Despite this transition, CFS's influence persists in the Linux ecosystem, as it remains available as a configurable option and continues to underpin many existing deployments and extensions.[1]Design Goals
The Completely Fair Scheduler (CFS) was designed primarily to enhance desktop interactivity by providing low-latency responses for interactive workloads, addressing the limitations of prior schedulers like the O(1) scheduler that struggled with mixed interactive and CPU-bound tasks.[1][2] This focus aimed to model an ideal multi-tasking CPU where each task receives a fair share of processing time proportional to its priority, ensuring that no task is starved while maintaining responsiveness for user-facing applications.[1] A key motivation was to eliminate the heuristic-based fixed timeslices and jiffy-dependent accounting of earlier schedulers, replacing them with nanosecond-granularity runtime tracking to achieve more precise and deterministic fairness without sacrificing efficiency.[2] By using virtual runtime as a measure of accumulated CPU usage adjusted for priority, CFS ensures proportional allocation that favors interactive tasks in desktop scenarios while allowing tunable parameters for server or batch environments.[1] CFS targeted general-purpose computing across desktops, servers, and embedded systems, with optimizations for low-latency desktop use as the default, up through Linux kernel version 6.5.[1] This design improved upon predecessors by better handling diverse workloads, such as combining short interactive bursts with long-running computations, without the array-switching artifacts that degraded performance in O(1).[2]Historical Development
Pre-CFS Schedulers
The earliest Linux kernel schedulers, used from version 0.01 in 1991 through the 2.4 series and into the 2.5 development branch, employed an O(n algorithm that performed a linear search over all runnable tasks to select the next one to execute.[7] This approach relied on agoodness() function to evaluate task priority and interactivity, iterating through the entire run queue during each scheduling decision, which resulted in O(n time complexity where n is the number of tasks.[7] As systems scaled to hundreds or thousands of tasks—common in server environments or with threaded applications like Java virtual machines—the linear search imposed significant overhead, degrading responsiveness and interactivity by increasing scheduling latency proportionally with load.[7] Additionally, the O(n scheduler assigned large fixed timeslices (often around 210 ms on average), which could delay low-priority tasks for extended periods, such as over 20 seconds in scenarios with 100 competing tasks.[7]
To address these scalability issues, Ingo Molnar announced the O(1) scheduler in January 2002 as a patch for the 2.5 development kernel, with its full integration occurring in the stable Linux 2.6.0 release in December 2003.[8] The O(1) design achieved constant-time task selection by organizing tasks into two per-CPU priority arrays—one active and one expired—allowing the scheduler to pick the highest-priority runnable task directly from the highest non-empty priority bitmap without iterating over all tasks.[9] It retained fixed timeslices scaled by static priority (e.g., (140 - static priority) × 20 ms for priorities below 120), and incorporated interactivity heuristics, such as dynamic priority boosts based on average sleep time, to favor responsive tasks like those in graphical interfaces.[9] This scheduler remained the default from Linux 2.6.0 through 2.6.22.[9]
Despite its efficiency gains, the O(1) scheduler exhibited several limitations that compromised fairness and performance in diverse workloads. Its heavy bias toward interactive tasks via heuristics often led to unfairness in mixed environments, where batch or CPU-bound processes received disproportionately less CPU time, sometimes starving them under load.[9] Interactive tasks themselves suffered high latency during heavy contention, as the fixed timeslice model and priority adjustments failed to adapt smoothly to varying demands.[9] Furthermore, "scheduler artifacts" such as periodic switches between active and expired arrays introduced unnecessary jitter, disrupting timing-sensitive applications and complicating further maintenance due to the growing complexity of ad-hoc heuristics.[9] These issues, building on the foundational scalability problems of the O(n) era, highlighted the need for a more principled approach to CPU allocation.[10]
Introduction and Key Contributors
The Completely Fair Scheduler (CFS) was developed primarily by Ingo Molnar in 2007 as a new process scheduling algorithm for the Linux kernel, drawing inspiration from earlier fair-scheduling concepts while aiming to simplify and enhance performance.[1] Molnar, a prominent kernel developer previously known for the O(1) scheduler, led the implementation with significant contributions from Mike Galbraith, who focused on interactivity tuning to improve responsiveness for user-facing applications.[11] Other collaborators included Peter Zijlstra for performance optimizations and Roman Zippel for integrating ideas from the "Really Fair Scheduler" (RFS).[12] CFS was merged into the Linux kernel version 2.6.23 in October 2007, becoming the default scheduler for the SCHED_OTHER class and replacing the prior O(1) scheduler.[1] This integration marked a shift toward a more heuristic-free approach, with the core design encapsulated in a single line: selecting the task with the lowest virtual runtime to ensure proportional CPU allocation.[10] The primary motivations for CFS were to enhance desktop responsiveness and achieve strict fairness in CPU time distribution without relying on complex, ad-hoc heuristics that plagued earlier schedulers.[2] By prioritizing low latency for interactive tasks—such as those in graphical user interfaces—while maintaining overall throughput, CFS addressed complaints about jittery performance under mixed workloads.[10] Upon release, CFS received positive reception for its ability to reduce latency in interactive scenarios, leading to rapid adoption across major Linux distributions. By 2008, it had become the standard scheduler in releases like Ubuntu 8.04 (using kernel 2.6.24) and Fedora 9 (using kernel 2.6.25), solidifying its role in improving everyday desktop and server usability.[13][14]Subsequent Enhancements and Transition to EEVDF
Following the initial introduction of the Completely Fair Scheduler (CFS) in Linux kernel 2.6.23, subsequent releases incorporated several key enhancements to improve its scalability, fairness, and responsiveness across diverse workloads. One significant addition was the integration of CFS with control groups (cgroups) in kernel 2.6.24, enabling group-based scheduling where CPU shares could be allocated proportionally to hierarchical process groups rather than individual tasks. This feature, requiring the CONFIG_CGROUPS and CONFIG_FAIR_GROUP_SCHED kernel options, allowed for better resource isolation in multi-tenant environments like servers and virtualized systems.[15] Further refinements to symmetric multiprocessing (SMP) load balancing arrived in kernel 2.6.32, where the scheduler's runqueue management was optimized to reduce overhead in multi-core systems by improving task migration decisions and eliminating unnecessary runqueue traversals during balancing operations.[1] These changes enhanced throughput in high-core-count setups without compromising fairness. In the 3.x kernel series, latency-focused tweaks were implemented to bolster real-time support, particularly for interactive and multimedia applications. For instance, kernel 2.6.36 (bridging to the 3.x renumbering) reduced the minimum granularity and target latency parameters, allowing shorter time slices for low-load scenarios and decreasing worst-case delays from around 20 ms to under 6 ms in typical desktop use cases. This adjustment prioritized responsiveness for latency-sensitive tasks while maintaining overall fairness.[16] Later, in kernel 2.6.38, the autogrouping feature was added, automatically grouping tasks by user session or login to prevent CPU-hogging applications from starving interactive shells or desktop elements, thus improving perceived performance on workstations. These modifications collectively addressed criticisms of CFS's occasional high-latency bursts under mixed workloads. Performance optimizations continued into the 4.x and 5.x series, with notable advancements in bandwidth control and handling of diverse workloads. CFS bandwidth control, merged in kernel 3.2, extended group scheduling by introducing cpu.cfs_quota_us and cpu.cfs_period_us parameters in cgroups, enforcing hard CPU limits on groups to prevent overconsumption while preserving fairness within quotas. This was particularly useful for containerized environments, where it ensured predictable resource allocation without excessive throttling.[3] In kernels 4.x onward, refinements to the scheduler addressed performance bugs and improved overall efficiency. Additional tweaks in 5.x, such as improved load tracking for heterogeneous cores, minimized unnecessary migrations and enhanced energy efficiency on modern hardware. The evolution of CFS culminated in its replacement by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler in Linux kernel 6.6, released in September 2023. Proposed and implemented by kernel developer Peter Zijlstra, EEVDF refines CFS's fairness model by incorporating explicit virtual deadlines, selecting the task with the earliest eligible deadline to reduce latency variance while inheriting CFS's virtual runtime tracking for proportional shares. This shift addressed longstanding heuristics in CFS that could lead to suboptimal latency in bursty workloads, achieving up to 50% lower tail latencies in microbenchmarks without sacrificing throughput.[17][4] As of November 2025, EEVDF serves as the default scheduler for the fair scheduling class in new kernels.[18]Core Concepts and Algorithm
Virtual Runtime Mechanism
The virtual runtime (vruntime) in the Completely Fair Scheduler (CFS) represents a normalized measure of a task's CPU usage, adjusted according to its priority weight to ensure proportional allocation of processor time.[1] This metric allows CFS to simulate an ideal multitasking environment where CPU resources are shared fairly among runnable tasks, regardless of their actual execution history. The vruntime is updated incrementally based on the actual CPU time a task consumes, using the formula: \Delta \text{vruntime} = \text{actual_runtime} \times \frac{\text{NICE_0_LOAD}}{\text{task_weight}} where \text{NICE_0_LOAD} is a fixed constant of 1024 corresponding to the default weight for a nice level of 0, and \text{task_weight} is a value derived from the task's nice level (ranging from -20 to 19), with higher weights assigned to higher-priority tasks to slow their vruntime accumulation.[1] For a task at nice level 0, this simplifies to \Delta \text{vruntime} = \text{actual_runtime}, providing a baseline for equal-priority sharing.[10] The primary purpose of vruntime is to enable CFS to select the task with the lowest value for scheduling, thereby approximating proportional fairness: higher-priority tasks (with larger weights) receive more CPU time before their vruntime catches up to others. Vruntime increases monotonically for active tasks, with the scheduler tracking a per-CPU minimum vruntime (min_vruntime) to maintain global progress and prevent any task from falling indefinitely behind.[1] Over time, equal-priority tasks converge toward identical vruntime values as they each accrue equivalent normalized usage, enforcing the scheduler's fairness goal.[10] Sleeping or blocked tasks do not accumulate vruntime during their inactive periods, causing their values to lag behind those of running tasks and granting them higher scheduling eligibility upon wakeup to compensate for missed execution opportunities.[1] This "sleeper fairness" ensures that I/O-bound or intermittently active tasks are not systematically disadvantaged compared to CPU-bound ones.[1] To mitigate scheduling overhead from short-running tasks and variability in load, CFS enforces a granularity parameter—typically 4 to 6 milliseconds on desktop systems—which sets the minimum runtime slice before preemption is considered, balancing responsiveness with efficiency.[1] Vruntime values are briefly referenced in task ordering mechanisms for quick selection.Data Structures and Task Organization
The Completely Fair Scheduler (CFS) employs a red-black tree as its primary data structure to organize runnable tasks in a time-ordered manner, ensuring efficient management based on virtual runtime (vruntime). This self-balancing binary search tree sorts tasks by their vruntime values, with the leftmost node representing the task with the smallest vruntime, which is selected for execution.[1] Operations such as insertion and deletion on the tree achieve O(log N) time complexity, where N is the number of tasks, providing scalability for systems with varying numbers of concurrent processes.[10] Per-CPU runqueues in CFS are augmented with a CFS-specific runqueue structure known as cfs_rq, which encapsulates the red-black tree root along with additional metadata for task management. The cfs_rq includes fields such as min_vruntime, which tracks the smallest vruntime observed on the runqueue to maintain a monotonic reference for new task placements, and a load sum representing the total weighted load of tasks within it.[1] This structure integrates with the broader scheduler runqueue (rq) via rq->cfs, allowing CFS to handle fair scheduling while coexisting with other scheduler classes like real-time tasks.[10] Each task in CFS is represented by a sched_entity (se) structure, which stores key attributes including the task's vruntime, weight, and red-black tree node for integration into the cfs_rq tree. The se is embedded within the task_struct and facilitates the hierarchy of scheduling entities, enabling support for grouped tasks without altering the core organization.[1] Vruntime serves as the primary sorting key within the se, dynamically updated to reflect accrued execution time and ensuring tasks are positioned accordingly in the tree.[10] This rbtree-based approach offers significant advantages over traditional array-based runqueues used in prior Linux schedulers, such as the O(1) scheduler, by eliminating fixed priority arrays that could introduce scheduling artifacts like uneven task switching.[2] Unlike arrays requiring O(N) scans for selection, the red-black tree enables logarithmic-time access and maintenance, reducing overhead in multiprocessor environments and improving overall system responsiveness without relying on heuristic adjustments.[1]Scheduling Decisions and Time Allocation
The Completely Fair Scheduler (CFS) selects the next task to run by identifying the one with the lowest virtual runtime from its per-CPU red-black tree data structure, ensuring that tasks which have accumulated the least normalized CPU usage are prioritized for execution. This selection process, known as picking the "leftmost" node in the tree, occurs during scheduling events such as timer interrupts or task wakeups, promoting fairness by always advancing the task that is most "behind" in its entitled share of CPU time. Upon selection, the task's virtual runtime is updated incrementally as it consumes CPU cycles, reflecting its actual execution time adjusted for fairness.[1] Time allocation in CFS avoids fixed timeslices, instead dynamically computing the runtime for each task based on its relative weight within the total weight of all runnable tasks on the CPU, divided proportionally from a target latency period. This runtime is determined by the formula where the allocated time approximates (target_latency × task_weight) / total_weight, allowing higher-weighted tasks (corresponding to higher priorities) to receive larger shares while maintaining overall fairness within the latency window. The target latency is configurable via the kernel parameter sysctl_sched_latency_ns, which defaults to approximately 20 ms on typical multi-core desktop systems (scaling with the number of CPUs as 6 ms × (1 + log2(number of CPUs))), and serves to throttle the scheduling frequency to balance responsiveness and throughput. This approach ensures that all tasks progress toward equal virtual runtime over time, with the granularity of allocation tuned by sysctl_sched_min_granularity_ns to prevent excessive context switches.[1] Preemption in CFS is triggered when a newly runnable task demonstrates a virtual runtime sufficiently lower than the currently running task's, specifically if the difference exceeds the execution delta since the last check, prompting an immediate context switch to restore fairness. This check occurs during potential preemption points, such as timer ticks or when tasks become runnable, and is modulated by the granularity parameters to avoid over-scheduling on lightly loaded systems. Additionally, when a CPU enters an idle state, CFS performs an "idle pull" mechanism to balance load by migrating tasks from overloaded CPUs, ensuring underutilized processors do not remain idle while others are burdened.[1] Wakeup handling in CFS positions newly awakened tasks in the red-black tree such that their virtual runtime accounts for the time spent sleeping, effectively adjusting it to a value that prevents long-sleeping tasks from being starved upon resumption. This placement favors recently woken tasks for potential preemption via the wakeup_preempt() function, which evaluates whether the woken task's virtual runtime warrants interrupting the current runner, thereby preserving low latency for interactive workloads without unfairly penalizing compute-bound ones.[1]Implementation Details
Kernel Integration and Code Structure
The Completely Fair Scheduler (CFS) is integrated into the Linux kernel as one of several scheduling classes within a hierarchical framework, allowing multiple scheduler modules to coexist and handle different task types. CFS specifically manages tasks under the SCHED_NORMAL policy (also known as SCHED_OTHER), while other classes like real-time (SCHED_FIFO and SCHED_RR) and deadline-based (SCHED_DEADLINE) handle higher-priority or specialized workloads. This integration occurs through thesched_class structure, a linked list of scheduler modules ordered by priority, where CFS occupies the fair class position. Key hooks such as pick_next_task_fair and enqueue_task_fair are registered in this structure to interface with the core scheduler; for instance, pick_next_task iterates through classes starting from the highest priority and delegates to CFS's implementation when appropriate, selecting the task with the minimal virtual runtime (vruntime) for fairness.[1][19][10]
The core implementation of CFS resides primarily in the kernel/sched/fair.c file, which contains the logic for fair task selection, red-black tree management for runnable tasks, and vruntime updates. This file defines the fair scheduling class and its operations, ensuring efficient O(log N) insertions and extractions for task enqueueing and dequeuing. CFS integrates seamlessly with the broader scheduler framework in kernel/sched/core.c, which handles generic mechanisms like timer ticks, context switches, and the main schedule() function that invokes class-specific hooks during preemption or voluntary yielding. For example, during a context switch, core.c calls into fair.c to update runqueues and select the next task, maintaining compatibility with the kernel's tick-based or tickless modes.[1][20][10]
During the kernel boot process, CFS is initialized as the default scheduler for SCHED_NORMAL tasks through the sched_init() function in kernel/sched/core.c, which sets up per-CPU runqueues, allocates structures for the root task group, and configures the initial task (init) with fair scheduling parameters like load weight and cfs_rq (CFS runqueue). This initialization occurs early in the boot sequence after CPU setup, ensuring that regular processes start under CFS without additional configuration. The framework is extensible; since Linux kernel 6.12, the sched_ext class allows loading user-space BPF-based schedulers as plugins, which can override or complement CFS for SCHED_EXT tasks while falling back to CFS on errors or unloading.[21][22]
CFS supports both processes and threads uniformly through the per-task sched_entity structure, embedded within each task_struct to encapsulate scheduling state such as vruntime, load weight, and red-black tree nodes. This design treats threads as lightweight tasks sharing the process's address space but with independent scheduling entities, enabling fine-grained fairness across multithreaded applications without distinguishing between process and thread contexts at the scheduler level.[10][23]
Priority Handling and Nice Levels
In the Completely Fair Scheduler (CFS), task priorities for SCHED_NORMAL (also known as SCHED_OTHER) processes are managed through nice values, which range from -20 (highest priority) to +19 (lowest priority), providing 40 distinct levels to fine-tune CPU allocation.[1] These values influence a task's share of CPU time by adjusting its scheduling weight, ensuring that higher-priority tasks (lower nice values) receive proportionally more execution time while maintaining overall fairness.[1] The weight for each nice level is determined using a precomputed array in the kernel, designed to implement multiplicative adjustments where each increment in nice value reduces the task's CPU share by approximately 10%. Specifically, the base weight for nice 0 is defined as NICE_0_LOAD = 1024, and weights scale by a factor approximating 1/1.25 per positive nice increment (or ×1.25 for negative increments), though exact values are tabulated to avoid floating-point computations. This results in the following representative weights:| Nice Value | Weight |
|---|---|
| -10 | 9548 |
| 0 | 1024 |
| 10 | 110 |
| 19 | 15 |
Load Balancing Across CPUs
In symmetric multiprocessing (SMP) systems, the Completely Fair Scheduler (CFS) employs load balancing to distribute tasks across multiple CPUs, ensuring equitable resource utilization while minimizing overhead. This mechanism prevents any single CPU from becoming overburdened, which could lead to inefficiencies in task execution. Load balancing is invoked periodically and opportunistically, integrating seamlessly with the scheduler's core fairness principles.[1] Load balancing is triggered periodically by scheduler ticks, during which each CPU evaluates potential imbalances using theload_avg metric—a weighted average of active load that accounts for task weights and recent utilization history. This allows the scheduler to assess system-wide load without excessive computational cost, triggering adjustments only when necessary. If an imbalance is detected, the scheduler initiates task migrations to restore equilibrium across runqueues.[1]
The pull and push logic forms the core of CFS's balancing strategy. Idle CPUs actively pull tasks from busier counterparts to utilize available capacity, selecting candidates via iterators over the red-black tree (RB-tree) structures that organize tasks by virtual runtime (vruntime). Conversely, overloaded CPUs push excess tasks to underutilized ones if their load exceeds 1.17 times (default imbalance_pct=117) the average load in the scheduling domain, again leveraging RB-tree iterators to identify the most suitable tasks—typically those with the lowest vruntime to prioritize fairness. This bidirectional approach reduces latency in responding to load variations, with pulls favored for idle scenarios to minimize disruptions.[1][24][25]
Load balancing operates within a hierarchical structure of scheduling domains that reflect hardware topology, starting from the lowest level (e.g., multi-core domains sharing L2 cache) and progressing to higher levels such as sockets (DIE domains) and NUMA nodes. At each domain level, balancing is performed by a designated CPU—often the first idle one—to coordinate migrations between groups of CPUs, ensuring locality and reducing inter-domain traffic. This hierarchy optimizes for performance by limiting migrations to necessary scopes, such as within a socket before escalating to node-wide adjustments.[26]
Integration with power-saving features extends CFS's balancing to heterogeneous architectures, such as ARM's big.LITTLE systems, through Energy Aware Scheduling (EAS). EAS augments the standard pull/push decisions with energy models that consider CPU capacity and power states, favoring migrations to efficient "little" cores when possible without compromising fairness. This enhances overall system efficiency in battery-constrained environments by aligning load distribution with heterogeneous performance domains.[27]
To preserve fairness during migrations, CFS adjusts the vruntime of relocated tasks relative to the minimum vruntime on the target runqueue, ensuring proportionality to their weight and preventing any task from gaining or losing undue advantage. This normalization maintains the scheduler's goal of approximating ideal fair sharing across CPUs, where each task receives CPU time proportional to its priority regardless of migration history.[28][1]
Advanced Features and Extensions
Support for Control Groups
The Completely Fair Scheduler (CFS) integrates with control groups (cgroups) through the cpu controller to enforce fairness and resource limits at the group level, enabling administrators to allocate CPU bandwidth proportionally or absolutely to sets of tasks.[29] This support requires the CONFIG_FAIR_GROUP_SCHED kernel configuration, which extends CFS to treat cgroups as scheduling entities alongside individual tasks.[29] Group scheduling was introduced in Linux kernel 2.6.24, allowing tasks to be organized into cgroups for fair CPU time division based on relative weights.[15] The cpu.shares parameter, written to a cgroup's cpu.shares file (default 1024), specifies the proportional share of CPU resources; for instance, a group with 2048 shares receives approximately twice the allocation of a group with 1024 shares under load.[29] Tasks are assigned to groups by writing their process IDs to the cgroup's tasks file, and vruntime is tracked per group to normalize fairness within and across hierarchies.[29] For absolute bandwidth limiting, CFS bandwidth control was merged in Linux kernel 3.2, building on group scheduling to cap CPU usage via the cpu.cfs_quota_us and cpu.cfs_period_us parameters.[30] The cpu.cfs_quota_us value (in microseconds, default -1 for unlimited, range 1000 to 1000000) sets the maximum runtime a group can consume, while cpu.cfs_period_us (default 100000, range 1000 to 1000000) defines the replenishment interval; together, they enforce a fraction of CPU, such as quota=25000 and period=100000 for 25% utilization.[3] When quota exhausts, tasks in the group are throttled until the next period, with runtime distributed in slices to per-CPU runqueues for efficient enforcement.[3] CFS supports hierarchical cgroups, where child groups inherit and aggregate shares from parents, ensuring bandwidth constraints propagate upward—throttling a child if the parent's quota is depleted.[3] This mechanism prevents overconsumption in nested structures without compromising intra-group task fairness.[29] These features are essential for container orchestration (e.g., Docker via libcontainer) and virtualization platforms (e.g., KVM), where they isolate workloads to specific CPU fractions, avoiding interference while upholding CFS's per-task equity.[31]Tuning Parameters and Configuration
The Completely Fair Scheduler (CFS) in the Linux kernel provides several sysfs tunables accessible via /proc/sys/kernel/ to optimize scheduling behavior for different workloads, such as interactive desktops or high-throughput servers. These parameters influence time slice allocation and preemption thresholds, allowing administrators to balance latency and fairness without recompiling the kernel. Changes can be applied temporarily using sysctl commands or persistently via /etc/sysctl.conf or kernel boot parameters. Key tuning parameters include the following, which directly affect vruntime-based time slices in a single sentence of reference:| Parameter | Description | Default Value (ns) | Location |
|---|---|---|---|
| sched_latency_ns | Defines the target latency for each scheduler period, during which all runnable tasks should ideally execute once; higher values favor throughput by allowing longer slices for CPU-bound tasks. | 20,000,000 | /proc/sys/kernel/sched_latency_ns |
| sched_min_granularity_ns | Sets the minimum time slice for any task, ensuring short-running tasks are not overly fragmented; used when the number of tasks exceeds the latency target. | 4,000,000 | /proc/sys/kernel/sched_min_granularity_ns |
| sched_wakeup_granularity_ns | Controls the threshold for preempting a running task upon wakeup of a higher-priority one, preventing excessive context switches for short wakeups. | 2,000,000 | /proc/sys/kernel/sched_wakeup_granularity_ns |