Fact-checked by Grok 2 weeks ago

Completely Fair Scheduler

The Completely Fair Scheduler (CFS) is a process scheduler in the Linux kernel designed to allocate CPU time equitably among tasks by approximating an "ideal multi-tasking CPU" scenario, where each runnable task receives a proportional share of processor resources as if executing in parallel on separate hardware. Developed primarily by Ingo Molnar, CFS was first announced on April 13, 2007, as part of a modular scheduler core overhaul, and was merged into the mainline Linux kernel with version 2.6.23 in October 2007, supplanting the earlier O(1) scheduler as the default for general-purpose (SCHED_NORMAL) workloads. This integration marked a shift toward a fairness-focused design that prioritizes desktop interactivity and server efficiency without relying on fixed time slices or complex heuristics. At its core, CFS employs a red-black tree (rbtree) to maintain tasks in chronological order based on their virtual runtime—a metric that accumulates actual CPU usage adjusted for task priority (nice levels)—ensuring the task with the lowest virtual runtime is selected for execution. This approach uses nanosecond-granularity accounting for precise fairness, with tunable parameters like scheduling granularity (default 4–6 ms for desktop, 10–20 ms for throughput-oriented systems) to balance responsiveness and efficiency. Unlike prior schedulers, CFS avoids runqueue arrays and periodic ticks for selection, reducing overhead and eliminating artifacts such as unfairness from array switches. CFS operates within a class-based , handling SCHED_NORMAL, SCHED_BATCH (for CPU-intensive tasks), and SCHED_IDLE policies, while classes (SCHED_FIFO and SCHED_RR) are managed separately via sched_rt.c. Key extensions include group scheduling (enabled via CONFIG_FAIR_GROUP_SCHED), which allows hierarchical resource control through and cpu.shares for proportional allocation, and bandwidth control to limit maximum CPU usage per group. These features enhance scalability in multi-core systems through simplified load-balancing, where idle CPUs actively pull tasks from busy ones. Compared to the O(1) scheduler, CFS offers superior resistance to manipulation attacks (e.g., via tools like fiftyp.c), better handling of priority adjustments, and no performance regressions in benchmarks, while reducing kernel code size and improving overall system isolation. It served as the cornerstone of Linux scheduling for nearly two decades, powering diverse environments from desktops to servers. However, CFS was superseded by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, merged as the default in Linux kernel 6.6 (released October 2023), to address certain limitations in fairness and latency; while CFS code remains for compatibility, EEVDF is the current standard for general-purpose scheduling as of 2025.

Introduction

Overview

The Completely Fair Scheduler (CFS) served as the default process scheduler for SCHED_NORMAL tasks in the from version 2.6.23, released in October 2007, through version 6.5. Developed primarily by Ingo Molnár, CFS was designed to handle the scheduling of typical and workloads by ensuring equitable CPU resource allocation among competing processes. At its core, CFS employs a mechanism that distributes proportionally to runnable tasks based on their and nice values, aiming to approximate an "ideal multi-tasking CPU" where each task receives its fair share without relying on fixed timeslices. This approach prioritizes conceptual fairness over rigid , using virtual runtime as the key to track and balance actual execution time against a task's entitled share, thereby minimizing scheduling overhead and promoting responsiveness. In version 6.6, released in late 2023, CFS was succeeded by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler as the new default for the fair scheduling class, addressing certain latency issues in CFS while building on its foundational principles. Despite this transition, CFS's influence persists in the ecosystem, as it remains available as a configurable option and continues to underpin many existing deployments and extensions.

Design Goals

The Completely Fair Scheduler (CFS) was designed primarily to enhance desktop interactivity by providing low-latency responses for interactive workloads, addressing the limitations of prior schedulers like the O(1) scheduler that struggled with mixed interactive and CPU-bound tasks. This focus aimed to model an ideal multi-tasking CPU where each task receives a fair share of processing time proportional to its priority, ensuring that no task is starved while maintaining responsiveness for user-facing applications. A key motivation was to eliminate the heuristic-based fixed timeslices and jiffy-dependent accounting of earlier schedulers, replacing them with nanosecond-granularity tracking to achieve more precise and deterministic fairness without sacrificing . By using virtual runtime as a measure of accumulated CPU usage adjusted for , CFS ensures proportional allocation that favors interactive tasks in scenarios while allowing tunable parameters for server or batch environments. CFS targeted general-purpose computing across desktops, servers, and embedded systems, with optimizations for low-latency use as the default, up through Linux kernel version 6.5. This design improved upon predecessors by better handling diverse workloads, such as combining short interactive bursts with long-running computations, without the array-switching artifacts that degraded performance in O(1).

Historical Development

Pre-CFS Schedulers

The earliest schedulers, used from version 0.01 in 1991 through the 2.4 series and into the 2.5 development branch, employed an that performed a over all runnable tasks to select the next one to execute. This approach relied on a goodness() to evaluate task priority and , iterating through the entire run during each scheduling decision, which resulted in where n is the number of tasks. As systems scaled to hundreds or thousands of tasks—common in server environments or with threaded applications like virtual machines—the imposed significant overhead, degrading responsiveness and by increasing scheduling proportionally with load. Additionally, the scheduler assigned large fixed timeslices (often around 210 ms on average), which could delay low-priority tasks for extended periods, such as over 20 seconds in scenarios with 100 competing tasks. To address these scalability issues, Ingo Molnar announced the O(1) scheduler in January 2002 as a patch for the 2.5 development kernel, with its full integration occurring in the stable Linux 2.6.0 release in December 2003. The O(1) design achieved constant-time task selection by organizing tasks into two per-CPU priority arrays—one active and one expired—allowing the scheduler to pick the highest-priority runnable task directly from the highest non-empty priority bitmap without iterating over all tasks. It retained fixed timeslices scaled by static priority (e.g., (140 - static priority) × 20 ms for priorities below 120), and incorporated interactivity heuristics, such as dynamic priority boosts based on average sleep time, to favor responsive tasks like those in graphical interfaces. This scheduler remained the default from Linux 2.6.0 through 2.6.22. Despite its efficiency gains, the O(1) scheduler exhibited several limitations that compromised fairness and performance in diverse workloads. Its heavy bias toward interactive tasks via heuristics often led to unfairness in mixed environments, where batch or CPU-bound processes received disproportionately less CPU time, sometimes starving them under load. Interactive tasks themselves suffered high latency during heavy contention, as the fixed timeslice model and priority adjustments failed to adapt smoothly to varying demands. Furthermore, "scheduler artifacts" such as periodic switches between active and expired arrays introduced unnecessary jitter, disrupting timing-sensitive applications and complicating further maintenance due to the growing complexity of ad-hoc heuristics. These issues, building on the foundational scalability problems of the O(n) era, highlighted the need for a more principled approach to CPU allocation.

Introduction and Key Contributors

The Completely Fair Scheduler (CFS) was developed primarily by Ingo Molnar in 2007 as a new process scheduling algorithm for the Linux kernel, drawing inspiration from earlier fair-scheduling concepts while aiming to simplify and enhance performance. Molnar, a prominent kernel developer previously known for the O(1) scheduler, led the implementation with significant contributions from Mike Galbraith, who focused on interactivity tuning to improve responsiveness for user-facing applications. Other collaborators included Peter Zijlstra for performance optimizations and Roman Zippel for integrating ideas from the "Really Fair Scheduler" (RFS). CFS was merged into the Linux kernel version 2.6.23 in October 2007, becoming the default scheduler for the SCHED_OTHER class and replacing the prior O(1) scheduler. This integration marked a shift toward a more heuristic-free approach, with the core design encapsulated in a single line: selecting the task with the lowest virtual runtime to ensure proportional CPU allocation. The primary motivations for CFS were to enhance responsiveness and achieve strict fairness in distribution without relying on complex, ad-hoc heuristics that plagued earlier schedulers. By prioritizing low latency for interactive tasks—such as those in graphical user interfaces—while maintaining overall throughput, CFS addressed complaints about jittery performance under mixed workloads. Upon release, CFS received positive reception for its ability to reduce latency in interactive scenarios, leading to rapid adoption across major distributions. By 2008, it had become the standard scheduler in releases like 8.04 (using kernel 2.6.24) and 9 (using kernel 2.6.25), solidifying its role in improving everyday desktop and server usability.

Subsequent Enhancements and Transition to EEVDF

Following the initial introduction of the Completely Fair Scheduler (CFS) in Linux kernel 2.6.23, subsequent releases incorporated several key enhancements to improve its scalability, fairness, and responsiveness across diverse workloads. One significant addition was the integration of CFS with control groups (cgroups) in kernel 2.6.24, enabling group-based scheduling where CPU shares could be allocated proportionally to hierarchical process groups rather than individual tasks. This feature, requiring the CONFIG_CGROUPS and CONFIG_FAIR_GROUP_SCHED kernel options, allowed for better resource isolation in multi-tenant environments like servers and virtualized systems. Further refinements to symmetric multiprocessing (SMP) load balancing arrived in kernel 2.6.32, where the scheduler's runqueue management was optimized to reduce overhead in multi-core systems by improving task migration decisions and eliminating unnecessary runqueue traversals during balancing operations. These changes enhanced throughput in high-core-count setups without compromising fairness. In the 3.x kernel series, latency-focused tweaks were implemented to bolster real-time support, particularly for interactive and applications. For instance, kernel 2.6.36 (bridging to the 3.x renumbering) reduced the minimum granularity and target latency parameters, allowing shorter time slices for low-load scenarios and decreasing worst-case delays from around 20 ms to under 6 ms in typical use cases. This adjustment prioritized responsiveness for latency-sensitive tasks while maintaining overall fairness. Later, in kernel 2.6.38, the autogrouping feature was added, automatically grouping tasks by user session or to prevent CPU-hogging applications from starving interactive shells or elements, thus improving perceived performance on workstations. These modifications collectively addressed criticisms of CFS's occasional high-latency bursts under mixed workloads. Performance optimizations continued into the 4.x and 5.x series, with notable advancements in bandwidth control and handling of diverse workloads. CFS bandwidth control, merged in kernel 3.2, extended group scheduling by introducing cpu.cfs_quota_us and cpu.cfs_period_us parameters in , enforcing hard CPU limits on groups to prevent overconsumption while preserving fairness within quotas. This was particularly useful for containerized environments, where it ensured predictable without excessive throttling. In kernels 4.x onward, refinements to the scheduler addressed bugs and improved overall efficiency. Additional tweaks in 5.x, such as improved load tracking for heterogeneous cores, minimized unnecessary migrations and enhanced on modern hardware. The evolution of CFS culminated in its replacement by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler in Linux kernel 6.6, released in September 2023. Proposed and implemented by kernel developer Peter Zijlstra, EEVDF refines CFS's fairness model by incorporating explicit virtual deadlines, selecting the task with the earliest eligible deadline to reduce latency variance while inheriting CFS's virtual runtime tracking for proportional shares. This shift addressed longstanding heuristics in CFS that could lead to suboptimal latency in bursty workloads, achieving up to 50% lower tail latencies in microbenchmarks without sacrificing throughput. As of November 2025, EEVDF serves as the default scheduler for the fair scheduling class in new kernels.

Core Concepts and Algorithm

Virtual Runtime Mechanism

The virtual runtime (vruntime) in the Completely Fair Scheduler (CFS) represents a normalized measure of a task's CPU usage, adjusted according to its weight to ensure proportional allocation of time. This metric allows CFS to simulate an ideal multitasking environment where CPU resources are shared fairly among runnable tasks, regardless of their actual execution history. The vruntime is updated incrementally based on the actual CPU time a task consumes, using the formula: \Delta \text{vruntime} = \text{actual_runtime} \times \frac{\text{NICE_0_LOAD}}{\text{task_weight}} where \text{NICE_0_LOAD} is a fixed constant of 1024 corresponding to the default weight for a nice level of 0, and \text{task_weight} is a value derived from the task's nice level (ranging from -20 to 19), with higher weights assigned to higher-priority tasks to slow their vruntime accumulation. For a task at nice level 0, this simplifies to \Delta \text{vruntime} = \text{actual_runtime}, providing a baseline for equal-priority sharing. The primary purpose of vruntime is to enable CFS to select the task with the lowest value for scheduling, thereby approximating proportional fairness: higher-priority tasks (with larger weights) receive more before their vruntime catches up to others. Vruntime increases monotonically for active tasks, with the scheduler tracking a per-CPU minimum vruntime (min_vruntime) to maintain global progress and prevent any task from falling indefinitely behind. Over time, equal-priority tasks converge toward identical vruntime values as they each accrue equivalent normalized usage, enforcing the scheduler's fairness goal. Sleeping or blocked tasks do not accumulate vruntime during their inactive periods, causing their values to lag behind those of running tasks and granting them higher scheduling eligibility upon wakeup to compensate for missed execution opportunities. This "sleeper fairness" ensures that I/O-bound or intermittently active tasks are not systematically disadvantaged compared to ones. To mitigate scheduling overhead from short-running tasks and variability in load, CFS enforces a granularity parameter—typically 4 to 6 milliseconds on desktop systems—which sets the minimum runtime slice before preemption is considered, balancing responsiveness with efficiency. Vruntime values are briefly referenced in task ordering mechanisms for quick selection.

Data Structures and Task Organization

The Completely Fair Scheduler (CFS) employs a red-black tree as its primary to organize runnable tasks in a time-ordered manner, ensuring efficient management based on virtual runtime (vruntime). This sorts tasks by their vruntime values, with the leftmost node representing the task with the smallest vruntime, which is selected for execution. Operations such as insertion and deletion on the tree achieve O(log N) , where N is the number of tasks, providing scalability for systems with varying numbers of concurrent processes. Per-CPU runqueues in CFS are augmented with a CFS-specific runqueue structure known as cfs_rq, which encapsulates the red-black tree root along with additional metadata for . The cfs_rq includes fields such as min_vruntime, which tracks the smallest vruntime observed on the runqueue to maintain a monotonic reference for new task placements, and a load sum representing the total weighted load of tasks within it. This structure integrates with the broader scheduler runqueue (rq) via rq->cfs, allowing CFS to handle fair scheduling while coexisting with other scheduler classes like tasks. Each task in CFS is represented by a sched_entity (se) structure, which stores key attributes including the task's vruntime, weight, and red-black tree node for integration into the cfs_rq tree. The se is embedded within the task_struct and facilitates the hierarchy of scheduling entities, enabling support for grouped tasks without altering the core organization. Vruntime serves as the primary sorting key within the se, dynamically updated to reflect accrued execution time and ensuring tasks are positioned accordingly in the tree. This rbtree-based approach offers significant advantages over traditional array-based runqueues used in prior Linux schedulers, such as the O(1) scheduler, by eliminating fixed priority arrays that could introduce scheduling artifacts like uneven task switching. Unlike arrays requiring O(N) scans for selection, the red-black tree enables logarithmic-time access and maintenance, reducing overhead in multiprocessor environments and improving overall system responsiveness without relying on heuristic adjustments.

Scheduling Decisions and Time Allocation

The Completely Fair Scheduler (CFS) selects the next task to run by identifying the one with the lowest virtual runtime from its per-CPU red-black tree data structure, ensuring that tasks which have accumulated the least normalized CPU usage are prioritized for execution. This selection process, known as picking the "leftmost" node in the tree, occurs during scheduling events such as timer interrupts or task wakeups, promoting fairness by always advancing the task that is most "behind" in its entitled share of CPU time. Upon selection, the task's virtual runtime is updated incrementally as it consumes CPU cycles, reflecting its actual execution time adjusted for fairness. Time allocation in CFS avoids fixed timeslices, instead dynamically computing the for each task based on its relative within the total of all runnable tasks on the CPU, divided proportionally from a target period. This is determined by the formula where the allocated time approximates (target_latency × task_weight) / total_weight, allowing higher-weighted tasks (corresponding to higher priorities) to receive larger shares while maintaining overall fairness within the window. The target is configurable via the sysctl_sched_latency_ns, which defaults to approximately 20 on typical multi-core systems ( with the number of CPUs as 6 × (1 + log2(number of CPUs))), and serves to throttle the scheduling frequency to balance responsiveness and throughput. This approach ensures that all tasks progress toward equal virtual over time, with the of allocation tuned by sysctl_sched_min_granularity_ns to prevent excessive context switches. Preemption in CFS is triggered when a newly runnable task demonstrates a virtual sufficiently lower than the currently running task's, specifically if the exceeds the execution since the last check, prompting an immediate to restore fairness. This check occurs during potential preemption points, such as timer ticks or when tasks become runnable, and is modulated by the parameters to avoid over-scheduling on lightly loaded systems. Additionally, when a CPU enters an state, CFS performs an "idle pull" mechanism to load by migrating tasks from overloaded CPUs, ensuring underutilized processors do not remain while others are burdened. Wakeup handling in CFS positions newly awakened tasks in the red-black tree such that their virtual runtime accounts for the time spent sleeping, effectively adjusting it to a value that prevents long-sleeping tasks from being starved upon resumption. This placement favors recently woken tasks for potential preemption via the wakeup_preempt() function, which evaluates whether the woken task's virtual runtime warrants interrupting the current runner, thereby preserving low latency for interactive workloads without unfairly penalizing compute-bound ones.

Implementation Details

Kernel Integration and Code Structure

The Completely Fair Scheduler (CFS) is integrated into the Linux kernel as one of several scheduling classes within a hierarchical framework, allowing multiple scheduler modules to coexist and handle different task types. CFS specifically manages tasks under the SCHED_NORMAL policy (also known as SCHED_OTHER), while other classes like real-time (SCHED_FIFO and SCHED_RR) and deadline-based (SCHED_DEADLINE) handle higher-priority or specialized workloads. This integration occurs through the sched_class structure, a linked list of scheduler modules ordered by priority, where CFS occupies the fair class position. Key hooks such as pick_next_task_fair and enqueue_task_fair are registered in this structure to interface with the core scheduler; for instance, pick_next_task iterates through classes starting from the highest priority and delegates to CFS's implementation when appropriate, selecting the task with the minimal virtual runtime (vruntime) for fairness. The core implementation of CFS resides primarily in the kernel/sched/fair.c file, which contains the logic for fair task selection, red-black tree management for runnable tasks, and vruntime updates. This file defines the fair scheduling class and its operations, ensuring efficient O(log N) insertions and extractions for task enqueueing and dequeuing. CFS integrates seamlessly with the broader scheduler framework in kernel/sched/core.c, which handles generic mechanisms like timer ticks, context switches, and the main schedule() function that invokes class-specific hooks during preemption or voluntary yielding. For example, during a context switch, core.c calls into fair.c to update runqueues and select the next task, maintaining compatibility with the kernel's tick-based or tickless modes. During the kernel boot process, CFS is initialized as the default scheduler for SCHED_NORMAL tasks through the sched_init() function in kernel/sched/core.c, which sets up per-CPU runqueues, allocates structures for the task group, and configures the initial task () with fair scheduling parameters like load weight and cfs_rq (CFS runqueue). This initialization occurs early in the after CPU setup, ensuring that regular processes start under CFS without additional configuration. The framework is extensible; since 6.12, the sched_ext class allows loading user-space BPF-based schedulers as plugins, which can override or complement CFS for SCHED_EXT tasks while falling back to CFS on errors or unloading. CFS supports both processes and threads uniformly through the per-task sched_entity structure, embedded within each task_struct to encapsulate scheduling state such as vruntime, load weight, and red-black tree nodes. This design treats threads as lightweight tasks sharing the process's but with independent scheduling entities, enabling fine-grained fairness across multithreaded applications without distinguishing between process and thread contexts at the scheduler level.

Priority Handling and Nice Levels

In the Completely Fair Scheduler (CFS), task priorities for SCHED_NORMAL (also known as SCHED_OTHER) processes are managed through nice values, which range from -20 (highest priority) to +19 (lowest priority), providing 40 distinct levels to fine-tune CPU allocation. These values influence a task's share of by adjusting its scheduling weight, ensuring that higher-priority tasks (lower nice values) receive proportionally more execution time while maintaining overall fairness. The weight for each nice level is determined using a precomputed array in the , designed to implement multiplicative adjustments where each increment in value reduces the task's CPU share by approximately 10%. Specifically, the base weight for 0 is defined as NICE_0_LOAD = , and weights scale by a factor approximating 1/1.25 per positive nice increment (or ×1.25 for negative increments), though exact values are tabulated to avoid floating-point computations. This results in the following representative weights:
Nice ValueWeight
-109548
01024
10110
1915
These weights are applied such that a lower weight (higher ) causes the task's virtual runtime to increase more rapidly per unit of actual consumed, effectively reducing its scheduling frequency and providing stronger isolation from interactive workloads. CFS extends handling to special policies like SCHED_BATCH and SCHED_IDLE, treating them as low- variants to accommodate non-interactive workloads. SCHED_BATCH, used for batch jobs, inherits weights from the nice value but employs adjusted vruntime accrual and reduced preemption heuristics to minimize disruptions to foreground tasks while preserving locality. In contrast, SCHED_IDLE tasks are assigned a fixed minimal weight of 3—lower than even nice 19—ensuring they execute only when no other runnable tasks exist, ideal for maintenance operations without impacting system responsiveness.

Load Balancing Across CPUs

In symmetric multiprocessing (SMP) systems, the Completely Fair Scheduler (CFS) employs load balancing to distribute tasks across multiple CPUs, ensuring equitable resource utilization while minimizing overhead. This mechanism prevents any single CPU from becoming overburdened, which could lead to inefficiencies in task execution. Load balancing is invoked periodically and opportunistically, integrating seamlessly with the scheduler's core fairness principles. Load balancing is triggered periodically by scheduler ticks, during which each CPU evaluates potential imbalances using the load_avg metric—a weighted of that accounts for task weights and recent utilization . This allows the scheduler to assess system-wide load without excessive computational cost, triggering adjustments only when necessary. If an imbalance is detected, the scheduler initiates task migrations to restore equilibrium across runqueues. The pull and push logic forms the core of CFS's balancing strategy. Idle CPUs actively pull tasks from busier counterparts to utilize available capacity, selecting candidates via iterators over the red-black tree (RB-tree) structures that organize tasks by virtual runtime (vruntime). Conversely, overloaded CPUs push excess tasks to underutilized ones if their load exceeds 1.17 times (default imbalance_pct=117) the average load in the scheduling domain, again leveraging RB-tree iterators to identify the most suitable tasks—typically those with the lowest vruntime to prioritize fairness. This bidirectional approach reduces latency in responding to load variations, with pulls favored for idle scenarios to minimize disruptions. Load balancing operates within a of scheduling domains that reflect , starting from the lowest level (e.g., multi-core domains sharing ) and progressing to higher levels such as (DIE domains) and NUMA nodes. At each domain level, balancing is performed by a designated CPU—often the first idle one—to coordinate migrations between groups of CPUs, ensuring locality and reducing inter-domain traffic. This hierarchy optimizes for by limiting migrations to necessary scopes, such as within a socket before escalating to node-wide adjustments. Integration with power-saving features extends CFS's balancing to heterogeneous architectures, such as ARM's big.LITTLE systems, through Energy Aware Scheduling (EAS). EAS augments the standard pull/push decisions with energy models that consider CPU capacity and power states, favoring migrations to efficient "little" cores when possible without compromising fairness. This enhances overall system efficiency in battery-constrained environments by aligning load distribution with heterogeneous performance domains. To preserve fairness during migrations, CFS adjusts the vruntime of relocated tasks relative to the minimum vruntime on the target runqueue, ensuring to their and preventing any task from gaining or losing undue advantage. This maintains the scheduler's goal of approximating sharing across CPUs, where each task receives proportional to its regardless of history.

Advanced Features and Extensions

Support for Control Groups

The Completely Fair Scheduler (CFS) integrates with control groups (cgroups) through the cpu controller to enforce fairness and resource limits at the group level, enabling administrators to allocate CPU bandwidth proportionally or absolutely to sets of tasks. This support requires the CONFIG_FAIR_GROUP_SCHED kernel configuration, which extends CFS to treat as scheduling entities alongside individual tasks. Group scheduling was introduced in Linux kernel 2.6.24, allowing tasks to be organized into cgroups for fair CPU time division based on relative weights. The cpu.shares parameter, written to a cgroup's cpu.shares file (default 1024), specifies the proportional share of CPU resources; for instance, a group with 2048 shares receives approximately twice the allocation of a group with 1024 shares under load. Tasks are assigned to groups by writing their process IDs to the cgroup's tasks file, and vruntime is tracked per group to normalize fairness within and across hierarchies. For absolute bandwidth limiting, CFS bandwidth control was merged in Linux kernel 3.2, building on group scheduling to cap CPU usage via the cpu.cfs_quota_us and cpu.cfs_period_us parameters. The cpu.cfs_quota_us value (in microseconds, default -1 for unlimited, range 1000 to 1000000) sets the maximum runtime a group can consume, while cpu.cfs_period_us (default 100000, range 1000 to 1000000) defines the replenishment interval; together, they enforce a fraction of CPU, such as quota=25000 and period=100000 for 25% utilization. When quota exhausts, tasks in the group are throttled until the next period, with runtime distributed in slices to per-CPU runqueues for efficient enforcement. CFS supports hierarchical , where child groups inherit and aggregate shares from parents, ensuring bandwidth constraints propagate upward—throttling a child if the parent's quota is depleted. This mechanism prevents overconsumption in nested structures without compromising intra-group task fairness. These features are essential for container orchestration (e.g., via libcontainer) and platforms (e.g., KVM), where they isolate workloads to specific CPU fractions, avoiding while upholding CFS's per-task .

Tuning Parameters and Configuration

The Completely Fair Scheduler (CFS) in the provides several tunables accessible via /proc/sys/kernel/ to optimize scheduling behavior for different workloads, such as interactive desktops or high-throughput servers. These parameters influence time slice allocation and preemption thresholds, allowing administrators to balance and fairness without recompiling the kernel. Changes can be applied temporarily using commands or persistently via /etc/sysctl.conf or kernel boot parameters. Key tuning parameters include the following, which directly affect vruntime-based time slices in a single sentence of reference:
ParameterDescriptionDefault Value (ns)Location
sched_latency_nsDefines the target latency for each scheduler period, during which all runnable tasks should ideally execute once; higher values favor throughput by allowing longer slices for CPU-bound tasks.20,000,000/proc/sys/kernel/sched_latency_ns
sched_min_granularity_nsSets the minimum time slice for any task, ensuring short-running tasks are not overly fragmented; used when the number of tasks exceeds the latency target.4,000,000/proc/sys/kernel/sched_min_granularity_ns
sched_wakeup_granularity_nsControls the threshold for preempting a running task upon wakeup of a higher-priority one, preventing excessive context switches for short wakeups.2,000,000/proc/sys/kernel/sched_wakeup_granularity_ns
Additional parameters include sched_child_runs_first, which determines whether the child process runs before the parent after (default 0, disabled for parent-first scheduling) at /proc/sys//sched_child_runs_first, and sched_migration_cost_ns, which specifies the assumed cost of migrating a task between CPUs to influence load balancing decisions (default 500,000 ns) at /proc/sys//sched_migration_cost_ns. User-space tools facilitate per-task or per-group adjustments complementary to these global tunables. The chrt command sets scheduling policies and levels for real-time or fair-share tasks under CFS, while the utility adjusts priority for non-real-time processes. For group-based tuning, cgcreate and cgexec from the cgroup-tools package enable creation and execution within control groups, though this section focuses on system-wide parameters rather than cgroup specifics. Monitoring CFS behavior is possible through /proc/sched_debug, which provides detailed output on runqueues, task vruntimes, and load balancing when CONFIG_SCHED_DEBUG is enabled in the . This file helps verify tunable impacts, such as slice distributions across CPUs. Best practices for tuning involve reducing sched_latency_ns and related granularities for interactive workloads to minimize response times, while increasing them for environments to improve batch throughput and reduce overhead. Always test changes under representative loads, as excessive reductions can lead to thrashing, and consult for version-specific behaviors.

Limitations and Criticisms

Despite its design goals of fairness and efficiency, the Completely Fair Scheduler (CFS) exhibits overly aggressive , which can introduce in latency-sensitive workloads such as interactive applications or systems. This arises from heuristics like wakeup , where woken tasks the current one only if their virtual runtime advantage exceeds a tunable ( 2 ), potentially causing unnecessary context switches and variability in response times. The reliance on red-black trees for organizing runnable tasks imposes an O(log N) time complexity for scheduling operations, leading to increased overhead in environments with very high task counts, where tree balancing and insertions become computationally expensive compared to simpler queue structures. CFS prioritizes proportional fairness over strict timing guarantees, making it less optimal for real-time workloads without the separate real-time (SCHED_FIFO/SCHED_RR) or deadline (SCHED_DEADLINE) classes, as normal-priority tasks may experience unbounded delays under overload. Scalability limitations manifest in massive parallelism, particularly on multi-core and NUMA systems, where poor load balancing—such as the scheduling group construction bug—can leave cores idle while runnable tasks queue on busy ones, resulting in up to 27× slowdowns in benchmarks like NAS lu on multi-node setups. In low-task-count scenarios, CFS incurs higher overhead than the prior O(1) scheduler due to its logarithmic complexity and vruntime tracking, even though the absolute log N cost remains small; this trade-off favors fairness at the expense of minimal-load efficiency. For bursty I/O workloads, CFS's fairness model can delay I/O-bound tasks behind CPU-intensive ones, creating trade-offs where short bursts receive disproportionate penalties to maintain long-term equity, as observed in comparisons with simpler schedulers. Community discussions on LWN and LKML have debated spikes from vruntime spread and preemption heuristics, with kernels showing maximum latencies up to 34 ms under periodic loads; these were partially mitigated in 5.x series kernels through features like dynamic minimum vruntime and reduced timeslices. These shortcomings, including heuristic dependencies and multi-core inefficiencies, prompted ongoing refinements and ultimately influenced the transition to the Earliest Eligible Virtual Deadline First (EEVDF) scheduler in Linux 6.6.

Comparisons

With O(1) Scheduler

The Completely Fair Scheduler (CFS) introduced significant improvements in efficiency over its predecessor, the O(1) scheduler, by replacing the O(1) scheduler's array-based per-priority run queues with a red-black tree data structure for task management. While the O(1) scheduler achieved constant-time O(1) operations through 140 separate active and expired priority queues, this approach became less efficient with very large numbers of tasks due to the fixed overhead of maintaining multiple arrays and switching between them. In contrast, CFS employs O(log n) operations for insertions, deletions, and finding the minimum virtual runtime task, which scales better for systems with over 1000 tasks, as the logarithmic cost remains low (e.g., approximately 10 operations for n=1024) without the combinatorial complexity of array maintenance. In terms of fairness, CFS provides proportional CPU share allocation using virtual runtime (vruntime) tracking, which ensures tasks receive time slices scaled inversely to their nice priority levels, modeling an "ideal" fair multi-tasking scenario without fixed timeslices. The O(1) scheduler relied on fixed timeslices varying by priority (e.g., 100ms for high-priority tasks down to near-zero for low-priority ones) and ad-hoc interactivity estimators to boost I/O-bound tasks, which often led to unfair treatment of CPU-bound processes and inconsistent desktop responsiveness under mixed workloads. CFS's vruntime-based approach eliminates these heuristics, reducing lag in interactive applications by automatically favoring recently woken tasks through vruntime lag compensation, thus achieving more consistent fairness across diverse task types. CFS also enhances latency predictability by eliminating the "timeslice starvation" problem inherent in the O(1) scheduler, where low-priority tasks could be indefinitely delayed if higher-priority ones continuously expired and reactivated their queues. The O(1) scheduler's use of heuristics for interactivity estimation introduced variability and potential jitter in response times, particularly under high load. CFS addresses this with nanosecond-granularity time accounting and a tunable scheduling granularity (default 4-6ms for desktops), ensuring bounded worst-case latency without reliance on complex priority boosting, resulting in more predictable response times for both interactive and batch workloads. Benchmarks evaluating fairness and interactivity confirm these gains, with CFS demonstrating lower standard deviation in task completion times (e.g., 20-30% reduction in variability for 8-32 concurrent pi-calculation processes) compared to O(1), indicating superior CPU resource equity. In interactivity tests using X-window event simulations under audio, video, and full system loads, CFS exhibited lower maximum latencies (often below 180ms, the human perception threshold) while maintaining acceptable average latencies, outperforming O(1)'s higher jitter and occasional spikes that degraded desktop usability. These results highlight CFS's ability to balance fairness and responsiveness more effectively than the O(1) scheduler in 2007-era Linux environments.

With EEVDF Scheduler

The Completely Fair Scheduler (CFS) and its successor, the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, share core objectives in achieving proportional fairness among tasks by leveraging virtual time metrics to allocate CPU resources equitably based on priority. EEVDF extends this foundation by building directly on much of the existing CFS codebase, maintaining compatibility with established scheduling classes while introducing refinements for modern workloads. Notably, CFS's concept of virtual runtime (vruntime) serves as a direct precursor to EEVDF's virtual runtime tracking, ensuring continuity in how task execution time is normalized across varying priorities. Key differences lie in EEVDF's scheduling mechanism, which employs deadlines and calculations to determine task eligibility, selecting the task with the earliest virtual deadline among eligible ones rather than relying solely on vruntime comparisons as in CFS. This approach eliminates several ad-hoc heuristics present in CFS, such as those for idling and wake-up path optimizations, resulting in a more streamlined implementation that reduces overhead in contended scenarios. EEVDF improves upon CFS particularly in managing short and interactive tasks through its earliest eligible virtual deadline first strategy, which prioritizes low-latency execution without compromising fairness. Benchmarks indicate substantial latency reductions, with up to 41% lower 99th percentile latencies in control plane applications compared to CFS. This leads to decreased and more consistent performance in kernel versions 6.6 and beyond, benefiting interactive and latency-sensitive workloads. EEVDF became the default scheduler for the normal scheduling class in 6.6, released in late 2023, effectively replacing CFS in new deployments while retaining CFS for in specialized configurations. By 2025, EEVDF is the preferred choice for contemporary systems due to its enhanced efficiency and latency characteristics, though CFS remains available via kernel configuration options.

References

  1. [1]
    CFS Scheduler - The Linux Kernel documentation
    CFS stands for “Completely Fair Scheduler,” and is the “desktop” process scheduler implemented by Ingo Molnar and merged in Linux 2.6.23.
  2. [2]
    Ingo Molnar: [Announce] [patch] Modular Scheduler Core and ... - LKML
    Apr 13, 2007 · [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] ... announce the first release of the "Modular Scheduler CoreMissing: original | Show results with:original
  3. [3]
    CFS Bandwidth Control - The Linux Kernel documentation
    CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy.
  4. [4]
    EEVDF Scheduler - The Linux Kernel documentation
    The Linux kernel began transitioning to EEVDF in version 6.6 (as a new option in 2024), moving away from the earlier Completely Fair Scheduler (CFS) in ...
  5. [5]
    sched(7) - Linux manual page - man7.org
    Since Linux 2.6.23, the default scheduler is CFS, the "Completely Fair Scheduler". The CFS scheduler replaced the earlier "O(1)" scheduler. API summary Linux ...
  6. [6]
    Linux_6.6 - Linux Kernel Newbies
    Summary: This release replaces the CFS process scheduler with a new one called EEVDF, which should provide better latency in general; support ...New task scheduler: EEVDF · First pieces of XFS online fsck · TMPFS · Networking
  7. [7]
    [PDF] Understanding the Linux 2.6.8.1 CPU Scheduler - People @EECS
    Feb 17, 2005 · 6.3 Weaknesses. In “Understanding the Linux Kernel,” Daniel Bovet and Marco Cesati expound on four weaknesses in the. Linux 2.4.x scheduler ...<|control11|><|separator|>
  8. [8]
    [announce] [patch] ultra-scalable O(1) SMP and UP scheduler
    From: Ingo Molnar <mingo@elte.hu> To: <linux-kernel@vger.kernel.org> Subject: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler Date: Fri, 4 Jan ...
  9. [9]
    (PDF) Fairness and interactive performance of O(1) and CFS Linux ...
    In this paper, we present the scheduling techniques used by two Linux schedulers: O(1) and Completely Fair Scheduler (CFS). CFS is the Linux kernel scheduler ...
  10. [10]
    Inside the Linux 2.6 Completely Fair Scheduler - IBM Developer
    Sep 19, 2018 · The 2.4 kernel included a relatively simple scheduler that operated in O(N) time (as it iterated over every task during a scheduling event).
  11. [11]
    Linux Gets Completely Fair Scheduler - Slashdot
    Jul 10, 2007 · Con Kolivas, for pioneering the fair-scheduling approach. Peter Williams, for smpnice. Mike Galbraith, for interactivity tuning of CFS
  12. [12]
    Ingo Molnar: [announce] CFS-devel, performance improvements
    Sep 11, 2007 · announce the latest iteration of the CFS scheduler development tree. Our main focus has been on simplifications and performance - and as ...Missing: Completely | Show results with:Completely
  13. [13]
    List of Ubuntu Versions with Corresponding Linux Kernel Version
    Aug 28, 2014 · You can get the list of the Ubuntu versions and their corresponding kernels from Wikipedia: 4.10 Warty Warthog 2.6.8 5.04 Hoary Hedgehog 2.6.10 5.10 Breezy ...How do I find the kernel version, Ubuntu release and disk partition ...How do I install an older kernel in jammy and later Ubuntu?More results from askubuntu.com
  14. [14]
    Releases/9/ReleaseSummary - Fedora Project Wiki
    ... release. Fedora 9 (Sulphur) is due out at the end of April 2008, and we started making plans for it as soon as we released Fedora 8 back in November. We've ...
  15. [15]
    Merged for 2.6.24 - LWN.net
    Oct 17, 2007 · The CFS group scheduling code has been merged. As of this writing, though, the feature cannot actually be turned on because the control groups ...
  16. [16]
    Improving scheduler latency - LWN.net
    Sep 14, 2010 · The patch did improve latencies, though. It turns out that the change that actually mattered was reducing the length of the minimum time slice ...
  17. [17]
    Completely Fair Scheduler - Wikipedia
    The Completely Fair Scheduler (CFS) was a process scheduler that was merged into the 2.6.23 (October 2007) release of the Linux kernel. It was the default ...
  18. [18]
    [PDF] Your Cores Are Slacking Off— Or Why OS Scheduling Is a Hard ...
    To this end, CFS periodically runs a load-balancing algorithm that will keep the queues roughly balanced. CFS balances run queues based on a metric called ...<|control11|><|separator|>
  19. [19]
    An EEVDF CPU scheduler for Linux - LWN.net
    Mar 9, 2023 · The kernel's completely fair scheduler (CFS) has the job of managing the allocation of CPU time for most of the processes running on most ...
  20. [20]
    EEVDF Scheduler May Be Ready For Landing With Linux 6.6
    Intel Linux engineer Peter Zijlstra's EEVDF CPU scheduler code to replace the existing Completely Fair Scheduler "CFS" code looks like it will ...
  21. [21]
    UIO: user-space drivers [LWN.net]
    **Summary of Virtual Runtime in CFS from https://lwn.net/Articles/233034/**
  22. [22]
    CFS scheduler merged - LWN.net
    Jul 9, 2007 · When later on the group scheduling feature is merged/enabled you can give all users their own group so they can't hamper each other. CFS ...Group schedulingThe 2.6.23 merge window closesMore results from lwn.net
  23. [23]
    [PDF] The Linux Kernel Scheduler - Linaro
    Sched class: CFS (Completely fair scheduler). ○ Introduced by Ingo Molnar (motivated by Rotating Staircase Deadline. Scheduler by Con Kolivas). ○ Scheduling ...
  24. [24]
    fair.c source code [linux/kernel/sched/fair.c] - Codebrowser
    * Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH). 4, *. 5 ... * (C) 2007 Mike Galbraith <efault@gmx.de>. 9, *. 10, * Various enhancements ...
  25. [25]
    Scheduler initialization · Linux Inside - 0xax
    The SCHED_NORMAL is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the nice value, the SCHED_BATCH used ...
  26. [26]
    Extensible Scheduler Class — The Linux Kernel documentation
    Userspace can implement an arbitrary BPF scheduler by loading a set of BPF programs that implement struct sched_ext_ops.Missing: plugins | Show results with:plugins
  27. [27]
    [PDF] Inside the Linux 2.6 Completely Fair Scheduler - cs.wisc.edu
    Dec 15, 2009 · The main idea behind the CFS is to maintain balance (fairness) in providing processor time to tasks. This means processes should be given a fair ...<|control11|><|separator|>
  28. [28]
    [PDF] Analyzing the Effectiveness of Multicore Scheduling Using ...
    Ev- ery 200 ms, load balancing is triggered on a core to steal tasks from the busiest core to bring the cores into closer balance. Some tasks cannot be ...
  29. [29]
    (PDF) An Adaptive Task-Core Ratio Load Balancing Strategy for ...
    ... Linux. operating system as a reference, it is generally set to 25%. variance (denoted by the imbalance value of 125). Effectively, the load balancer will ...
  30. [30]
    CFS More Details | Deep into Linux and Beyond
    So as of version 2.6.38 Linux added a group scheduling feature to bring fairness between groups of threads (cgroup feature). When a thread belongs to a cgroup ...
  31. [31]
    [PDF] Fewer Cores, More Hertz: Leveraging High-Frequency ... - USENIX
    Jul 15, 2020 · In CFS, periodic load balancing is performed hierarchically, with different periods: cores in the same cache domain are more frequently balanced.
  32. [32]
    Energy Aware Scheduling - LWN.net
    Nov 19, 2018 · This patch series introduces Energy Aware Scheduling (EAS) for CFS tasks on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE).
  33. [33]
    How can the linux CFS scheduler prevent a task with a very small ...
    Jul 2, 2012 · I found the answer. When a task is dequeued from the runqueue, the following will be called: se->vruntime -= cfs_rq->min_vruntime.linux - CFS scheduler: change vruntime of task to slow it downHow to know linux scheduler time slice? - Stack OverflowMore results from stackoverflow.comMissing: proportionality | Show results with:proportionality
  34. [34]
    CFS Scheduler — The Linux Kernel documentation
    CFS stands for “Completely Fair Scheduler,” and is the new “desktop” process scheduler implemented by Ingo Molnar and merged in Linux 2.6.23.Missing: mechanism | Show results with:mechanism
  35. [35]
    3.2 merge window, part 1 - LWN.net
    Nov 2, 2011 · The CFS bandwidth controller, allowing an administrator to set maximum CPU usage for groups of processes, has been merged. See the documentation ...
  36. [36]
    Chapter 26. Understanding control groups | Red Hat Enterprise Linux
    Control groups (cgroups) are a kernel feature that organizes processes into groups to control resource usage, such as CPU, memory, and network bandwidth.<|control11|><|separator|>
  37. [37]
    Documentation for /proc/sys/kernel/ — The Linux Kernel documentation
    ### Summary of Scheduler Parameters from https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html
  38. [38]
    15 Tuning the task scheduler - SUSE Documentation
    Since the Linux kernel version 2.6.24, CFS can be tuned to be fair to groups rather than to tasks only. Runnable tasks are then grouped to form entities ...Missing: cgroups integration
  39. [39]
    About use of kernel parameter 'sched' - Red Hat Customer Portal
    Aug 6, 2024 · kernel.sched_latency_ns sched_latency_ns is the initial value for the scheduler period. The scheduler period is a period of time during ...
  40. [40]
    chrt(1) - Linux manual page - man7.org
    chrt sets or retrieves the real-time scheduling attributes of an existing PID, or runs command with the given attributes.Missing: cgcreate cgexec CFS
  41. [41]
    [PDF] Completely Fair Scheduler and its tuning1
    In one of the earliest commentaries on the new scheduler its author, Ingo Molnar, wrote (see Documentation/- scheduler/sched-design-CFS.txt):. 2. Page 3. 80 ...Missing: paper | Show results with:paper
  42. [42]
    Add latency priority for CFS class - LWN.net
    Feb 23, 2023 · ... cfs task can or should preempt the current running task. The patch ... jitter and jank frame sources of a system [1]. In addition to ...Missing: issues | Show results with:issues
  43. [43]
    [PDF] Advanced Operating Systems (CS 202) Scheduling (1)
    CFS is O(log(N)) because of red-black tree. Is it really fair? What to do with multi-core scheduling? 47.
  44. [44]
    [PDF] The Linux Scheduler: a Decade of Wasted Cores - People
    The Linux scheduler has bugs where cores stay idle while ready threads wait, causing performance issues, and long-term idle cores are unintentional.
  45. [45]
    [PDF] BFS vs. CFS Scheduler Comparison - UNM CS
    Dec 11, 2009 · The Completely Fair Scheduler (CFS) was written by Ingo Molnar. It ... sort the red-black tree. With CFS, tasks have no concept of ...<|control11|><|separator|>
  46. [46]
    Mathieu Desnoyers: [RFC PATCH 00/11] sched: CFS low ... - LKML
    Aug 26, 2010 · Hi, Following the findings I presented a few months ago (http://lkml.org/lkml/2010/4/18/13) about CFS having large vruntime spreadMissing: LWN | Show results with:LWN
  47. [47]
    [PDF] Scheduling in Linux
    ▫ Linux CFS. •1 Process. •3 Processes. 1/3rd progress. 5. Page 6. ❑ Approximate fair scheduling. ▫ Run each thread once per schedule latency (SL). ▫ Weighted ...
  48. [48]
    [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
    ### Summary of CFS vs O(1) Scheduler Comparisons
  49. [49]
    EEVDF Scheduler Replaces CFS - Oracle Help Center
    EEVDF is a new kernel scheduler that replaces the Completely Fair Scheduler (CFS). EEVDF provides a better scheduling policy for the kernel and reduces ...
  50. [50]
    [PDF] Evaluation of Linux Scheduler Algorithms for Low Latency Control ...
    From version 2.6.23, it implements the Completely Fair Scheduler (CFS), until version 6.6, where it was replaced by the Earliest Eligible Virtual Deadline First ...