Fact-checked by Grok 2 weeks ago

Completely Fair Scheduler

The Completely Fair Scheduler (CFS) is a process scheduler in the Linux kernel designed to allocate CPU time equitably among tasks by approximating an "ideal multi-tasking CPU" scenario, where each runnable task receives a proportional share of processor resources as if executing in parallel on separate hardware.^[1] Developed primarily by Ingo Molnar, CFS was first announced on April 13, 2007, as part of a modular scheduler core overhaul, and was merged into the mainline Linux kernel with version 2.6.23 in October 2007, supplanting the earlier O(1) scheduler as the default for general-purpose (SCHED_NORMAL) workloads.^[2]^[1] This integration marked a shift toward a fairness-focused design that prioritizes desktop interactivity and server efficiency without relying on fixed time slices or complex heuristics.^[1] At its core, CFS employs a red-black tree (rbtree) data structure to maintain tasks in chronological order based on their virtual runtime—a metric that accumulates actual CPU usage adjusted for task priority (nice levels)—ensuring the task with the lowest virtual runtime is selected for execution.^[1] This approach uses nanosecond-granularity accounting for precise fairness, with tunable parameters like scheduling granularity (default 4–6 ms for desktop, 10–20 ms for throughput-oriented systems) to balance responsiveness and efficiency.^[1] Unlike prior schedulers, CFS avoids runqueue arrays and periodic ticks for selection, reducing overhead and eliminating artifacts such as unfairness from array switches.^[1]^[2] CFS operates within a class-based framework, handling SCHED_NORMAL, SCHED_BATCH (for CPU-intensive tasks), and SCHED_IDLE policies, while real-time classes (SCHED_FIFO and SCHED_RR) are managed separately via sched_rt.c.^[1] Key extensions include group scheduling (enabled via CONFIG_FAIR_GROUP_SCHED), which allows hierarchical resource control through cgroups and cpu.shares for proportional bandwidth allocation, and bandwidth control to limit maximum CPU usage per group.^[3]^[1] These features enhance scalability in multi-core SMP systems through simplified load-balancing, where idle CPUs actively pull tasks from busy ones.^[1] Compared to the O(1) scheduler, CFS offers superior resistance to manipulation attacks (e.g., via tools like fiftyp.c), better handling of priority adjustments, and no performance regressions in benchmarks, while reducing kernel code size and improving overall system isolation.^[2]^[1] It served as the cornerstone of Linux scheduling for nearly two decades, powering diverse environments from desktops to servers. However, CFS was superseded by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, merged as the default in Linux kernel 6.6 (released October 2023), to address certain limitations in fairness and latency; while CFS code remains for compatibility, EEVDF is the current standard for general-purpose scheduling as of 2025.^[4]^[5]^[6]

Introduction

Overview

The Completely Fair Scheduler (CFS) served as the default process scheduler for SCHED_NORMAL tasks in the Linux kernel from version 2.6.23, released in October 2007, through version 6.5.^[1] Developed primarily by Ingo Molnár, CFS was designed to handle the scheduling of typical desktop and server workloads by ensuring equitable CPU resource allocation among competing processes.^[1] At its core, CFS employs a fair queuing mechanism that distributes CPU time proportionally to runnable tasks based on their priority and nice values, aiming to approximate an "ideal multi-tasking CPU" where each task receives its fair share without relying on fixed timeslices.^[1] This approach prioritizes conceptual fairness over rigid quanta, using virtual runtime as the key metric to track and balance actual execution time against a task's entitled share, thereby minimizing scheduling overhead and promoting responsiveness.^[1] In Linux kernel version 6.6, released in late 2023, CFS was succeeded by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler as the new default for the fair scheduling class, addressing certain latency issues in CFS while building on its foundational principles.^[4]^[6] Despite this transition, CFS's influence persists in the Linux ecosystem, as it remains available as a configurable option and continues to underpin many existing deployments and extensions.^[1]

Design Goals

The Completely Fair Scheduler (CFS) was designed primarily to enhance desktop interactivity by providing low-latency responses for interactive workloads, addressing the limitations of prior schedulers like the O(1) scheduler that struggled with mixed interactive and CPU-bound tasks.^[1]^[2] This focus aimed to model an ideal multi-tasking CPU where each task receives a fair share of processing time proportional to its priority, ensuring that no task is starved while maintaining responsiveness for user-facing applications.^[1] A key motivation was to eliminate the heuristic-based fixed timeslices and jiffy-dependent accounting of earlier schedulers, replacing them with nanosecond-granularity runtime tracking to achieve more precise and deterministic fairness without sacrificing efficiency.^[2] By using virtual runtime as a measure of accumulated CPU usage adjusted for priority, CFS ensures proportional allocation that favors interactive tasks in desktop scenarios while allowing tunable parameters for server or batch environments.^[1] CFS targeted general-purpose computing across desktops, servers, and embedded systems, with optimizations for low-latency desktop use as the default, up through Linux kernel version 6.5.^[1] This design improved upon predecessors by better handling diverse workloads, such as combining short interactive bursts with long-running computations, without the array-switching artifacts that degraded performance in O(1).^[2]

Historical Development

Pre-CFS Schedulers

The earliest Linux kernel schedulers, used from version 0.01 in 1991 through the 2.4 series and into the 2.5 development branch, employed an O(n algorithm that performed a linear search over all runnable tasks to select the next one to execute.^[7] This approach relied on a goodness() function to evaluate task priority and interactivity, iterating through the entire run queue during each scheduling decision, which resulted in O(n time complexity where n is the number of tasks.^[7] As systems scaled to hundreds or thousands of tasks—common in server environments or with threaded applications like Java virtual machines—the linear search imposed significant overhead, degrading responsiveness and interactivity by increasing scheduling latency proportionally with load.^[7] Additionally, the O(n scheduler assigned large fixed timeslices (often around 210 ms on average), which could delay low-priority tasks for extended periods, such as over 20 seconds in scenarios with 100 competing tasks.^[7] To address these scalability issues, Ingo Molnar announced the O(1) scheduler in January 2002 as a patch for the 2.5 development kernel, with its full integration occurring in the stable Linux 2.6.0 release in December 2003.^[8] The O(1) design achieved constant-time task selection by organizing tasks into two per-CPU priority arrays—one active and one expired—allowing the scheduler to pick the highest-priority runnable task directly from the highest non-empty priority bitmap without iterating over all tasks.^[9] It retained fixed timeslices scaled by static priority (e.g., (140 - static priority) × 20 ms for priorities below 120), and incorporated interactivity heuristics, such as dynamic priority boosts based on average sleep time, to favor responsive tasks like those in graphical interfaces.^[9] This scheduler remained the default from Linux 2.6.0 through 2.6.22.^[9] Despite its efficiency gains, the O(1) scheduler exhibited several limitations that compromised fairness and performance in diverse workloads. Its heavy bias toward interactive tasks via heuristics often led to unfairness in mixed environments, where batch or CPU-bound processes received disproportionately less CPU time, sometimes starving them under load.^[9] Interactive tasks themselves suffered high latency during heavy contention, as the fixed timeslice model and priority adjustments failed to adapt smoothly to varying demands.^[9] Furthermore, "scheduler artifacts" such as periodic switches between active and expired arrays introduced unnecessary jitter, disrupting timing-sensitive applications and complicating further maintenance due to the growing complexity of ad-hoc heuristics.^[9] These issues, building on the foundational scalability problems of the O(n) era, highlighted the need for a more principled approach to CPU allocation.^[10]

Introduction and Key Contributors

The Completely Fair Scheduler (CFS) was developed primarily by Ingo Molnar in 2007 as a new process scheduling algorithm for the Linux kernel, drawing inspiration from earlier fair-scheduling concepts while aiming to simplify and enhance performance.^[1] Molnar, a prominent kernel developer previously known for the O(1) scheduler, led the implementation with significant contributions from Mike Galbraith, who focused on interactivity tuning to improve responsiveness for user-facing applications.^[11] Other collaborators included Peter Zijlstra for performance optimizations and Roman Zippel for integrating ideas from the "Really Fair Scheduler" (RFS).^[12] CFS was merged into the Linux kernel version 2.6.23 in October 2007, becoming the default scheduler for the SCHED_OTHER class and replacing the prior O(1) scheduler.^[1] This integration marked a shift toward a more heuristic-free approach, with the core design encapsulated in a single line: selecting the task with the lowest virtual runtime to ensure proportional CPU allocation.^[10] The primary motivations for CFS were to enhance desktop responsiveness and achieve strict fairness in CPU time distribution without relying on complex, ad-hoc heuristics that plagued earlier schedulers.^[2] By prioritizing low latency for interactive tasks—such as those in graphical user interfaces—while maintaining overall throughput, CFS addressed complaints about jittery performance under mixed workloads.^[10] Upon release, CFS received positive reception for its ability to reduce latency in interactive scenarios, leading to rapid adoption across major Linux distributions. By 2008, it had become the standard scheduler in releases like Ubuntu 8.04 (using kernel 2.6.24) and Fedora 9 (using kernel 2.6.25), solidifying its role in improving everyday desktop and server usability.^[13]^[14]

Subsequent Enhancements and Transition to EEVDF

Following the initial introduction of the Completely Fair Scheduler (CFS) in Linux kernel 2.6.23, subsequent releases incorporated several key enhancements to improve its scalability, fairness, and responsiveness across diverse workloads. One significant addition was the integration of CFS with control groups (cgroups) in kernel 2.6.24, enabling group-based scheduling where CPU shares could be allocated proportionally to hierarchical process groups rather than individual tasks. This feature, requiring the CONFIG_CGROUPS and CONFIG_FAIR_GROUP_SCHED kernel options, allowed for better resource isolation in multi-tenant environments like servers and virtualized systems.^[15] Further refinements to symmetric multiprocessing (SMP) load balancing arrived in kernel 2.6.32, where the scheduler's runqueue management was optimized to reduce overhead in multi-core systems by improving task migration decisions and eliminating unnecessary runqueue traversals during balancing operations.^[1] These changes enhanced throughput in high-core-count setups without compromising fairness. In the 3.x kernel series, latency-focused tweaks were implemented to bolster real-time support, particularly for interactive and multimedia applications. For instance, kernel 2.6.36 (bridging to the 3.x renumbering) reduced the minimum granularity and target latency parameters, allowing shorter time slices for low-load scenarios and decreasing worst-case delays from around 20 ms to under 6 ms in typical desktop use cases. This adjustment prioritized responsiveness for latency-sensitive tasks while maintaining overall fairness.^[16] Later, in kernel 2.6.38, the autogrouping feature was added, automatically grouping tasks by user session or login to prevent CPU-hogging applications from starving interactive shells or desktop elements, thus improving perceived performance on workstations. These modifications collectively addressed criticisms of CFS's occasional high-latency bursts under mixed workloads. Performance optimizations continued into the 4.x and 5.x series, with notable advancements in bandwidth control and handling of diverse workloads. CFS bandwidth control, merged in kernel 3.2, extended group scheduling by introducing cpu.cfs_quota_us and cpu.cfs_period_us parameters in cgroups, enforcing hard CPU limits on groups to prevent overconsumption while preserving fairness within quotas. This was particularly useful for containerized environments, where it ensured predictable resource allocation without excessive throttling.^[3] In kernels 4.x onward, refinements to the scheduler addressed performance bugs and improved overall efficiency. Additional tweaks in 5.x, such as improved load tracking for heterogeneous cores, minimized unnecessary migrations and enhanced energy efficiency on modern hardware. The evolution of CFS culminated in its replacement by the Earliest Eligible Virtual Deadline First (EEVDF) scheduler in Linux kernel 6.6, released in September 2023. Proposed and implemented by kernel developer Peter Zijlstra, EEVDF refines CFS's fairness model by incorporating explicit virtual deadlines, selecting the task with the earliest eligible deadline to reduce latency variance while inheriting CFS's virtual runtime tracking for proportional shares. This shift addressed longstanding heuristics in CFS that could lead to suboptimal latency in bursty workloads, achieving up to 50% lower tail latencies in microbenchmarks without sacrificing throughput.^[17]^[4] As of November 2025, EEVDF serves as the default scheduler for the fair scheduling class in new kernels.^[18]

Core Concepts and Algorithm

Virtual Runtime Mechanism

The virtual runtime (vruntime) in the Completely Fair Scheduler (CFS) represents a normalized measure of a task's CPU usage, adjusted according to its priority weight to ensure proportional allocation of processor time.^[1] This metric allows CFS to simulate an ideal multitasking environment where CPU resources are shared fairly among runnable tasks, regardless of their actual execution history. The vruntime is updated incrementally based on the actual CPU time a task consumes, using the formula:

\Delta \text{vruntime} = \text{actual_runtime} \times \frac{\text{NICE_0_LOAD}}{\text{task_weight}}

where

\text{NICE_0_LOAD}

is a fixed constant of 1024 corresponding to the default weight for a nice level of 0, and

\text{task_weight}

is a value derived from the task's nice level (ranging from -20 to 19), with higher weights assigned to higher-priority tasks to slow their vruntime accumulation.^[1] For a task at nice level 0, this simplifies to

\Delta \text{vruntime} = \text{actual_runtime}

, providing a baseline for equal-priority sharing.^[10] The primary purpose of vruntime is to enable CFS to select the task with the lowest value for scheduling, thereby approximating proportional fairness: higher-priority tasks (with larger weights) receive more CPU time before their vruntime catches up to others. Vruntime increases monotonically for active tasks, with the scheduler tracking a per-CPU minimum vruntime (min_vruntime) to maintain global progress and prevent any task from falling indefinitely behind.^[1] Over time, equal-priority tasks converge toward identical vruntime values as they each accrue equivalent normalized usage, enforcing the scheduler's fairness goal.^[10] Sleeping or blocked tasks do not accumulate vruntime during their inactive periods, causing their values to lag behind those of running tasks and granting them higher scheduling eligibility upon wakeup to compensate for missed execution opportunities.^[1] This "sleeper fairness" ensures that I/O-bound or intermittently active tasks are not systematically disadvantaged compared to CPU-bound ones.^[1] To mitigate scheduling overhead from short-running tasks and variability in load, CFS enforces a granularity parameter—typically 4 to 6 milliseconds on desktop systems—which sets the minimum runtime slice before preemption is considered, balancing responsiveness with efficiency.^[1] Vruntime values are briefly referenced in task ordering mechanisms for quick selection.

Data Structures and Task Organization

The Completely Fair Scheduler (CFS) employs a red-black tree as its primary data structure to organize runnable tasks in a time-ordered manner, ensuring efficient management based on virtual runtime (vruntime). This self-balancing binary search tree sorts tasks by their vruntime values, with the leftmost node representing the task with the smallest vruntime, which is selected for execution.^[1] Operations such as insertion and deletion on the tree achieve O(log N) time complexity, where N is the number of tasks, providing scalability for systems with varying numbers of concurrent processes.^[10] Per-CPU runqueues in CFS are augmented with a CFS-specific runqueue structure known as cfs_rq, which encapsulates the red-black tree root along with additional metadata for task management. The cfs_rq includes fields such as min_vruntime, which tracks the smallest vruntime observed on the runqueue to maintain a monotonic reference for new task placements, and a load sum representing the total weighted load of tasks within it.^[1] This structure integrates with the broader scheduler runqueue (rq) via rq->cfs, allowing CFS to handle fair scheduling while coexisting with other scheduler classes like real-time tasks.^[10] Each task in CFS is represented by a sched_entity (se) structure, which stores key attributes including the task's vruntime, weight, and red-black tree node for integration into the cfs_rq tree. The se is embedded within the task_struct and facilitates the hierarchy of scheduling entities, enabling support for grouped tasks without altering the core organization.^[1] Vruntime serves as the primary sorting key within the se, dynamically updated to reflect accrued execution time and ensuring tasks are positioned accordingly in the tree.^[10] This rbtree-based approach offers significant advantages over traditional array-based runqueues used in prior Linux schedulers, such as the O(1) scheduler, by eliminating fixed priority arrays that could introduce scheduling artifacts like uneven task switching.^[2] Unlike arrays requiring O(N) scans for selection, the red-black tree enables logarithmic-time access and maintenance, reducing overhead in multiprocessor environments and improving overall system responsiveness without relying on heuristic adjustments.^[1]

Scheduling Decisions and Time Allocation

The Completely Fair Scheduler (CFS) selects the next task to run by identifying the one with the lowest virtual runtime from its per-CPU red-black tree data structure, ensuring that tasks which have accumulated the least normalized CPU usage are prioritized for execution. This selection process, known as picking the "leftmost" node in the tree, occurs during scheduling events such as timer interrupts or task wakeups, promoting fairness by always advancing the task that is most "behind" in its entitled share of CPU time. Upon selection, the task's virtual runtime is updated incrementally as it consumes CPU cycles, reflecting its actual execution time adjusted for fairness.^[1] Time allocation in CFS avoids fixed timeslices, instead dynamically computing the runtime for each task based on its relative weight within the total weight of all runnable tasks on the CPU, divided proportionally from a target latency period. This runtime is determined by the formula where the allocated time approximates (target_latency × task_weight) / total_weight, allowing higher-weighted tasks (corresponding to higher priorities) to receive larger shares while maintaining overall fairness within the latency window. The target latency is configurable via the kernel parameter sysctl_sched_latency_ns, which defaults to approximately 20 ms on typical multi-core desktop systems (scaling with the number of CPUs as 6 ms × (1 + log2(number of CPUs))), and serves to throttle the scheduling frequency to balance responsiveness and throughput. This approach ensures that all tasks progress toward equal virtual runtime over time, with the granularity of allocation tuned by sysctl_sched_min_granularity_ns to prevent excessive context switches.^[1] Preemption in CFS is triggered when a newly runnable task demonstrates a virtual runtime sufficiently lower than the currently running task's, specifically if the difference exceeds the execution delta since the last check, prompting an immediate context switch to restore fairness. This check occurs during potential preemption points, such as timer ticks or when tasks become runnable, and is modulated by the granularity parameters to avoid over-scheduling on lightly loaded systems. Additionally, when a CPU enters an idle state, CFS performs an "idle pull" mechanism to balance load by migrating tasks from overloaded CPUs, ensuring underutilized processors do not remain idle while others are burdened.^[1] Wakeup handling in CFS positions newly awakened tasks in the red-black tree such that their virtual runtime accounts for the time spent sleeping, effectively adjusting it to a value that prevents long-sleeping tasks from being starved upon resumption. This placement favors recently woken tasks for potential preemption via the wakeup_preempt() function, which evaluates whether the woken task's virtual runtime warrants interrupting the current runner, thereby preserving low latency for interactive workloads without unfairly penalizing compute-bound ones.^[1]

Implementation Details

Kernel Integration and Code Structure

The Completely Fair Scheduler (CFS) is integrated into the Linux kernel as one of several scheduling classes within a hierarchical framework, allowing multiple scheduler modules to coexist and handle different task types. CFS specifically manages tasks under the SCHED_NORMAL policy (also known as SCHED_OTHER), while other classes like real-time (SCHED_FIFO and SCHED_RR) and deadline-based (SCHED_DEADLINE) handle higher-priority or specialized workloads. This integration occurs through the sched_class structure, a linked list of scheduler modules ordered by priority, where CFS occupies the fair class position. Key hooks such as pick_next_task_fair and enqueue_task_fair are registered in this structure to interface with the core scheduler; for instance, pick_next_task iterates through classes starting from the highest priority and delegates to CFS's implementation when appropriate, selecting the task with the minimal virtual runtime (vruntime) for fairness.^[1]^[19]^[10] The core implementation of CFS resides primarily in the kernel/sched/fair.c file, which contains the logic for fair task selection, red-black tree management for runnable tasks, and vruntime updates. This file defines the fair scheduling class and its operations, ensuring efficient O(log N) insertions and extractions for task enqueueing and dequeuing. CFS integrates seamlessly with the broader scheduler framework in kernel/sched/core.c, which handles generic mechanisms like timer ticks, context switches, and the main schedule() function that invokes class-specific hooks during preemption or voluntary yielding. For example, during a context switch, core.c calls into fair.c to update runqueues and select the next task, maintaining compatibility with the kernel's tick-based or tickless modes.^[1]^[20]^[10] During the kernel boot process, CFS is initialized as the default scheduler for SCHED_NORMAL tasks through the sched_init() function in kernel/sched/core.c, which sets up per-CPU runqueues, allocates structures for the root task group, and configures the initial task (init) with fair scheduling parameters like load weight and cfs_rq (CFS runqueue). This initialization occurs early in the boot sequence after CPU setup, ensuring that regular processes start under CFS without additional configuration. The framework is extensible; since Linux kernel 6.12, the sched_ext class allows loading user-space BPF-based schedulers as plugins, which can override or complement CFS for SCHED_EXT tasks while falling back to CFS on errors or unloading.^[21]^[22] CFS supports both processes and threads uniformly through the per-task sched_entity structure, embedded within each task_struct to encapsulate scheduling state such as vruntime, load weight, and red-black tree nodes. This design treats threads as lightweight tasks sharing the process's address space but with independent scheduling entities, enabling fine-grained fairness across multithreaded applications without distinguishing between process and thread contexts at the scheduler level.^[10]^[23]

Priority Handling and Nice Levels

In the Completely Fair Scheduler (CFS), task priorities for SCHED_NORMAL (also known as SCHED_OTHER) processes are managed through nice values, which range from -20 (highest priority) to +19 (lowest priority), providing 40 distinct levels to fine-tune CPU allocation.^[1] These values influence a task's share of CPU time by adjusting its scheduling weight, ensuring that higher-priority tasks (lower nice values) receive proportionally more execution time while maintaining overall fairness.^[1] The weight for each nice level is determined using a precomputed array in the kernel, designed to implement multiplicative adjustments where each increment in nice value reduces the task's CPU share by approximately 10%. Specifically, the base weight for nice 0 is defined as NICE_0_LOAD = 1024, and weights scale by a factor approximating 1/1.25 per positive nice increment (or ×1.25 for negative increments), though exact values are tabulated to avoid floating-point computations. This results in the following representative weights:

Nice Value	Weight
-10	9548
0	1024
10	110
19	15

These weights are applied such that a lower weight (higher nice) causes the task's virtual runtime to increase more rapidly per unit of actual CPU time consumed, effectively reducing its scheduling frequency and providing stronger isolation from interactive workloads.^[1] CFS extends priority handling to special policies like SCHED_BATCH and SCHED_IDLE, treating them as low-priority variants to accommodate non-interactive workloads. SCHED_BATCH, used for CPU-bound batch jobs, inherits weights from the nice value but employs adjusted vruntime accrual and reduced preemption heuristics to minimize disruptions to foreground tasks while preserving cache locality.^[1] In contrast, SCHED_IDLE tasks are assigned a fixed minimal weight of 3—lower than even nice 19—ensuring they execute only when no other runnable tasks exist, ideal for maintenance operations without impacting system responsiveness.^[1]

Load Balancing Across CPUs

In symmetric multiprocessing (SMP) systems, the Completely Fair Scheduler (CFS) employs load balancing to distribute tasks across multiple CPUs, ensuring equitable resource utilization while minimizing overhead. This mechanism prevents any single CPU from becoming overburdened, which could lead to inefficiencies in task execution. Load balancing is invoked periodically and opportunistically, integrating seamlessly with the scheduler's core fairness principles.^[1] Load balancing is triggered periodically by scheduler ticks, during which each CPU evaluates potential imbalances using the load_avg metric—a weighted average of active load that accounts for task weights and recent utilization history. This allows the scheduler to assess system-wide load without excessive computational cost, triggering adjustments only when necessary. If an imbalance is detected, the scheduler initiates task migrations to restore equilibrium across runqueues.^[1] The pull and push logic forms the core of CFS's balancing strategy. Idle CPUs actively pull tasks from busier counterparts to utilize available capacity, selecting candidates via iterators over the red-black tree (RB-tree) structures that organize tasks by virtual runtime (vruntime). Conversely, overloaded CPUs push excess tasks to underutilized ones if their load exceeds 1.17 times (default imbalance_pct=117) the average load in the scheduling domain, again leveraging RB-tree iterators to identify the most suitable tasks—typically those with the lowest vruntime to prioritize fairness. This bidirectional approach reduces latency in responding to load variations, with pulls favored for idle scenarios to minimize disruptions.^[1]^[24]^[25] Load balancing operates within a hierarchical structure of scheduling domains that reflect hardware topology, starting from the lowest level (e.g., multi-core domains sharing L2 cache) and progressing to higher levels such as sockets (DIE domains) and NUMA nodes. At each domain level, balancing is performed by a designated CPU—often the first idle one—to coordinate migrations between groups of CPUs, ensuring locality and reducing inter-domain traffic. This hierarchy optimizes for performance by limiting migrations to necessary scopes, such as within a socket before escalating to node-wide adjustments.^[26] Integration with power-saving features extends CFS's balancing to heterogeneous architectures, such as ARM's big.LITTLE systems, through Energy Aware Scheduling (EAS). EAS augments the standard pull/push decisions with energy models that consider CPU capacity and power states, favoring migrations to efficient "little" cores when possible without compromising fairness. This enhances overall system efficiency in battery-constrained environments by aligning load distribution with heterogeneous performance domains.^[27] To preserve fairness during migrations, CFS adjusts the vruntime of relocated tasks relative to the minimum vruntime on the target runqueue, ensuring proportionality to their weight and preventing any task from gaining or losing undue advantage. This normalization maintains the scheduler's goal of approximating ideal fair sharing across CPUs, where each task receives CPU time proportional to its priority regardless of migration history.^[28]^[1]

Advanced Features and Extensions

Support for Control Groups

The Completely Fair Scheduler (CFS) integrates with control groups (cgroups) through the cpu controller to enforce fairness and resource limits at the group level, enabling administrators to allocate CPU bandwidth proportionally or absolutely to sets of tasks.^[29] This support requires the CONFIG_FAIR_GROUP_SCHED kernel configuration, which extends CFS to treat cgroups as scheduling entities alongside individual tasks.^[29] Group scheduling was introduced in Linux kernel 2.6.24, allowing tasks to be organized into cgroups for fair CPU time division based on relative weights.^[15] The cpu.shares parameter, written to a cgroup's cpu.shares file (default 1024), specifies the proportional share of CPU resources; for instance, a group with 2048 shares receives approximately twice the allocation of a group with 1024 shares under load.^[29] Tasks are assigned to groups by writing their process IDs to the cgroup's tasks file, and vruntime is tracked per group to normalize fairness within and across hierarchies.^[29] For absolute bandwidth limiting, CFS bandwidth control was merged in Linux kernel 3.2, building on group scheduling to cap CPU usage via the cpu.cfs_quota_us and cpu.cfs_period_us parameters.^[30] The cpu.cfs_quota_us value (in microseconds, default -1 for unlimited, range 1000 to 1000000) sets the maximum runtime a group can consume, while cpu.cfs_period_us (default 100000, range 1000 to 1000000) defines the replenishment interval; together, they enforce a fraction of CPU, such as quota=25000 and period=100000 for 25% utilization.^[3] When quota exhausts, tasks in the group are throttled until the next period, with runtime distributed in slices to per-CPU runqueues for efficient enforcement.^[3] CFS supports hierarchical cgroups, where child groups inherit and aggregate shares from parents, ensuring bandwidth constraints propagate upward—throttling a child if the parent's quota is depleted.^[3] This mechanism prevents overconsumption in nested structures without compromising intra-group task fairness.^[29] These features are essential for container orchestration (e.g., Docker via libcontainer) and virtualization platforms (e.g., KVM), where they isolate workloads to specific CPU fractions, avoiding interference while upholding CFS's per-task equity.^[31]

Tuning Parameters and Configuration

The Completely Fair Scheduler (CFS) in the Linux kernel provides several sysfs tunables accessible via /proc/sys/kernel/ to optimize scheduling behavior for different workloads, such as interactive desktops or high-throughput servers. These parameters influence time slice allocation and preemption thresholds, allowing administrators to balance latency and fairness without recompiling the kernel. Changes can be applied temporarily using sysctl commands or persistently via /etc/sysctl.conf or kernel boot parameters. Key tuning parameters include the following, which directly affect vruntime-based time slices in a single sentence of reference:

Parameter	Description	Default Value (ns)	Location
sched_latency_ns	Defines the target latency for each scheduler period, during which all runnable tasks should ideally execute once; higher values favor throughput by allowing longer slices for CPU-bound tasks.	20,000,000	/proc/sys/kernel/sched_latency_ns
sched_min_granularity_ns	Sets the minimum time slice for any task, ensuring short-running tasks are not overly fragmented; used when the number of tasks exceeds the latency target.	4,000,000	/proc/sys/kernel/sched_min_granularity_ns
sched_wakeup_granularity_ns	Controls the threshold for preempting a running task upon wakeup of a higher-priority one, preventing excessive context switches for short wakeups.	2,000,000	/proc/sys/kernel/sched_wakeup_granularity_ns

Additional parameters include sched_child_runs_first, which determines whether the child process runs before the parent after fork (default 0, disabled for parent-first scheduling) at /proc/sys/kernel/sched_child_runs_first,^[32] and sched_migration_cost_ns, which specifies the assumed cost of migrating a task between CPUs to influence load balancing decisions (default 500,000 ns) at /proc/sys/kernel/sched_migration_cost_ns.^[32]^[33] User-space tools facilitate per-task or per-group adjustments complementary to these global tunables. The chrt command sets scheduling policies and nice levels for real-time or fair-share tasks under CFS, while the nice utility adjusts priority for non-real-time processes.^[35] For group-based tuning, cgcreate and cgexec from the cgroup-tools package enable creation and execution within control groups, though this section focuses on system-wide parameters rather than cgroup specifics. Monitoring CFS behavior is possible through /proc/sched_debug, which provides detailed output on runqueues, task vruntimes, and load balancing when CONFIG_SCHED_DEBUG is enabled in the kernel. This file helps verify tunable impacts, such as slice distributions across CPUs. Best practices for tuning involve reducing sched_latency_ns and related granularities for interactive workloads to minimize response times, while increasing them for server environments to improve batch throughput and reduce overhead.^[33] Always test changes under representative loads, as excessive reductions can lead to thrashing, and consult kernel documentation for version-specific behaviors.^[34]

Limitations and Criticisms

Despite its design goals of fairness and efficiency, the Completely Fair Scheduler (CFS) exhibits overly aggressive preemption, which can introduce jitter in latency-sensitive workloads such as interactive desktop applications or embedded systems. This arises from heuristics like wakeup granularity, where woken tasks preempt the current one only if their virtual runtime advantage exceeds a tunable threshold (default 2 ms), potentially causing unnecessary context switches and variability in response times.^[36]^[37]^[38] The reliance on red-black trees for organizing runnable tasks imposes an O(log N) time complexity for scheduling operations, leading to increased overhead in environments with very high task counts, where tree balancing and insertions become computationally expensive compared to simpler queue structures.^[39] CFS prioritizes proportional fairness over strict timing guarantees, making it less optimal for real-time workloads without the separate real-time (SCHED_FIFO/SCHED_RR) or deadline (SCHED_DEADLINE) classes, as normal-priority tasks may experience unbounded delays under overload.^[29] Scalability limitations manifest in massive parallelism, particularly on multi-core and NUMA systems, where poor load balancing—such as the scheduling group construction bug—can leave cores idle while runnable tasks queue on busy ones, resulting in up to 27× slowdowns in benchmarks like NAS lu on multi-node setups.^[40] In low-task-count scenarios, CFS incurs higher overhead than the prior O(1) scheduler due to its logarithmic complexity and vruntime tracking, even though the absolute log N cost remains small; this trade-off favors fairness at the expense of minimal-load efficiency.^[23] For bursty I/O workloads, CFS's fairness model can delay I/O-bound tasks behind CPU-intensive ones, creating trade-offs where short bursts receive disproportionate penalties to maintain long-term equity, as observed in comparisons with simpler schedulers.^[41] Community discussions on LWN and LKML have debated latency spikes from vruntime spread and preemption heuristics, with vanilla kernels showing maximum latencies up to 34 ms under periodic loads; these were partially mitigated in 5.x series kernels through features like dynamic minimum vruntime and reduced timeslices.^[42]^[16]^[37] These shortcomings, including heuristic dependencies and multi-core inefficiencies, prompted ongoing refinements and ultimately influenced the transition to the Earliest Eligible Virtual Deadline First (EEVDF) scheduler in Linux 6.6.^[40]

Comparisons

With O(1) Scheduler

The Completely Fair Scheduler (CFS) introduced significant improvements in efficiency over its predecessor, the O(1) scheduler, by replacing the O(1) scheduler's array-based per-priority run queues with a red-black tree data structure for task management. While the O(1) scheduler achieved constant-time O(1) operations through 140 separate active and expired priority queues, this approach became less efficient with very large numbers of tasks due to the fixed overhead of maintaining multiple arrays and switching between them. In contrast, CFS employs O(log n) operations for insertions, deletions, and finding the minimum virtual runtime task, which scales better for systems with over 1000 tasks, as the logarithmic cost remains low (e.g., approximately 10 operations for n=1024) without the combinatorial complexity of array maintenance.^[10]^[43] In terms of fairness, CFS provides proportional CPU share allocation using virtual runtime (vruntime) tracking, which ensures tasks receive time slices scaled inversely to their nice priority levels, modeling an "ideal" fair multi-tasking scenario without fixed timeslices. The O(1) scheduler relied on fixed timeslices varying by priority (e.g., 100ms for high-priority tasks down to near-zero for low-priority ones) and ad-hoc interactivity estimators to boost I/O-bound tasks, which often led to unfair treatment of CPU-bound processes and inconsistent desktop responsiveness under mixed workloads. CFS's vruntime-based approach eliminates these heuristics, reducing lag in interactive applications by automatically favoring recently woken tasks through vruntime lag compensation, thus achieving more consistent fairness across diverse task types.^[9]^[44] CFS also enhances latency predictability by eliminating the "timeslice starvation" problem inherent in the O(1) scheduler, where low-priority tasks could be indefinitely delayed if higher-priority ones continuously expired and reactivated their queues. The O(1) scheduler's use of heuristics for interactivity estimation introduced variability and potential jitter in response times, particularly under high load. CFS addresses this with nanosecond-granularity time accounting and a tunable scheduling granularity (default 4-6ms for desktops), ensuring bounded worst-case latency without reliance on complex priority boosting, resulting in more predictable response times for both interactive and batch workloads.^[10]^[44] Benchmarks evaluating fairness and interactivity confirm these gains, with CFS demonstrating lower standard deviation in task completion times (e.g., 20-30% reduction in variability for 8-32 concurrent pi-calculation processes) compared to O(1), indicating superior CPU resource equity. In interactivity tests using X-window event simulations under audio, video, and full system loads, CFS exhibited lower maximum latencies (often below 180ms, the human perception threshold) while maintaining acceptable average latencies, outperforming O(1)'s higher jitter and occasional spikes that degraded desktop usability. These results highlight CFS's ability to balance fairness and responsiveness more effectively than the O(1) scheduler in 2007-era Linux environments.^[9]

With EEVDF Scheduler

The Completely Fair Scheduler (CFS) and its successor, the Earliest Eligible Virtual Deadline First (EEVDF) scheduler, share core objectives in achieving proportional fairness among tasks by leveraging virtual time metrics to allocate CPU resources equitably based on priority.^[4] EEVDF extends this foundation by building directly on much of the existing CFS codebase, maintaining compatibility with established scheduling classes while introducing refinements for modern workloads.^[17] Notably, CFS's concept of virtual runtime (vruntime) serves as a direct precursor to EEVDF's virtual runtime tracking, ensuring continuity in how task execution time is normalized across varying priorities.^[1] Key differences lie in EEVDF's scheduling mechanism, which employs virtual deadlines and lag calculations to determine task eligibility, selecting the task with the earliest virtual deadline among eligible ones rather than relying solely on vruntime comparisons as in CFS.^[4] This approach eliminates several ad-hoc heuristics present in CFS, such as those for idling and wake-up path optimizations, resulting in a more streamlined implementation that reduces overhead in contended scenarios.^[17] EEVDF improves upon CFS particularly in managing short and interactive tasks through its earliest eligible virtual deadline first strategy, which prioritizes low-latency execution without compromising fairness.^[45] Benchmarks indicate substantial latency reductions, with up to 41% lower 99th percentile latencies in real-time control plane applications compared to CFS.^[46] This leads to decreased jitter and more consistent performance in kernel versions 6.6 and beyond, benefiting interactive and latency-sensitive workloads.^[17] EEVDF became the default scheduler for the normal scheduling class in Linux kernel 6.6, released in late 2023, effectively replacing CFS in new deployments while retaining CFS for backward compatibility in specialized configurations.^[6] By 2025, EEVDF is the preferred choice for contemporary systems due to its enhanced efficiency and latency characteristics, though CFS remains available via kernel configuration options.^[45]

References

[1]
CFS Scheduler - The Linux Kernel documentation
CFS stands for “Completely Fair Scheduler,” and is the “desktop” process scheduler implemented by Ingo Molnar and merged in Linux 2.6.23.
[2]
Ingo Molnar: [Announce] [patch] Modular Scheduler Core and ... - LKML
Apr 13, 2007 · [Announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS] ... announce the first release of the "Modular Scheduler CoreMissing: original | Show results with:original
[3]
CFS Bandwidth Control - The Linux Kernel documentation
CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the specification of the maximum CPU bandwidth available to a group or hierarchy.
[4]
EEVDF Scheduler - The Linux Kernel documentation
The Linux kernel began transitioning to EEVDF in version 6.6 (as a new option in 2024), moving away from the earlier Completely Fair Scheduler (CFS) in ...
[5]
sched(7) - Linux manual page - man7.org
Since Linux 2.6.23, the default scheduler is CFS, the "Completely Fair Scheduler". The CFS scheduler replaced the earlier "O(1)" scheduler. API summary Linux ...
[6]
Linux_6.6 - Linux Kernel Newbies
Summary: This release replaces the CFS process scheduler with a new one called EEVDF, which should provide better latency in general; support ...New task scheduler: EEVDF · First pieces of XFS online fsck · TMPFS · Networking
[7]
[PDF] Understanding the Linux 2.6.8.1 CPU Scheduler - People @EECS
Feb 17, 2005 · 6.3 Weaknesses. In “Understanding the Linux Kernel,” Daniel Bovet and Marco Cesati expound on four weaknesses in the. Linux 2.4.x scheduler ...<|control11|><|separator|>
[8]
[announce] [patch] ultra-scalable O(1) SMP and UP scheduler
From: Ingo Molnar <mingo@elte.hu> To: <linux-kernel@vger.kernel.org> Subject: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler Date: Fri, 4 Jan ...
[9]
(PDF) Fairness and interactive performance of O(1) and CFS Linux ...
In this paper, we present the scheduling techniques used by two Linux schedulers: O(1) and Completely Fair Scheduler (CFS). CFS is the Linux kernel scheduler ...
[10]
Inside the Linux 2.6 Completely Fair Scheduler - IBM Developer
Sep 19, 2018 · The 2.4 kernel included a relatively simple scheduler that operated in O(N) time (as it iterated over every task during a scheduling event).
[11]
Linux Gets Completely Fair Scheduler - Slashdot
Jul 10, 2007 · Con Kolivas, for pioneering the fair-scheduling approach. Peter Williams, for smpnice. Mike Galbraith, for interactivity tuning of CFS
[12]
Ingo Molnar: [announce] CFS-devel, performance improvements
Sep 11, 2007 · announce the latest iteration of the CFS scheduler development tree. Our main focus has been on simplifications and performance - and as ...Missing: Completely | Show results with:Completely
[13]
List of Ubuntu Versions with Corresponding Linux Kernel Version
Aug 28, 2014 · You can get the list of the Ubuntu versions and their corresponding kernels from Wikipedia: 4.10 Warty Warthog 2.6.8 5.04 Hoary Hedgehog 2.6.10 5.10 Breezy ...How do I find the kernel version, Ubuntu release and disk partition ...How do I install an older kernel in jammy and later Ubuntu?More results from askubuntu.com
[14]
Releases/9/ReleaseSummary - Fedora Project Wiki
... release. Fedora 9 (Sulphur) is due out at the end of April 2008, and we started making plans for it as soon as we released Fedora 8 back in November. We've ...
[15]
Merged for 2.6.24 - LWN.net
Oct 17, 2007 · The CFS group scheduling code has been merged. As of this writing, though, the feature cannot actually be turned on because the control groups ...
[16]
Improving scheduler latency - LWN.net
Sep 14, 2010 · The patch did improve latencies, though. It turns out that the change that actually mattered was reducing the length of the minimum time slice ...
[17]
Completely Fair Scheduler - Wikipedia
The Completely Fair Scheduler (CFS) was a process scheduler that was merged into the 2.6.23 (October 2007) release of the Linux kernel. It was the default ...
[18]
[PDF] Your Cores Are Slacking Off— Or Why OS Scheduling Is a Hard ...
To this end, CFS periodically runs a load-balancing algorithm that will keep the queues roughly balanced. CFS balances run queues based on a metric called ...<|control11|><|separator|>
[19]
An EEVDF CPU scheduler for Linux - LWN.net
Mar 9, 2023 · The kernel's completely fair scheduler (CFS) has the job of managing the allocation of CPU time for most of the processes running on most ...
[20]
EEVDF Scheduler May Be Ready For Landing With Linux 6.6
Intel Linux engineer Peter Zijlstra's EEVDF CPU scheduler code to replace the existing Completely Fair Scheduler "CFS" code looks like it will ...
[21]
UIO: user-space drivers [LWN.net]
**Summary of Virtual Runtime in CFS from https://lwn.net/Articles/233034/**
[22]
CFS scheduler merged - LWN.net
Jul 9, 2007 · When later on the group scheduling feature is merged/enabled you can give all users their own group so they can't hamper each other. CFS ...Group schedulingThe 2.6.23 merge window closesMore results from lwn.net
[23]
[PDF] The Linux Kernel Scheduler - Linaro
Sched class: CFS (Completely fair scheduler). ○ Introduced by Ingo Molnar (motivated by Rotating Staircase Deadline. Scheduler by Con Kolivas). ○ Scheduling ...
[24]
fair.c source code [linux/kernel/sched/fair.c] - Codebrowser
* Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH). 4, *. 5 ... * (C) 2007 Mike Galbraith <efault@gmx.de>. 9, *. 10, * Various enhancements ...
[25]
Scheduler initialization · Linux Inside - 0xax
The SCHED_NORMAL is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the nice value, the SCHED_BATCH used ...
[26]
Extensible Scheduler Class — The Linux Kernel documentation
Userspace can implement an arbitrary BPF scheduler by loading a set of BPF programs that implement struct sched_ext_ops.Missing: plugins | Show results with:plugins
[27]
[PDF] Inside the Linux 2.6 Completely Fair Scheduler - cs.wisc.edu
Dec 15, 2009 · The main idea behind the CFS is to maintain balance (fairness) in providing processor time to tasks. This means processes should be given a fair ...<|control11|><|separator|>
[28]
[PDF] Analyzing the Effectiveness of Multicore Scheduling Using ...
Ev- ery 200 ms, load balancing is triggered on a core to steal tasks from the busiest core to bring the cores into closer balance. Some tasks cannot be ...
[29]
(PDF) An Adaptive Task-Core Ratio Load Balancing Strategy for ...
... Linux. operating system as a reference, it is generally set to 25%. variance (denoted by the imbalance value of 125). Effectively, the load balancer will ...
[30]
CFS More Details | Deep into Linux and Beyond
So as of version 2.6.38 Linux added a group scheduling feature to bring fairness between groups of threads (cgroup feature). When a thread belongs to a cgroup ...
[31]
[PDF] Fewer Cores, More Hertz: Leveraging High-Frequency ... - USENIX
Jul 15, 2020 · In CFS, periodic load balancing is performed hierarchically, with different periods: cores in the same cache domain are more frequently balanced.
[32]
Energy Aware Scheduling - LWN.net
Nov 19, 2018 · This patch series introduces Energy Aware Scheduling (EAS) for CFS tasks on platforms with asymmetric CPU topologies (e.g. Arm big.LITTLE).
[33]
How can the linux CFS scheduler prevent a task with a very small ...
Jul 2, 2012 · I found the answer. When a task is dequeued from the runqueue, the following will be called: se->vruntime -= cfs_rq->min_vruntime.linux - CFS scheduler: change vruntime of task to slow it downHow to know linux scheduler time slice? - Stack OverflowMore results from stackoverflow.comMissing: proportionality | Show results with:proportionality
[34]
CFS Scheduler — The Linux Kernel documentation
CFS stands for “Completely Fair Scheduler,” and is the new “desktop” process scheduler implemented by Ingo Molnar and merged in Linux 2.6.23.Missing: mechanism | Show results with:mechanism
[35]
3.2 merge window, part 1 - LWN.net
Nov 2, 2011 · The CFS bandwidth controller, allowing an administrator to set maximum CPU usage for groups of processes, has been merged. See the documentation ...
[36]
Chapter 26. Understanding control groups | Red Hat Enterprise Linux
Control groups (cgroups) are a kernel feature that organizes processes into groups to control resource usage, such as CPU, memory, and network bandwidth.<|control11|><|separator|>
[37]
Documentation for /proc/sys/kernel/ — The Linux Kernel documentation
### Summary of Scheduler Parameters from https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html
[38]
15 Tuning the task scheduler - SUSE Documentation
Since the Linux kernel version 2.6.24, CFS can be tuned to be fair to groups rather than to tasks only. Runnable tasks are then grouped to form entities ...Missing: cgroups integration
[39]
About use of kernel parameter 'sched' - Red Hat Customer Portal
Aug 6, 2024 · kernel.sched_latency_ns sched_latency_ns is the initial value for the scheduler period. The scheduler period is a period of time during ...
[40]
chrt(1) - Linux manual page - man7.org
chrt sets or retrieves the real-time scheduling attributes of an existing PID, or runs command with the given attributes.Missing: cgcreate cgexec CFS
[41]
[PDF] Completely Fair Scheduler and its tuning1
In one of the earliest commentaries on the new scheduler its author, Ingo Molnar, wrote (see Documentation/- scheduler/sched-design-CFS.txt):. 2. Page 3. 80 ...Missing: paper | Show results with:paper
[42]
Add latency priority for CFS class - LWN.net
Feb 23, 2023 · ... cfs task can or should preempt the current running task. The patch ... jitter and jank frame sources of a system [1]. In addition to ...Missing: issues | Show results with:issues
[43]
[PDF] Advanced Operating Systems (CS 202) Scheduling (1)
CFS is O(log(N)) because of red-black tree. Is it really fair? What to do with multi-core scheduling? 47.
[44]
[PDF] The Linux Scheduler: a Decade of Wasted Cores - People
The Linux scheduler has bugs where cores stay idle while ready threads wait, causing performance issues, and long-term idle cores are unintentional.
[45]
[PDF] BFS vs. CFS Scheduler Comparison - UNM CS
Dec 11, 2009 · The Completely Fair Scheduler (CFS) was written by Ingo Molnar. It ... sort the red-black tree. With CFS, tasks have no concept of ...<|control11|><|separator|>
[46]
Mathieu Desnoyers: [RFC PATCH 00/11] sched: CFS low ... - LKML
Aug 26, 2010 · Hi, Following the findings I presented a few months ago (http://lkml.org/lkml/2010/4/18/13) about CFS having large vruntime spreadMissing: LWN | Show results with:LWN
[47]
[PDF] Scheduling in Linux
▫ Linux CFS. •1 Process. •3 Processes. 1/3rd progress. 5. Page 6. ❑ Approximate fair scheduling. ▫ Run each thread once per schedule latency (SL). ▫ Weighted ...
[48]
[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
### Summary of CFS vs O(1) Scheduler Comparisons
[49]
EEVDF Scheduler Replaces CFS - Oracle Help Center
EEVDF is a new kernel scheduler that replaces the Completely Fair Scheduler (CFS). EEVDF provides a better scheduling policy for the kernel and reduces ...
[50]
[PDF] Evaluation of Linux Scheduler Algorithms for Low Latency Control ...
From version 2.6.23, it implements the Completely Fair Scheduler (CFS), until version 6.6, where it was replaced by the Earliest Eligible Virtual Deadline First ...