Knowledge distillation

Knowledge distillation is a model compression technique in machine learning where a compact "student" model is trained to replicate the behavior and generalizations of a larger, more complex "teacher" model, thereby transferring knowledge to enable efficient deployment while preserving much of the original performance.^[1] Introduced in 2015 by Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean, the method originally focused on distilling the collective predictions from an ensemble of neural networks into a single smaller network, using softened probability distributions (soft targets) derived from the teacher's outputs as training signals rather than hard one-hot labels.^[1] This approach demonstrated significant improvements, such as enhancing acoustic models in commercial speech recognition systems and achieving competitive results on benchmarks like MNIST.^[1] Since its inception, knowledge distillation has evolved into a versatile paradigm applicable across various domains, including computer vision, natural language processing, and speech recognition, addressing the challenges of deploying deep learning models on edge devices with limited computational resources.^[2] Key variants include offline distillation, where a pre-trained teacher guides the student; online distillation, involving simultaneous training of teacher and student; and self-distillation, where a model distills knowledge to itself for refinement.^[2] The technique transfers not only output predictions but also intermediate feature representations, attention maps, and relational structures between data points, leading to benefits like reduced model size, faster inference, and sometimes even improved generalization on the student model.^[2] Applications span mobile AI, federated learning, and large language models, where distillation compresses billion-parameter models into efficient versions without substantial accuracy loss.^[2]

Overview

Definition and Framework

Knowledge distillation is a machine learning paradigm that enables the transfer of knowledge from a larger, more complex model to a smaller, more efficient one, allowing the latter to approximate the former's predictive capabilities.^[3] In this framework, the process revolves around a teacher-student architecture, where the teacher serves as a source of learned representations and the student as the recipient designed for practical deployment.^[4] The teacher model typically consists of a large, pre-trained neural network, often an ensemble of models or a highly regularized single network, which generates detailed predictions on input data.^[3] These predictions capture not only correct classifications but also nuanced relationships between classes, derived from extensive training on vast datasets.^[3] In contrast, the student model is a compact, simpler network—such as a shallower or narrower architecture—trained specifically to replicate the teacher's behavior across a range of inputs, thereby inheriting its generalization patterns while reducing computational demands.^[3]^[4] Knowledge transfer in this setup occurs through two primary modes: explicit and implicit. Explicit transfer involves directly mimicking the teacher's output predictions, such as softened probability distributions over classes that provide richer supervision than hard labels alone.^[3] Implicit transfer, on the other hand, focuses on aligning intermediate representations, like feature maps from hidden layers, which encode deeper structural insights about the data, such as spatial hierarchies or relational patterns.^[4] These modes allow the student to absorb both surface-level decisions and underlying model intuition from the teacher. The general workflow begins with training the teacher model on the original labeled dataset to develop its predictive expertise.^[3] Subsequently, the student is trained to mimic the teacher's outputs using a transfer set that may combine labeled examples from the original data with additional unlabeled samples, enabling broader generalization without requiring new annotations.^[3] This approach enhances model efficiency for resource-constrained environments, such as mobile devices, by yielding a deployable model with performance close to the teacher's.^[4]

Motivation and Benefits

Knowledge distillation is primarily motivated by the need to deploy powerful machine learning models in resource-limited settings, such as edge devices, mobile applications, and real-time systems, where large models are impractical due to their high computational demands and latency. By transferring knowledge from a cumbersome teacher model to a compact student model, distillation enables significant reductions in model size and inference time while aiming to preserve much of the original accuracy. For instance, in speech recognition tasks, distillation can compress an ensemble of models with millions of parameters into a single efficient model suitable for mobile deployment, achieving comparable performance with reduced overhead.^[1] A key benefit lies in the transfer of "dark knowledge," where the teacher's softened probability distributions reveal nuanced relationships between classes—such as relative confidences in incorrect predictions—that hard labels from ground truth cannot capture, leading to improved generalization in the student model. This results in better performance on test sets compared to training solely on one-hot labels; for example, on MNIST, a distilled small network achieves 74 test errors versus 146 for one trained on hard targets. Additionally, distillation yields cost savings by lowering both training and inference expenses through smaller models that require fewer resources, facilitating deployment in constrained environments like IoT sensors or battery-powered devices.^[1] In practice, distillation has enabled efficient applications such as distilling large language models for chatbots, where models like Vicuna retain conversational abilities with substantial size reductions, or vision models for IoT devices, supporting real-time object detection with 2-10x inference speedups. However, trade-offs include a potential accuracy drop of 1-5% relative to the teacher, as seen in semantic segmentation tasks, and the prerequisite of access to a pre-trained teacher model.^[5]

Mathematical Foundations

Core Formulation

Knowledge distillation establishes a foundational setup where a pre-trained teacher model, with parameters denoted as T, processes an input x to produce logits z_t = f_T(x), and a student model, with parameters S, generates corresponding logits z_s = f_S(x).^[1] Here, f_T and f_S represent the forward-pass functions of the respective models, and the logits z_t and z_s serve as the raw, unnormalized outputs that encode the models' predictions for classification tasks.^[1] The core objective is to train the student model by minimizing a divergence, such as the Kullback-Leibler divergence, between the softened probability distributions derived from its logits z_s and the teacher's logits z_t, thereby transferring the teacher's learned representations to the more compact student architecture.^[1] This minimization occurs over a full labeled dataset D = \{(x_i, y_i)\}_{i=1}^N, where the ground-truth labels y_i provide additional hard supervision alongside the soft targets from the teacher, enabling the student to leverage both sources for improved generalization.^[1] To quantify the outputs for distillation, softened probability distributions are derived from the logits using a temperature-scaled softmax operation: p_t^\tau = \softmax(z_t / \tau) for the teacher and p_s^\tau = \softmax(z_s / \tau) for the student (with \tau > 1), where p_t^\tau and p_s^\tau represent the softened class probabilities that capture the teacher's uncertainty and knowledge.^[1] This notation emphasizes the focus on matching the distributional outputs rather than solely the final predictions, distinguishing distillation from standard supervised learning.^[1]

Loss Functions and Temperature Scaling

In knowledge distillation, the optimization objective for training the student model typically combines two loss terms: a standard cross-entropy loss on the true labels, known as the hard loss L_{\text{hard}}, and a distillation loss that encourages the student to match the softened probability distribution of the teacher, denoted as L_{\text{soft}}. The overall distillation loss is formulated as a weighted combination:

L_{\text{dist}} = (1 - \alpha) L_{\text{hard}} + \alpha L_{\text{soft}},

where \alpha \in [0, 1] is a hyperparameter controlling the trade-off between the two objectives, often set between 0.1 and 0.9 to balance label supervision with teacher imitation.^[1] The hard loss L_{\text{hard}} is the conventional cross-entropy between the student's predicted probabilities (using temperature \tau = 1) and the one-hot encoded ground-truth labels, providing direct supervision from the data.^[1] The soft loss L_{\text{soft}} is commonly defined using the Kullback-Leibler (KL) divergence between the teacher's softened output probabilities p_t^\tau and the student's softened outputs p_s^\tau:

L_{\text{soft}} = D_{\text{KL}}(p_t^\tau \parallel p_s^\tau) = \sum_i p_{t,i}^\tau \log \frac{p_{t,i}^\tau}{p_{s,i}^\tau}.

This measures how closely the student's probability distribution aligns with the teacher's, capturing richer relational information beyond hard labels.^[1] Equivalently, L_{\text{soft}} can be expressed as the cross-entropy between the teacher's soft targets and the student's predictions, since the entropy of the teacher's distribution is constant with respect to the student parameters.^[1] Temperature scaling plays a crucial role in softening the probability distributions to reveal the teacher's "dark knowledge," such as inter-class relationships encoded in the logits z_t and z_s. The softened probabilities for a model are computed by scaling the logits in the softmax function with a temperature parameter \tau > 1:

p_i^\tau = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}.

A higher \tau (typically ranging from 2 to 10) flattens the distribution, making it less peaked and more informative for distillation, as it exposes probabilities for incorrect classes that reflect the teacher's learned uncertainties.^[1] During inference, \tau is reset to 1 to recover sharp predictions. To ensure the gradients from the soft loss have appropriate magnitude—since softening reduces them by a factor of approximately $1/\tau^2—the KL divergence is often scaled by \tau^2:

L_{\text{soft}} = \tau^2 D_{\text{KL}}(p_t^\tau \parallel p_s^\tau).

This adjustment maintains the relative contribution of the soft loss compared to the hard loss, preventing under-emphasis on teacher signals.^[1] The same \tau is applied to both teacher and student logits during training for consistency.^[1]

Techniques and Variants

Classical Offline Distillation

Classical offline distillation, also known as the original teacher-student framework, involves transferring knowledge from a pre-trained, larger teacher model to a smaller student model in a two-phase process. First, the teacher model—a complex neural network or ensemble—is trained on the full dataset to capture rich representations. Once trained, the teacher is frozen, and its outputs, particularly softened probability distributions generated using a high temperature parameter in the softmax function, serve as supervisory signals for the student. This approach leverages the teacher's "dark knowledge," which includes inter-class relationships not evident in one-hot hard labels, to guide the student toward more informed predictions.^[1] The key steps in classical offline distillation begin with pre-training the teacher on the complete dataset, often employing regularization techniques like dropout to enhance generalization. Next, the frozen teacher generates predictions on a transfer set, typically the same training data, producing soft targets that emphasize relative probabilities across classes. The student model is then optimized using a combined loss function that balances distillation loss—measuring divergence between student and teacher outputs—with standard cross-entropy loss on ground-truth hard labels; notably, the teacher remains unchanged throughout this student training phase, ensuring a one-way knowledge transfer. This process results in a compact student model that approximates the teacher's performance while being more efficient for deployment.^[1]^[2] Despite its effectiveness, classical offline distillation has notable limitations. It demands significant upfront computational resources to train and freeze a large-capacity teacher model, which can be prohibitive in resource-constrained environments. The static nature of the knowledge transfer—relying on fixed teacher outputs—limits adaptability, making it less suitable for tasks where data distributions evolve over time or require dynamic updates. Additionally, the success of distillation hinges on the teacher's quality and the student's capacity to absorb the transferred knowledge, potentially leading to suboptimal results if there is a substantial architectural gap between the models.^[2]^[1] A prominent example of classical offline distillation is its application to speech recognition, as demonstrated by Hinton et al. In their work on an Android voice search system, a single distilled model, reduced from a 10-model ensemble baseline, achieved comparable performance with 60.8% frame accuracy and 10.7% word error rate (WER), enabling substantial model size compression while maintaining efficacy on 2000 hours of training data. This illustrates the method's utility in compressing acoustic models for mobile deployment without significant accuracy loss.^[1]

Online and Self-Distillation Methods

Online knowledge distillation extends the classical paradigm by training multiple student models simultaneously in a collaborative manner, where each model acts as both a student and a teacher to its peers, eliminating the need for a pre-trained, static teacher. This approach, often termed mutual or peer-to-peer distillation, enables dynamic knowledge transfer during training through the use of softened probability distributions derived from logits. A seminal method in this category is Deep Mutual Learning (DML), proposed by Zhang et al. in 2018, which trains several networks of identical or varying architectures in parallel. In DML, each student minimizes a combined loss of cross-entropy with ground-truth labels and Kullback-Leibler divergence with respect to the softened outputs of other students, fostering mutual improvement without hierarchical dependencies.^[6] This online setup offers advantages such as reduced training phases compared to offline methods and enhanced generalization through diverse peer interactions, particularly beneficial in resource-constrained environments. For instance, on the CIFAR-100 dataset using ResNet-32 architectures, DML achieves a top-1 accuracy of 71.19%, representing a 2.2% improvement over the baseline single-student training accuracy of 68.99%. Similar gains, around 1-2% in accuracy, have been observed in vision tasks, underscoring its efficacy for collaborative learning scenarios. However, online distillation introduces challenges like increased computational overhead due to parallel model updates and the need for careful synchronization among peers to avoid instability. Self-distillation further evolves the technique by enabling a single model to refine itself without any external teacher, leveraging its own predictions or internal components for knowledge transfer. In this framework, the model iteratively distills knowledge from previous iterations or from deeper layers to shallower ones, promoting iterative refinement and internal consistency. A foundational example is Born-Again Neural Networks (BANs), introduced by Furlanello et al. in 2018, where a converged model serves as the initial teacher, and subsequent "generations" of the same architecture are trained to mimic the prior version's softened outputs alongside ground-truth labels. This process repeats for multiple iterations, with each student potentially surpassing its predecessor due to accumulated dark knowledge.^[7] Another prominent self-distillation variant is the "Be Your Own Teacher" method by Zhang et al. in 2019, which integrates multiple classifiers at different depths within a single convolutional neural network, distilling knowledge from the deepest classifier to shallower ones via Kullback-Leibler divergence on outputs and L2 regularization on features. This intra-network transfer enhances performance without additional models and allows for early-exit inference, where shallower classifiers provide faster predictions for simpler inputs. Self-distillation's key benefits include simplified deployment—no separate teacher training required—and iterative self-improvement, leading to 1-2% accuracy boosts in tasks like image classification, though it demands careful hyperparameter tuning to mitigate diminishing returns across iterations and added training complexity.^[8]

Relations to Model Optimization

Connection to Compression Techniques

Knowledge distillation serves as a key knowledge-transfer method within the broader pipeline of model compression techniques, enabling the reduction of model parameters through behavioral mimicry of a larger teacher model by a compact student model. Unlike direct architectural downsizing, which may lead to significant performance degradation, distillation facilitates compression by encoding the teacher's learned representations—such as softened probability distributions—into the student, thereby maintaining representational capacity in a smaller footprint. This approach is particularly valuable in scenarios requiring deployment on resource-constrained devices, where parameter count and inference speed are critical.^[1] Distillation exhibits strong synergies with other compression paradigms, such as quantization and low-rank adaptation, allowing for compounded efficiency gains. For instance, post-distillation quantization applies bit-width reduction (e.g., to 8-bit or lower precision) to the already compressed student model, further minimizing memory usage and computational overhead while leveraging the teacher's guidance to mitigate quantization-induced accuracy drops; studies have demonstrated this combination yielding low-precision networks with minimal performance loss compared to quantized teachers alone. Similarly, integrating distillation with low-rank adaptation methods, like low-rank factorization of weight matrices or adapters such as LoRA, exploits parameter redundancy for additional size cuts— for example, progressive low-rank decomposition during fine-tuning can compress Transformer-based models while distilling knowledge to preserve task-specific efficacy. These hybrid strategies often form multi-stage pipelines, where distillation initializes the student before applying quantization or low-rank techniques.^[9]^[10] Representative metrics underscore distillation's role in achieving substantial compression ratios alongside FLOP efficiency. For example, distilling from ResNet-56 (0.85 million parameters) to ResNet-20 (0.27 million parameters) on CIFAR-10 yields approximately a 68% parameter reduction, with the student attaining 93.15% accuracy—nearly matching the teacher's 93.61%—while also lowering FLOPs for faster inference. In more aggressive cases, such as large language models, distillation combined with other methods can enable up to 10× parameter reduction (e.g., 90% fewer parameters) while retaining 95-98% of the teacher's performance, outperforming standalone compression by avoiding the pitfalls of unguided downsizing. This preservation stems from the teacher's provision of "dark knowledge," including inter-class relationships, which direct compression techniques like naive resizing cannot replicate.^[2]^[2]

Differences from Pruning Algorithms

Pruning algorithms compress neural networks by systematically removing weights or neurons that contribute minimally to the model's performance, often guided by saliency criteria such as weight magnitude or approximations from the Hessian matrix. A foundational approach, Optimal Brain Damage, introduced in 1989, employs a second-order Taylor expansion to estimate the expected increase in training error upon weight removal, enabling an iterative process of pruning followed by fine-tuning to maintain accuracy.^[11] Magnitude-based methods, a common variant, target weights with the smallest absolute values, assuming they have negligible impact on outputs.^[11] Unlike knowledge distillation, which holistically transfers representational knowledge from a larger teacher model to a smaller student by aligning their softened probability distributions via a distillation loss, pruning focuses on local, parameter-level eliminations within a single model.^[12]^[13] Pruning modifies an existing pre-trained network by inducing sparsity—either unstructured (individual weights) or structured (entire filters or channels)—without requiring a separate teacher model, whereas distillation trains a new student model from scratch or with guidance to mimic the teacher's behavior.^[13] This distinction means pruning preserves the original architecture's topology more closely, while distillation allows for greater flexibility in designing compact student architectures.^[13] Knowledge distillation is preferable when transitioning to a fundamentally different or smaller architecture, such as compressing a large convolutional neural network (CNN) into a more efficient variant while retaining learned representations, as it leverages the teacher's global knowledge to guide the student.^[12] In contrast, pruning suits scenarios where the goal is to accelerate an existing model's inference within the same architecture, such as by sparsifying weights to reduce computational load on resource-constrained devices.^[13] Both techniques can synergize with broader compression strategies, but pruning's emphasis on sparsity makes it ideal for hardware-aware optimizations like exploiting sparse matrix operations.^[13] For instance, magnitude-based pruning can reduce parameters by up to 9x with minimal initial accuracy loss on networks like AlexNet, but aggressive pruning often incurs drops that are effectively recovered through fine-tuning or integration with knowledge distillation to restore performance.

Historical Development

Origins and Early Concepts

The roots of knowledge distillation trace back to early efforts in neural network optimization during the late 1980s, driven by hardware constraints that limited the deployment of large models. In their seminal work, LeCun et al. introduced "Optimal Brain Damage" (OBD), a pruning technique that systematically removes weights from trained networks to reduce computational demands while preserving performance.^[14] OBD employs second-order approximations, specifically the diagonal of the Hessian matrix, to estimate the "saliency" of each weight—measuring the expected increase in training error upon removal—and achieves significant compression, such as up to 60% parameter reduction in digit recognition tasks with negligible accuracy loss.^[14] This approach marked an initial shift toward efficient model architectures, laying groundwork for later compression ideas by emphasizing structured simplification over brute-force reduction. By the 1990s, ensemble methods emerged as key inspirations for implicit knowledge sharing among models, addressing limitations in single-learner generalization. Hansen and Salamon demonstrated that combining multiple neural networks, trained independently on the same data, reduces residual error through consensus mechanisms like majority voting, as diverse local minima in the optimization landscape lead to complementary error patterns.^[15] Similarly, boosting algorithms, such as AdaBoost introduced by Freund and Schapire, sequentially train weak learners and aggregate their predictions with weighted voting, effectively sharing knowledge to boost overall accuracy beyond individual components.^[16] These techniques highlighted the value of collaborative prediction in ensembles, where collective outputs capture robust decision boundaries that no single model achieves alone, influencing subsequent ideas on transferable model insights. A pivotal precursor to formal distillation appeared in 2006 with Buciluǎ et al.'s work on model compression, which explicitly trained compact models to mimic the softened predictions of large ensembles, treating ensemble outputs as rich supervisory signals.^[17] By generating pseudo-labeled data from ensemble probabilities—via methods like MUNGE for data augmentation—they compressed ensembles of hundreds of classifiers into neural networks 1000 times smaller and faster, often retaining 97% of the original performance on benchmarks like UCI datasets.^[17] This represented a conceptual evolution from hardware-focused pruning to viewing model predictions as vessels of transferable knowledge, where probabilistic outputs encode nuanced information beyond hard labels, enabling smaller models to approximate complex ensemble functions effectively.^[17]

Key Milestones and Recent Advances

The seminal work introducing knowledge distillation was published in 2015 by Geoffrey Hinton and colleagues, who proposed a teacher-student framework where a compact "student" model learns from the softened probability outputs (soft targets) of a larger "teacher" model, using temperature scaling to reveal inter-class relationships and improve generalization.^[1] This approach demonstrated improvements in student model performance, such as reducing the test error rate on MNIST from 1.46% to 0.74% for a smaller network compared to training from hard labels alone, using a teacher with 1,200 hidden units per layer, laying the foundation for model compression in neural networks.^[1] Between 2018 and 2020, advancements focused on transferring richer representations beyond logits, such as attention maps and relational structures. In 2016, attention transfer was proposed to align feature maps using convolutional kernels, enabling better performance in convolutional neural networks for image classification tasks.^[18] The same year, relational knowledge distillation emerged, shifting emphasis to preserving pairwise relationships between samples in the feature space, which allowed student models to surpass teacher accuracy on datasets like CIFAR-100 by capturing structural dependencies.^[19] From 2021 to 2023, knowledge distillation gained prominence in transformer architectures, particularly for natural language processing and vision. DistilBERT (2019, extended in applications through 2021) compressed BERT by 40% while retaining 97% of its performance on GLUE benchmarks, using layer-wise distillation of attention and hidden states.^[20] Similarly, TinyBERT (2019, with follow-ups in 2021) applied task-specific and general distillation to BERT variants, achieving 96% of the teacher's speed with minimal accuracy loss on downstream NLP tasks.^[21] In vision transformers, self-distillation techniques like DINO (2021) interpreted self-supervised learning as distillation without labels, yielding emergent properties such as segmentation masks from attention maps in ViTs trained on ImageNet.^[22] Data-efficient image transformers (DeiT, 2021) further integrated distillation with strong augmentation, training ViTs from scratch to match supervised baselines using only 1% of labeled data. Recent developments from 2024 to 2025 have emphasized privacy-preserving and generative applications. Federated knowledge distillation has advanced to enable collaborative training across decentralized devices without sharing raw data, as in privacy-preserving frameworks for heterogeneous settings.^[23] Integration with diffusion models, such as DiffKD (2023) for denoising features during distillation^[24] and methods distilling diffusion models themselves (2024), has aimed to accelerate sampling in generative tasks by reducing the number of denoising steps while preserving quality.^[23] Surveys highlight KD's role in mobile AI deployments for edge computing, driven by its efficiency in compressing large models.^[23] In 2025, KD has been increasingly applied to large language models, enabling compression of models like Llama variants for efficient inference on resource-limited devices.^[23] These milestones have shifted knowledge distillation toward scalable, sustainable AI, enabling the deployment of large-scale models on resource-constrained devices and fostering innovations in privacy and generation that support broader AI adoption.^[23]