Fact-checked by Grok 2 weeks ago

Knowledge distillation

Knowledge distillation is a model technique in where a compact "student" model is trained to replicate the behavior and generalizations of a larger, more complex "teacher" model, thereby transferring knowledge to enable efficient deployment while preserving much of the original performance. Introduced in 2015 by , , and Jeffrey Dean, the method originally focused on distilling the collective predictions from an ensemble of neural networks into a single smaller network, using softened probability distributions (soft targets) derived from the teacher's outputs as training signals rather than hard labels. This approach demonstrated significant improvements, such as enhancing acoustic models in commercial systems and achieving competitive results on benchmarks like MNIST. Since its inception, knowledge distillation has evolved into a versatile paradigm applicable across various domains, including , , and , addressing the challenges of deploying models on edge devices with limited computational resources. Key variants include offline distillation, where a pre-trained guides the ; online distillation, involving simultaneous training of and ; and self-distillation, where a model distills to itself for refinement. The technique transfers not only output predictions but also intermediate feature representations, attention maps, and relational structures between data points, leading to benefits like reduced model size, faster inference, and sometimes even improved generalization on the student model. Applications span mobile AI, , and large language models, where distillation compresses billion-parameter models into efficient versions without substantial accuracy loss.

Overview

Definition and Framework

Knowledge distillation is a machine learning paradigm that enables the transfer of knowledge from a larger, more complex model to a smaller, more efficient one, allowing the latter to approximate the former's predictive capabilities. In this framework, the process revolves around a , where the serves as a source of learned representations and the student as the recipient designed for practical deployment. The teacher model typically consists of a large, pre-trained , often an of models or a highly regularized single network, which generates detailed predictions on input data. These predictions capture not only correct classifications but also nuanced relationships between classes, derived from extensive on vast datasets. In contrast, the student model is a compact, simpler network—such as a shallower or narrower architecture—trained specifically to replicate the teacher's behavior across a range of inputs, thereby inheriting its generalization patterns while reducing computational demands. Knowledge transfer in this setup occurs through two primary modes: explicit and implicit. Explicit transfer involves directly mimicking the teacher's output predictions, such as softened probability distributions over classes that provide richer than hard labels alone. Implicit transfer, on the other hand, focuses on aligning intermediate representations, like feature maps from hidden layers, which encode deeper structural insights about the data, such as spatial hierarchies or relational patterns. These modes allow the student to absorb both surface-level decisions and underlying model intuition from the teacher. The general workflow begins with training the teacher model on the original labeled to develop its predictive expertise. Subsequently, the student is trained to mimic the teacher's outputs using a transfer set that may combine labeled examples from the original data with additional unlabeled samples, enabling broader without requiring new annotations. This approach enhances model efficiency for resource-constrained environments, such as mobile devices, by yielding a deployable model with performance close to the teacher's.

Motivation and Benefits

Knowledge distillation is primarily motivated by the need to deploy powerful models in resource-limited settings, such as devices, applications, and systems, where large models are impractical due to their high computational demands and . By transferring from a cumbersome teacher model to a compact model, distillation enables significant reductions in model size and inference time while aiming to preserve much of the original accuracy. For instance, in tasks, distillation can compress an of models with millions of parameters into a single efficient model suitable for deployment, achieving comparable with reduced overhead. A key benefit lies in the transfer of "dark knowledge," where the teacher's softened probability distributions reveal nuanced relationships between classes—such as relative confidences in incorrect predictions—that hard labels from cannot capture, leading to improved in the model. This results in better performance on test sets compared to solely on labels; for example, on MNIST, a distilled small achieves 74 test errors versus 146 for one trained on hard targets. Additionally, yields cost savings by lowering both and expenses through smaller models that require fewer resources, facilitating deployment in constrained environments like sensors or battery-powered devices. In practice, distillation has enabled efficient applications such as distilling large models for chatbots, where models like Vicuna retain conversational abilities with substantial size reductions, or vision models for devices, supporting real-time with 2-10x inference speedups. However, trade-offs include a potential accuracy drop of 1-5% relative to the , as seen in semantic segmentation tasks, and the prerequisite of access to a pre-trained model.

Mathematical Foundations

Core Formulation

Knowledge distillation establishes a foundational setup where a pre-trained model, with parameters denoted as T, processes an input x to produce logits z_t = f_T(x), and a model, with parameters S, generates corresponding logits z_s = f_S(x). Here, f_T and f_S represent the forward-pass functions of the respective models, and the logits z_t and z_s serve as the raw, unnormalized outputs that encode the models' predictions for tasks. The core objective is to train the model by minimizing a , such as the Kullback-Leibler divergence, between the softened probability distributions derived from its logits z_s and the teacher's logits z_t, thereby transferring the teacher's learned representations to the more compact student architecture. This minimization occurs over a full labeled D = \{(x_i, y_i)\}_{i=1}^N, where the ground-truth labels y_i provide additional hard alongside the soft targets from the teacher, enabling the student to leverage both sources for improved generalization. To quantify the outputs for , softened probability distributions are derived from the logits using a temperature-scaled softmax operation: p_t^\tau = \softmax(z_t / \tau) for the and p_s^\tau = \softmax(z_s / \tau) for the (with \tau > 1), where p_t^\tau and p_s^\tau represent the softened class probabilities that capture the 's and . This notation emphasizes the focus on matching the distributional outputs rather than solely the final predictions, distinguishing from standard .

Loss Functions and Temperature Scaling

In knowledge distillation, the optimization objective for training the student model typically combines two loss terms: a standard loss on the true labels, known as the hard loss L_{\text{hard}}, and a distillation loss that encourages the student to match the softened probability distribution of the teacher, denoted as L_{\text{soft}}. The overall distillation loss is formulated as a weighted combination: L_{\text{dist}} = (1 - \alpha) L_{\text{hard}} + \alpha L_{\text{soft}}, where \alpha \in [0, 1] is a hyperparameter controlling the between the two objectives, often set between 0.1 and 0.9 to balance label supervision with teacher imitation. The hard loss L_{\text{hard}} is the conventional between the student's predicted probabilities (using \tau = 1) and the encoded ground-truth labels, providing direct supervision from the data. The soft loss L_{\text{soft}} is commonly defined using the Kullback-Leibler (KL) divergence between the teacher's softened output probabilities p_t^\tau and the student's softened outputs p_s^\tau: L_{\text{soft}} = D_{\text{KL}}(p_t^\tau \parallel p_s^\tau) = \sum_i p_{t,i}^\tau \log \frac{p_{t,i}^\tau}{p_{s,i}^\tau}. This measures how closely the student's aligns with the teacher's, capturing richer relational information beyond hard labels. Equivalently, L_{\text{soft}} can be expressed as the between the teacher's soft targets and the student's predictions, since the entropy of the teacher's is constant with respect to the student parameters. Temperature scaling plays a crucial role in softening the probability distributions to reveal the teacher's "dark knowledge," such as inter-class relationships encoded in the logits z_t and z_s. The softened probabilities for a model are computed by scaling the logits in the softmax function with a temperature parameter \tau > 1: p_i^\tau = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}. A higher \tau (typically ranging from 2 to 10) flattens the distribution, making it less peaked and more informative for distillation, as it exposes probabilities for incorrect classes that reflect the teacher's learned uncertainties. During inference, \tau is reset to 1 to recover sharp predictions. To ensure the gradients from the soft loss have appropriate magnitude—since softening reduces them by a factor of approximately $1/\tau^2—the KL divergence is often scaled by \tau^2: L_{\text{soft}} = \tau^2 D_{\text{KL}}(p_t^\tau \parallel p_s^\tau). This adjustment maintains the relative contribution of the soft loss compared to the hard loss, preventing under-emphasis on teacher signals. The same \tau is applied to both teacher and student logits during training for consistency.

Techniques and Variants

Classical Offline Distillation

Classical offline distillation, also known as the original teacher-student framework, involves transferring knowledge from a pre-trained, larger teacher model to a smaller model in a two-phase . First, the teacher model—a complex or ensemble—is trained on the full dataset to capture rich representations. Once trained, the teacher is frozen, and its outputs, particularly softened probability distributions generated using a high in the , serve as supervisory signals for the student. This approach leverages the teacher's "dark knowledge," which includes inter-class relationships not evident in one-hot hard labels, to guide the student toward more informed predictions. The key steps in classical offline distillation begin with pre-training the on the complete , often employing regularization techniques like dropout to enhance . Next, the generates predictions on a transfer set, typically the same training data, producing soft targets that emphasize relative probabilities across classes. The model is then optimized using a combined that balances distillation loss—measuring between and outputs—with standard loss on ground-truth hard labels; notably, the remains unchanged throughout this training phase, ensuring a one-way . This process results in a compact model that approximates the 's performance while being more efficient for deployment. Despite its effectiveness, classical offline distillation has notable limitations. It demands significant upfront computational resources to train and freeze a large-capacity model, which can be prohibitive in resource-constrained environments. The static nature of the —relying on fixed outputs—limits adaptability, making it less suitable for tasks where data distributions evolve over time or require dynamic updates. Additionally, the success of distillation hinges on the teacher's quality and the student's capacity to absorb the transferred , potentially leading to suboptimal results if there is a substantial architectural gap between the models. A prominent example of classical offline distillation is its application to , as demonstrated by Hinton et al. In their work on an Android system, a single distilled model, reduced from a 10-model ensemble baseline, achieved comparable performance with 60.8% frame accuracy and 10.7% (WER), enabling substantial model size while maintaining efficacy on 2000 hours of training data. This illustrates the method's utility in compressing acoustic models for mobile deployment without significant accuracy loss.

Online and Self-Distillation Methods

Online knowledge distillation extends the classical paradigm by training multiple models simultaneously in a collaborative manner, where each model acts as both a and a to its peers, eliminating the need for a pre-trained, static teacher. This approach, often termed mutual or distillation, enables dynamic during training through the use of softened probability distributions derived from logits. A seminal method in this category is Deep Mutual Learning (DML), proposed by Zhang et al. in 2018, which trains several networks of identical or varying architectures in parallel. In DML, each minimizes a combined loss of with ground-truth labels and Kullback-Leibler divergence with respect to the softened outputs of other students, fostering mutual improvement without hierarchical dependencies. This online setup offers advantages such as reduced training phases compared to offline methods and enhanced generalization through diverse peer interactions, particularly beneficial in resource-constrained environments. For instance, on the CIFAR-100 dataset using ResNet-32 architectures, DML achieves a top-1 accuracy of 71.19%, representing a 2.2% improvement over the baseline single-student training accuracy of 68.99%. Similar gains, around 1-2% in accuracy, have been observed in vision tasks, underscoring its efficacy for scenarios. However, online distillation introduces challenges like increased computational overhead due to parallel model updates and the need for careful among peers to avoid instability. Self-distillation further evolves the technique by enabling a single model to refine itself without any external teacher, leveraging its own predictions or internal components for . In this framework, the model iteratively distills from previous iterations or from deeper layers to shallower ones, promoting iterative refinement and . A foundational example is Born-Again Neural Networks (BANs), introduced by Furlanello et al. in 2018, where a converged model serves as the initial teacher, and subsequent "generations" of the same architecture are trained to mimic the prior version's softened outputs alongside ground-truth labels. This process repeats for multiple iterations, with each student potentially surpassing its predecessor due to accumulated dark knowledge. Another prominent self-distillation variant is the "Be Your Own Teacher" method by Zhang et al. in 2019, which integrates multiple classifiers at different depths within a single , distilling knowledge from the deepest classifier to shallower ones via Kullback-Leibler divergence on outputs and regularization on features. This intra-network transfer enhances performance without additional models and allows for early-exit , where shallower classifiers provide faster predictions for simpler inputs. Self-distillation's key benefits include simplified deployment—no separate training required—and iterative self-improvement, leading to 1-2% accuracy boosts in tasks like image classification, though it demands careful hyperparameter tuning to mitigate across iterations and added training complexity.

Relations to Model Optimization

Connection to Compression Techniques

Knowledge distillation serves as a key knowledge-transfer method within the broader of model techniques, enabling the reduction of model parameters through behavioral of a larger model by a compact model. Unlike direct architectural downsizing, which may lead to significant performance degradation, distillation facilitates by encoding the 's learned representations—such as softened probability distributions—into the , thereby maintaining representational capacity in a smaller . This approach is particularly valuable in scenarios requiring deployment on resource-constrained devices, where parameter count and inference speed are critical. Distillation exhibits strong synergies with other compression paradigms, such as quantization and low-rank , allowing for compounded efficiency gains. For instance, post-distillation quantization applies bit-width reduction (e.g., to 8-bit or lower ) to the already compressed student model, further minimizing usage and computational overhead while leveraging the teacher's guidance to mitigate quantization-induced accuracy drops; studies have demonstrated this combination yielding low-precision networks with minimal performance loss compared to quantized teachers alone. Similarly, integrating with low-rank adaptation methods, like low-rank of weight matrices or adapters such as , exploits parameter redundancy for additional size cuts— for example, progressive low-rank decomposition during can compress Transformer-based models while distilling knowledge to preserve task-specific efficacy. These hybrid strategies often form multi-stage pipelines, where initializes the student before applying quantization or low-rank techniques. Representative metrics underscore distillation's role in achieving substantial ratios alongside FLOP . For example, distilling from ResNet-56 (0.85 million ) to ResNet-20 (0.27 million ) on yields approximately a 68% parameter reduction, with the attaining 93.15% accuracy—nearly matching the teacher's 93.61%—while also lowering for faster . In more aggressive cases, such as large models, distillation combined with other methods can enable up to 10× parameter reduction (e.g., 90% fewer parameters) while retaining 95-98% of the teacher's performance, outperforming standalone by avoiding the pitfalls of unguided downsizing. This preservation stems from the teacher's provision of "dark ," including inter-class relationships, which direct compression techniques like naive resizing cannot replicate.

Differences from Pruning Algorithms

Pruning algorithms compress neural networks by systematically removing weights or neurons that contribute minimally to the model's performance, often guided by saliency criteria such as weight magnitude or approximations from the . A foundational approach, Optimal Brain Damage, introduced in , employs a second-order Taylor expansion to estimate the expected increase in training error upon weight removal, enabling an iterative process of followed by to maintain accuracy. Magnitude-based methods, a common variant, target weights with the smallest absolute values, assuming they have negligible impact on outputs. Unlike knowledge distillation, which holistically transfers representational knowledge from a larger teacher model to a smaller student by aligning their softened probability distributions via a distillation loss, pruning focuses on local, parameter-level eliminations within a single model. Pruning modifies an existing pre-trained network by inducing sparsity—either unstructured (individual weights) or structured (entire filters or channels)—without requiring a separate teacher model, whereas distillation trains a new student model from scratch or with guidance to mimic the teacher's behavior. This distinction means pruning preserves the original architecture's topology more closely, while distillation allows for greater flexibility in designing compact student architectures. Knowledge distillation is preferable when transitioning to a fundamentally different or smaller , such as compressing a large () into a more efficient variant while retaining learned representations, as it leverages the teacher's global to guide the . In contrast, suits scenarios where the goal is to accelerate an existing model's within the same , such as by sparsifying weights to reduce computational load on resource-constrained devices. Both techniques can synergize with broader compression strategies, but 's emphasis on sparsity makes it ideal for hardware-aware optimizations like exploiting operations. For instance, magnitude-based can reduce parameters by up to 9x with minimal initial accuracy loss on networks like , but aggressive pruning often incurs drops that are effectively recovered through or integration with knowledge distillation to restore performance.

Historical Development

Origins and Early Concepts

The roots of knowledge distillation trace back to early efforts in optimization during the late , driven by constraints that limited the deployment of large models. In their seminal work, LeCun et al. introduced "Optimal Brain Damage" (OBD), a technique that systematically removes weights from trained networks to reduce computational demands while preserving performance. OBD employs second-order approximations, specifically the diagonal of the , to estimate the "saliency" of each weight—measuring the expected increase in training error upon removal—and achieves significant compression, such as up to 60% parameter reduction in digit recognition tasks with negligible accuracy loss. This approach marked an initial shift toward efficient model architectures, laying groundwork for later compression ideas by emphasizing structured simplification over brute-force reduction. By the 1990s, ensemble methods emerged as key inspirations for implicit knowledge sharing among models, addressing limitations in single-learner generalization. and Salamon demonstrated that combining multiple neural networks, trained independently on the same data, reduces residual error through consensus mechanisms like majority voting, as diverse local minima in the optimization landscape lead to complementary error patterns. Similarly, boosting algorithms, such as introduced by Freund and Schapire, sequentially train weak learners and aggregate their predictions with , effectively sharing knowledge to boost overall accuracy beyond individual components. These techniques highlighted the value of collaborative prediction in ensembles, where collective outputs capture robust decision boundaries that no single model achieves alone, influencing subsequent ideas on transferable model insights. A pivotal precursor to formal distillation appeared in 2006 with Buciluǎ et al.'s work on model compression, which explicitly trained compact models to mimic the softened predictions of large ensembles, treating ensemble outputs as rich supervisory signals. By generating pseudo-labeled data from ensemble probabilities—via methods like MUNGE for data augmentation—they compressed ensembles of hundreds of classifiers into neural networks 1000 times smaller and faster, often retaining 97% of the original performance on benchmarks like UCI datasets. This represented a conceptual evolution from hardware-focused pruning to viewing model predictions as vessels of transferable knowledge, where probabilistic outputs encode nuanced information beyond hard labels, enabling smaller models to approximate complex ensemble functions effectively.

Key Milestones and Recent Advances

The seminal work introducing knowledge distillation was published in 2015 by and colleagues, who proposed a teacher-student framework where a compact "student" model learns from the softened probability outputs (soft targets) of a larger "teacher" model, using temperature scaling to reveal inter-class relationships and improve generalization. This approach demonstrated improvements in student model performance, such as reducing the test error rate on MNIST from 1.46% to 0.74% for a smaller network compared to training from hard labels alone, using a teacher with 1,200 hidden units per layer, laying the foundation for model compression in neural networks. Between 2018 and 2020, advancements focused on transferring richer representations beyond logits, such as attention maps and relational structures. In 2016, attention transfer was proposed to align feature maps using convolutional kernels, enabling better performance in convolutional neural networks for image classification tasks. The same year, relational knowledge distillation emerged, shifting emphasis to preserving pairwise relationships between samples in the feature space, which allowed student models to surpass teacher accuracy on datasets like CIFAR-100 by capturing structural dependencies. From 2021 to 2023, knowledge distillation gained prominence in transformer architectures, particularly for and vision. DistilBERT (2019, extended in applications through 2021) compressed by 40% while retaining 97% of its performance on GLUE benchmarks, using layer-wise distillation of and hidden states. Similarly, TinyBERT (2019, with follow-ups in 2021) applied task-specific and general distillation to variants, achieving 96% of the teacher's speed with minimal accuracy loss on downstream tasks. In vision transformers, self-distillation techniques like (2021) interpreted as distillation without labels, yielding emergent properties such as segmentation masks from maps in ViTs trained on . Data-efficient image transformers (DeiT, 2021) further integrated distillation with strong augmentation, training ViTs from scratch to match supervised baselines using only 1% of labeled data. Recent developments from 2024 to 2025 have emphasized privacy-preserving and generative applications. Federated knowledge distillation has advanced to enable collaborative training across decentralized devices without sharing raw data, as in privacy-preserving frameworks for heterogeneous settings. Integration with models, such as DiffKD (2023) for denoising features during distillation and methods distilling diffusion models themselves (2024), has aimed to accelerate sampling in generative tasks by reducing the number of denoising steps while preserving quality. Surveys highlight KD's role in mobile deployments for , driven by its efficiency in compressing large models. In 2025, KD has been increasingly applied to large models, enabling compression of models like variants for efficient inference on resource-limited devices. These milestones have shifted knowledge distillation toward scalable, sustainable , enabling the deployment of large-scale models on resource-constrained devices and fostering innovations in and generation that support broader adoption.

References

  1. [1]
    [1503.02531] Distilling the Knowledge in a Neural Network - arXiv
    Mar 9, 2015 · We show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a ...
  2. [2]
    [2006.05525] Knowledge Distillation: A Survey - arXiv
    Jun 9, 2020 · This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student ...
  3. [3]
  4. [4]
  5. [5]
    A Survey on Knowledge Distillation of Large Language Models - arXiv
    Feb 20, 2024 · This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller ...Missing: machine | Show results with:machine
  6. [6]
    [1706.00384] Deep Mutual Learning - arXiv
    Jun 1, 2017 · Abstract:Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network.
  7. [7]
    [1805.04770] Born Again Neural Networks - arXiv
    May 12, 2018 · View a PDF of the paper titled Born Again Neural Networks, by Tommaso Furlanello and 3 other authors. View PDF. Abstract:Knowledge Distillation ...
  8. [8]
    Improve the Performance of Convolutional Neural Networks via Self ...
    May 17, 2019 · In this paper, we propose a general training framework named self distillation, which notably enhances the performance (accuracy) of convolutional neural ...
  9. [9]
  10. [10]
    PC-LoRA: Low-Rank Adaptation for Progressive Model ... - arXiv
    Jun 13, 2024 · PC-LoRA uses low-rank adaptation to compress and fine-tune models by gradually removing pre-trained weights, leaving only low-rank adapters.
  11. [11]
    [PDF] A Survey on Deep Neural Network Pruning - arXiv
    Aug 9, 2024 · Whether through criteria or learning, pruning aims to determine the weights of a network that should be pruned. The above three aspects ...
  12. [12]
    [PDF] Distilling the Knowledge in a Neural Network - arXiv
    Mar 9, 2015 · A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and ...
  13. [13]
    None
    ### Authors, Year, Title
  14. [14]
    None
    ### Summary of Optimal Brain Damage (OBD) from the Paper
  15. [15]
  16. [16]
    [PDF] A Decision-Theoretic Generalization of On-Line Learning and an ...
    Freund and R. E. Schapire, Game theory, on-line prediction and boosting, in ``Proceedings of the Ninth Annual Conference on. Computational Learning Theory, 1996 ...
  17. [17]
    [PDF] Model Compression - Cornell: Computer Science
    We present a method for “compressing” large, complex ensembles into smaller, faster models, usually with- out significant loss in performance. Categories and ...
  18. [18]
  19. [19]
    [1904.05068] Relational Knowledge Distillation - arXiv
    We introduce a novel approach, dubbed relational knowledge distillation (RKD), that transfers mutual relations of data examples instead.Missing: Heo | Show results with:Heo
  20. [20]
  21. [21]
  22. [22]
    Emerging Properties in Self-Supervised Vision Transformers - arXiv
    We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
  23. [23]
    [PDF] A Comprehensive Survey on Knowledge Distillation - arXiv
    Mar 15, 2025 · Abstract—Deep Neural Networks (DNNs) have achieved no- table performance in the fields of computer vision and natural.