Fact-checked by Grok 2 weeks ago

Transfer learning

Transfer learning is a subfield of machine learning that focuses on improving the performance of models on a target task by leveraging knowledge acquired from a related source task or domain, particularly when the target domain has limited labeled data.^[1] This approach addresses the challenge of data scarcity in traditional machine learning, where models are typically trained from scratch on task-specific datasets, by reusing pre-trained representations to accelerate learning and enhance generalization.^[2] Originating from early ideas in the 1990s, such as the 1995 NIPS workshop on "Learning to Learn," transfer learning gained prominence with initiatives like DARPA's 2005 program on transfer learning for knowledge reuse across tasks.^[1] In practice, transfer learning involves transferring knowledge across domains (source and target) that differ in data distribution, feature space, or tasks, categorized primarily into inductive, transductive, and unsupervised settings.^[1] Inductive transfer learning applies when the source and target tasks differ but some labeled target data is available, often through fine-tuning pre-trained models.^[2] Transductive transfer learning assumes the same task across domains but different data distributions, requiring adaptation without target labels, such as domain adaptation techniques.^[1] Unsupervised transfer learning operates without labeled data in either domain, focusing on shared structures like clustering or feature learning.^[1] A key insight from deep learning research is that lower-layer features in neural networks, such as edge detectors in convolutional networks, tend to be more transferable across tasks than higher-layer task-specific ones.^[3] Transfer learning has become foundational in fields like computer vision and natural language processing, enabling efficient model development. In computer vision, pre-trained models on large datasets like ImageNet are fine-tuned for tasks such as object detection and medical image analysis, reducing training time and data needs.^[3] In natural language processing (NLP), models like BERT demonstrate transfer by pre-training on massive corpora for masked language modeling and then adapting to downstream tasks like sentiment analysis or question answering. Despite its benefits, challenges persist, including negative transfer, where irrelevant source knowledge degrades target performance, and handling domain shifts due to covariate or label shifts.^[1] Ongoing research emphasizes robust methods to mitigate these issues, ensuring reliable knowledge transfer in diverse applications.^[2]

Fundamentals

Definition

Transfer learning is a subfield of machine learning, which includes supervised learning—where models learn from labeled data to map inputs to outputs—and unsupervised learning—where models identify patterns in unlabeled data without explicit guidance.^[1] Formally, transfer learning is defined as a machine learning paradigm that aims to improve the learning of a target predictive function f_T(\cdot) in a target domain D_T by leveraging knowledge from a source domain D_S and source task T_S, where D_S \neq D_T or T_S \neq T_T.^[1] A domain D is composed of a feature space \mathcal{X} and a marginal probability distribution P(\mathcal{X}) over it, so the source domain is D_S = \{\mathcal{X}_S, P(\mathcal{X}_S)\} and the target domain is D_T = \{\mathcal{X}_T, P(\mathcal{X}_T)\}.^[1] A task T consists of a label space \mathcal{Y} and an objective predictive function, often the conditional probability P(\mathcal{Y}|\mathcal{X}); thus, the source task is T_S = \{\mathcal{Y}_S, P(\mathcal{Y}_S|\mathcal{X}_S)\} and the target task is T_T = \{\mathcal{Y}_T, P(\mathcal{Y}_T|\mathcal{X}_T)\}.^[1] In transfer learning scenarios, the goal is to reuse a model or knowledge from the source to initialize or enhance learning in the target, typically when the target has limited data (n_T \ll n_S).^[1] Outcomes include positive transfer, where the source knowledge improves target performance; negative transfer, where it degrades performance due to unrelated domains or tasks; and no transfer, which has a neutral effect.^[1]

Motivation and Benefits

Transfer learning is motivated by the challenges inherent in traditional machine learning paradigms, where models are typically trained from scratch on task-specific datasets drawn from identical distributions. In practice, labeled data for many real-world applications—such as specialized domains in healthcare or rare event detection—is often scarce and expensive to acquire, limiting the ability of standard supervised learning to achieve robust performance.^[4] Transfer learning addresses this by enabling the reuse of knowledge from related source tasks or domains with abundant data, thereby accelerating adaptation to the target scenario without requiring extensive new labeling efforts.^[5] A key driver is the high computational cost of training large-scale models, particularly deep neural networks, from the ground up, which can demand significant resources in terms of time, hardware, and energy. For instance, pre-training on massive datasets like ImageNet allows subsequent fine-tuning on smaller target datasets, drastically cutting these overheads while leveraging learned representations of general features such as edges or textures. This approach not only mitigates data scarcity but also harnesses prior knowledge to bootstrap learning, making it feasible to deploy sophisticated models in resource-constrained environments.^[4] The benefits of transfer learning are particularly pronounced in improving generalization and performance on small datasets, where traditional methods often overfit or underperform due to insufficient training examples. By initializing models with pre-trained weights, transfer learning enhances predictive accuracy, with empirical studies in computer vision tasks, such as semantic segmentation, showing gains of 10-20% in recall accuracy compared to training from random initialization.^[6] In natural language processing, fine-tuning pre-trained models like BERT can reduce training time by orders of magnitude—often to just a few hours on a single GPU versus days for from-scratch training—while achieving state-of-the-art results on downstream tasks with minimal additional data. This contrasts sharply with conventional machine learning, which assumes i.i.d. data across training and testing, rendering models brittle to distribution shifts; transfer learning, by contrast, explicitly reuses knowledge across differing distributions, fostering more adaptable and efficient systems.^[4] A compelling example is the application of ImageNet-pre-trained convolutional networks to medical imaging, where limited annotated scans pose a barrier; such transfer has demonstrated substantial performance uplifts, such as 2-6% improvements in AUC for disease classification in chest X-ray images, enabling reliable diagnostics with far fewer patient-specific labels.^[7] Overall, these advantages make transfer learning indispensable for scaling AI to diverse, data-limited domains.

Historical Development

Origins and Early Work

The concept of transfer learning originated in psychological studies of how learning in one context influences performance in another. In 1901, Edward L. Thorndike and Robert S. Woodworth conducted foundational experiments demonstrating that transfer depends on the presence of identical elements between tasks, rather than broad formal discipline or general faculties of the mind. Their theory of identical elements posited that positive transfer occurs when tasks share specific common features, while negative transfer arises from interfering elements; this challenged earlier notions of widespread mental training effects and emphasized empirical measurement of transfer degrees.^[8] These insights from educational psychology provided an early conceptual framework for knowledge reuse across domains. In artificial intelligence, early explorations of transfer-like mechanisms appeared in the 1970s amid nascent neural network research. A pioneering effort came from Ante Fulgosi and Stevo Bozinovski in 1976, who investigated transfer learning in the training of a single-layer perceptron, examining how prior exposure to similar patterns accelerated learning on new tasks through weight initialization from previous trainings. Their work demonstrated that pattern similarity between source and target tasks enhanced training efficiency, marking the first explicit application of transfer principles to neural networks and establishing source domains, target tasks, and adaptation via reused parameters.^[9] This built on psychological transfer ideas by applying them to computational models, focusing on self-learning systems without backpropagation, and laid initial groundwork for multi-task scenarios in AI. Pre-1990s developments further advanced transfer in specific architectures. In pattern recognition, early neural network reuse involved adapting pre-trained shallow networks for related classification problems, such as handwriting or speech recognition, where shared feature detectors from one dataset improved generalization on sparse data. A notable algorithmic contribution was Lorraine Pratt's 1992 Discriminability-Based Transfer (DBT) method for neural networks, which quantified the utility of hidden units from a source network using discriminability measures to selectively transfer beneficial hyperplanes, achieving significant speedups in learning (e.g., up to 50% reduction in epochs on benchmark tasks like vowel recognition).^[10] Although focused on neural models rather than decision trees, DBT exemplified early systematic reuse of learned representations, prioritizing transferable components based on information-theoretic criteria. By the mid-1990s, surveys began synthesizing these precursors under inductive transfer paradigms. Rich Caruana's 1993 work on multitask learning positioned shared representations across related tasks as a form of knowledge transfer, arguing that joint training leverages domain information to improve generalization on individual tasks. This approach, detailed in conference proceedings, served as an early survey of transfer mechanisms, bridging psychological roots and AI implementations by formalizing multitask setups as precursors to modern transfer learning, all without deep architectures. These foundational efforts established core principles of adaptation and reuse, enabling subsequent evolution in machine learning.

Key Milestones and Evolution

The formalization of transfer learning gained momentum in the 2000s through key surveys that categorized its approaches and distinguished types such as inductive and transductive transfer. A seminal overview by Taylor and Stone in 2009 focused on transfer methods for reinforcement learning domains, proposing a framework to classify techniques based on their representational capabilities and learning goals.^[11] This was complemented by the influential 2010 survey by Pan and Yang, which systematically reviewed progress in transfer learning for classification, regression, and clustering tasks, while formally defining core settings like negative transfer and highlighting relationships to domain adaptation and multitask learning.^[4] The 2010s marked a pivotal shift with the integration of transfer learning into deep neural networks, driven by breakthroughs in large-scale pre-training. The 2012 AlexNet model by Krizhevsky et al. demonstrated the efficacy of pre-training deep convolutional networks on massive datasets like ImageNet, achieving a top-5 error rate of 15.3% and sparking widespread adoption of transfer learning in computer vision by enabling feature extraction from pre-trained weights.^[12] In 2016, Andrew Ng forecasted during a NIPS tutorial that transfer learning would emerge as the dominant paradigm in machine learning, surpassing traditional supervised approaches due to its ability to leverage vast pre-existing knowledge.^[13] This prediction aligned with the rise of transformer-based models; for instance, BERT by Devlin et al. in 2018 introduced bidirectional pre-training on unlabeled text, yielding state-of-the-art results on GLUE benchmarks (average score of 80.5%) and popularizing fine-tuning for natural language tasks.^[14] Entering the 2020s, research emphasized efficiency and scalability in transfer learning amid growing model sizes. Zoph et al. in 2020 challenged conventional pre-training by showing it could sometimes degrade performance on downstream tasks like COCO object detection (e.g., -1.0 AP with strong augmentation), advocating self-training as a robust alternative that improved COCO AP by up to 3.4 over baselines without relying on external pre-trained models.^[15] Post-2020 developments included the advent of federated transfer learning to address privacy-preserving adaptation across distributed data sources, as explored in comprehensive reviews that categorize hybrid approaches combining federated and transfer mechanisms for heterogeneous domains. Concurrently, the Vision Transformer (ViT) by Dosovitskiy et al. in 2020 extended transfer principles to pure attention-based architectures, achieving 88.55% top-1 accuracy on ImageNet when pre-trained at scale, thus bridging NLP and vision paradigms.^[16] Key advancements since 2021 include contrastive models like CLIP (Radford et al., 2021) for zero-shot multimodal transfer across vision and language, and parameter-efficient techniques such as LoRA (Hu et al., 2021) for fine-tuning large models.^[17]^[18] Overall, transfer learning has evolved from shallow, instance-based methods in the 2000s to deep pre-training and fine-tuning strategies dominant since the 2010s, with ongoing surveys like Zhuang et al.'s 2020 comprehensive review synthesizing over 40 approaches and underscoring the field's progression toward handling domain shifts in large-scale AI systems.^[2] This trajectory reflects a broader transition to knowledge reuse in resource-constrained environments, with recent works up to 2025 highlighting multimodal and privacy-aware extensions.

Classification and Types

Inductive Transfer Learning

Inductive transfer learning refers to the paradigm in transfer learning where the source and target domains differ, but labeled data is available for the target task, allowing the transfer of knowledge to improve the target learner's performance.^[19] This approach assumes that the source domain provides useful knowledge that can be adapted to the target, typically when the source has abundant labeled data while the target has limited labels.^[19] Unlike scenarios without target labels, inductive transfer explicitly leverages supervised signals in the target to refine the transferred knowledge.^[19] A key subtype of inductive transfer learning is multi-task learning, where multiple related tasks are learned simultaneously to leverage shared representations and improve generalization across them.^[19] In this setup, the tasks share common features or parameters, enabling inductive bias transfer from one task to others, as originally formalized in early work on multitask frameworks. This subtype is particularly effective when tasks are interdependent, such as predicting related outcomes in classification problems. Mechanisms in inductive transfer learning often involve instance weighting to emphasize source samples relevant to the target domain. A seminal algorithm, TrAdaBoost, introduced in 2007, extends the AdaBoost framework by iteratively adjusting weights: target instances receive standard boosting updates, while source instances are downweighted if they lead to errors on the target, assuming related but shifted distributions between domains. This process relies on the assumption that source and target distributions are similar enough for positive transfer, with source data providing auxiliary supervision without identical tasks. Other methods build on this by incorporating feature alignment or parameter sharing under similar distributional relatedness assumptions.^[19] A representative example is digit recognition, where a model pretrained on the MNIST dataset (handwritten digits) is fine-tuned on the SVHN dataset (street-view house numbers), both involving labeled classification of digits 0-9 but with differing visual styles and backgrounds. This transfer exploits shared digit semantics while adapting to domain-specific noise, achieving notable accuracy gains over training from scratch on SVHN alone. Inductive transfer learning is effective for tasks with related domains, reducing the need for extensive target labeling and accelerating convergence, but it can suffer from negative transfer if domain shifts are too pronounced, leading to degraded performance compared to target-only training.^[19]

Transductive and Unsupervised Transfer Learning

Transductive transfer learning addresses scenarios where the source domain provides labeled data for a specific task, but the target domain shares the same task while lacking labels, with access to unlabeled target samples available for adaptation. This setting emphasizes domain adaptation techniques to bridge the distribution shift between source and target without requiring target annotations, making it suitable for real-world applications where labeling target data is costly or infeasible. Unlike inductive transfer learning, which relies on labeled target data to refine models for potentially different tasks, transductive approaches focus solely on aligning representations across domains for the shared task. A prominent method in transductive transfer learning is instance-based adaptation, which iteratively reweights source instances to emphasize those similar to the target domain while downweighting outliers, effectively boosting a weak learner for the target task. More advanced feature-level techniques include subspace alignment, which represents source and target domains as low-dimensional subspaces via principal component analysis and learns a linear transformation to align the source subspace basis with the target, minimizing divergence while preserving discriminative features for classification. This approach has demonstrated superior performance in visual domain adaptation tasks, such as adapting object recognition models from office environments to webcam images, achieving relative accuracy improvements of up to 20% over prior geodesic flow kernel methods on benchmark datasets like Office-Caltech.^[20] Adversarial training methods further advance transductive adaptation by learning domain-invariant features through a minimax game between a feature extractor and a domain discriminator. The Domain-Adversarial Neural Network (DANN) exemplifies this by incorporating a gradient reversal layer during backpropagation, which encourages the extractor to fool the discriminator into treating source and target samples as indistinguishable, while maintaining task-specific discriminability on source labels. Applied to image classification, DANN has set state-of-the-art results on datasets like Office-31, attaining 73% accuracy in cross-domain transfers (e.g., Amazon to Webcam), surpassing traditional methods by aligning marginal and conditional distributions. These techniques highlight transductive learning's reliance on target domain access to enable effective, unsupervised alignment.^[21] Unsupervised transfer learning extends beyond transductive settings by assuming no access to target domain data or labels, relying instead on selecting or extracting transferable knowledge from the source domain alone to generalize to unseen targets. This variant is particularly relevant for zero-shot or broad-domain transfer, where the goal is to identify intrinsic source structures—such as shared features or clusters—that apply universally without target-specific adaptation. In contrast to transductive methods, unsupervised approaches do not leverage target samples, broadening applicability to scenarios with completely novel environments but increasing the risk of negative transfer from irrelevant source elements. Key unsupervised methods include feature selection and clustering techniques, such as Self-Taught Clustering, which first learns sparse representations from a large pool of unlabeled source data using algorithms like sparse coding, then clusters these to discover transferable patterns for downstream tasks without supervision. Instance selection strategies, like those in early transfer boosting variants, further refine this by pruning source data to retain only high-relevance subsets based on intrinsic properties, such as density or manifold structure, for application to new domains. These methods have shown efficacy in applications like text clustering, where transferring clustered features from one corpus improves performance on unrelated datasets over non-transfer baselines, emphasizing conceptual reuse over domain-specific tuning. Overall, unsupervised transfer prioritizes robust, generalizable source exploitation, serving as a foundation for more extreme adaptation challenges.

Mathematical Framework

Domain and Task Formalism

In transfer learning, the foundational mathematical framework begins with formal definitions of domains and tasks to distinguish between source and target settings. A domain D is defined as a pair consisting of a feature space \mathcal{X} and a marginal probability distribution P(X) over that space, denoted as D = \{\mathcal{X}, P(X)\}, where \mathcal{X} represents the space of possible input features.^[1] Similarly, a task T comprises a label space \mathcal{Y} and a predictive function f(\cdot) = P(Y|X), expressed as T = \{\mathcal{Y}, P(Y|X)\}, where the conditional distribution P(Y|X) models the relationship between inputs and outputs, typically learned from labeled data pairs \{(x_i, y_i)\}.^[1] The core objective of transfer learning is established under this formalism: given a source domain D_S = \{\mathcal{X}_S, P(X_S)\} and source task T_S = \{\mathcal{Y}_S, P(Y_S|X_S)\}, along with a target domain D_T = \{\mathcal{X}_T, P(X_T)\} and target task T_T = \{\mathcal{Y}_T, P(Y_T|X_T)\}, the goal is to improve the learning of the target predictive function f_T(\cdot) by leveraging knowledge from the source, particularly when D_S \neq D_T or T_S \neq T_T.^[1] In practice, the source typically provides abundant labeled data \{(x_S^i, y_S^i)\}_{i=1}^{n_S} with n_S \gg n_T, while the target has limited or no labels \{(x_T^j, y_T^j)\}_{j=1}^{n_T}.^[1] Differences between source and target are often characterized by specific types of distributional shifts. Covariate shift occurs when the marginal distributions differ, P(X_S) \neq P(X_T), but the conditional P(Y|X) remains invariant across domains, assuming \mathcal{X}_S = \mathcal{X}_T.^[1] Label shift, also known as prior shift, arises when the label distribution changes, P(Y_S) \neq P(Y_T), while the class-conditional input distribution P(X|Y) stays the same, leading to altered P(Y|X).^[22] Concept shift, in contrast, involves a change in the predictive relationship itself, P(Y_S|X_S) \neq P(Y_T|X_T), even if the feature distributions align, encompassing broader task variations such as differing label spaces \mathcal{Y}_S \neq \mathcal{Y}_T.^[1] These shifts highlight the challenges in transferring knowledge, as they violate assumptions of identical distributions underlying standard machine learning.^[23]

Adaptation Algorithms and Metrics

In transfer learning, adaptation algorithms aim to bridge the gap between source and target domains or tasks by reweighting data, transforming representations, or adjusting model parameters. Instance-based methods focus on selecting or reweighting source instances to better align with the target domain, assuming that some source data are more relevant than others. A prominent example is TrAdaBoost, which extends AdaBoost by dynamically adjusting weights for source instances during boosting iterations, downweighting those that perform poorly on the target while upweighting useful ones.^[24] Feature-based approaches seek to learn a shared feature representation that reduces distribution discrepancies across domains. Transfer Component Analysis (TCA), for instance, projects source and target data into a reproducing kernel Hilbert space (RKHS) to minimize the maximum mean discrepancy (MMD) while preserving within-domain variance, enabling effective adaptation in unsupervised settings.^[25] Parameter-based methods transfer learned parameters from a source model to the target, often by sharing lower-layer weights in neural networks and fine-tuning higher layers. This approach leverages the generality of early features, as demonstrated in studies showing that transferring convolutional layers from pre-trained models like AlexNet improves target performance, with transferability decreasing as layers become more task-specific.^[3] Theoretical foundations for these algorithms often rely on generalization bounds that quantify the impact of domain shift. A key result from domain adaptation theory provides an upper bound on the target error \epsilon_T(f) of a hypothesis f in terms of the source error \epsilon_S(f), the divergence between domains, and task discrepancy:

\epsilon_T(f) \leq \epsilon_S(f) + \frac{1}{2} d_{H \Delta H}(D_S, D_T) + \lambda

Here, d_{H \Delta H}(D_S, D_T) is the \mathcal{H}\Delta\mathcal{H}-divergence measuring the distinguishability of domains under the hypothesis class \mathcal{H}, and \lambda captures the joint error of the optimal hypothesis across domains and tasks.^[26] Adaptation algorithms typically minimize proxies for this divergence to tighten the bound and improve target performance. Evaluation in transfer learning employs metrics tailored to assess adaptation quality beyond standard accuracy. The transfer performance gap measures the relative degradation or improvement, often computed as the difference between target accuracy with and without transfer, highlighting the net benefit of adaptation. The negative transfer gap (NTG) measures the performance degradation when source knowledge harms the target, serving as a diagnostic for harmful shifts.^[27] For distribution similarity, the A-distance provides a non-parametric proxy for the \mathcal{H}\Delta\mathcal{H}-divergence, defined as d_A(D_S, D_T) = 2(1 - 2\hat{\epsilon}(\eta)), where \hat{\epsilon}(\eta) is the error of a classifier \eta trained to distinguish unlabeled source and target samples; lower values indicate better alignment potential.^[26] These metrics guide algorithm selection and validation, emphasizing bounds like those from Ben-David et al. (2010) to ensure theoretical guarantees.^[26]

Practical Techniques

Pre-training and Fine-Tuning

Pre-training is a foundational phase in transfer learning where a deep neural network is trained from scratch on a large-scale source dataset to learn general-purpose representations. In computer vision, models are commonly pre-trained on the ImageNet dataset, which contains over 1.2 million labeled images across 1,000 categories, enabling the extraction of hierarchical features from low-level edges to high-level objects.^[28] In natural language processing (NLP), pre-training occurs on massive text corpora, such as the combination of BooksCorpus and English Wikipedia used for BERT, totaling around 3.3 billion words, to capture linguistic patterns and contextual embeddings.^[14] This phase leverages abundant unlabeled or weakly labeled data to initialize model parameters, often using self-supervised objectives like masked language modeling in BERT or next-sentence prediction.^[14] Fine-tuning follows pre-training by adapting the initialized model to a specific target task with limited labeled data, typically using a lower learning rate to preserve learned representations while updating weights. Strategies include freezing early layers, which capture generic features like textures in vision or syntax in NLP, and only updating later layers or the task-specific head to prevent catastrophic forgetting.^[3] For instance, in vision tasks, fine-tuning a pre-trained ResNet on medical images has shown significant accuracy improvements, often around 10% or more, over training from scratch on small datasets.^[29] In NLP, fine-tuning BERT on downstream tasks like sentiment analysis achieves state-of-the-art results by jointly optimizing the entire model or select layers.^[14] Variants of fine-tuning offer flexibility based on computational resources and data availability. Linear probing involves freezing the entire pre-trained backbone and training only a linear classifier on top of the frozen features, which is computationally efficient and preserves representations but may underperform on complex adaptations.^[30] Full fine-tuning updates all parameters end-to-end, maximizing adaptation but risking overfitting on small target sets. Progressive unfreezing, as introduced in ULMFiT, gradually unfreezes layers from the classifier head to the body, allowing stable adaptation with techniques like discriminative learning rates that decrease exponentially across layers.^[31] These approaches fall under parameter-based transfer learning, where weights are directly reused and adjusted.^[3] Practical implementation of pre-training and fine-tuning is facilitated by open-source frameworks like Hugging Face Transformers, which provide pre-trained models such as BERT and Vision Transformers, along with APIs for seamless fine-tuning on custom datasets.^[32] This library supports variants like linear probing via simple classifier additions and progressive unfreezing through layer-wise optimizers, democratizing access to transfer learning for researchers and practitioners.^[32]

Feature and Parameter Reuse

Feature extraction in transfer learning involves utilizing intermediate layers of a pre-trained source model as fixed feature representations for training a new classifier on the target task, thereby avoiding the need for full retraining of the source network. This approach leverages the hierarchical nature of deep neural networks, where lower layers capture general features like edges and textures, while higher layers encode task-specific patterns. For instance, in convolutional neural networks (CNNs) pre-trained on large datasets such as ImageNet, embeddings from early to mid-level layers serve as robust inputs for downstream vision tasks, enabling effective transfer even to dissimilar domains. Seminal work has quantified this transferability, showing that features from the first two layers of an 8-layer CNN transfer almost perfectly across tasks, achieving accuracies comparable to training from scratch (e.g., top-1 accuracy of approximately 0.625 on similar datasets), while deeper layers exhibit greater specificity, with performance drops of up to 25% on dissimilar tasks like distinguishing man-made from natural objects.^[3] Parameter sharing represents another key method for reusing learned parameters across tasks or instances, promoting efficiency by constraining the model to learn shared representations. In architectures like Siamese networks, two identical subnetworks share all weights to compute similarity metrics, such as in one-shot image recognition, where the shared CNN backbone processes pairs of images to learn embeddings for comparison without task-specific retraining. This design reduces parameter redundancy and enhances generalization in few-shot scenarios by enforcing invariance to input variations. Similarly, in multi-task learning setups adapted for transfer, a common backbone (e.g., a shared CNN or transformer encoder) feeds into task-specific heads, allowing parameters from the source task's pre-training to be directly reused for multiple related targets, as demonstrated in early multi-task frameworks where shared lower layers improved performance across diverse predictions like classification and regression.^[33] Hybrid approaches combine feature or parameter reuse with minimal additional training through modular components, such as adapter modules inserted into frozen pre-trained models. These adapters consist of small bottleneck layers—a down-projection to a low-dimensional space followed by a nonlinearity and up-projection—that are added after key operations like attention or feed-forward blocks in transformers, enabling task adaptation with only the adapter parameters being updated. Introduced for natural language processing, this method exemplifies parameter-efficient transfer by repurposing large models like BERT without altering their core weights. On benchmarks like GLUE, adapter tuning achieves a mean score of 80.0, within 0.4 points of full fine-tuning's 80.4, while adding just 3.6% more parameters to the base model.^[34] Other parameter-efficient techniques, such as low-rank adaptation (LoRA), further reduce trainable parameters by injecting low-rank matrices into transformer layers, achieving comparable performance with even fewer updates and becoming widely adopted by 2025.^[18] Such reuse strategies yield significant efficiency gains, particularly in resource-constrained settings, by drastically reducing the number of trainable parameters compared to full model adaptation. For example, adapters can decrease the parameter footprint by two orders of magnitude relative to fine-tuning all layers of a large pre-trained model, effectively cutting trainable parameters by over 90% in cases like BERT-large adaptations, where only a fraction of the total 340 million parameters (around 12 million) are optimized per task. This not only lowers computational costs but also facilitates modular deployment, allowing multiple tasks to share a single frozen backbone with lightweight, swappable adapters.^[34]

Applications

Computer Vision

Transfer learning has revolutionized computer vision by enabling models pre-trained on large-scale datasets like ImageNet to adapt effectively to specialized tasks with limited data, addressing challenges such as domain shifts and data scarcity. In object detection, models like YOLO are commonly fine-tuned from pre-training on the COCO dataset, allowing for efficient detection of objects in diverse environments; for instance, fine-tuning YOLOv9 on vehicle-specific datasets has demonstrated robust performance in real-world scenarios with reduced training time. Similarly, for semantic segmentation, U-Net variants leverage transfer learning by initializing with weights from natural image pre-training and fine-tuning on task-specific data, achieving precise pixel-level predictions in applications like biomedical analysis. In medical imaging, transferring knowledge from natural images to X-ray datasets mitigates the scarcity of labeled medical data, with pre-training on large natural image corpora enabling models to learn generalizable features for chest X-ray classification and anomaly detection, often performing comparably to or better than medical-specific pre-training on larger targets.^[35]^[36] Case studies highlight the practical impact of these approaches. Pre-training on ImageNet has been shown to boost accuracy on custom small datasets by 15-30% in classification tasks, particularly when fine-tuning with limited labels, by providing robust low-level features like edges and textures that generalize across domains. For domain adaptation, the Office-31 benchmark evaluates cross-dataset recognition, where techniques like deep unsupervised adaptation transfer knowledge from source domains (e.g., Amazon images) to target domains (e.g., webcam photos), improving classification accuracy by aligning feature distributions and reducing domain discrepancy. These adaptations are crucial for scenarios with distribution shifts, such as varying lighting or viewpoints in office object recognition. In 2025, advances in continual learning have further enhanced transfer learning in computer vision by enabling models to adapt to sequential tasks without catastrophic forgetting, as reviewed in recent surveys.^[37]^[38]^[39] Tailored techniques further enhance transfer in computer vision. Data augmentation strategies, including style transfer and mixup, help handle domain shifts by generating varied training samples that bridge source and target distributions, improving model robustness without additional labeled data. Recent advances in vision-language models, such as CLIP, enable zero-shot transfer by aligning image and text embeddings during pre-training, allowing classification of unseen categories via natural language prompts; extensions in 2023-2024, like CLIP-PING, have boosted lightweight models' zero-shot performance on downstream tasks by optimizing distillation and alignment. This impact extends to real-time applications like autonomous driving, where transfer learning from simulated or large-scale driving datasets to limited real-world labeled data enables efficient perception systems for object detection and scene understanding, reducing the need for extensive annotations.^[17]

Natural Language Processing

Transfer learning has transformed natural language processing (NLP) by allowing models pre-trained on vast unlabeled text corpora to adapt efficiently to downstream tasks, leveraging shared representations across domains. In NLP, this paradigm is prominently applied to tasks such as sentiment analysis, where models classify text polarity; machine translation, enabling translation between language pairs; and question answering, which involves extracting or generating responses from context. These applications benefit from pre-training paradigms like masked language modeling, followed by task-specific fine-tuning.^[14] A landmark case study is BERT, released in 2018, which pre-trains bidirectional transformer encoders on BooksCorpus (800 million words) and English Wikipedia (2.5 billion words) using masked language modeling and next-sentence prediction objectives. Upon fine-tuning, BERT LARGE established state-of-the-art performance on the GLUE benchmark, achieving an average score of 80.5%—a 7.7 percentage point absolute improvement over prior methods. Specifically, it excelled in sentiment analysis on the SST-2 dataset with 94.9% accuracy and in question answering on SQuAD v1.1 with 93.2 F1 score, demonstrating robust transfer to diverse NLP tasks.^[14] The GPT series illustrates generative transfer learning in NLP, shifting focus from discriminative to autoregressive models. GPT-3, a 175-billion-parameter model pre-trained on 410 billion tokens from diverse internet sources like Common Crawl, supports few-shot learning for generative tasks without parameter updates. It achieved 85.0 F1 on the CoQA question-answering dataset in few-shot settings and strong BLEU scores in machine translation, such as 35.1 for Romanian-to-English, highlighting its ability to transfer broad linguistic knowledge to new generative applications like text completion and summarization.^[40] Cross-lingual adaptations extend transfer learning to low-resource languages, enabling models trained primarily on high-resource data like English to perform in underrepresented ones. Multilingual BERT (mBERT), pre-trained on monolingual corpora from 104 languages, facilitates zero-shot and fine-tuned transfer across linguistic families. For instance, mBERT fine-tuned on Swahili data from the MasakhaNER dataset reached 89.36 F1 for named entity recognition, outperforming traditional models by leveraging cross-lingual embeddings despite limited Swahili training data.^[41] Recent 2024 advances in multimodal NLP incorporate vision-text alignment into transfer learning frameworks. Multimodal large language models (MM-LLMs), such as LLaVA and BLIP-2, employ lightweight projectors (e.g., Q-Former) to align visual encoders like CLIP ViT with pre-trained LLMs, enabling instruction-tuned transfer for tasks integrating text and images, such as visual question answering, while preserving core NLP generative capabilities. Overall, transfer learning democratizes NLP for underrepresented languages by drastically reducing data needs—often enabling viable performance with zero or few target-language examples. In low-resource African languages, cross-lingual methods like mT5-xl with constrained decoding boost zero-shot NER F1 scores on datasets like MasakhaNER, making advanced tools accessible without extensive annotation efforts. In 2025, further advancements in instruction-finetuned multilingual LLMs have improved transfer for low-resource NLP tasks.^[42]^[43]

Challenges and Limitations

Negative Transfer

Negative transfer refers to the phenomenon in transfer learning where the incorporation of knowledge from a source domain or task degrades the performance on the target domain or task, rather than improving it.^[44] This occurs primarily when the source and target domains are mismatched, such as through significant covariate shift, label shift, or concept shift, leading the model to overfit to irrelevant source-specific patterns that hinder generalization to the target.^[45] In the formalism of domains and tasks, negative transfer is exacerbated when the joint distribution of inputs and labels in the source domain P_S(X_S, Y_S) diverges substantially from that in the target P_T(X_T, Y_T), causing transferred representations to misalign with target requirements.^[44] A prominent example arises in computer vision, where models pretrained on natural images (e.g., ImageNet) and transferred to synthetic image datasets like VisDA or across domains in benchmarks like Office-31 exhibit negative transfer, with accuracy drops of up to 10-20% compared to target-only training in cases such as webcam to DSLR transfers, due to stylistic and distributional differences.^[45] Another case is in unsupervised domain adaptation on benchmarks like Office-31, where transferring from a source domain with unrelated categories (e.g., webcam images to DSLR) results in a transfer gap—defined as the difference between source-pretrained target performance and optimal baseline—quantifying the harm, often reaching negative values indicating worse outcomes than no transfer.^[45] To mitigate negative transfer, domain discrepancy measures such as the Maximum Mean Discrepancy (MMD) kernel are employed to quantify and minimize distributional differences between source and target, enabling adaptive alignment only when similarity thresholds are met.^[46] Selective transfer techniques, like adversarial filtering to exclude harmful source samples, have been shown to recover performance losses, improving accuracy by 5-15% on affected benchmarks.^[45] Ensemble methods that combine multiple source models, weighting them based on predicted compatibility, further reduce risks by averaging out detrimental influences.^[44] Empirical studies reveal negative transfer as a pervasive issue, particularly in unsupervised settings, where it manifests in a significant portion of domain adaptation scenarios across over 20 evaluated algorithms on specialized benchmarks, underscoring the need for proactive detection.^[47]

Evaluation and Scalability Issues

Evaluating transfer learning models poses significant challenges due to the limited availability of standardized benchmarks beyond well-known datasets like GLUE for natural language processing and the Office dataset for domain adaptation in computer vision.^[48] While GLUE provides a multi-task evaluation framework for assessing generalization across NLP tasks, it has been criticized for not fully capturing out-of-distribution robustness, leading to the development of extensions like GLUE-X to address these gaps.^[49] Similarly, the Office dataset, which evaluates domain shifts across office environments, lacks breadth for diverse real-world scenarios, complicating fair comparisons and hindering the identification of robust transfer methods. Cross-validation in shifted domains exacerbates these issues, as traditional splits often fail to account for distribution mismatches between source and target data, resulting in overly optimistic performance estimates that do not generalize well.^[50] Scalability remains a core concern in transfer learning, particularly for pre-training large models, where computational demands can be prohibitive. For instance, pre-training GPT-3 with 175 billion parameters required approximately 3.14 × 10^23 floating-point operations, far exceeding the resources available to most researchers and organizations. This high compute cost not only limits accessibility but also raises environmental concerns due to the energy consumption involved. In federated transfer learning scenarios, where models are adapted across decentralized devices, data privacy adds further complexity, as sharing model updates must comply with regulations like GDPR while preventing leakage of sensitive source data. Additional issues include catastrophic forgetting during fine-tuning, where adapting a pre-trained model to a new task erodes performance on the original tasks, and bias amplification from source data, which can propagate and intensify unfair representations in the target domain. Catastrophic forgetting arises because fine-tuning overwrites shared parameters critical to prior knowledge, as observed in deep transfer learning for medical imaging where source-task accuracy drops significantly post-adaptation.^[51] Bias amplification occurs when spurious correlations in the source dataset, such as demographic imbalances, persist or worsen in the transferred model, even if the target data is debiased, leading to unreliable downstream applications.^[52] To mitigate these challenges, techniques like efficient adapters and knowledge distillation offer practical solutions for scalability and evaluation. Adapter modules insert lightweight, task-specific layers into pre-trained models, adding only a fraction of the parameters (e.g., 0.5-3% for NLP tasks) while preserving overall performance, thus enabling faster fine-tuning without full retraining. Knowledge distillation compresses large teacher models into smaller student versions by transferring softened output distributions, reducing model size by up to 90% in transfer settings while maintaining accuracy, as demonstrated in vision-language tasks. These approaches facilitate more reliable evaluation by allowing experimentation on resource-constrained setups and help scale transfer learning to broader applications.

Future Directions

Recent Advances

In 2025, advancements in statistical transfer learning emphasized the development of specialized data structures to handle domain shifts more effectively, as detailed in a comprehensive review that categorizes challenges into model-based and data-based approaches while introducing resolution techniques for typical methods.^[53] Surveys on cross-dataset visual adaptation have highlighted problem-oriented transfer methods, both shallow and deep, to improve recognition performance across diverse visual datasets by addressing distribution mismatches.^[54] In 2025, transfer learning in robotics gained traction through reviews that unified the paradigm under taxonomies considering robot morphology, task complexity, and data modalities, enabling efficient reuse of prior experiences to accelerate adaptation without starting from scratch.^[55] In real-time estimation tasks, such as hospital-specific post-discharge mortality prediction, latent transfer learning frameworks demonstrated reductions in estimation errors by incorporating multi-source hospital data, achieving efficiency gains through decreased standard errors compared to isolated models.^[56] In 2025, transfer learning extended to chemistry with approaches leveraging custom-tailored virtual molecular databases to predict catalytic activity in real-world organic photosensitizers, enhancing model generalization from simulated to experimental data.^[57] A survey further explored the integration of transfer learning with large language models in medical systems, showcasing applications in diagnostics and patient management that boost performance in data-scarce healthcare scenarios.^[58] Key theoretical contributions included analyses from statistical mechanics, developing effective theories for transfer in fully connected neural networks via Franz-Parisi formalisms to quantify generalization boosts in the proportional limit.^[59]

Emerging Trends and Open Questions

One prominent emerging trend in transfer learning is the rise of foundation models, particularly multimodal variants that integrate diverse data types such as text, images, and video to enable more robust knowledge transfer across domains. Models like Flamingo exemplify this shift, leveraging large-scale pre-training on interleaved multimodal corpora to achieve few-shot learning capabilities, thereby reducing the need for extensive task-specific data. This approach has extended to biological applications, where multi-modal transfer learning connects modalities like DNA, RNA, and proteins, facilitating cross-domain adaptations in scientific modeling. Another key trend involves federated and privacy-preserving transfer learning, which allows collaborative model training across distributed devices without sharing raw data, addressing growing concerns over data sovereignty in sensitive sectors like healthcare and manufacturing. Techniques such as homomorphic encryption and selective knowledge sharing in federated settings have demonstrated improved performance while maintaining privacy in resource-constrained environments. Complementing this is the advancement in lifelong learning paradigms, which mitigate catastrophic forgetting by enabling continuous adaptation to new tasks while retaining prior knowledge, as seen in neural architectures that balance plasticity and stability for sequential learning scenarios.^[60] Open questions persist in handling extreme domain shifts, where models struggle with significant distributional mismatches, such as transferring from simulated to real-world environments, often leading to performance degradation without adaptive alignment strategies. Ethical biases in transferred models represent another critical challenge, as pre-trained representations can propagate societal inequities into downstream applications like medical diagnostics, necessitating bias-detection frameworks integrated into transfer pipelines. Scalability to edge devices remains unresolved, with computational overhead limiting deployment on low-resource hardware despite promising hybrid federated-transfer approaches. Looking ahead, the integration of transfer learning with quantum machine learning holds potential for exponential speedups in high-dimensional tasks, as hybrid quantum-classical architectures enable robust knowledge transfer in adversarial settings. Auto-transfer systems, which automate source selection and adaptation, are gaining traction for streamlining deployment, with algorithms like automated broad-transfer learning showing efficacy in cross-domain fault diagnosis by dynamically aligning features without manual intervention.^[61]^[62] Research gaps include the absence of a unified theory for avoiding negative transfer, where source knowledge hinders target performance, as current methods like feature alignment provide empirical fixes but lack theoretical guarantees for generalizability. Additionally, standardized benchmarks for 2025+ large language models in transfer scenarios are underdeveloped, with existing evaluations like ECLeKTic highlighting needs for cross-lingual and multimodal metrics to assess long-term adaptability beyond 2024 baselines.^[19]^[45]

References

[1]
[PDF] A Survey on Transfer Learning
This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression and clustering problems. In this ...Missing: seminal | Show results with:seminal
[2]
[1911.02685] A Comprehensive Survey on Transfer Learning - arXiv
Nov 7, 2019 · Transfer learning aims at improving the performance of target learners on target domains by transferring the knowledge contained in different ...
[3]
How transferable are features in deep neural networks? - arXiv
Nov 6, 2014 · How transferable are features in deep neural networks? Authors:Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson.
[4]
A Survey on Transfer Learning | IEEE Journals & Magazine
Oct 16, 2009 · This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems.Missing: benefits | Show results with:benefits
[5]
A survey of transfer learning | Journal of Big Data | Full Text
May 28, 2016 · This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning.
[6]
Deep learning in computer vision: A critical review of emerging ...
Dec 15, 2021 · The experiments showed that the method performed outstandingly in both quality (i.e., 10%–20% improvement in recall accuracy averagely) and ...
[7]
[PDF] Pre-training on Grayscale ImageNet Improves Medical Image ...
We demonstrate that a network pre-trained on grayscale ImageNet is a better starting point for transfer learning on medical images, because it (1) leads to more ...
[8]
Thorndike & Woodworth (1901a)
The influence of improvement in one mental function upon the efficiency of other functions (I) EL Thorndike & RS Woodworth (1901)
[9]
[PDF] A Review of Transfer Theories and Effective Instructional Practices
Learning transfer has become a significant research topic in educational psychology since Thorndike and Woodworth developed the theory of identical elements in ...
[10]
[PDF] Reminder of the First Paper on Transfer Learning in Neural ...
This paper describes a work on transfer learning in neural networks carried out in 1970s and early. 1980s, which produced its first publication in 1976.
[11]
[PDF] Discriminability-Based Transfer between Neural Networks
We have described the DBT algorithm for transfer between neural networks.2 DBT demonstrated substantial and significant learning speed improvement over randomly.
[12]
[PDF] Transfer Learning for Reinforcement Learning Domains: A Survey
The goals of this survey are to introduce the reader to the transfer learning problem in RL domains, to organize and discuss current transfer methods, and to ...
[13]
ImageNet Classification with Deep Convolutional Neural Networks
Authors. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. Abstract. We trained a large, deep convolutional neural network to classify the 1.3 million ...
[14]
Transfer Learning - Machine Learning's Next Frontier - ruder.io
Mar 21, 2017 · This blog post gives an overview of transfer learning, outlines why it is important, and presents applications and practical methods.Sebastian Ruder · Adapting To New Domains · Related Research AreasMissing: 2006 | Show results with:2006
[15]
[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
[16]
[2010.11929] An Image is Worth 16x16 Words: Transformers ... - arXiv
Oct 22, 2020 · A pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
[17]
[PDF] A Comprehensive Survey on Transfer Learning - arXiv
Transfer learning, which focuses on transferring the knowledge across domains, is a promising machine learning methodology for solving the above problem. The ...<|control11|><|separator|>
[18]
[PDF] Unsupervised Visual Domain Adaptation Using Subspace Alignment
In this paper, we introduce a new domain adaptation. (DA) algorithm where the source and target domains are represented by subspaces described by eigenvectors.
[19]
[PDF] Domain-Adversarial Training of Neural Networks
Domain-adversarial training uses a neural network (DANN) to learn features that are discriminative for the source domain and indiscriminate between domains, ...
[20]
[PDF] A Unified View of Label Shift Estimation - NIPS papers
Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for ...
[21]
Data Distribution Shifts and Monitoring - Chip Huyen
Feb 7, 2022 · Label shift, covariate shift, and concept drift are defined as follows. Covariate shift is when P(X) changes, but P(Y|X) remains the same.Data Distribution Shifts · Monitoring and Observability
[22]
Boosting for transfer learning | Proceedings of the 24th international ...
In this paper, we present a novel transfer learning framework called TrAdaBoost, which extends boosting-based learning algorithms.Missing: et | Show results with:et
[23]
A theory of learning from different domains | Machine Learning
Oct 23, 2009 · Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution.
[24]
[PDF] A Survey on Negative Transfer - arXiv
Aug 9, 2021 · Negative transfer (NT), i.e., leveraging source domain data/knowledge undesirably reduces the learning performance in the target domain, has ...
[25]
ImageNet Large Scale Visual Recognition Challenge - arXiv
Sep 1, 2014 · This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We ...
[26]
[PDF] Fine-Tuning can Distort Pretrained Features and Underperform Out ...
Feb 21, 2022 · Fine-tuning (FT) and linear probing (LP) are popular transfer learning algorithms. There is substantial evidence of FT outperforming LP in- ...
[27]
Universal Language Model Fine-tuning for Text Classification - arXiv
Jan 18, 2018 · We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP.
[28]
Transformers - Hugging Face
There are over 1M+ Transformers model checkpoints on the Hugging Face Hub you can use. Explore the Hub today to find a model and use Transformers to help you ...Hugging Face's logoQuickstartUsing 🤗 transformers at ...ModelsInstallation
[29]
[PDF] Siamese Neural Networks for One-shot Image Recognition
In this paper, we explore a method for learning siamese neural networks which employ a unique structure to naturally rank similarity be- tween inputs. Once a ...
[30]
[1902.00751] Parameter-Efficient Transfer Learning for NLP - arXiv
Feb 2, 2019 · We propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task.
[31]
Deep Learning-based Bio-Medical Image Segmentation using UNet ...
May 24, 2023 · We show that transferred learning model has better performance in image segmentation than UNet model that is implemented from scratch. Subjects: ...
[32]
Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few ...
May 31, 2021 · Title:Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images.
[33]
[1805.08974] Do Better ImageNet Models Transfer Better? - arXiv
٢٣‏/٠٥‏/٢٠١٨ · Our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
[34]
Accelerating Deep Unsupervised Domain Adaptation with Transfer ...
Mar 25, 2019 · TCP is validated on two benchmark datasets-Office-31 and ImageCLEF-DA with two common backbone networks-VGG16 and ResNet50. Experimental results ...
[35]
Learning Transferable Visual Models From Natural Language ...
Feb 26, 2021 · After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the ...Missing: 2023 2024 advances
[36]
[2005.14165] Language Models are Few-Shot Learners - arXiv
May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
[37]
[1906.01502] How multilingual is Multilingual BERT? - arXiv
Jun 4, 2019 · In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 ...
[38]
Cross-Lingual Transfer for Low-Resource Natural Language ... - arXiv
Feb 4, 2025 · This paper focuses on cross-lingual transfer learning to improve NLP for low-resource languages, using data-based and model-based methods, ...
[39]
A Survey on Negative Transfer
### Summary of Negative Transfer from arXiv:2009.00909
[40]
[PDF] Characterizing and Avoiding Negative Transfer - CVF Open Access
When labeled data is scarce for a specific target task, transfer learning often offers an effective solution by utiliz- ing data from a related source task.Missing: metric | Show results with:metric
[41]
A study of the effects of negative transfer on deep unsupervised ...
Apr 1, 2021 · A study of the effects of negative transfer on deep unsupervised domain adaptation methods☆ ... View PDFView articleGoogle Scholar. Long et ...Missing: pdf | Show results with:pdf
[42]
A Survey on Negative Transfer
Negative transfer (NT), i.e., leveraging source domain data/knowledge undesirably reduces learning performance in the target domain, and has been a long- ...
[43]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural ...
Apr 20, 2018 · GLUE is a tool for evaluating and analyzing NLU models across diverse tasks. It is model-agnostic and incentivizes sharing knowledge.
[44]
[PDF] GLUE-X: Evaluating Natural Language Understanding Models from ...
Jul 9, 2023 · This paper presents the first attempt at creating a unified benchmark named. GLUE-X for evaluating OOD robustness in. NLP models, highlighting ...
[45]
[PDF] Validity Challenges in Machine Learning Benchmarks
Aug 3, 2022 · In this thesis, we probe the validity of machine learning benchmarks from several perspectives. We first consider the statistical validity of ...
[46]
https://www.sciencedirect.com/science/article/pii/S0957417420308459
[47]
[2207.02842] When does Bias Transfer in Transfer Learning? - arXiv
Jul 6, 2022 · Bias transfer occurs when biases of the source model persist after adapting to the target class, even when the target dataset is de-biased.Missing: amplification | Show results with:amplification
[48]
https://arxiv.org/abs/1804.07461
[49]
Recent Advances in Transfer Learning for Cross-Dataset Visual ...
This article takes a problem-oriented perspective and presents a comprehensive review of transfer-learning methods, both shallow and deep, for cross-dataset ...
[50]
Transfer learning in robotics: An upcoming breakthrough? A review ...
Sep 13, 2024 · The transfer learning paradigm for robotics is a promising avenue to avoid learning from scratch by reusing previously-acquired experience in ...
[51]
A latent transfer learning method for estimating hospital-specific post ...
Nov 8, 2024 · We observe that including source hospitals results in a reduction in estimated standard errors. This efficiency gain shows how the Latent-TL ...
[52]
Transfer learning from custom-tailored virtual molecular databases ...
Oct 1, 2025 · Transfer learning for a foundational chemistry model. Chem. Sci. 15, 5143–5151 (2024). Article CAS ... © 2025 Springer Nature Limited.
[53]
A survey on the applications of transfer learning to enhance the ...
Jun 5, 2025 · This survey investigates the significant impact of Transfer Learning and large language models on medical systems by explaining their applications.
[54]
Statistical Mechanics of Transfer Learning in Fully Connected ...
In this Letter we develop a novel single-instance Franz-Parisi formalism that yields an effective theory for TL in fully connected neural networks.
[55]
Privacy-preserving Heterogeneous Federated Transfer Learning
We propose an end-to-end privacy-preserving multi-party learning approach with two variants based on homomorphic encryption and secret sharing techniques.
[56]
Continual lifelong learning with neural networks: A review
Lifelong learning and catastrophic forgetting in neural networks ... catastrophic forgetting and performs positive transfer to previously learned tasks.Review · 2. Biological Aspects Of... · 4. Developmental Approaches...
[57]
Ethical and Bias Considerations in Artificial Intelligence/Machine ...
This review will discuss the relevant ethical and bias considerations in AI-ML specifically within the pathology and medical domain.
[58]
Using Transfer Learning in Building Federated Learning Models on ...
We combine transfer learning with federated learning, where we train a base model with a public dataset. The base model is passed to the federated users.
[59]
[2510.16301] Adversarially Robust Quantum Transfer Learning - arXiv
Oct 18, 2025 · This chapter introduces a hybrid quantum-classical architecture that combines the advantages of quantum computing with transfer learning ...
[60]
Automated broad transfer learning for cross-domain fault diagnosis
An automated broad-transfer learning algorithm (AutoBTL) is proposed to improve predictive modeling for cross-domain tasks.
[61]
ECLeKTic: A novel benchmark for evaluating cross-lingual ...
ECLeKTic is a new benchmark designed to evaluate the ability of large language models (LLMs) to transfer knowledge across different languages.