Fact-checked by Grok 2 weeks ago

Transfer learning

Transfer learning is a subfield of that focuses on improving the performance of models on a target task by leveraging acquired from a related source task or domain, particularly when the target domain has limited labeled data. This approach addresses the challenge of data scarcity in traditional , where models are typically trained from scratch on task-specific datasets, by reusing pre-trained representations to accelerate learning and enhance . Originating from early ideas in the , such as the 1995 NIPS workshop on "Learning to Learn," transfer learning gained prominence with initiatives like DARPA's 2005 program on transfer learning for reuse across tasks. In practice, transfer learning involves transferring knowledge across domains (source and target) that differ in data distribution, feature space, or tasks, categorized primarily into inductive, transductive, and unsupervised settings. Inductive transfer learning applies when the source and target tasks differ but some labeled target data is available, often through fine-tuning pre-trained models. Transductive transfer learning assumes the same task across domains but different data distributions, requiring adaptation without target labels, such as domain adaptation techniques. Unsupervised transfer learning operates without labeled data in either domain, focusing on shared structures like clustering or feature learning. A key insight from deep learning research is that lower-layer features in neural networks, such as edge detectors in convolutional networks, tend to be more transferable across tasks than higher-layer task-specific ones. Transfer learning has become foundational in fields like and , enabling efficient model development. In , pre-trained models on large datasets like are fine-tuned for tasks such as and medical image analysis, reducing training time and data needs. In , models like demonstrate transfer by pre-training on massive corpora for masked language modeling and then adapting to downstream tasks like or . Despite its benefits, challenges persist, including negative transfer, where irrelevant source knowledge degrades target performance, and handling domain shifts due to covariate or label shifts. Ongoing research emphasizes robust methods to mitigate these issues, ensuring reliable in diverse applications.

Fundamentals

Definition

Transfer learning is a subfield of machine learning, which includes supervised learning—where models learn from labeled data to map inputs to outputs—and unsupervised learning—where models identify patterns in unlabeled data without explicit guidance. Formally, transfer learning is defined as a machine learning paradigm that aims to improve the learning of a target predictive function f_T(\cdot) in a target domain D_T by leveraging knowledge from a source domain D_S and source task T_S, where D_S \neq D_T or T_S \neq T_T. A domain D is composed of a feature space \mathcal{X} and a marginal probability distribution P(\mathcal{X}) over it, so the source domain is D_S = \{\mathcal{X}_S, P(\mathcal{X}_S)\} and the target domain is D_T = \{\mathcal{X}_T, P(\mathcal{X}_T)\}. A task T consists of a label space \mathcal{Y} and an objective predictive function, often the conditional probability P(\mathcal{Y}|\mathcal{X}); thus, the source task is T_S = \{\mathcal{Y}_S, P(\mathcal{Y}_S|\mathcal{X}_S)\} and the target task is T_T = \{\mathcal{Y}_T, P(\mathcal{Y}_T|\mathcal{X}_T)\}. In transfer learning scenarios, the goal is to reuse a model or knowledge from the source to initialize or enhance learning in the target, typically when the target has limited data (n_T \ll n_S). Outcomes include positive transfer, where the source knowledge improves target performance; negative transfer, where it degrades performance due to unrelated domains or tasks; and no transfer, which has a neutral effect.

Motivation and Benefits

Transfer learning is motivated by the challenges inherent in traditional paradigms, where models are typically trained from scratch on task-specific datasets drawn from identical distributions. In practice, for many real-world applications—such as specialized domains in healthcare or rare event detection—is often scarce and expensive to acquire, limiting the ability of standard to achieve robust performance. Transfer learning addresses this by enabling the reuse of knowledge from related tasks or domains with abundant data, thereby accelerating to the scenario without requiring extensive new labeling efforts. A key driver is the high computational cost of training large-scale models, particularly deep neural networks, from the ground up, which can demand significant resources in terms of time, hardware, and energy. For instance, pre-training on massive datasets like allows subsequent on smaller target datasets, drastically cutting these overheads while leveraging learned representations of general features such as edges or textures. This approach not only mitigates data scarcity but also harnesses prior knowledge to bootstrap learning, making it feasible to deploy sophisticated models in resource-constrained environments. The benefits of transfer learning are particularly pronounced in improving and performance on small datasets, where traditional methods often overfit or underperform due to insufficient training examples. By initializing models with pre-trained weights, transfer learning enhances predictive accuracy, with empirical studies in tasks, such as semantic segmentation, showing gains of 10-20% in recall accuracy compared to training from random initialization. In , fine-tuning pre-trained models like can reduce training time by orders of magnitude—often to just a few hours on a single GPU versus days for from-scratch training—while achieving state-of-the-art results on downstream tasks with minimal additional data. This contrasts sharply with conventional machine learning, which assumes i.i.d. data across training and testing, rendering models brittle to distribution shifts; transfer learning, by contrast, explicitly reuses knowledge across differing distributions, fostering more adaptable and efficient systems. A compelling example is the application of ImageNet-pre-trained convolutional networks to medical imaging, where limited annotated scans pose a barrier; such transfer has demonstrated substantial performance uplifts, such as 2-6% improvements in AUC for disease classification in chest X-ray images, enabling reliable diagnostics with far fewer patient-specific labels. Overall, these advantages make transfer learning indispensable for scaling AI to diverse, data-limited domains.

Historical Development

Origins and Early Work

The concept of transfer learning originated in psychological studies of how learning in one context influences performance in another. In 1901, Edward L. Thorndike and conducted foundational experiments demonstrating that transfer depends on the presence of identical elements between tasks, rather than broad formal discipline or general faculties of the mind. Their theory of identical elements posited that positive transfer occurs when tasks share specific common features, while negative transfer arises from interfering elements; this challenged earlier notions of widespread mental training effects and emphasized empirical measurement of transfer degrees. These insights from provided an early for knowledge reuse across domains. In , early explorations of transfer-like mechanisms appeared in the 1970s amid nascent research. A pioneering effort came from Ante Fulgosi and Stevo Bozinovski in 1976, who investigated transfer learning in the training of a single-layer , examining how prior exposure to similar patterns accelerated learning on new tasks through weight initialization from previous trainings. Their work demonstrated that pattern similarity between source and target tasks enhanced training efficiency, marking the first explicit application of transfer principles to and establishing source domains, target tasks, and adaptation via reused parameters. This built on psychological transfer ideas by applying them to computational models, focusing on self-learning systems without , and laid initial groundwork for multi-task scenarios in AI. Pre-1990s developments further advanced transfer in specific architectures. In pattern recognition, early neural network reuse involved adapting pre-trained shallow networks for related classification problems, such as handwriting or speech recognition, where shared feature detectors from one dataset improved generalization on sparse data. A notable algorithmic contribution was Lorraine Pratt's 1992 Discriminability-Based Transfer (DBT) method for neural networks, which quantified the utility of hidden units from a source network using discriminability measures to selectively transfer beneficial hyperplanes, achieving significant speedups in learning (e.g., up to 50% reduction in epochs on benchmark tasks like vowel recognition). Although focused on neural models rather than decision trees, DBT exemplified early systematic reuse of learned representations, prioritizing transferable components based on information-theoretic criteria. By the mid-1990s, surveys began synthesizing these precursors under inductive transfer paradigms. Rich Caruana's 1993 work on positioned shared representations across related tasks as a form of , arguing that joint training leverages domain information to improve on individual tasks. This approach, detailed in , served as an early survey of transfer mechanisms, bridging psychological roots and AI implementations by formalizing multitask setups as precursors to modern transfer learning, all without deep architectures. These foundational efforts established core principles of adaptation and reuse, enabling subsequent evolution in .

Key Milestones and Evolution

The formalization of transfer learning gained momentum in the through key surveys that categorized its approaches and distinguished types such as inductive and transductive transfer. A seminal overview by Taylor and Stone in 2009 focused on transfer methods for domains, proposing a framework to classify techniques based on their representational capabilities and learning goals. This was complemented by the influential 2010 survey by Pan and Yang, which systematically reviewed progress in transfer learning for , , and clustering tasks, while formally defining core settings like negative transfer and highlighting relationships to and . The 2010s marked a pivotal shift with the integration of transfer learning into deep neural networks, driven by breakthroughs in large-scale pre-training. The 2012 AlexNet model by Krizhevsky et al. demonstrated the efficacy of pre-training deep convolutional networks on massive datasets like , achieving a top-5 error rate of 15.3% and sparking widespread adoption of transfer learning in by enabling feature extraction from pre-trained weights. In 2016, forecasted during a NIPS tutorial that transfer learning would emerge as the dominant paradigm in , surpassing traditional supervised approaches due to its ability to leverage vast pre-existing knowledge. This prediction aligned with the rise of transformer-based models; for instance, by Devlin et al. in 2018 introduced bidirectional pre-training on unlabeled text, yielding state-of-the-art results on GLUE benchmarks (average score of 80.5%) and popularizing for tasks. Entering the 2020s, research emphasized efficiency and scalability in transfer learning amid growing model sizes. Zoph et al. in 2020 challenged conventional pre-training by showing it could sometimes degrade performance on downstream tasks like COCO (e.g., -1.0 with strong augmentation), advocating self-training as a robust alternative that improved COCO by up to 3.4 over baselines without relying on external pre-trained models. Post-2020 developments included the advent of federated transfer learning to address privacy-preserving adaptation across distributed data sources, as explored in comprehensive reviews that categorize hybrid approaches combining federated and transfer mechanisms for heterogeneous domains. Concurrently, the (ViT) by Dosovitskiy et al. in 2020 extended transfer principles to pure attention-based architectures, achieving 88.55% top-1 accuracy on when pre-trained at scale, thus bridging and vision paradigms. Key advancements since 2021 include contrastive models like CLIP (Radford et al., 2021) for zero-shot multimodal transfer across vision and language, and parameter-efficient techniques such as (Hu et al., 2021) for large models. Overall, transfer learning has evolved from shallow, instance-based methods in the to deep pre-training and strategies dominant since the , with ongoing surveys like Zhuang et al.'s comprehensive review synthesizing over 40 approaches and underscoring the field's progression toward handling domain shifts in large-scale systems. This trajectory reflects a broader transition to knowledge reuse in resource-constrained environments, with recent works up to 2025 highlighting multimodal and privacy-aware extensions.

Classification and Types

Inductive Transfer Learning

Inductive transfer learning refers to the paradigm in transfer learning where the source and target domains differ, but is available for the task, allowing the transfer of to improve the learner's performance. This approach assumes that the source domain provides useful that can be adapted to the , typically when the source has abundant while the has limited labels. Unlike scenarios without labels, inductive transfer explicitly leverages supervised signals in the to refine the transferred . A key subtype of inductive transfer learning is multi-task learning, where multiple related tasks are learned simultaneously to leverage shared representations and improve generalization across them. In this setup, the tasks share common features or parameters, enabling inductive bias transfer from one task to others, as originally formalized in early work on multitask frameworks. This subtype is particularly effective when tasks are interdependent, such as predicting related outcomes in classification problems. Mechanisms in inductive transfer learning often involve instance weighting to emphasize source samples relevant to the target domain. A seminal , TrAdaBoost, introduced in 2007, extends the framework by iteratively adjusting weights: target instances receive standard boosting updates, while source instances are downweighted if they lead to errors on the target, assuming related but shifted distributions between domains. This process relies on the assumption that source and target distributions are similar enough for positive , with source data providing auxiliary supervision without identical tasks. Other methods build on this by incorporating feature alignment or parameter sharing under similar distributional relatedness assumptions. A representative example is recognition, where a model pretrained on the MNIST dataset (handwritten s) is fine-tuned on the SVHN dataset (street-view house numbers), both involving labeled of s 0-9 but with differing visual styles and backgrounds. This transfer exploits shared semantics while adapting to domain-specific noise, achieving notable accuracy gains over training from scratch on SVHN alone. Inductive transfer learning is effective for tasks with related s, reducing the need for extensive target labeling and accelerating convergence, but it can suffer from negative transfer if domain shifts are too pronounced, leading to degraded compared to target-only training.

Transductive and Unsupervised Transfer Learning

Transductive transfer learning addresses scenarios where the source domain provides labeled data for a specific task, but the target domain shares the same task while lacking labels, with access to unlabeled target samples available for adaptation. This setting emphasizes domain adaptation techniques to bridge the distribution shift between source and target without requiring target annotations, making it suitable for real-world applications where labeling target data is costly or infeasible. Unlike inductive transfer learning, which relies on labeled target data to refine models for potentially different tasks, transductive approaches focus solely on aligning representations across domains for the shared task. A prominent method in transductive transfer learning is instance-based , which iteratively reweights instances to emphasize those similar to the target domain while downweighting outliers, effectively boosting a weak learner for the target task. More advanced feature-level techniques include alignment, which represents and target domains as low-dimensional subspaces via and learns a linear to align the subspace basis with the target, minimizing divergence while preserving discriminative features for classification. This approach has demonstrated superior performance in visual tasks, such as adapting models from office environments to images, achieving relative accuracy improvements of up to 20% over prior geodesic flow methods on datasets like Office-Caltech. Adversarial training methods further advance transductive adaptation by learning domain-invariant features through a game between a feature extractor and a domain discriminator. The (DANN) exemplifies this by incorporating a gradient reversal layer during , which encourages the extractor to fool the discriminator into treating source and target samples as indistinguishable, while maintaining task-specific discriminability on source labels. Applied to image classification, DANN has set state-of-the-art results on datasets like Office-31, attaining 73% accuracy in cross-domain transfers (e.g., to ), surpassing traditional methods by aligning marginal and conditional distributions. These techniques highlight transductive learning's reliance on target domain access to enable effective, unsupervised alignment. Unsupervised transfer learning extends beyond transductive settings by assuming no access to domain data or labels, relying instead on selecting or extracting transferable knowledge from the source domain alone to generalize to unseen targets. This variant is particularly relevant for zero-shot or broad- transfer, where the goal is to identify intrinsic source structures—such as shared features or clusters—that apply universally without -specific . In contrast to transductive methods, unsupervised approaches do not leverage samples, broadening applicability to scenarios with completely novel environments but increasing the risk of negative from irrelevant source elements. Key unsupervised methods include and clustering techniques, such as Self-Taught Clustering, which first learns sparse representations from a large pool of unlabeled source data using algorithms like sparse coding, then clusters these to discover transferable patterns for downstream tasks without supervision. Instance selection strategies, like those in early transfer boosting variants, further refine this by pruning source data to retain only high-relevance subsets based on intrinsic properties, such as or manifold structure, for application to new domains. These methods have shown efficacy in applications like text clustering, where transferring clustered features from one improves performance on unrelated datasets over non-transfer baselines, emphasizing conceptual reuse over domain-specific tuning. Overall, unsupervised transfer prioritizes robust, generalizable source exploitation, serving as a foundation for more extreme adaptation challenges.

Mathematical Framework

Domain and Task Formalism

In transfer learning, the foundational mathematical framework begins with formal definitions of domains and tasks to distinguish between source and target settings. A domain D is defined as a pair consisting of a feature space \mathcal{X} and a marginal probability distribution P(X) over that space, denoted as D = \{\mathcal{X}, P(X)\}, where \mathcal{X} represents the space of possible input features. Similarly, a task T comprises a label space \mathcal{Y} and a predictive function f(\cdot) = P(Y|X), expressed as T = \{\mathcal{Y}, P(Y|X)\}, where the conditional distribution P(Y|X) models the relationship between inputs and outputs, typically learned from labeled data pairs \{(x_i, y_i)\}. The core objective of transfer learning is established under this formalism: given a source domain D_S = \{\mathcal{X}_S, P(X_S)\} and source task T_S = \{\mathcal{Y}_S, P(Y_S|X_S)\}, along with a target domain D_T = \{\mathcal{X}_T, P(X_T)\} and target task T_T = \{\mathcal{Y}_T, P(Y_T|X_T)\}, the goal is to improve the learning of the target predictive function f_T(\cdot) by leveraging knowledge from the source, particularly when D_S \neq D_T or T_S \neq T_T. In practice, the source typically provides abundant labeled data \{(x_S^i, y_S^i)\}_{i=1}^{n_S} with n_S \gg n_T, while the target has limited or no labels \{(x_T^j, y_T^j)\}_{j=1}^{n_T}. Differences between source and target are often characterized by specific types of distributional shifts. Covariate shift occurs when the marginal distributions differ, P(X_S) \neq P(X_T), but the conditional P(Y|X) remains invariant across domains, assuming \mathcal{X}_S = \mathcal{X}_T. Label shift, also known as prior shift, arises when the label distribution changes, P(Y_S) \neq P(Y_T), while the class-conditional input distribution P(X|Y) stays the same, leading to altered P(Y|X). Concept shift, in contrast, involves a change in the predictive relationship itself, P(Y_S|X_S) \neq P(Y_T|X_T), even if the feature distributions align, encompassing broader task variations such as differing label spaces \mathcal{Y}_S \neq \mathcal{Y}_T. These shifts highlight the challenges in transferring knowledge, as they violate assumptions of identical distributions underlying standard machine learning.

Adaptation Algorithms and Metrics

In transfer learning, adaptation algorithms aim to bridge the gap between source and target domains or tasks by reweighting data, transforming representations, or adjusting model parameters. Instance-based methods focus on selecting or reweighting source instances to better align with the target domain, assuming that some source data are more relevant than others. A prominent example is TrAdaBoost, which extends by dynamically adjusting weights for source instances during boosting iterations, downweighting those that perform poorly on the target while upweighting useful ones. Feature-based approaches seek to learn a shared feature representation that reduces distribution discrepancies across domains. Transfer Component Analysis (), for instance, projects source and target data into a (RKHS) to minimize the maximum mean discrepancy (MMD) while preserving within-domain variance, enabling effective adaptation in unsupervised settings. Parameter-based methods transfer learned parameters from a source model to the target, often by sharing lower-layer weights in neural networks and fine-tuning higher layers. This approach leverages the generality of early features, as demonstrated in studies showing that transferring convolutional layers from pre-trained models like improves target performance, with transferability decreasing as layers become more task-specific. Theoretical foundations for these algorithms often rely on generalization bounds that quantify the impact of domain shift. A key result from domain adaptation theory provides an upper bound on the target error \epsilon_T(f) of a hypothesis f in terms of the source error \epsilon_S(f), the divergence between domains, and task discrepancy: \epsilon_T(f) \leq \epsilon_S(f) + \frac{1}{2} d_{H \Delta H}(D_S, D_T) + \lambda Here, d_{H \Delta H}(D_S, D_T) is the \mathcal{H}\Delta\mathcal{H}-divergence measuring the distinguishability of domains under the hypothesis class \mathcal{H}, and \lambda captures the joint error of the optimal hypothesis across domains and tasks. Adaptation algorithms typically minimize proxies for this divergence to tighten the bound and improve target performance. Evaluation in transfer learning employs metrics tailored to assess adaptation quality beyond standard accuracy. The transfer performance gap measures the relative degradation or improvement, often computed as the difference between target accuracy with and without transfer, highlighting the net benefit of . The negative transfer gap (NTG) measures the performance degradation when source knowledge harms the target, serving as a diagnostic for harmful shifts. For distribution similarity, the A-distance provides a non-parametric for the \mathcal{H}\Delta\mathcal{H}-, defined as d_A(D_S, D_T) = 2(1 - 2\hat{\epsilon}(\eta)), where \hat{\epsilon}(\eta) is the error of a classifier \eta trained to distinguish unlabeled source and target samples; lower values indicate better alignment potential. These metrics guide algorithm selection and validation, emphasizing bounds like those from Ben-David et al. (2010) to ensure theoretical guarantees.

Practical Techniques

Pre-training and Fine-Tuning

Pre-training is a foundational phase in transfer learning where a deep is trained from scratch on a large-scale source dataset to learn general-purpose representations. In , models are commonly pre-trained on the dataset, which contains over 1.2 million labeled images across 1,000 categories, enabling the extraction of hierarchical features from low-level edges to high-level objects. In (NLP), pre-training occurs on massive text corpora, such as the combination of BooksCorpus and used for , totaling around 3.3 billion words, to capture linguistic patterns and contextual embeddings. This phase leverages abundant unlabeled or weakly labeled data to initialize model parameters, often using self-supervised objectives like masked language modeling in or next-sentence prediction. Fine-tuning follows pre-training by adapting the initialized model to a specific target task with limited , typically using a lower to preserve learned representations while updating weights. Strategies include freezing early layers, which capture generic features like textures in or syntax in , and only updating later layers or the task-specific head to prevent catastrophic . For instance, in tasks, fine-tuning a pre-trained ResNet on medical images has shown significant accuracy improvements, often around 10% or more, over training from scratch on small datasets. In , fine-tuning on downstream tasks like achieves state-of-the-art results by jointly optimizing the entire model or select layers. Variants of fine-tuning offer flexibility based on computational resources and data availability. Linear probing involves freezing the entire pre-trained backbone and training only a on top of the frozen features, which is computationally efficient and preserves representations but may underperform on complex adaptations. Full updates all parameters end-to-end, maximizing but risking on small target sets. Progressive unfreezing, as introduced in ULMFiT, gradually unfreezes layers from the classifier head to the body, allowing stable with techniques like discriminative learning rates that decrease exponentially across layers. These approaches fall under parameter-based transfer learning, where weights are directly reused and adjusted. Practical implementation of pre-training and fine-tuning is facilitated by open-source frameworks like Transformers, which provide pre-trained models such as and Vision Transformers, along with APIs for seamless on custom datasets. This library supports variants like via simple classifier additions and progressive unfreezing through layer-wise optimizers, democratizing access to transfer learning for researchers and practitioners.

Feature and Parameter Reuse

Feature extraction in transfer learning involves utilizing intermediate layers of a pre-trained source model as fixed feature representations for a new classifier on the target task, thereby avoiding the need for full retraining of the source network. This approach leverages the hierarchical nature of deep neural networks, where lower layers capture general features like edges and textures, while higher layers encode task-specific patterns. For instance, in (CNNs) pre-trained on large datasets such as , embeddings from early to mid-level layers serve as robust inputs for downstream vision tasks, enabling effective transfer even to dissimilar domains. Seminal work has quantified this transferability, showing that features from the first two layers of an 8-layer CNN transfer almost perfectly across tasks, achieving accuracies comparable to from (e.g., top-1 accuracy of approximately 0.625 on similar datasets), while deeper layers exhibit greater specificity, with drops of up to 25% on dissimilar tasks like distinguishing man-made from natural objects. Parameter sharing represents another key method for reusing learned parameters across tasks or instances, promoting efficiency by constraining the model to learn shared representations. In architectures like networks, two identical subnetworks share all weights to compute similarity metrics, such as in one-shot image recognition, where the shared CNN backbone processes pairs of images to learn embeddings for comparison without task-specific retraining. This design reduces parameter redundancy and enhances in few-shot scenarios by enforcing invariance to input variations. Similarly, in setups adapted for transfer, a common backbone (e.g., a shared CNN or transformer encoder) feeds into task-specific heads, allowing parameters from the source task's pre-training to be directly reused for multiple related targets, as demonstrated in early multi-task frameworks where shared lower layers improved performance across diverse predictions like and . Hybrid approaches combine feature or parameter reuse with minimal additional training through modular components, such as modules inserted into frozen pre-trained models. These s consist of small layers—a down-projection to a low-dimensional space followed by a nonlinearity and up-projection—that are added after key operations like or feed-forward blocks in s, enabling task adaptation with only the adapter parameters being updated. Introduced for , this method exemplifies parameter-efficient transfer by repurposing large models like without altering their core weights. On benchmarks like GLUE, tuning achieves a mean score of 80.0, within 0.4 points of full fine-tuning's 80.4, while adding just 3.6% more parameters to the base model. Other parameter-efficient techniques, such as low-rank adaptation (), further reduce trainable parameters by injecting low-rank matrices into layers, achieving comparable performance with even fewer updates and becoming widely adopted by 2025. Such reuse strategies yield significant efficiency gains, particularly in resource-constrained settings, by drastically reducing the number of trainable parameters compared to full model . For example, adapters can decrease the parameter footprint by two orders of magnitude relative to all layers of a large pre-trained model, effectively cutting trainable parameters by over 90% in cases like BERT-large adaptations, where only a of the 340 million parameters (around 12 million) are optimized per task. This not only lowers computational costs but also facilitates modular deployment, allowing multiple tasks to share a single frozen backbone with lightweight, swappable adapters.

Applications

Computer Vision

Transfer learning has revolutionized by enabling models pre-trained on large-scale datasets like to adapt effectively to specialized tasks with limited data, addressing challenges such as domain shifts and data scarcity. In , models like are commonly fine-tuned from pre-training on the COCO dataset, allowing for efficient detection of objects in diverse environments; for instance, fine-tuning on vehicle-specific datasets has demonstrated robust performance in real-world scenarios with reduced training time. Similarly, for semantic segmentation, variants leverage transfer learning by initializing with weights from natural image pre-training and fine-tuning on task-specific data, achieving precise pixel-level predictions in applications like biomedical analysis. In , transferring knowledge from natural images to datasets mitigates the scarcity of labeled medical data, with pre-training on large natural image corpora enabling models to learn generalizable features for chest classification and , often performing comparably to or better than medical-specific pre-training on larger targets. Case studies highlight the practical impact of these approaches. Pre-training on has been shown to boost accuracy on custom small datasets by 15-30% in classification tasks, particularly when with limited labels, by providing robust low-level features like edges and textures that generalize across domains. For , the Office-31 evaluates cross-dataset , where techniques like knowledge from source (e.g., images) to target domains (e.g., photos), improving classification accuracy by aligning and reducing domain discrepancy. These adaptations are crucial for scenarios with distribution shifts, such as varying or viewpoints in office object . In , advances in continual learning have further enhanced transfer learning in by enabling models to adapt to sequential tasks without catastrophic forgetting, as reviewed in recent surveys. Tailored techniques further enhance transfer in . Data augmentation strategies, including style transfer and , help handle domain shifts by generating varied training samples that bridge source and target distributions, improving model robustness without additional . Recent advances in vision-language models, such as CLIP, enable zero-shot transfer by aligning image and text embeddings during pre-training, allowing of unseen categories via prompts; extensions in 2023-2024, like CLIP-PING, have boosted lightweight models' zero-shot performance on downstream tasks by optimizing and . This impact extends to real-time applications like autonomous , where transfer learning from simulated or large-scale driving datasets to limited real-world enables efficient systems for and scene understanding, reducing the need for extensive annotations.

Natural Language Processing

Transfer learning has transformed (NLP) by allowing models pre-trained on vast unlabeled text corpora to adapt efficiently to downstream tasks, leveraging shared representations across domains. In NLP, this paradigm is prominently applied to tasks such as , where models classify text polarity; , enabling translation between language pairs; and , which involves extracting or generating responses from context. These applications benefit from pre-training paradigms like masked language modeling, followed by task-specific . A landmark case study is , released in 2018, which pre-trains bidirectional encoders on BooksCorpus (800 million words) and (2.5 billion words) using masked language modeling and next-sentence prediction objectives. Upon , BERT LARGE established state-of-the-art performance on the , achieving an average score of 80.5%—a 7.7 percentage point absolute improvement over prior methods. Specifically, it excelled in on the SST-2 dataset with 94.9% accuracy and in on SQuAD v1.1 with 93.2 F1 score, demonstrating robust transfer to diverse tasks. The series illustrates generative transfer learning in , shifting focus from discriminative to autoregressive models. , a 175-billion-parameter model pre-trained on 410 billion tokens from diverse sources like , supports for generative tasks without parameter updates. It achieved 85.0 F1 on the CoQA question-answering dataset in few-shot settings and strong scores in , such as 35.1 for Romanian-to-English, highlighting its ability to transfer broad linguistic knowledge to new generative applications like text completion and summarization. Cross-lingual adaptations extend transfer learning to low-resource languages, enabling models trained primarily on high-resource data like English to perform in underrepresented ones. Multilingual BERT (mBERT), pre-trained on monolingual corpora from 104 languages, facilitates zero-shot and fine-tuned transfer across linguistic families. For instance, mBERT fine-tuned on data from the MasakhaNER dataset reached 89.36 F1 for , outperforming traditional models by leveraging cross-lingual embeddings despite limited Swahili training data. Recent 2024 advances in NLP incorporate vision-text alignment into transfer learning frameworks. large language models (MM-LLMs), such as LLaVA and BLIP-2, employ lightweight projectors (e.g., Q-Former) to align visual encoders like CLIP ViT with pre-trained LLMs, enabling instruction-tuned transfer for tasks integrating text and images, such as , while preserving core generative capabilities. Overall, transfer learning democratizes for underrepresented languages by drastically reducing data needs—often enabling viable performance with zero or few target-language examples. In low-resource languages, cross-lingual methods like mT5-xl with constrained decoding boost zero-shot NER F1 scores on datasets like MasakhaNER, making advanced tools accessible without extensive efforts. In 2025, further advancements in instruction-finetuned multilingual LLMs have improved transfer for low-resource tasks.

Challenges and Limitations

Negative Transfer

Negative transfer refers to the phenomenon in transfer learning where the incorporation of knowledge from a source or task degrades the performance on the or task, rather than improving it. This occurs primarily when the source and domains are mismatched, such as through significant covariate shift, label shift, or concept shift, leading the model to overfit to irrelevant source-specific patterns that hinder to the . In the of domains and tasks, negative transfer is exacerbated when the joint of inputs and labels in the source P_S(X_S, Y_S) diverges substantially from that in the P_T(X_T, Y_T), causing transferred representations to misalign with requirements. A prominent example arises in , where models pretrained on natural images (e.g., ) and transferred to synthetic image datasets like VisDA or across domains in benchmarks like Office-31 exhibit negative , with accuracy drops of up to 10-20% compared to target-only training in cases such as webcam to DSLR transfers, due to stylistic and distributional differences. Another case is in unsupervised on benchmarks like Office-31, where transferring from a source domain with unrelated categories (e.g., webcam images to DSLR) results in a —defined as the difference between source-pretrained target performance and optimal baseline—quantifying the harm, often reaching negative values indicating worse outcomes than no . To mitigate negative transfer, domain discrepancy measures such as the Maximum Mean Discrepancy (MMD) are employed to quantify and minimize distributional differences between source and target, enabling adaptive alignment only when similarity thresholds are met. Selective transfer techniques, like adversarial filtering to exclude harmful source samples, have been shown to recover performance losses, improving accuracy by 5-15% on affected benchmarks. Ensemble methods that combine multiple source models, weighting them based on predicted compatibility, further reduce risks by averaging out detrimental influences. Empirical studies reveal negative transfer as a pervasive issue, particularly in unsupervised settings, where it manifests in a significant portion of domain adaptation scenarios across over 20 evaluated algorithms on specialized benchmarks, underscoring the need for proactive detection.

Evaluation and Scalability Issues

Evaluating transfer learning models poses significant challenges due to the limited availability of standardized benchmarks beyond well-known datasets like GLUE for natural language processing and the Office dataset for domain adaptation in computer vision. While GLUE provides a multi-task evaluation framework for assessing generalization across NLP tasks, it has been criticized for not fully capturing out-of-distribution robustness, leading to the development of extensions like GLUE-X to address these gaps. Similarly, the Office dataset, which evaluates domain shifts across office environments, lacks breadth for diverse real-world scenarios, complicating fair comparisons and hindering the identification of robust transfer methods. Cross-validation in shifted domains exacerbates these issues, as traditional splits often fail to account for distribution mismatches between source and target data, resulting in overly optimistic performance estimates that do not generalize well. Scalability remains a core concern in transfer learning, particularly for pre-training large models, where computational demands can be prohibitive. For instance, pre-training with 175 billion parameters required approximately 3.14 × 10^23 floating-point operations, far exceeding the resources available to most researchers and organizations. This high compute cost not only limits accessibility but also raises environmental concerns due to the energy consumption involved. In federated transfer learning scenarios, where models are adapted across decentralized devices, data privacy adds further complexity, as sharing model updates must comply with regulations like GDPR while preventing leakage of sensitive source data. Additional issues include catastrophic during , where adapting a pre-trained model to a new task erodes performance on the original tasks, and bias amplification from source data, which can propagate and intensify unfair representations in the target domain. Catastrophic arises because overwrites shared parameters critical to prior knowledge, as observed in deep transfer learning for where source-task accuracy drops significantly post-adaptation. Bias amplification occurs when spurious correlations in the source dataset, such as demographic imbalances, persist or worsen in the transferred model, even if the target data is debiased, leading to unreliable downstream applications. To mitigate these challenges, techniques like efficient and offer practical solutions for scalability and evaluation. Adapter modules insert lightweight, task-specific layers into pre-trained models, adding only a fraction of the parameters (e.g., 0.5-3% for tasks) while preserving overall performance, thus enabling faster without full retraining. compresses large teacher models into smaller student versions by transferring softened output distributions, reducing model size by up to 90% in transfer settings while maintaining accuracy, as demonstrated in vision-language tasks. These approaches facilitate more reliable evaluation by allowing experimentation on resource-constrained setups and help scale transfer learning to broader applications.

Future Directions

Recent Advances

In 2025, advancements in statistical learning emphasized the development of specialized data structures to handle shifts more effectively, as detailed in a comprehensive review that categorizes challenges into model-based and data-based approaches while introducing resolution techniques for typical methods. Surveys on cross-dataset visual have highlighted problem-oriented methods, both shallow and , to improve performance across diverse visual datasets by addressing mismatches. In 2025, transfer learning in gained traction through reviews that unified the paradigm under taxonomies considering morphology, task complexity, and data modalities, enabling efficient reuse of prior experiences to accelerate adaptation without starting from scratch. In tasks, such as hospital-specific post-discharge mortality , latent transfer learning frameworks demonstrated reductions in errors by incorporating multi-source data, achieving efficiency gains through decreased standard errors compared to isolated models. In 2025, transfer learning extended to with approaches leveraging custom-tailored molecular to predict catalytic activity in real-world photosensitizers, enhancing model from simulated to experimental . A survey further explored the integration of transfer learning with large language models in systems, showcasing applications in diagnostics and patient management that boost performance in data-scarce healthcare scenarios. Key theoretical contributions included analyses from , developing effective theories for transfer in fully connected neural via Franz-Parisi formalisms to quantify boosts in the proportional limit. One prominent emerging trend in transfer learning is the rise of foundation models, particularly variants that integrate diverse data types such as text, images, and video to enable more robust across domains. Models like Flamingo exemplify this shift, leveraging large-scale pre-training on interleaved corpora to achieve capabilities, thereby reducing the need for extensive task-specific data. This approach has extended to biological applications, where multi-modal transfer learning connects modalities like DNA, , and proteins, facilitating cross-domain adaptations in scientific modeling. Another key trend involves federated and privacy-preserving transfer learning, which allows collaborative model training across distributed devices without sharing raw data, addressing growing concerns over in sensitive sectors like healthcare and . Techniques such as and selective knowledge sharing in federated settings have demonstrated improved performance while maintaining privacy in resource-constrained environments. Complementing this is the advancement in paradigms, which mitigate catastrophic by enabling continuous to new tasks while retaining prior , as seen in neural architectures that balance plasticity and stability for sequential learning scenarios. Open questions persist in handling extreme domain shifts, where models struggle with significant distributional mismatches, such as transferring from simulated to real-world environments, often leading to performance degradation without adaptive alignment strategies. Ethical biases in transferred models represent another critical challenge, as pre-trained representations can propagate societal inequities into downstream applications like medical diagnostics, necessitating bias-detection frameworks integrated into transfer pipelines. to devices remains unresolved, with computational overhead limiting deployment on low-resource despite promising federated-transfer approaches. Looking ahead, the integration of transfer learning with quantum machine learning holds potential for exponential speedups in high-dimensional tasks, as hybrid quantum-classical architectures enable robust knowledge transfer in adversarial settings. Auto-transfer systems, which automate source selection and adaptation, are gaining traction for streamlining deployment, with algorithms like automated broad-transfer learning showing efficacy in cross-domain fault diagnosis by dynamically aligning features without manual intervention. Research gaps include the absence of a unified theory for avoiding negative transfer, where source knowledge hinders target performance, as current methods like feature alignment provide empirical fixes but lack theoretical guarantees for generalizability. Additionally, standardized benchmarks for 2025+ large language models in transfer scenarios are underdeveloped, with existing evaluations like ECLeKTic highlighting needs for cross-lingual and metrics to assess long-term adaptability beyond baselines.

References

  1. [1]
    [PDF] A Survey on Transfer Learning
    This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression and clustering problems. In this ...Missing: seminal | Show results with:seminal
  2. [2]
    [1911.02685] A Comprehensive Survey on Transfer Learning - arXiv
    Nov 7, 2019 · Transfer learning aims at improving the performance of target learners on target domains by transferring the knowledge contained in different ...
  3. [3]
    How transferable are features in deep neural networks? - arXiv
    Nov 6, 2014 · How transferable are features in deep neural networks? Authors:Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson.
  4. [4]
    A Survey on Transfer Learning | IEEE Journals & Magazine
    Oct 16, 2009 · This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems.Missing: benefits | Show results with:benefits
  5. [5]
    A survey of transfer learning | Journal of Big Data | Full Text
    May 28, 2016 · This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning.
  6. [6]
    Deep learning in computer vision: A critical review of emerging ...
    Dec 15, 2021 · The experiments showed that the method performed outstandingly in both quality (i.e., 10%–20% improvement in recall accuracy averagely) and ...
  7. [7]
    [PDF] Pre-training on Grayscale ImageNet Improves Medical Image ...
    We demonstrate that a network pre-trained on grayscale ImageNet is a better starting point for transfer learning on medical images, because it (1) leads to more ...
  8. [8]
    Thorndike & Woodworth (1901a)
    The influence of improvement in one mental function upon the efficiency of other functions (I) EL Thorndike & RS Woodworth (1901)
  9. [9]
    [PDF] A Review of Transfer Theories and Effective Instructional Practices
    Learning transfer has become a significant research topic in educational psychology since Thorndike and Woodworth developed the theory of identical elements in ...
  10. [10]
    [PDF] Reminder of the First Paper on Transfer Learning in Neural ...
    This paper describes a work on transfer learning in neural networks carried out in 1970s and early. 1980s, which produced its first publication in 1976.
  11. [11]
    [PDF] Discriminability-Based Transfer between Neural Networks
    We have described the DBT algorithm for transfer between neural networks.2 DBT demonstrated substantial and significant learning speed improvement over randomly.
  12. [12]
    [PDF] Transfer Learning for Reinforcement Learning Domains: A Survey
    The goals of this survey are to introduce the reader to the transfer learning problem in RL domains, to organize and discuss current transfer methods, and to ...
  13. [13]
    ImageNet Classification with Deep Convolutional Neural Networks
    Authors. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. Abstract. We trained a large, deep convolutional neural network to classify the 1.3 million ...
  14. [14]
    Transfer Learning - Machine Learning's Next Frontier - ruder.io
    Mar 21, 2017 · This blog post gives an overview of transfer learning, outlines why it is important, and presents applications and practical methods.Sebastian Ruder · Adapting To New Domains · Related Research AreasMissing: 2006 | Show results with:2006
  15. [15]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  16. [16]
    [2010.11929] An Image is Worth 16x16 Words: Transformers ... - arXiv
    Oct 22, 2020 · A pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
  17. [17]
    [PDF] A Comprehensive Survey on Transfer Learning - arXiv
    Transfer learning, which focuses on transferring the knowledge across domains, is a promising machine learning methodology for solving the above problem. The ...<|control11|><|separator|>
  18. [18]
    [PDF] Unsupervised Visual Domain Adaptation Using Subspace Alignment
    In this paper, we introduce a new domain adaptation. (DA) algorithm where the source and target domains are represented by subspaces described by eigenvectors.
  19. [19]
    [PDF] Domain-Adversarial Training of Neural Networks
    Domain-adversarial training uses a neural network (DANN) to learn features that are discriminative for the source domain and indiscriminate between domains, ...
  20. [20]
    [PDF] A Unified View of Label Shift Estimation - NIPS papers
    Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for ...
  21. [21]
    Data Distribution Shifts and Monitoring - Chip Huyen
    Feb 7, 2022 · Label shift, covariate shift, and concept drift are defined as follows. Covariate shift is when P(X) changes, but P(Y|X) remains the same.Data Distribution Shifts · Monitoring and Observability
  22. [22]
    Boosting for transfer learning | Proceedings of the 24th international ...
    In this paper, we present a novel transfer learning framework called TrAdaBoost, which extends boosting-based learning algorithms.Missing: et | Show results with:et
  23. [23]
    A theory of learning from different domains | Machine Learning
    Oct 23, 2009 · Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution.
  24. [24]
    [PDF] A Survey on Negative Transfer - arXiv
    Aug 9, 2021 · Negative transfer (NT), i.e., leveraging source domain data/knowledge undesirably reduces the learning performance in the target domain, has ...
  25. [25]
    ImageNet Large Scale Visual Recognition Challenge - arXiv
    Sep 1, 2014 · This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We ...
  26. [26]
    [PDF] Fine-Tuning can Distort Pretrained Features and Underperform Out ...
    Feb 21, 2022 · Fine-tuning (FT) and linear probing (LP) are popular transfer learning algorithms. There is substantial evidence of FT outperforming LP in- ...
  27. [27]
    Universal Language Model Fine-tuning for Text Classification - arXiv
    Jan 18, 2018 · We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP.
  28. [28]
    Transformers - Hugging Face
    There are over 1M+ Transformers model checkpoints on the Hugging Face Hub you can use. Explore the Hub today to find a model and use Transformers to help you ...Hugging Face's logoQuickstartUsing 🤗 transformers at ...ModelsInstallation
  29. [29]
    [PDF] Siamese Neural Networks for One-shot Image Recognition
    In this paper, we explore a method for learning siamese neural networks which employ a unique structure to naturally rank similarity be- tween inputs. Once a ...
  30. [30]
    [1902.00751] Parameter-Efficient Transfer Learning for NLP - arXiv
    Feb 2, 2019 · We propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task.
  31. [31]
    Deep Learning-based Bio-Medical Image Segmentation using UNet ...
    May 24, 2023 · We show that transferred learning model has better performance in image segmentation than UNet model that is implemented from scratch. Subjects: ...
  32. [32]
    Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few ...
    May 31, 2021 · Title:Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images.
  33. [33]
    [1805.08974] Do Better ImageNet Models Transfer Better? - arXiv
    ٢٣‏/٠٥‏/٢٠١٨ · Our results show that ImageNet architectures generalize well across datasets, but ImageNet features are less general than previously suggested.
  34. [34]
    Accelerating Deep Unsupervised Domain Adaptation with Transfer ...
    Mar 25, 2019 · TCP is validated on two benchmark datasets-Office-31 and ImageCLEF-DA with two common backbone networks-VGG16 and ResNet50. Experimental results ...
  35. [35]
    Learning Transferable Visual Models From Natural Language ...
    Feb 26, 2021 · After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the ...Missing: 2023 2024 advances
  36. [36]
    [2005.14165] Language Models are Few-Shot Learners - arXiv
    May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
  37. [37]
    [1906.01502] How multilingual is Multilingual BERT? - arXiv
    Jun 4, 2019 · In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 ...
  38. [38]
    Cross-Lingual Transfer for Low-Resource Natural Language ... - arXiv
    Feb 4, 2025 · This paper focuses on cross-lingual transfer learning to improve NLP for low-resource languages, using data-based and model-based methods, ...
  39. [39]
    A Survey on Negative Transfer
    ### Summary of Negative Transfer from arXiv:2009.00909
  40. [40]
    [PDF] Characterizing and Avoiding Negative Transfer - CVF Open Access
    When labeled data is scarce for a specific target task, transfer learning often offers an effective solution by utiliz- ing data from a related source task.Missing: metric | Show results with:metric
  41. [41]
    A study of the effects of negative transfer on deep unsupervised ...
    Apr 1, 2021 · A study of the effects of negative transfer on deep unsupervised domain adaptation methods☆ ... View PDFView articleGoogle Scholar. Long et ...Missing: pdf | Show results with:pdf
  42. [42]
    A Survey on Negative Transfer
    Negative transfer (NT), i.e., leveraging source domain data/knowledge undesirably reduces learning performance in the target domain, and has been a long- ...
  43. [43]
    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural ...
    Apr 20, 2018 · GLUE is a tool for evaluating and analyzing NLU models across diverse tasks. It is model-agnostic and incentivizes sharing knowledge.
  44. [44]
    [PDF] GLUE-X: Evaluating Natural Language Understanding Models from ...
    Jul 9, 2023 · This paper presents the first attempt at creating a unified benchmark named. GLUE-X for evaluating OOD robustness in. NLP models, highlighting ...
  45. [45]
    [PDF] Validity Challenges in Machine Learning Benchmarks
    Aug 3, 2022 · In this thesis, we probe the validity of machine learning benchmarks from several perspectives. We first consider the statistical validity of ...
  46. [46]
  47. [47]
    [2207.02842] When does Bias Transfer in Transfer Learning? - arXiv
    Jul 6, 2022 · Bias transfer occurs when biases of the source model persist after adapting to the target class, even when the target dataset is de-biased.Missing: amplification | Show results with:amplification
  48. [48]
  49. [49]
    Recent Advances in Transfer Learning for Cross-Dataset Visual ...
    This article takes a problem-oriented perspective and presents a comprehensive review of transfer-learning methods, both shallow and deep, for cross-dataset ...
  50. [50]
    Transfer learning in robotics: An upcoming breakthrough? A review ...
    Sep 13, 2024 · The transfer learning paradigm for robotics is a promising avenue to avoid learning from scratch by reusing previously-acquired experience in ...
  51. [51]
    A latent transfer learning method for estimating hospital-specific post ...
    Nov 8, 2024 · We observe that including source hospitals results in a reduction in estimated standard errors. This efficiency gain shows how the Latent-TL ...
  52. [52]
    Transfer learning from custom-tailored virtual molecular databases ...
    Oct 1, 2025 · Transfer learning for a foundational chemistry model. Chem. Sci. 15, 5143–5151 (2024). Article CAS ... © 2025 Springer Nature Limited.
  53. [53]
    A survey on the applications of transfer learning to enhance the ...
    Jun 5, 2025 · This survey investigates the significant impact of Transfer Learning and large language models on medical systems by explaining their applications.
  54. [54]
    Statistical Mechanics of Transfer Learning in Fully Connected ...
    In this Letter we develop a novel single-instance Franz-Parisi formalism that yields an effective theory for TL in fully connected neural networks.
  55. [55]
    Privacy-preserving Heterogeneous Federated Transfer Learning
    We propose an end-to-end privacy-preserving multi-party learning approach with two variants based on homomorphic encryption and secret sharing techniques.
  56. [56]
    Continual lifelong learning with neural networks: A review
    Lifelong learning and catastrophic forgetting in neural networks ... catastrophic forgetting and performs positive transfer to previously learned tasks.Review · 2. Biological Aspects Of... · 4. Developmental Approaches...
  57. [57]
    Ethical and Bias Considerations in Artificial Intelligence/Machine ...
    This review will discuss the relevant ethical and bias considerations in AI-ML specifically within the pathology and medical domain.
  58. [58]
    Using Transfer Learning in Building Federated Learning Models on ...
    We combine transfer learning with federated learning, where we train a base model with a public dataset. The base model is passed to the federated users.
  59. [59]
    [2510.16301] Adversarially Robust Quantum Transfer Learning - arXiv
    Oct 18, 2025 · This chapter introduces a hybrid quantum-classical architecture that combines the advantages of quantum computing with transfer learning ...
  60. [60]
    Automated broad transfer learning for cross-domain fault diagnosis
    An automated broad-transfer learning algorithm (AutoBTL) is proposed to improve predictive modeling for cross-domain tasks.
  61. [61]
    ECLeKTic: A novel benchmark for evaluating cross-lingual ...
    ECLeKTic is a new benchmark designed to evaluate the ability of large language models (LLMs) to transfer knowledge across different languages.