Fact-checked by Grok 2 weeks ago

Multi-task learning

Multi-task learning () is a paradigm in which multiple related tasks are trained simultaneously using a shared model, leveraging commonalities such as shared representations to improve and performance across all tasks. Introduced in the , MTL originated as an approach to inductive transfer that enhances learning for a primary task by incorporating training signals from auxiliary tasks, often implemented through architectures like neural networks with shared hidden layers. Key benefits of MTL include improved data efficiency by effectively doubling or more the training data through task synergies, reduced via shared parameters that regularize the model, and faster convergence compared to single-task learning. In practice, MTL reduces error rates by 5-25% across domains such as medical prediction and , with gains equivalent to a 50-100% increase in training data volume. It works by combining error gradients from multiple tasks into a unified optimization process, often using a weighted like (1 - \lambda) \times \text{Main Task Loss} + \sum (\lambda \times \text{Auxiliary Task Loss}), where \lambda balances task contributions. MTL architectures typically feature shared encoders for extracting common features followed by task-specific decoders, enabling applications in diverse fields. Notable modern uses span (e.g., joint object detection and segmentation in autonomous driving), (e.g., simultaneous and summarization), healthcare (e.g., multi-outcome prognosis), and recommender systems (e.g., predicting ratings and clicks). Recent advancements, particularly with deep neural networks since the , address challenges like task imbalance through techniques such as gradient surgery and dynamic weighting, while integration with pre-trained foundation models has further amplified its scalability in the 2020s.

Overview

Definition and Core Concepts

Multi-task learning (MTL) is a subfield of in which a model is trained to simultaneously solve multiple related tasks, leveraging shared representations or parameters to exploit interdependencies among the tasks for improved performance and generalization. In this paradigm, the model learns a unified representation from data across all tasks, allowing that enhances the learning of each individual task compared to training them in isolation. This approach contrasts with single-task learning, where separate models are developed independently for each task, potentially leading to redundant computations and missed opportunities for cross-task synergies. Core concepts in revolve around shared s, such as common feature extractors that capture underlying patterns beneficial to multiple tasks; auxiliary tasks, which serve as supportive problems to regularize the model and provide additional supervisory signals; and derived from task relatedness, which guides the learner toward hypotheses that generalize better across the . For instance, in a shared representation setup, an initial layer might extract general features like edges in images, which are then branched into task-specific heads for or , as opposed to fully independent models that duplicate such foundational learning. This structure effectively acts as implicit by amplifying the training signals through related tasks, increasing the effective sample size and reducing without requiring additional for the primary task. MTL differs from related paradigms like , where knowledge is sequentially transferred from a source task to a target task after pre-training, whereas MTL emphasizes joint training of all tasks from the outset to enable bidirectional knowledge sharing. The foundational formalization of MTL traces back to Rich Caruana's work, which introduced the idea of using related tasks to impose a beneficial , thereby improving generalization through the implicit augmentation of training data via cross-task signals.

Historical Development and Motivation

Multi-task learning (MTL) emerged in the late 1990s as a technique to enhance generalization in by jointly training models on multiple related tasks, drawing inspiration from how humans learn interconnected skills. The foundational work by Rich Caruana in formalized MTL, demonstrating its potential through shared representations in neural networks to leverage domain-specific information from auxiliary tasks, particularly in data-scarce environments. Early efforts focused on shallow models and paradigms, with key advancements including regularized formulations for feature sharing in the mid-2000s, such as those by Evgeniou and Pontil (2004), which used kernel methods to capture task correlations. The field experienced a resurgence in the 2010s, driven by the revolution following the success of convolutional neural networks around 2012. Researchers began integrating into deep architectures, emphasizing shared encoders to exploit hierarchical representations across tasks; for instance, Misra et al. (2016) introduced cross-stitch units for adaptive parameter sharing in vision tasks, while Luong et al. (2016) applied shared encoders in sequence-to-sequence models for . This period also saw 's integration with , exemplified by Taskonomy (Zamir et al., 2018), which pretrained models on diverse visual tasks to enable efficient downstream between 2015 and 2020. In the 2020s, has evolved alongside foundation models, incorporating pretraining for vision-language tasks; notable examples include the 12-in-1 model by Lu et al. (2020), which unified multiple vision-and-language objectives, and extensions of CLIP-like architectures such as M2-CLIP (2024), which use adapters for multi-task video understanding. Recent advances (2023–2025) emphasize scalable in pretrained models like variants of and mPLUG-2, enabling joint learning across text, image, and video modalities for large-scale systems. The primary motivations for include improved via shared inductive biases, reduced through auxiliary tasks, enhanced efficiency in low-data regimes, and for complex systems; empirical studies from early benchmarks, such as those on sentiment and , report relative error reductions of 10–20% compared to single-task learning. This evolution was propelled by the shift from shallow to deep neural networks post-2012, deeper integration with , and the rise of foundation models handling . Early work was largely confined to supervised tasks, but by the , expansions to semi-supervised and paradigms addressed these gaps, broadening MTL's applicability.

Methods

Task Relationship Modeling

Task relationship modeling in multi-task learning involves identifying and quantifying dependencies among tasks to guide the design of shared representations and avoid negative transfer. This foundational step enables the selective sharing of knowledge between similar tasks while isolating dissimilar ones, improving overall . Approaches typically begin by analyzing task similarities through data-driven metrics, followed by clustering or modeling to exploit overlaps. Seminal work in this area, such as the use of priors for inferring task clusters, demonstrated that grouping related tasks can enhance predictive performance by capturing latent structures in task relatedness. Task grouping methods cluster tasks based on similarity measures derived from task embeddings or correlation matrices, allowing joint training within clusters to leverage shared patterns. For instance, hierarchical clustering algorithms applied to multi-task settings, introduced around , use nonparametric Bayesian models to automatically determine the number of clusters and assign tasks accordingly, as in the infinite relational model which treats tasks as nodes in a and infers cluster assignments via posterior sampling. These methods compute similarities from task outputs or gradients, grouping tasks with high to form sub-networks for training. In practice, such clustering has been shown to reduce in datasets with heterogeneous tasks by limiting interference from unrelated groups. Overlap exploitation techniques model shared between tasks using low-rank approximations of task covariances, assuming that related tasks lie in a low-dimensional manifold. A key approach regularizes the joint across tasks to enforce low-rank structure, capturing correlations via nuclear penalties on the of task predictors. This allows decomposition of task-specific parameters into shared low-rank components plus sparse individual deviations, effectively modeling subspace overlaps. For example, in , tasks like semantic segmentation and exhibit overlap in feature representations for edge and region detection, where low-rank modeling groups them to share convolutional filters, leading to improved accuracy on benchmarks like PASCAL VOC over single-task baselines. Strategies for handling unrelated or negatively correlated tasks treat them as regularizers to enhance robustness, preventing in joint optimization. In studies, including negatively correlated tasks in multi-task frameworks was found to act as implicit injection, improving on held-out data in scenarios with task conflicts, as evidenced in regularization-based relation learning that assigns negative weights to dissimilar pairs. This approach uses adaptive penalties to downweight negative influences during , ensuring that unrelated tasks contribute to without dominating shared parameters. Evidence from synthetic and real-world datasets, such as gene expression prediction, shows that such inclusion mitigates in high-dimensional settings. Metrics for task relationships include distances and , which quantify similarity without assuming specific model architectures. distances, derived from methods, measure divergence between task covariance kernels as the Frobenius norm of their difference, providing a kernel-based similarity score. , estimated via kernel density approximations, captures nonlinear dependencies between task outputs. The following illustrates computing a correlation-based task similarity , a precursor to these metrics:
import numpy as np

def compute_task_similarity(task_outputs):
    # task_outputs: list of arrays, each shape (n_samples, n_features) for a task
    n_tasks = len(task_outputs)
    similarity_matrix = np.zeros((n_tasks, n_tasks))
    for i in range(n_tasks):
        for j in range(i+1, n_tasks):
            corr = np.corrcoef(task_outputs[i].flatten(), task_outputs[j].flatten())[0,1]
            similarity_matrix[i,j] = similarity_matrix[j,i] = abs(corr)  # Use absolute for grouping
    return similarity_matrix
These metrics enable preprocessing steps like thresholding for grouping, with kernel-based methods effective in formulations for clustering in MTL setups. complements this by handling non-Gaussian dependencies, as shown in dependence-maximizing frameworks.

Knowledge Transfer Techniques

In multi-task learning, techniques facilitate the sharing of learned representations and parameters across related tasks to improve and . These methods, evolving from early regularization approaches in the to deep adaptations, emphasize architectural designs that balance task independence and interdependence without requiring explicit task groupings. Parameter sharing is a foundational technique for , where components of the model are jointly optimized to capture commonalities. Hard parameter sharing employs a shared "trunk" of layers, typically convolutional or , followed by task-specific heads, as introduced in early multi-task architectures such as multi-head networks around 2016. This approach reduces compared to task-independent models, particularly when tasks share low-level features like in tasks. Soft parameter sharing, in contrast, assigns separate parameter sets to each task but induces via regularized constraints on parameter differences, allowing flexibility for loosely related tasks while promoting alignment. Architectures like cross-stitch networks exemplify this by learning task-specific combinations of shared activations, enhancing without full parameter fusion. Regularization-based transfer enforces low-rank structures or predictive consistency across tasks to prevent negative . Trace regularization, a seminal method from the early , promotes low-rank weight matrices across tasks by penalizing the nuclear of task parameter concatenations, enabling sparse data regimes to leverage task correlations effectively. In variants, adaptations like sign dropout extend traditional dropout by selectively masking gradients based on task relations, mitigating in multi-task settings during the shift to neural networks. Cross-task further supports by using predictions from one task as soft labels to guide another, as demonstrated in multi-task recommendation systems where auxiliary task outputs distill knowledge to primary tasks, improving without additional data. Auxiliary task design involves introducing synthetic or tasks to enrich representations for primary objectives, a strategy dating to the early but refined in . For instance, reconstruction tasks as auxiliaries for classification compel models to learn robust features by predicting input reconstructions alongside labels, boosting primary performance in on benchmarks. Recent setups, such as hierarchical frameworks combining and , use auxiliary cognitive tasks to enhance main diagnostic goals in healthcare applications like . For non-stationary tasks where distributions evolve over time, continual multi-task learning employs replay s to mitigate catastrophic post-2018. These buffers store exemplars from prior tasks, replaying them during on new tasks to preserve knowledge; methods like CLEAR use experience replay, significantly reducing in sequential benchmarks such as . Curiosity-driven variants further prioritize diverse buffer samples, supporting efficient in dynamic environments without full retraining. Recent advancements as of 2025 include with large models for scalable continual learning in transformer-based architectures.

Optimization and Learning Paradigms

In multi-task learning, optimization typically involves minimizing a joint objective that combines losses from multiple tasks, often through a weighted sum to balance their contributions during training. The basic formulation employs static weights, but dynamic weighting schemes adapt these based on task-specific characteristics to prevent dominant tasks from overshadowing others. A prominent approach introduces task as a learnable , where the weight for each task's is inversely proportional to its homoscedastic , modeled as w_i = \frac{1}{2\sigma_i^2} for task i, with \sigma_i optimized alongside model parameters via . This method, applied to scene geometry and semantics tasks, improves performance by automatically scaling losses according to their levels, achieving relative error reductions of up to 25% on depth benchmarks compared to equal . The following illustrates the forward pass and loss computation for uncertainty-weighted multi-task optimization in a setting:
for each batch in training data:
    for each task i in tasks:
        predictions_i = model(batch_inputs)[task_i]
        loss_i = task_loss_i(predictions_i, batch_targets_i)
        weighted_loss_i = loss_i / (2 * exp(log_sigma_i))  # Equivalent to 1/(2 sigma_i^2)
        total_loss += weighted_loss_i
    total_loss += regularization_on_log_sigmas  # Penalize extreme uncertainties
    optimizer.step(total_loss)
This framework extends standard by incorporating uncertainty estimation, enabling robust training across heterogeneous tasks. Bayesian paradigms in multi-task optimization leverage probabilistic models to capture task correlations and uncertainties, particularly through multi-task Gaussian processes (MTGPs) that share kernels across tasks for efficient inference. MTGPs model outputs as a drawn from a prior, allowing knowledge transfer via coregionalization or intrinsic models, which has been shown to outperform single-task GPs in on synthetic and real-world datasets from 2015 onward. For hyperparameter tuning, extends to multi-task settings by treating tasks as dimensions in a acquisition function, such as multi-task expected improvement, facilitating shared exploration of hyperparameters like learning rates across tasks and reducing tuning time by up to 50% in environments. These developments, spanning 2015-2020, emphasize scalable approximations like sparse inducing points to handle high-dimensional data. Evolutionary methods address multi-task optimization by evolving populations across multiple fitness landscapes simultaneously, exploiting inter-task synergies through genetic algorithms. Multifactorial optimization, introduced post-2016, represents individuals with scalar fitness factors for each task, enabling implicit parallel search and via crossover between similar tasks, as demonstrated in suites where it achieves convergence speeds 2-5 times faster than single-task evolutionary algorithms on constrained problems. This models tasks as a multifactorial evolutionary , where the overall fitness is a , promoting adaptive in dynamic environments. Game-theoretic paradigms frame multi-task optimization as a cooperative game among tasks, seeking equilibria that balance individual and joint objectives. Inspired by , recent works (2020s) treat tasks as agents in a , optimizing shared parameters to reach stable points where no task can unilaterally improve its loss without harming others, applied in for multi-task settings. These approaches use techniques like policy gradient ascent on a game payoff matrix to enforce cooperative balancing, particularly effective in heterogeneous scenarios like vision-language tasks. Recent advances include interleaved training regimes that alternate between tasks based on learning progress, mimicking human cognitive switching to enhance in continual learning. A 2025 method modulates interleaving via energy-based learning progress, where task selection probability is proportional to a free-energy estimate of improvement, reducing catastrophic forgetting on sequential benchmarks while adapting to heterogeneous tasks through dynamic scheduling. This energy-modulated approach prioritizes tasks with high marginal gains, integrating seamlessly with existing optimizers for efficient deployment in resource-constrained settings.

Mathematical Foundations

General Problem Formulation

Multi-task learning (MTL) extends the single-task learning paradigm by jointly optimizing multiple related tasks to leverage shared information, improving across all tasks. In single-task learning, the objective is to minimize a L(\theta) + \Omega(\theta), where \theta represents the model , L(\theta) is the empirical on task-specific , and \Omega(\theta) is a regularizer to prevent . MTL generalizes this to T tasks by introducing shared \theta and task-specific components, formulating the problem as minimizing a composite \mathcal{L}(\theta) = \sum_{t=1}^T w_t L_t(\theta) + \Omega(\theta), where L_t(\theta) = \frac{1}{n_t} \sum_{j=1}^{n_t} \ell(y_{tj}, f_t(x_{tj}; \theta)) is the average for task t over its D_t = \{(x_{tj}, y_{tj})\}_{j=1}^{n_t}, \ell is a task-specific (e.g., squared error or ), f_t maps inputs to outputs for task t, and w_t \geq 0 are weights balancing task contributions (often set to 1 for equal weighting). This joint optimization assumes tasks share a common , extending scalar-valued functions (single output) to vector-valued mappings across tasks without assuming structures. The tasks in MTL are assumed to be related through a shared latent structure, such as common input features or underlying representations that capture domain-specific patterns across the T tasks. Formally, each task t defines an input-output mapping from \mathcal{X}_t to \mathcal{Y}_t, but homogeneity is often imposed where \mathcal{X}_t = \mathcal{X} for all t to enable parameter sharing; heterogeneous cases align features via transformations. This relatedness is crucial, as unrelated tasks can lead to interference rather than transfer, but the formulation exploits correlations in the joint data distribution to induce a beneficial bias in \theta. Evaluation in MTL combines task-specific metrics, such as for or accuracy for on held-out data per task, with MTL-specific measures like the avoidance of negative transfer, where performance on a target task degrades due to joint training with dissimilar tasks. To derive the role of in optimization, consider the of the composite : \nabla_\theta \mathcal{L}(\theta) = \sum_{t=1}^T w_t \nabla_\theta L_t(\theta), which aggregates task gradients scaled by w_t; unbalanced gradients can cause dominant tasks to overshadow others, leading to suboptimal . Techniques like dynamic adjust w_t to normalize gradient magnitudes, ensuring equitable updates across tasks and mitigating negative transfer.

Vector-Valued Reproducing Kernel Hilbert Spaces

Vector-valued reproducing Hilbert spaces (RKHS) provide a functional analytic for multi-task learning () by extending scalar-valued RKHS to handle vector-valued outputs, enabling the modeling of multiple related tasks within a single of functions. Formally, a vector-valued RKHS \mathcal{H}_K consists of functions f: \mathcal{X} \to \mathbb{R}^T, where \mathcal{X} is the input space and T denotes the number of tasks, equipped with a matrix-valued K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}^{T \times T} that is positive semi-definite, meaning for any n, points x_1, \dots, x_n \in \mathcal{X}, and vectors c_1, \dots, c_n \in \mathbb{R}^T, the inequality \sum_{i,j=1}^n c_i^\top K(x_i, x_j) c_j \geq 0 holds. The K induces an inner product on \mathcal{H}_K such that the space is complete, and the reproducing property states that for any f \in \mathcal{H}_K, x \in \mathcal{X}, and v \in \mathbb{R}^T, \langle f(x), v \rangle_{\mathbb{R}^T} = \langle f, K(x, \cdot) v \rangle_{\mathcal{H}_K}, allowing point evaluations via inner products with kernel sections. A common construction for vector-valued kernels in MTL is the separable kernel, particularly when tasks share identical input structures, given by K(x, y) = k(x, y) I_T, where k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} is a positive definite scalar kernel (e.g., Gaussian or linear) and I_T is the T \times T identity matrix. This form assumes task independence in the output space while leveraging shared input representations, leading to an RKHS where functions decompose as f(x) = \sum_{t=1}^T f_t(x) e_t with each f_t \in \mathcal{H}_k, the scalar RKHS induced by k. The eigenvalue decomposition of the scalar kernel facilitates analysis; for instance, the Mercer decomposition k(x, y) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(y) extends to the vector-valued case, yielding an orthonormal basis for \mathcal{H}_K with eigenvalues \lambda_i I_T, which simplifies regularization and bounds on function norms. More general separable kernels incorporate task correlations via K(x, y) = k(x, y) B, where B \succeq 0 is a fixed task covariance matrix, capturing prior beliefs about task relatedness. To incorporate known task structures, such as prior covariances between tasks, vector-valued kernels can be designed using sums of separable forms, K(x, y) = \sum_{q=1}^Q B_q k_q(x, y), where each B_q \succeq 0 models a latent of task correlation and k_q are scalar kernels. This aligns with the of coregionalization, where task outputs are linear combinations of shared latent functions. Kronecker products further enable efficient kernel construction, particularly for the over training points X = \{x_i\}_{i=1}^n, as K(X, X) = B \otimes k(X, X), reducing inversion costs from O((nT)^3) to O(n^3 + T^3) via eigendecomposition of B and k(X, X). Such designs allow encoding , like hierarchical task relations, directly into the . Learning in vector-valued RKHS involves joint optimization over functions and task relations, typically minimizing empirical risk plus a regularizer, such as \min_{f \in \mathcal{H}_K} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i), y_i) + \lambda \|f\|_{\mathcal{H}_K}^2, where \ell is a vector-valued (e.g., squared error) and \lambda > 0 controls complexity. The vector-valued guarantees that the solution lies in the span of kernel sections: f(x) = \sum_{i=1}^n K(x, x_i) c_i for coefficients c_i \in \mathbb{R}^T, reducing the infinite-dimensional problem to finite-dimensional linear algebra, solvable via kernel matrix inversion. This enables structured MTL by estimating task covariances B alongside f, often through alternating optimization or in Gaussian process views. Despite these advantages, vector-valued RKHS face limitations in scalability, particularly for large T, as the kernel matrix scales as (nT) \times (nT), leading to O(n^3 T^3) time for , which becomes prohibitive beyond moderate task counts (e.g., T > 10). Approximations developed in the , such as low-rank factorizations of B or Nyström methods for the input , mitigate this by reducing effective dimensionality while preserving reproducing properties.

Applications

Computer Vision and Image Processing

Multi-task learning (MTL) has been extensively applied in and image processing to jointly address interrelated tasks such as and semantic segmentation, leveraging shared feature representations to enhance overall performance. A prominent use case is the integration of and instance segmentation, as exemplified by Mask R-CNN, which extends Faster R-CNN by adding a mask prediction branch to perform both bounding box detection and pixel-level segmentation in a single forward pass. Variants of Mask R-CNN, developed since 2017, have further optimized this joint learning for diverse scenarios, including real-time applications in autonomous driving and , where simultaneous detection and segmentation reduce computational overhead compared to separate models. Another key application involves semantic segmentation combined with depth estimation, enabling holistic scene understanding; for instance, QuadroNet employs MTL to jointly predict 2D , semantic segmentation, depth estimation, and surface normals from monocular images, achieving real-time performance on edge hardware. In terms of architectures, in vision often relies on shared (CNN) backbones followed by task-specific heads to extract common low-level features like edges and textures while allowing specialization for higher-level tasks. This hard-sharing approach, where the backbone parameters are jointly optimized across tasks, has been formalized in frameworks that demonstrate improved over single-task models. Adaptations of multi-task deep neural networks (MT-DNN) originally from have evolved for vision tasks, incorporating transformer-based backbones since 2019 to handle multi-modal inputs; recent developments, such as adapters, enable generalizable MTL by learning task affinities that transfer to unseen vision domains like medical and imagery. By 2025, these evolutions include weighted vision transformers that balance task contributions dynamically, supporting efficient joint training for segmentation and in resource-constrained environments. The benefits of MTL in this domain are particularly evident in , where shared representations lead to parameter efficiency and faster inference without sacrificing accuracy. For example, on the CheXpert dataset—a large collection of 224,316 chest radiographs with labels for 14 pathologies— models trained via MTL outperform single-task baselines by exploiting hierarchical disease dependencies, achieving higher scores while reducing the need for task-specific . In multi-disease tasks, MTL has been demonstrated in semi-supervised approaches that leverage auxiliary tasks like segmentation to boost on limited . Recent lightweight MTL models for edge devices, such as those presented at WACV 2025, further enable deployment on mobile hardware for real-time vision tasks; for instance, multi-task supervised compression models reduce computational requirements while maintaining or improving detection accuracy on benchmarks like COCO, facilitating applications in portable medical diagnostics and autonomous systems. Despite these advantages, MTL in vision faces challenges like negative transfer, where optimizing for one task (e.g., high-resolution segmentation) degrades performance on another (e.g., low-level depth estimation) in diverse scenes such as varying lighting or occlusions. This issue arises from conflicting gradients in shared backbones. Mitigation strategies include dynamic weighting of task losses during training, such as scaling by exponential moving averages of validation losses to prioritize beneficial tasks and suppress harmful ones, which has been shown to improve convergence and final performance in vision benchmarks like Cityscapes. Lightweight transformer-based MTL models incorporate such dynamic re-weighting to adapt to scene variability, ensuring robust transfer across indoor-outdoor environments without extensive retraining.

Natural Language Processing

Multi-task learning (MTL) in (NLP) has prominently featured joint modeling of interrelated linguistic tasks, such as multi-label text classification, combined with summarization, and paired with natural language inference. The GLUE benchmark, introduced in 2018, exemplifies this by aggregating nine diverse NLU tasks—including for multi-label classification, natural language inference for entailment (e.g., MNLI and QQP datasets), and (e.g., QNLI and )—to evaluate models' ability to share linguistic knowledge across limited-data scenarios. These tasks highlight MTL's role in capturing shared semantic and , enabling models to generalize better than single-task training on benchmarks like GLUE. Modern approaches in leverage pretrained models, reformulating tasks into unified formats for joint training. The model, released in 2019, pioneered text-to-text by treating all tasks as text generation problems, allowing multi-task on datasets like (e.g., English-to-German) and summarization (e.g., /), where task-specific prefixes guide the shared encoder-decoder architecture. Its multilingual extension, mT5 from 2020, extends this to over 100 languages, supporting for cross-lingual tasks such as and by pretraining on massively diverse corpora, achieving robust zero-shot transfer. From 2020 to 2025, these foundation models have integrated prompt-based alignment to enhance efficiency, as seen in frameworks like CrossPT (2025), which decomposes prompts into shared and task-specific components via attention mechanisms, improving cross-task transfer on GLUE-like benchmarks in low-resource settings. The MTFPA framework (2025) further advances prompt-based by hybrid alignment of task prompts, though primarily demonstrated in vision-language contexts adaptable to . Performance gains from in often stem from shared embeddings that capture common linguistic features, yielding improvements of 5-10% on downstream tasks like and by mitigating and leveraging auxiliary data. Recent 2024-2025 studies on interleaved , such as optimizing dataset combinations for large language models, report enhanced efficiency and up to 8% gains in biomedical tasks (e.g., and relation extraction) through iterative selection of synergistic task mixtures, reducing training costs while boosting generalization. MTL in NLP has evolved from sequence-based models like early transformers to post-2022 multimodal hybrids integrating text with vision, as in large models that jointly process textual inference and visual for richer representations. This shift enables hybrid tasks, such as captioning with entailment verification, building on shared encoders from T5-like architectures to handle interleaved text-image data streams.

Other Domains and Emerging Uses

Multi-task learning (MTL) has found significant applications in scientific domains, particularly in bioinformatics for predicting protein-protein interactions (PPIs). A 2025 review highlights how advancements, including MTL frameworks, have improved PPI prediction accuracy by leveraging shared representations across interaction types from 2021 to 2025 data. For instance, the DeepPFP architecture employs MTL to simultaneously predict multiple protein functions, achieving superior performance over single-task baselines on benchmarks like CAFA3 by integrating evolutionary and structural features. Similarly, the MPBind model uses MTL to forecast binding sites for diverse partners such as proteins and DNA, demonstrating enhanced in multi-partner interaction tasks. In sensor-based , addresses challenges in industrial monitoring by jointly modeling normal and anomalous patterns across heterogeneous sensors. The MLAD framework, introduced in 2025, clusters sensors via time-series analysis and applies cluster-constrained graph neural networks for representation learning, followed by multi-task anomaly scoring; it outperforms baselines like isolation forests by up to 15% in F1-score on datasets such as and . This approach enables efficient detection in cyber-physical systems by sharing knowledge between clustering and detection tasks. Industrial applications of MTL extend to recommendation systems, where it enhances user profiling by jointly optimizing personalization and preference modeling. A 2025 framework for joint group profiling and recommendation uses deep neural MTL to infer group behaviors from individual interactions, improving click-through rate predictions by 8-12% on real-world e-commerce data compared to independent models. In short-video platforms, user behavior-aware MTL integrates viewing history and engagement signals across tasks like click prediction and retention, yielding a 10% uplift in recommendation diversity. In , facilitates perception-action integration for complex environments, particularly post-2019 advancements in autonomous systems. For vision-based , the FASNet model (2020) employs with future state predictions to handle tasks like lane detection and trajectory forecasting, reducing collision risks by 20% in simulated urban scenarios over single-task networks. More recent work on robotic manipulators (2025) uses in to share policies across grasping and navigation, accelerating convergence by 30% in multi-task benchmarks like RLBench. These methods enable robots to transfer skills from to , improving adaptability in dynamic settings. Emerging trends in MTL involve its integration with foundation models for multimodal data processing, emphasizing interleaved paradigms since 2024. task vectors (MTVs) enable many-shot learning in interleaved large models like QwenVL by aligning vision-language tasks, boosting zero-shot on benchmarks such as VQA by 15% through shared embeddings. A 2024 interfacing approach for foundation models creates interleaved shared spaces via multi-task multi-modal training, allowing seamless extension to new modalities with minimal . In , MTL enhances modeling by jointly predicting variables like and ; a 2022 MTL-NET model forecasts the up to seven months ahead, surpassing dynamical models like CFSv2 in correlation scores by 0.1-0.2. Similarly, a 2023 MTL framework retrieves passive microwave and land surface simultaneously, improving retrieval accuracy by 5-10% over univariate methods on GPM datasets. Case studies in healthcare diagnostics illustrate MTL's efficiency gains for multi-disease from and text. A 2023 large image-text (LIT) model for scans uses to jointly diagnose conditions by fusing radiological reports with images. In chronic disease , a 2025 network processes electronic health records and for tasks like and cardiovascular , achieving high scores (e.g., 0.89 for ) comparable to single-task models while leveraging across nationwide cohorts. These applications highlight MTL's role in scalable diagnostics, with shared encoders enabling 20% faster inference in resource-constrained settings.

Implementations

Software Libraries and Frameworks

Several libraries and frameworks facilitate the implementation of multi-task learning (MTL), providing tools for shared representations, task-specific heads, and joint optimization across classical and paradigms. These libraries emphasize to support custom architectures while handling common MTL challenges like task imbalance through weighted losses and dynamic sampling. In the classical machine learning domain, scikit-multilearn offers a scikit-learn-compatible module for multi-label classification, which extends to MTL scenarios by treating tasks as interdependent labels. It supports algorithms like classifier chains and label powerset for joint prediction, leveraging sparse matrices for efficiency on large datasets. For instance, a basic setup involves wrapping a base estimator:
python
from skmultilearn.problem_transform import ClassifierChains
from sklearn.ensemble import RandomForestClassifier

base_estimator = RandomForestClassifier()
model = ClassifierChains(base_estimator)
model.fit(X_train, y_train)  # y_train as multi-label matrix
This library, built on and , has been widely adopted for its integration with the ecosystem since its release in 2017. For deep learning, PyTorch-based libraries like LibMTL provide comprehensive support for , including predefined architectures (e.g., hard parameter sharing), weighting strategies (e.g., uncertainty weighting), and evaluation metrics across tasks. LibMTL allows users to define a shared encoder followed by task-specific heads, with built-in handling for conflicts via adaptive optimizers. A simple shared encoder example is:
python
import torch
import torch.nn as nn
from libmtl import Trainer

class SharedEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(784, 128), nn.ReLU())
        self.task_heads = nn.ModuleDict({
            'task1': nn.Linear(128, 10),
            'task2': nn.Linear(128, 2)
        })

    def forward(self, x):
        features = self.encoder(x)
        return {task: head(features) for task, head in self.task_heads.items()}

model = SharedEncoder()
trainer = Trainer(model, tasks=['task1', 'task2'], weight='uw')  # Uncertainty weighting
Released in , LibMTL emphasizes through standardized benchmarks on datasets like NYUv2 and Cityscapes. As of 2025, LibMTL continues to evolve with support for larger-scale benchmarks. TensorFlow integrates MTL via its Keras Functional , enabling multi-output models with shared layers and task-specific losses, often used in recommenders for joint and . The supports weighted losses by specifying per-output weights in the model.compile step, and task sampling can be implemented via custom data generators. For example:
python
import tensorflow as tf
from tensorflow import [keras](/page/Keras)

inputs = [keras](/page/Keras).Input(shape=(784,))
shared = [keras](/page/Keras).layers.Dense(128, activation='relu')(inputs)
task1 = [keras](/page/Keras).layers.Dense(10, name='task1')(shared)
task2 = [keras](/page/Keras).layers.Dense(2, name='task2')(shared)

model = keras.Model(inputs=inputs, outputs=[task1, task2])
model.compile(optimizer='adam',
              loss={'task1': 'sparse_categorical_crossentropy', 'task2': 'binary_crossentropy'},
              loss_weights={'task1': 1.0, 'task2': 0.5})
This approach has been demonstrated in official Recommenders for since 2023. Specialized frameworks like Transformers enable MTL fine-tuning for NLP and vision tasks, using a shared backbone with multiple heads. It includes utilities for multitask prompt tuning and joint on datasets like GLUE, supporting features such as dynamic padding and task-specific schedulers. Recent extensions allow seamless integration of MTL via the , as shown in community examples for multi-head fine-tuning. For kernel-based MTL, implementations draw from foundational vector-valued methods, with libraries like those in extended via custom kernels, though dedicated packages remain limited; early works from the influenced modern extensions in . Community resources enhance adoption, including GitHub repositories for LibMTL and torchMTL that provide reproducible code for baselines, and benchmarks like those in LibMTL for evaluating MTL performance across domains. These tools collectively lower barriers to MTL experimentation, focusing on scalable, verifiable implementations.

Practical Deployment Considerations

Deploying multi-task learning (MTL) models in production environments requires addressing significant scalability challenges, particularly when handling large numbers of tasks. In distributed training setups common in 2020s cloud infrastructures, task parallelism enables efficient scaling by assigning individual tasks to separate computing resources, such as GPUs, while sharing model updates across nodes to maintain parameter consistency. For instance, the Distributed Sparse Multi-task Learning (DSML) algorithm achieves this by having each machine process its task independently and communicate debiased parameter estimates to a central node, scaling effectively to high-dimensional features (p up to thousands) and numerous tasks (m > 100) with minimal communication overhead. Memory optimization for shared parameters is critical, as large task sets can lead to quadratic computational costs in affinity estimation; techniques like gradient-based approximations reduce this by projecting high-dimensional gradients into lower-dimensional spaces, cutting memory usage by up to 32x in FLOPs and enabling training on 500 tasks with 21 million edges in under 112 GPU hours. Evaluation in production MTL deployments often encounters pitfalls related to distinguishing positive from negative , where joint training can either enhance or degrade task performance compared to single-task baselines. Post-2019 standards emphasize metrics such as gain, defined as the relative improvement in task when trained jointly versus individually (e.g., S_t^{i \rightarrow j} = 1 - \frac{L_j(\phi_{t+1}^{\{i,j\}}, \theta_{t+1}^j)}{L_j(\phi_{t+1}^{\{j\}}, \theta_{t+1}^j)}), to quantify positive (values > 0) and identify negative (values < 0). Task , a key pitfall, arises from cross-task gradient conflicts during optimization, measurable through approximations like negative cosine similarity of task gradients, which signal when shared representations hinder specific tasks. These metrics help detect when MTL underperforms single-task learning in certain benchmarks, guiding adjustments to avoid deployment failures. Best practices for deployment include task selection heuristics that prioritize related tasks to maximize positive transfer, such as computing gradient similarities or alignments to group tasks. In non-stationary environments, for data or concept drift is essential, using adaptive federated MTL frameworks that dynamically cluster tasks and update models to handle heterogeneous, time-varying distributions. Integration with pipelines, a 2024-2025 trend, involves automated tools for drift detection, enabling continuous retraining and to prevent performance decay in production. Real-world deployments highlight failures from imbalanced tasks, such as in applications with 128 prediction tasks, where dominant easy tasks cause negative transfer, degrading performance on harder ones by 10-30% relative to single-task models. Mitigations like curriculum learning, implemented via dynamic task dropping (e.g., scheduling based on task incompleteness and sample scarcity), allow gradual introduction of complex tasks, reducing interference and improving average accuracy by 5-15% across and benchmarks. These approaches ensure robust deployment by balancing task influences throughout training.

References

  1. [1]
    [PDF] Multitask Learning - Carnegie Mellon University
    Multitask Learning. Rich Caruana. 23 September 1997. CMU-CS-97-203. School of Computer Science. Carnegie Mellon University. Pittsburgh, PA 15213.
  2. [2]
    Multitask Learning 1997–2024: Part I Fundamentals
    Jul 31, 2025 · Unlike single-task learning (STL), MTL is a learning paradigm that simultaneously learns multiple related tasks by leveraging both task-specific ...
  3. [3]
    Multi-Task Learning with Deep Neural Networks: A Survey - arXiv
    Sep 10, 2020 · Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model. Such ...
  4. [4]
    Multitask Learning | Machine Learning
    Caruana, R. (1997). “Multitask Learning,” Ph.D. Thesis, School of Computer Science, Carnegie Mellon University. Cooper, G. F. ...
  5. [5]
    [PDF] A Survey on Multi-Task Learning - arXiv
    Abstract—Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information contained in multiple related ...
  6. [6]
    [PDF] caruana-1997.pdf - People @EECS
    In this paper we demonstrate multitask learning in three domains. We explain how multitask learning works, and show that there are many opportunities for ...
  7. [7]
    [PDF] Multitask Learning - Cornell: Computer Science
    More detail can be found in [Caruana 1994, 1997]. 3.2.1. Statistical Data Amplification. Data amplification is an effective increase in sample size due to ...
  8. [8]
    [PDF] MULTI-TASK SEQUENCE TO SEQUENCE LEARNING
    This paper examines three multi-task learning (MTL) settings for sequence to sequence models: (a) the one- to-many setting – where the encoder is shared between ...
  9. [9]
    [PDF] 12-in-1: Multi-Task Vision and Language Representation Learning
    In this work, we develop a multi-task model for discrimi- native vision-and-language tasks based on the recently pro- posed ViLBERT [27] model. We consider four ...
  10. [10]
    M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video ...
    Jan 22, 2024 · M2-CLIP is a multimodal, multi-task framework for video action recognition, using adapters for visual and text branches and a multi-task ...Missing: variants 2023-2025
  11. [11]
    [PDF] mPLUG-2: A Modularized Multi-modal Foundation Model Across ...
    Recent years have witnessed a big convergence of language, vision, and multi-modal pretrain- ing. In this work, we present mPLUG-2 , a.
  12. [12]
    Multi-Task Learning for Classification with Dirichlet Process Priors
    A new hierarchical nonparametric Bayesian model is proposed for the problem of multitask learning ... Task Clustering and Gating for Bayesian Multitask Learning.
  13. [13]
    [PDF] A Convex Formulation for Learning Task Relationships in Multi-Task ...
    From Table 2, we can see that some tasks are positively correlated (e.g., third and sixth tasks), some are negatively correlated (e.g., second and third tasks), ...
  14. [14]
    A Regularization Approach to Learning Task Relationships in ...
    In this article, we propose a regularization approach to learning the relationships between tasks in multitask learning. This approach can be viewed as a novel ...Missing: studies | Show results with:studies
  15. [15]
    [PDF] Multi-Task Learning by Maximizing Statistical Dependence
    We present a new multi-task learning (MTL) approach that can be applied to multiple heterogeneous task estimators. Our motivation is that the best task ...Missing: Gram | Show results with:Gram
  16. [16]
    Multi-Task Learning Using Uncertainty to Weigh Losses for ... - arXiv
    May 19, 2017 · Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. Authors:Alex Kendall, Yarin Gal, Roberto Cipolla.
  17. [17]
    [PDF] Multi-task Causal Learning with Gaussian Processes
    This paper studies the problem of learning the correlation structure of a set of intervention functions defined on the directed acyclic graph (DAG) of a ...
  18. [18]
    Multi-Task Deep Learning Games: Investigating Nash Equilibria and ...
    In this paper, we present a game-theoretic investigation of multi-task deep learning, focusing on the existence and convergence of Nash equilibria.Missing: 2020s | Show results with:2020s
  19. [19]
    [2504.00707] Interleaved Multitask Learning with Energy Modulated ...
    Abstract page for arXiv paper 2504.00707: Interleaved Multitask Learning with Energy Modulated Learning Progress.Missing: training | Show results with:training
  20. [20]
  21. [21]
  22. [22]
    [PDF] Kernels for Multi--task Learning
    This paper provides a foundation for multi–task learning using reproducing ker- nel Hilbert spaces of vector–valued functions. In this setting, the kernel ...Missing: seminal | Show results with:seminal
  23. [23]
    [PDF] A Unifying Framework in Vector-valued Reproducing Kernel Hilbert ...
    This paper presents a general vector-valued reproducing kernel Hilbert spaces (RKHS) framework for the problem of learning an unknown functional dependency ...Missing: seminal | Show results with:seminal
  24. [24]
    [PDF] Kernels for Vector-Valued Functions: a Review - arXiv
    Apr 16, 2012 · The goal of this survey is twofold. First, we aim at discussing recent results in multi-output/multi-task learning based on kernel methods and ...
  25. [25]
    [PDF] The Multi-Task Learning View of Multimodal Data - HAL
    A vector-valued RKHS is uniquely defined by a positive definite multi-task kernel K(·,·); that is a matrix-valued function of two variables satisfying the ...
  26. [26]
    [PDF] arXiv:1703.06870v3 [cs.CV] 24 Jan 2018
    Jan 24, 2018 · Mask R-CNN using ResNet-101-. FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these ...
  27. [27]
    [PDF] DI-MaskDINO: A Joint Object Detection and Instance ... - NIPS papers
    For example, the classical model Mask RCNN [16] achieves joint object detection and instance segmentation by adding a mask branch to Faster RCNN [39]. Recently,.<|separator|>
  28. [28]
    [PDF] Multi-Task Learning for Real-Time Semantic Depth Aware Instance ...
    In this work we focus on jointly solving four tasks that are important for autonomous driving applications: 2-D object detection for high recall detection of ...
  29. [29]
    [PDF] Learning to Branch for Multi-Task Learning
    In the hard sharing setting, all tasks share the same set of backbone parameters, or at least share part of the back- bone with branches toward the outputs.<|separator|>
  30. [30]
    [PDF] Vision Transformer Adapters for Generalizable Multitask Learning
    We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains.<|separator|>
  31. [31]
    (PDF) A Weighted Vision Transformer-Based Multi-Task Learning ...
    Aug 25, 2025 · A Weighted Vision Transformer-Based Multi-Task Learning Framework for Predicting ADAS-Cog Scores. August 2025. DOI:10.48550/arXiv.2508.17613.
  32. [32]
    CheXpert: A Large Chest Radiograph Dataset with Uncertainty ...
    Jan 21, 2019 · Abstract:Large, labeled datasets have driven deep learning methods to achieve expert-level performance on a variety of medical imaging tasks.Missing: multi- task speedup
  33. [33]
    [PDF] Semi-supervised Multi-task Learning with Chest X-Ray Images
    Ours is the first model to pursue a multi-task learning approach to the analysis of chest X-ray images. Fig. 1.Missing: speedup percentage CheXpert
  34. [34]
    A Multi-Task Supervised Compression Model for Split Computing
    In this study we propose Ladon the first multi-task-head supervised compression model for multi-task split computing. Experimental results show that the multi- ...
  35. [35]
    Mitigating Negative Transfer in Multi-Task Learning with Exponential ...
    Nov 22, 2022 · We propose multiple techniques for loss balancing based on scaling by the exponential moving average and benchmark them against current best-performing methods.Missing: evaluation | Show results with:evaluation
  36. [36]
    [PDF] Mitigating Negative Transfer in Multi-Task Learning with Exponential ...
    To do so, we scale task losses by their exponen- tial moving averages (EMAs). We merge other techniques for loss-weighting based on their training rates with ...
  37. [37]
    A lightweight transformer based multi task learning model with ...
    Aug 1, 2025 · By using parameter isolation and distillation techniques, MIND improved learning performance on new tasks while preserving memory of old tasks.
  38. [38]
    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural ...
    Apr 20, 2018 · We introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse ...
  39. [39]
    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
    ### Summary of T5's Approach to Multi-Task Learning in NLP
  40. [40]
    CrossPT: Exploring Cross-Task Transferability through Multi-Task Prompt Tuning
    ### Summary of CrossPT: Prompt-Based Multi-Task Learning Approach for NLP
  41. [41]
    A multi-task learning framework with prompt-based hybrid alignment ...
    Jul 14, 2025 · We propose a novel multi-task framework with prompt-based alignment (MTFPA) that enables efficient co-tuning of pre-trained vision-language models using ...
  42. [42]
    Multi-Task Learning in Natural Language Processing: An Overview
    Then we present optimization techniques on loss construction, gradient regularization, data sampling, and task scheduling to properly train a multi-task model.
  43. [43]
    Towards Better Multi-task Learning: A Framework for Optimizing ...
    To efficiently select optimal dataset combinations for enhancing multi-task learning (MTL) performance in large language models, we proposed a novel framework ...Missing: MTFPA | Show results with:MTFPA
  44. [44]
    A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks
    ### Summary of Evolution to Multimodal Hybrids in MTL for NLP Post-2022
  45. [45]
    Recent advances in deep learning for protein-protein interaction
    Jun 16, 2025 · This review meticulously assesses deep learning progress in PPI prediction from 2021 to 2025. ... In PPI prediction, a multi-task learning ...
  46. [46]
    DeepPFP: a multi-task-aware architecture for protein function ...
    Feb 5, 2025 · These results underscore the potential of our conceptual architecture as a promising methodology for multi-task protein function prediction.Introduction · Result · Discussion · Method
  47. [47]
    MPBind: Multitask Protein Binding Site Prediction by ... - bioRxiv
    Apr 18, 2025 · Through multitask learning, it can predict binding sites on proteins that interact with five key categories of binding partners: proteins, DNA/ ...<|separator|>
  48. [48]
    MLAD: A Multi-Task Learning Framework for Anomaly Detection
    We propose a novel framework called Multi-task Learning Anomaly Detection (MLAD), which leverages clustering techniques to group sensors based on their ...<|control11|><|separator|>
  49. [49]
    Joint Group Profiling and Recommendation via Deep Neural ... - arXiv
    Mar 5, 2025 · This paper presents Joint Group Profiling and Recommendation via Deep Neural Network-based Multi-Task Learning, a framework that unifies group profiling and ...
  50. [50]
    A user behavior-aware multi-task learning model for enhanced short ...
    Feb 7, 2025 · This paper introduces a user behavior-aware multi-task learning model for enhanced short video recommendation (UBA-SVR) by leveraging insights into dynamic ...
  51. [51]
    [PDF] Multi-task Learning with Future States for Vision-based Autonomous ...
    To adopt a human driver's behavior, we propose a vision-based autonomous driving model, called. Future Actions and States Network (FASNet), which uses predicted ...
  52. [52]
    Control strategy of robotic manipulator based on multi-task ...
    Feb 19, 2025 · Multi-task learning is important in reinforcement learning where simultaneously training across different tasks allows for leveraging shared ...
  53. [53]
    [PDF] Multimodal Task Vectors Enable Many-Shot ... - UC Berkeley EECS
    May 15, 2025 · In this work, we apply MTV to the following interleaved LMMs as they are better-suited for multimodal ICL as shown by [16]: (1) QwenVL [6] is a ...Missing: variants | Show results with:variants
  54. [54]
    [PDF] Interfacing Foundation Models' Embeddings - NIPS papers
    With the benefit of multi-task multi-modal training, the proposed interface creates an interleaved shared embedding space. (3) Extendable. The proposed ...
  55. [55]
    Multi-task machine learning improves multi-seasonal prediction of ...
    Dec 12, 2022 · The MTL-NET can predict the IOD well up to 7-month ahead, outperforming most of world-class dynamical models used for comparison in this study.
  56. [56]
    Multi‐Task Learning for Simultaneous Retrievals of Passive ...
    Apr 5, 2023 · This study proposes a novel precipitation retrieval framework in which these two tasks are simultaneously trained using multi-task learning approach (MTL).
  57. [57]
    CT Multi-Task Learning with a Large Image-Text (LIT) Model | bioRxiv
    Apr 6, 2023 · In this paper, we report a feasibility study of building a multi-task CT large image-text (LIT) model for lung cancer diagnosis by combining an LLM and a large ...
  58. [58]
    Multitask learning multimodal network for chronic disease prediction
    May 3, 2025 · In this research, we utilized a nationwide dataset and proposed a multi-task learning approach combined with a multimodal disease prediction model.
  59. [59]
    LibMTL: A PyTorch Library for Multi-Task Learning — LibMTL ...
    LibMTL: A PyTorch Library for Multi-Task Learning¶ · Apply to a New Dataset · Customize an Architecture · Customize a Weighting Strategy.
  60. [60]
    A scikit-learn based module for multi-label et. al. classification - GitHub
    scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Python packages.
  61. [61]
    scikit-multilearn: A Python library for Multi-Label Classification
    The scikit-multilearn is a Python library for performing multi-label classification. It is compatible with the scikit-learn and scipy ecosystems and uses sparse ...
  62. [62]
    median-research-group/LibMTL: A PyTorch Library for Multi ... - GitHub
    LibMTL is an open-source library built on PyTorch for Multi-Task Learning (MTL). See the latest documentation for detailed introductions and API instructions.
  63. [63]
    The Functional API | TensorFlow Core
    Apr 12, 2024 · The functional API creates models by specifying inputs and outputs in a graph of layers, and is generally higher-level, easier and safer.Models With Multiple Inputs... · A Toy Resnet Model · Functional Api Strengths
  64. [64]
    Multi-task recommenders - TensorFlow
    May 27, 2023 · Multi-task recommenders optimize for multiple objectives, share variables between tasks, and use transfer learning to improve predictions on ...A Multi-Task Model · Putting It Together · Retrieval-Specialized Model
  65. [65]
    Fine-tuning - Hugging Face
    Fine-tuning adapts a pretrained model to a specific task with a smaller specialized dataset. This approach requires far less data and compute.Missing: multi- | Show results with:multi-
  66. [66]
    Fine-tuning BERT with multiple classification heads - Transformers
    Oct 14, 2022 · I need to train a model that has the same backbone such as BERT as a feature extractor and use multiple classification heads.
  67. [67]
  68. [68]
    chrisby/torchMTL: A lightweight module for Multi-Task Learning in ...
    A lightweight module for Multi-Task Learning in pytorch. torchmtl tries to help you composing modular multi-task architectures with minimal effort.
  69. [69]
    [PDF] Distributed Multi-Task Learning
    We consider the problem of distributed multi- task learning, where each machine learns a sepa- rate, but related, task. Specifically, each machine.Missing: scalability memory
  70. [70]
    None
    ### Summary of Scalability Issues in Multi-Task Learning (MTL)
  71. [71]
    [PDF] Towards Principled Task Grouping for Multi-Task Learning - arXiv
    May 16, 2025 · However, MTL often struggles to ef- fectively manage positive and negative transfer between tasks, which can hinder performance improvements ...
  72. [72]
    [PDF] Multi-Task Learning Basics - Stanford CS330
    Oct 3, 2022 · Split into shared parameters and task-specific parameters θ θ sh θ i. Choosing how to condition on z i equivalent to. Choosing how & where to ...
  73. [73]
    None
    ### Summary of Task Selection Heuristics and Best Practices for MTL
  74. [74]
    Federated Multi-Task Learning with Non-Stationary Heterogeneous ...
    Federated Multi-Task Learning with Non-Stationary Heterogeneous Data. Abstract: Federated multi-task learning (FMTL) is a promising edge learning framework to ...
  75. [75]
    MLOps Landscape in 2025: Top Tools and Platforms - Neptune.ai
    This article explores the key players in the MLOps and FMOps (or LLMOps) ecosystems, encompassing both open-source and closed-source tools.
  76. [76]
    Loss-Balanced Task Weighting to Reduce Negative Transfer in Multi ...
    Aug 6, 2025 · We show the prevalence of negative transfer in a computational chemistry case study ... imbalanced and complex real-world datasets. Traditional ...
  77. [77]
    [PDF] Mitigating Negative Transfer in Multi-task Learning using Dynamic ...
    The association of tasks is dynamically decided based on the validation loss values. The result is a dynamic task-specific weighting parameter λ against each ...