Fact-checked by Grok 2 weeks ago

Self-supervised learning

Self-supervised learning (SSL) is a within , specifically a subset of , that trains models to extract meaningful representations from vast amounts of unlabeled data by automatically generating supervisory signals from the structure or relationships inherent in the data itself, thereby mitigating the need for costly human-annotated labels. This approach contrasts with traditional , which relies on paired input-output examples, and instead uses tasks—such as predicting rotations, solving puzzles, or reconstructing masked portions of inputs—to foster generalizable . SSL has emerged as a cornerstone for scaling models, particularly in domains where is scarce or expensive to obtain. The origins of SSL trace back to early work in the , such as de Sa's 1994 exploration of co-occurrence relationships between visual and auditory data to learn representations without labels. It gained momentum in the with the rise of deep neural networks, building on foundational techniques like autoencoders and transitioning to pretext-based methods; for instance, Doersch et al. (2015) introduced context prediction in images, while Noroozi and Favaro (2016) proposed solving as a spatial understanding task. A pivotal shift occurred around 2020 with the advent of large-scale contrastive methods, exemplified by Momentum Contrast (MoCo) by He et al. (2020), which achieved competitive performance on benchmarks using unlabeled pretraining followed by . Subsequent innovations, such as Simple Contrastive Learning of Representations (SimCLR) by Chen et al. (2020), further demonstrated that SSL could rival supervised methods in tasks through careful augmentation and large batch sizes. SSL encompasses several core paradigms, each leveraging different mechanisms to create pseudo-labels. Generative approaches, like masked image modeling (MIM) in Masked Autoencoders () by He et al. (2022), reconstruct masked input portions to learn pixel-level and semantic features, achieving state-of-the-art results such as 83.6% top-1 accuracy on fine-tuning. Contrastive methods, including SimCLR and MoCo, maximize agreement between augmented views of the same instance while repelling dissimilar ones, enabling robust invariant representations without negative sampling in later variants like BYOL (Grill et al., 2020). Non-contrastive and hybrid techniques, such as those in Data2Vec (Baevski et al., 2022), unify modalities by predicting latent representations across , speech, and text. These methods often employ architectures, scaling effectively to billions of parameters and diverse data types. In applications, SSL has revolutionized by powering tasks like , semantic segmentation, and video understanding; for example, VideoMAE extends MAE to kinetics datasets, attaining 80.0% accuracy on action recognition. In natural language processing, models like BERT (Devlin et al., 2019) use masked language modeling—a generative SSL pretext—to pretrain on unlabeled text corpora, enabling downstream for and with minimal labels. Beyond these, SSL extends to settings, as in CLIP (Radford et al., 2021), which aligns image-text pairs contrastively for zero-shot transfer. Its impact lies in democratizing AI by harnessing abundant unlabeled data from the , fostering efficiency in resource-constrained environments, and inspiring ongoing research into theoretical foundations and unified frameworks across modalities. As of 2025, SSL continues to evolve with scalable models and applications in new domains like and healthcare.

Background and Fundamentals

Definition and Motivation

Self-supervised learning (SSL) is a within that enables models to learn meaningful representations from unlabeled data by generating supervisory signals, or pseudo-labels, directly from the inherent structure of the input data itself. Unlike traditional , which relies on human-annotated labels, SSL formulates pretext tasks that transform portions of the data into predictive problems, allowing the model to extract features without external supervision. This approach focuses on representation learning, where the goal is to produce generalizable embeddings that can be fine-tuned for downstream tasks. The primary motivation for SSL stems from the significant challenges in , particularly the scarcity and high cost of acquiring large-scale labeled datasets, which often require substantial human effort and expertise. By leveraging abundant unlabeled —such as internet-scale collections of images, text, or audio—SSL addresses this bottleneck, enabling the training of scalable models that perform effectively across diverse domains. This is especially valuable in fields like healthcare or , where annotations are limited or expensive to obtain. Key benefits of SSL include enhanced data efficiency, as pre-trained representations can be transferred to new tasks with minimal additional labeling, reducing and improving on small datasets. It also facilitates adaptability to low-resource settings, achieving performance levels comparable to supervised methods in many cases while promoting robust, transferable features. For instance, SSL supports by learning versatile embeddings that capture semantic structures, benefiting applications from to language modeling. Common pretext tasks in SSL include masked prediction, where models infer missing elements from partially obscured inputs—such as filling blanks in a for —to learn contextual relationships. In images, rotation prediction requires estimating the orientation of rotated visuals, encouraging the model to understand geometric invariances and object structures. These tasks conceptually exploit self-generated labels from the 's intrinsic properties, fostering representations that align with natural distributions without manual intervention.

Historical Development

Precursors to self-supervised learning, such as autoencoders developed in the by Rumelhart, Hinton, and Williams, demonstrated how neural networks could learn internal by reconstructing inputs through . This approach used portions of the input data itself as supervisory signals, laying foundational ideas for representation learning without explicit labels. The roots of SSL as a distinct trace back to the , with early work such as de Sa's (1994) exploration of co-occurrence relationships between visual and auditory data to learn multimodal representations without labels. Neuroscience-inspired models further advanced these concepts in the decade, particularly through frameworks that enabled networks to learn by anticipating future inputs based on prior sensory data, as explored by Rao and Ballard (1999). The 2000s marked a resurgence in unsupervised pretraining techniques that presaged modern self-supervised methods. Hinton's introduction of deep belief networks in 2006 utilized layer-wise training with restricted Boltzmann machines to initialize deep architectures, facilitating effective learning from unlabeled data. This was complemented by Vincent et al.'s denoising autoencoders in 2008, which corrupted inputs with and trained models to reconstruct versions, thereby capturing robust features invariant to perturbations. These developments addressed challenges in training deep networks and highlighted the potential of self-generated supervisory tasks. The 2010s brought a boom in self-supervised learning, spurred by the success of supervised on large labeled datasets like in 2012, which underscored the limitations of annotation costs and motivated unlabeled pretraining strategies. Pioneering work included Pathak et al.'s Context Encoders in 2016, which employed —predicting missing image regions from context—as a task for in . The decade's momentum accelerated with contrastive methods, such as van den Oord et al.'s Contrastive Predictive Coding (CPC) in 2018, which maximized mutual information between predictions and future data segments to learn general representations. In , Devlin et al.'s in 2018 popularized masked language modeling, where models predicted withheld tokens in sentences, achieving state-of-the-art transfer performance after pretraining on vast unlabeled corpora. The 2020s witnessed scaling laws and methodological diversification in self-supervised learning, with Chen et al.'s SimCLR in 2020 showing that larger models and datasets, trained via contrastive objectives, could rival supervised benchmarks on image classification without labels. Non-contrastive approaches gained traction, exemplified by Caron et al.'s in 2021, which used self-distillation to align teacher-student network predictions, yielding high-quality visual representations. Building on the ImageNet-era emphasis on pretraining, recent trends from 2023 to 2025 have shifted toward self-supervised learning, with extensions of models like CLIP—originally by Radford et al. in 2021—such as BLIP-2 (2023) for vision-language tasks, integrating vision and language through contrastive alignment of image-text pairs to enable zero-shot capabilities across modalities.

Core Methods

Autoassociative Approaches

Autoassociative approaches in self-supervised learning involve training models to reconstruct their input data, thereby learning useful s without explicit labels. These methods, often exemplified by autoencoders, employ an encoder-decoder where the encoder compresses the input into a lower-dimensional latent representation, and the decoder reconstructs the original input from this representation. The learning process minimizes the reconstruction error, enabling the model to capture essential features of the data distribution. This paradigm was initially proposed in the context of modular learning in neural networks. Key variants of autoencoders extend this core idea to address specific challenges. autoencoders focus on basic through deterministic mapping. Variational autoencoders (VAEs) introduce probabilistic latent spaces, modeling the latent variables as distributions rather than point estimates to enable generative capabilities and regularization. Denoising autoencoders enhance robustness by training on corrupted inputs—such as those with added —and reconstructing the clean originals, which helps learn invariant features. In modern self-supervised learning, particularly for vision, masked autoencoders (MAEs) represent a prominent advancement (He et al., 2022). MAEs use a Vision Transformer (ViT) architecture with an asymmetric encoder-decoder design, where a high ratio (e.g., 75%) of image patches are randomly masked, and the model is trained to reconstruct the masked values from the visible ones. This approach leverages the transformer's attention mechanism to learn both low-level and high-level features efficiently. On ImageNet-1K, MAE pretraining followed by achieves 83.6% top-1 accuracy with a ViT-Base model, as reported in , demonstrating superior scalability to large unlabeled datasets compared to earlier variants. The mathematical foundation of these approaches centers on optimization objectives that enforce faithful . For standard autoencoders, the loss is typically the : L = \| x - \hat{x} \|^2 where x is the input and \hat{x} is the reconstructed output. In VAEs, the objective is the (ELBO): \mathcal{L} = \mathbb{E}[\log p(x|z)] - \mathrm{KL}(q(z|x) \| p(z)) which balances reconstruction fidelity with a term to regularize the approximate posterior q(z|x) toward a p(z), often a standard Gaussian. These formulations ensure the learned representations are both compact and informative. For MAEs, the loss focuses on over masked pixels only, promoting semantic understanding. Autoassociative methods offer simplicity, as they require no negative sampling or pairwise comparisons, making them computationally efficient for large datasets. They are particularly advantageous for tasks like , where the latent space provides a compressed yet semantically rich encoding of the data. Unlike contrastive methods, which rely on distinguishing positive and negative pairs, autoassociative approaches emphasize generative reconstruction of the input itself. A notable example is sparse autoencoders, which incorporate sparsity constraints on the latent representations to promote efficient , inspired by biological systems. By penalizing non-zero activations in the hidden units, these models discover overcomplete bases that sparsely represent natural images, leading to the emergence of edge-like filters akin to simple-cell receptive fields in the visual cortex.

Contrastive Approaches

Contrastive approaches in self-supervised learning focus on discriminative techniques that learn representations by pulling together positive pairs—typically augmented views of the same data instance—and pushing apart negative pairs from distinct instances. This process encourages the model to capture invariant features across transformations while distinguishing semantically dissimilar samples, fostering robust embeddings without explicit labels. The paradigm draws inspiration from noise-contrastive estimation but adapts it for representation learning, emphasizing maximization between positives. The foundational objective in these methods is the InfoNCE (Noise-Contrastive Estimation) loss, which approximates the between positive pairs by treating the task as a problem: identifying the correct positive among multiple negatives. Mathematically, for a batch of samples, the loss for a positive pair (z_i, z_j) is defined as: \mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{N} \exp(\text{sim}(z_i, z_k)/\tau)}, where \text{sim}(\cdot, \cdot) denotes , \tau > 0 is a temperature parameter controlling sharpness, z_i and z_j are projected representations of the positive pair, and the denominator sums over one positive and N-1 negatives z_k. This formulation, introduced in , ensures that the model prioritizes alignment for positives while contrasting against negatives to avoid collapse into trivial solutions. Key implementations have advanced this framework across domains. In , SimCLR (Chen et al., 2020) simplifies the pipeline by relying on large-batch training for in-batch negative sampling, combined with strong data augmentations like random cropping, horizontal flips, and color jittering, achieving state-of-the-art linear classification accuracy on (e.g., 76.5% top-1 with ResNet-50, as reported in 2020). To mitigate the memory demands of large batches, MoCo (He et al., 2020) introduces a momentum-updated encoder and a dynamic of negative embeddings, enabling stable training with smaller batches while maintaining a large effective of negatives; this approach outperforms supervised pre-training on multiple downstream tasks. For sequential data, such as audio or text, Contrastive Predictive Coding (CPC; van den Oord et al., 2018) adapts the loss to predict future latent representations from past contexts, contrasting them against non-predictive negatives, and has demonstrated effectiveness in learning hierarchical features for raw waveforms. Data augmentations are central to generating positive pairs, as they define what constitutes "similarity." In vision tasks, augmentations preserve semantic content through geometric (e.g., crops, rotations) and photometric (e.g., brightness adjustments) variations, ensuring the encoder learns invariance to such perturbations. For , analogous strategies include token masking, synonym substitutions, and sentence reordering to create coherent yet diverse views, as explored in contrastive models for text embeddings. These augmentations must balance informativeness and diversity to avoid mode collapse or overly simplistic representations. A notable limitation of contrastive approaches is the high computational overhead from negative sampling, which often necessitates large mini-batches (thousands of samples) or external structures to provide sufficient negatives for effective , scaling poorly on resource-constrained hardware. This dependency can hinder scalability, particularly for high-dimensional data or applications, prompting ongoing into efficient alternatives.

Non-Contrastive Approaches

Non-contrastive approaches in self-supervised learning generate representations without relying on negative samples, instead using mechanisms like architectural asymmetries or predictive heads to promote diversity and prevent representational collapse. These methods address key limitations of contrastive techniques, such as the dependency on large batches of negative examples, by focusing solely on positive pairs derived from data augmentations. This positive-only learning paradigm leverages inherent biases in network design to ensure that representations remain informative and non-trivial. A prominent example is Bootstrap Your Own Latent (BYOL; Grill et al., 2020), which employs two neural networks: an network that processes augmented views and a target network that is an exponential of the online network. The online network includes a predictor module to further decorrelate its output from the target, encouraging the learning of invariant features. The loss function is defined as the (MSE) between the normalized projections of the online and target networks, with a stop-gradient operation applied to the target to avoid collapse: \mathcal{L} = \left\| \text{sg}\left[ z_t \right] - \hat{z}_o \right\|^2_2 where z_t is the target projection, \hat{z}_o is the predicted online projection, and \text{sg}[\cdot] denotes the stop-gradient. This setup allows BYOL to scale effectively without negative sampling, demonstrating strong performance in representation quality comparable to contrastive methods. SimSiam (Chen and He, 2021) simplifies this further by using a symmetric network architecture with identical encoder backbones for two augmented views, omitting the target network and predictor. It prevents collapse through a stop-gradient on one branch and a predictor on the other, optimizing a loss between the outputs. This approach highlights that simple architectural choices, without momentum encoders or negative pairs, suffice for meaningful self-supervised learning. SwAV (Caron et al., 2020) introduces an online mechanism using a set of learnable prototypes, where augmented views are assigned to clusters in a swapped manner to enforce without direct feature comparisons. The method alternates between solving an optimal transport problem for cluster assignments and the network to predict these assignments, enabling efficient handling of large batch sizes and reducing demands associated with negatives. Building on these ideas, (Caron et al., 2021) applies a self-distillation framework to vision transformers, using a student-teacher setup where the teacher is updated as an exponential moving average of the student, and employs sharpening of teacher outputs to encourage diverse predictions across augmentations. This leads to emergent properties like attention maps resembling object segmentations, without explicit supervision. Other notable methods include Barlow Twins (Zbontar et al., 2021), which prevents collapse by minimizing the cross-correlation matrix between the output of two augmented views while maximizing their invariance, using a simple redundancy reduction term in the loss. Similarly, VICReg (Bardes et al., 2022) enforces variance, invariance, and covariance regularization on positive pairs to maintain informative representations without negatives or momentum. These approaches further emphasize the efficacy of positive-only learning in diverse settings. Overall, non-contrastive methods offer advantages in simpler and lower usage, as they eliminate the need for storing or computing negative examples, and they perform particularly well in data-scarce settings by focusing on robust positive signals.

Comparisons with Other Paradigms

Versus

Self-supervised learning (SSL) fundamentally differs from in its approach to training. In SSL, models are pretrained on vast amounts of unlabeled data using pretext tasks that generate supervisory signals from the data itself, such as predicting rotations or solving puzzles, before on labeled data for downstream tasks. In contrast, supervised learning trains models end-to-end directly on sets, where each input is paired with explicit annotations, relying entirely on human-provided labels throughout the process. This distinction allows SSL to leverage abundant unlabeled data, mitigating the dependency on costly annotations that constrain supervised methods. Performance trade-offs between the two paradigms highlight SSL's strengths in label efficiency, particularly for transfer learning. For instance, SSL-pretrained models often match or surpass fully supervised baselines in downstream vision tasks when using only 1% of labeled data for fine-tuning, as demonstrated by contrastive methods like SimCLR, which achieved 76.5% top-1 accuracy on ImageNet via linear probing—comparable to supervised ResNet-50 training on the full labeled dataset—while outperforming supervised models trained from scratch with 100 times fewer labels. However, SSL incurs higher computational costs during pretraining due to the need for large batch sizes, extended training epochs, and processing billions of unlabeled samples, whereas supervised learning typically requires less overall compute since it focuses solely on labeled data. Post-2020 studies, such as those on SimCLR, show that SSL can achieve competitive performance on ImageNet with 100 times fewer labels compared to supervised training, underscoring its value in low-data regimes. As of 2025, multimodal SSL models continue to demonstrate label efficiency, achieving near-supervised performance with minimal annotations in zero-shot settings. Data scalability further accentuates these differences: SSL thrives on web-scale unlabeled corpora, such as billions of images scraped from the , enabling robust without annotation bottlenecks. Supervised learning, however, is hampered by labeling expenses; for example, annotating the dataset involved human workers spending a median of 26 seconds per image, resulting in substantial time and financial investment for just 1.2 million labeled examples. Hybrid approaches integrate SSL as a foundational pretraining stage, followed by supervised or simple on frozen representations, which has proven effective for downstream tasks like , where methods like MoCo outperform supervised pretraining by significant margins on datasets such as PASCAL VOC and COCO. This synergy positions SSL as a complementary , enhancing supervised models' efficiency in resource-limited settings.

Versus Unsupervised Learning

Self-supervised learning (SSL) operates within the broader paradigm of unsupervised learning, which encompasses techniques that extract patterns from unlabeled data without external supervision. Traditional unsupervised methods include clustering algorithms like k-means, which partition data into groups based on similarity measures such as Euclidean distance to minimize intra-cluster variance, and dimensionality reduction techniques like principal component analysis (PCA), which identify principal components to capture data variance. Additionally, generative models such as variational autoencoders (VAEs) and generative adversarial networks (GANs) focus on modeling the underlying probability distribution of the data to enable synthesis of new samples, often through objectives like evidence lower bound maximization in VAEs or adversarial min-max games in GANs. In distinction from these, SSL generates pseudo-labels or supervisory signals directly from the input data via pretext tasks—such as predicting rotations, solving puzzles, or completing masked portions—to train models for learning transferable representations, particularly hierarchical embeddings suited for downstream applications. This task-driven approach contrasts with the intrinsic goals of broader , where methods like clustering aim solely for data grouping without transfer intent, or generative models prioritize likelihood estimation and sample quality over discriminative feature extraction. While unsupervised techniques often conclude with standalone outputs like cluster assignments or generated instances, SSL's mechanisms foster representations that enhance performance when fine-tuned on for specific tasks. SSL can be regarded as a specialized subset of unsupervised learning, sharing the core use of unlabeled data but refining it toward representation learning with downstream utility; for example, VAEs may incorporate pretext-like reconstruction tasks but are traditionally optimized for generative capabilities rather than transfer. Overlaps exist where SSL principles integrate with other unsupervised tools, such as using learned SSL features to initialize clustering, yet the paradigms diverge in their end objectives—SSL emphasizes predictive supervision derived from data structure, while unsupervised methods like k-means or GANs pursue exploratory or synthetic aims without self-generated labels. Evaluation further highlights these differences: SSL success is gauged by downstream task metrics, such as classification accuracy after or on benchmarks , where models transfer learned representations to achieve high performance with minimal . In contrast, methods are assessed via intrinsic criteria, including silhouette scores for clustering quality (measuring cluster cohesion and separation) or reconstruction errors and (FID) for generative fidelity. Empirical evidence underscores SSL's advantages, with methods like achieving an average score of 80.5% on the GLUE benchmark, surpassing prior supervised baselines around 70-75%.

Applications

In Computer Vision

Self-supervised learning has revolutionized computer vision by enabling the training of robust visual representations from vast unlabeled image and video datasets, which are then transferred to downstream tasks like object classification, detection, and segmentation. This paradigm leverages pretext tasks to instill semantic understanding without human annotations, often matching or surpassing supervised pretraining in data-scarce scenarios. Key strategies focus on image-level and pixel-level pretraining, followed by evaluation on standard benchmarks and application to complex tasks. Early pretraining paradigms emphasized image-level pretext tasks to capture global structure. Rotation prediction requires models to determine the orientation of rotated images, fostering invariance to transformations. The approach, introduced by Noroozi and Favaro in 2016, shuffles image patches and trains the network to reconstruct the original arrangement, thereby learning spatial relationships and object compositions. At the pixel level, tasks involve reconstructing masked image regions from context. et al. in 2016 developed context encoders for this purpose, using adversarial training to generate plausible completions and extract local texture features. These methods laid the groundwork for scalable representation learning by exploiting inherent image redundancies. Influential models have advanced these paradigms through contrastive and non-contrastive frameworks. SimCLR, proposed by Chen et al. in 2020, employs instance discrimination via a network that contrasts augmented views of the same , achieving 76.5% top-1 accuracy in linear evaluation on using a ResNet-50 backbone. MoCo, by He et al. in 2020, enhances this with a momentum-updated encoder and dynamic negative queue, enabling larger batch sizes and stable training for high-quality features. Non-contrastive alternatives include (2021), where Caron et al. used between student and teacher networks to align probability distributions, yielding self-supervised features with emergent localization properties. iBOT (2021), from Zhou et al., integrates masked image modeling, self-distillation, and contrastive objectives in a unified , mimicking BERT's for holistic image understanding. Downstream applications demonstrate the efficacy of these representations. In object detection, self-supervised pretraining boosts models like Faster R-CNN; for instance, MoCo-pretrained detectors significantly improve generalization on COCO with just 1% labeled data compared to fully supervised counterparts from scratch. For semantic segmentation, DenseCL (2020) by Wang et al. applies pixel-wise contrastive learning during pretraining, enhancing dense prediction tasks on datasets like Cityscapes, where it outperforms supervised baselines by capturing fine-grained boundaries. Benchmarks such as linear probing on validate these gains, with semi-supervised protocols (e.g., 1% or 10% labels) showing self-supervised models closing the gap to fully labeled training, as evidenced by SimCLR's robust performance across scales. Building on earlier works like (2022) from Tong et al., which extends masked autoencoding to spatiotemporal patches and attains state-of-the-art action recognition on (80.9% top-1 accuracy), developments from 2023 to 2025 have further expanded self-supervised learning to dynamic and multimodal domains. For example, (2023) scales masking strategies for improved video representations. Multimodal approaches like , developed by Radford et al. in 2021, align images with text via contrastive pretraining on 400 million pairs, enabling zero-shot transfer to vision tasks and influencing hybrid models in subsequent years. These advances underscore self-supervised learning's role in handling diverse visual data efficiently.

In Natural Language Processing

Self-supervised learning in (NLP) primarily leverages pretext tasks on unlabeled text corpora to learn rich representations of language structure and semantics. These representations are then fine-tuned for downstream tasks, enabling models to achieve state-of-the-art performance with minimal . Key pretext tasks include masked language modeling (), where approximately 15% of input tokens are randomly masked and the model predicts them based on bidirectional context, and next sentence prediction (NSP), which trains the model to determine if two sentences are contiguous in the original text. Another variant is permuted language modeling, which generates predictions over all possible permutations of the input sequence to capture bidirectional dependencies without the masking artifacts of . Seminal models exemplify these approaches. BERT (Bidirectional Encoder Representations from Transformers), introduced in 2018, uses bidirectional MLM combined with NSP to pre-train on large corpora like BooksCorpus and English Wikipedia, producing contextual embeddings that capture nuanced linguistic relationships. RoBERTa, released in 2019, refines BERT's MLM by removing NSP, using dynamic masking, larger batch sizes, and extended training on datasets including CC-News and OpenWebText, resulting in more robust representations without architectural changes. ELECTRA, from 2020, shifts from generative MLM to a discriminative replaced token detection task, where a lightweight generator proposes replacements for masked tokens, and the main model discriminates real from fake tokens; this approach is more sample-efficient, achieving comparable performance to BERT with four times less compute. These self-supervised pre-trained models excel in downstream applications after . For and other general understanding tasks, models like and achieve high performance on the GLUE benchmark, a suite of nine diverse tasks including and similarity, marking a significant improvement over prior supervised baselines. In , variants achieve F1 scores exceeding 90% on the dataset, demonstrating precise span extraction from passages by leveraging learned contextual cues. For , self-supervised embeddings from models like mBERT serve as strong initializers, enabling effective on corpora and improving translation quality in low-resource language pairs. Scaling self-supervised learning has further amplified its impact. The GPT series, starting with in 2020, employs unidirectional causal language modeling—a form of self-supervision where the model predicts the next token in a sequence—trained on approximately 300 billion tokens from diverse web sources, enabling across tasks without task-specific . This scaling yields efficiency gains, such as up to 10-fold reductions in data requirements and speed for downstream applications, by transferring general language knowledge. Recent advancements extend self-supervised learning to multilingual settings, particularly for low-resource languages. Extensions of mBERT, which pre-trains on 104 languages via MLM on a shared vocabulary, have been refined in 2024-2025 works like mmBERT to incorporate cross-lingual prompting and self-supervised adaptation, boosting zero-shot transfer performance on tasks like classification in underrepresented languages by 5-15% over monolingual baselines.

In Other Domains

Self-supervised learning has been adapted to audio and through pretext tasks that leverage the temporal and sequential nature of sound data. In wav2vec 2.0, contrastive is employed to learn representations by predicting quantized latent representations of future audio segments, enabling effective discrimination without labeled data. This approach has been extended in , which uses masked prediction of hidden units derived from a BERT-like model on raw audio features, leading to 10-20% reductions in (WER) on automatic tasks compared to supervised baselines. In and , self-supervised methods focus on predicting world models from sequences of states and actions to enable and . Dreamer, for instance, learns a latent model by reconstructing observations and predicting rewards in a self-supervised manner, facilitating sample-efficient in simulated environments. Such techniques also support sim-to-real transfer by pretraining policies on simulated data with pretext tasks like trajectory reconstruction, reducing the need for real-world annotations and improving generalization to physical robots. For graph-structured data, such as in recommendation systems, contrastive self-supervised learning generates augmented views of graphs to learn robust embeddings. GraphCL applies contrastive between augmented subgraphs (e.g., via edge dropping or attribute masking) to capture structural invariances, outperforming supervised methods on classification benchmarks. Similarly, PinSage at uses graph convolutions with self-supervised proximity prediction as a pretext task to generate embeddings for billions of items, enhancing personalized recommendations by 15-20% in . Multimodal self-supervised learning bridges domains like and text, as seen in CLIP, which aligns image and caption representations through contrastive learning on large-scale paired data, enabling zero-shot transfer to downstream tasks. In , emerging applications as of 2025 extend this to , with ESMFold-inspired models using masked language modeling on evolutionary sequence data to predict structures, achieving near-AlphaFold accuracy on unseen proteins via self-supervision alone. A key aspect across these domains is the design of domain-specific augmentations for tasks, such as time-warping or speed in audio to simulate variations in speech, which enhance representation robustness without altering semantic content.

Challenges and Future Directions

Key Limitations

One major limitation in self-supervised learning (SSL), particularly in non-contrastive methods, is representational collapse, where the learned representations converge to trivial constant solutions that fail to capture meaningful features. This occurs because the optimization landscape allows for degenerate equilibria, such as all embeddings mapping to the same point, leading to ineffective downstream performance. Mitigations include techniques like or predictor networks to enforce diversity and prevent such collapses during training. SSL approaches are computationally intensive, often requiring substantial resources for pretraining on large datasets. For instance, the SimCLR framework demands training on 128 v3 cores with a batch size of 4096 for hundreds of epochs, equivalent to days of computation on hundreds of GPUs for comparable setups. This high compute demand contributes to significant environmental impacts. Evaluation protocols in SSL frequently over-rely on , where a simple is trained atop frozen representations, which may inflate perceived performance but fail to capture full model capabilities. Studies from 2022 highlight that proxy tasks like do not always generalize to complex downstream scenarios, showing drops in accuracy when full or real-world shifts are considered. Without explicit labels to guide learning, SSL models can amplify inherent biases in unlabeled datasets, such as demographic skews in web-scraped image corpora, leading to skewed representations that propagate unfairness to downstream applications. For example, visual SSL models trained on internet-scale data have been shown to exacerbate or racial imbalances present in the source material. SSL models often underperform supervised counterparts on out-of-distribution () data, due to their reliance on distributional assumptions in pretraining that do not hold under shifts. One prominent emerging trend in self-supervised learning (SSL) is the integration with models, where SSL pre-training enhances the generalization of large-scale architectures across diverse tasks. This approach leverages vast unlabeled data to bootstrap representations, as seen in models like CLIP, which aligns and text embeddings through contrastive objectives to achieve zero-shot transfer capabilities. Recent surveys highlight how such integrations reduce reliance on while improving downstream performance in vision-language tasks, with models like Data2Vec extending SSL paradigms uniformly across modalities such as vision, speech, and language. Another key direction involves and cross-modal SSL, unifying representations from disparate data types to tackle complex real-world scenarios. Methods like have evolved to support cross-modal reconstruction, enabling scalable learning in domains like and healthcare where labeled multimodal data is scarce. For instance, in , SSL facilitates by pre-training on unlabeled scans, achieving improvements in few-shot segmentation tasks compared to supervised baselines. Emerging work also explores SSL in time series and graph data, with predictive modeling paradigms showing promise for and forecasting by exploiting temporal invariances. Efficiency and theoretical unification represent critical advancements, addressing the computational demands of large-scale SSL. Innovations in masked image modeling, such as BEiT v2 and iBOT, optimize pre-training by focusing on semantic reconstruction, reducing training costs while maintaining competitive fine-tuning accuracies on (e.g., 87.8% top-1 for iBOT). Theoretically, efforts to unify contrastive and generative SSL under information-theoretic frameworks aim to identify optimal pretext tasks, with recent analyses linking SSL dynamics to for better interpretability. Additionally, combining SSL with continual and is gaining traction, enabling adaptive agents in dynamic environments like autonomous systems. As of 2025, trends also include privacy-preserving SSL techniques, such as on unlabeled data, to address data privacy in distributed settings. Challenges persist in and robustness, but trends toward task-agnostic tasks and integrations promise broader applicability. Surveys emphasize the need for benchmarks that evaluate SSL across shifts, with methods combining contrastive and reconstructive learning emerging as a pathway to robust, generalizable representations.

References

  1. [1]
    A Survey on Self-supervised Learning: Algorithms, Applications, and ...
    Jan 13, 2023 · Self-supervised learning (SSL), a subset of unsupervised learning, aims to learn discriminative features from unlabeled data without relying ...Missing: definition | Show results with:definition
  2. [2]
    A Cookbook of Self-Supervised Learning
    ### Summary of Key Sections on Self-Supervised Learning from arXiv:2304.12210
  3. [3]
    A Simple Framework for Contrastive Learning of Visual ... - arXiv
    Feb 13, 2020 · This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised ...
  4. [4]
  5. [5]
    Survey on Self-Supervised Learning: Auxiliary Pretext Tasks ... - MDPI
    This paper surveys self-supervised feature-learning methods drawn from images. It details the motivation for this research and the terminologies of the field, ...<|control11|><|separator|>
  6. [6]
    [PDF] 1987-Modular Learning in Neural Networks
    Computer simulations of learning using internal units have been restricted to small-scale systems. This paper describes a way of coupling autoassociative ...
  7. [7]
    [1312.6114] Auto-Encoding Variational Bayes - arXiv
    Dec 20, 2013 · Authors:Diederik P Kingma, Max Welling. View a PDF of the paper titled Auto-Encoding Variational Bayes, by Diederik P Kingma and 1 other authors.
  8. [8]
    [PDF] Extracting and Composing Robust Features with Denoising ...
    To this end we display the filters obtained after initial training of the first denoising autoencoder on MNIST digits. Fig- ure 3 shows a few of these filters ...
  9. [9]
    Emergence of simple-cell receptive field properties by learning a ...
    Jun 13, 1996 · Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Bruno A. Olshausen &; David J. Field. Nature ...
  10. [10]
    Representation Learning with Contrastive Predictive Coding - arXiv
    Jul 10, 2018 · We propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding.
  11. [11]
    Momentum Contrast for Unsupervised Visual Representation Learning
    We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up,
  12. [12]
    [PDF] Text Transformations in Contrastive Self-Supervised Learning - IJCAI
    Techniques proposed for standard data augmentation in low- resource learning settings can also be used to generate pos- itive samples for contrastive ...
  13. [13]
    Bootstrap your own latent: A new approach to self-supervised ... - arXiv
    Jun 13, 2020 · We introduce Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning. BYOL relies on two neural ...
  14. [14]
    Exploring Simple Siamese Representation Learning - arXiv
    Nov 20, 2020 · In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following:
  15. [15]
    Unsupervised Learning of Visual Features by Contrasting Cluster ...
    Jun 17, 2020 · In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons.
  16. [16]
    Emerging Properties in Self-Supervised Vision Transformers - arXiv
    Apr 29, 2021 · In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks ( ...
  17. [17]
    Self-supervised learning for medical image classification - Nature
    Apr 26, 2023 · One pioneering contrastive-based method is SimCLR, which outperformed supervised models on ImageNet benchmark using 100 times fewer labels.
  18. [18]
    [PDF] Deep Generative Modelling: A Comparative Review of VAEs, GANs ...
    Interrelated with generative models is the field of self- supervised learning where the focus is on learning good in- termediate representations that can be ...
  19. [19]
    How is self-supervised learning different from unsupervised learning?
    A key distinction lies in the presence of an explicit learning objective. Unsupervised methods like clustering (e.g., K-means) or dimensionality reduction ...
  20. [20]
    Augmentations vs Algorithms: What Works in Self-Supervised ... - arXiv
    Mar 8, 2024 · The main distinction between the two is that SSL uses some form of weak labels generated from the input data to induce learning, while ...<|control11|><|separator|>
  21. [21]
    [PDF] A Survey of Self-Supervised Learning from Multiple Perspectives
    Jan 13, 2023 · After the self-supervised pretraining process is completed, the learned model can be further transferred to downstream tasks (especially when ...
  22. [22]
    What is the difference between self-supervised and unsupervised ...
    May 7, 2023 · Self-supervised learning IS an unsupervised learning algorithm that uses certain methods to derive learning signals without explicit labels.Missing: compute | Show results with:compute
  23. [23]
    Unsupervised Learning Evaluation Metrics Explained - Insight7
    The Silhouette Score is a crucial tool for assessing cluster cohesion in unsupervised learning. It quantifies how well each data point fits into its assigned ...
  24. [24]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    Oct 11, 2018 · Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly ...
  25. [25]
    XLNet: Generalized Autoregressive Pretraining for Language ... - arXiv
    Jun 19, 2019 · We propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood.
  26. [26]
    RoBERTa: A Robustly Optimized BERT Pretraining Approach - arXiv
    Jul 26, 2019 · We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size.
  27. [27]
    Pre-training Text Encoders as Discriminators Rather Than Generators
    Mar 23, 2020 · Abstract page for arXiv paper 2003.10555: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
  28. [28]
    [2005.14165] Language Models are Few-Shot Learners - arXiv
    May 28, 2020 · Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model ...
  29. [29]
    [1906.01502] How multilingual is Multilingual BERT? - arXiv
    Jun 4, 2019 · In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 ...
  30. [30]
    SSP: Self-Supervised Prompting for Cross-Lingual Transfer to Low ...
    Jun 27, 2024 · We investigate their effectiveness for NLP tasks in low-resource languages (LRLs), especially in the setting of zero-labelled cross-lingual transfer (0-CLT).
  31. [31]
    [2209.15007] Understanding Collapse in Non-Contrastive Siamese ...
    Sep 29, 2022 · Abstract:Contrastive methods have led a recent surge in the performance of self-supervised representation learning (SSL).Missing: mitigations batch normalization
  32. [32]
  33. [33]
    Feature Normalization Prevents Collapse of Noncontrastive ...
    Aug 14, 2025 · Contrastive learning is a self-supervised representation learning ... Feature Normalization Prevents Collapse of Noncontrastive Learning Dynamics.Missing: methods mitigations batch
  34. [34]
    [PDF] Towards Better Understanding of Domain Shift on Linear-Probed ...
    We find that not only do linear probes fail to generalize on some shift benchmarks, but linear probes trained on some shifted data achieve low train accuracy, ...<|control11|><|separator|>
  35. [35]
    A study on the distribution of social biases in self-supervised ...
    Foundation models trained on web-scraped datasets propagate societal biases to downstream tasks. While counterfactual generation enables bias analysis ...
  36. [36]
    On the Out-of-Distribution Generalization of Self-Supervised Learning
    May 22, 2025 · In this paper, we focus on the out-of-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during the ...
  37. [37]
  38. [38]
    Self-Supervised Learning: A Comprehensive Survey of Methods ...
    Aug 28, 2025 · This comprehensive survey provides an in-depth analysis of the evolution, methodologies, and applications of self-supervised learning across ...Missing: definition | Show results with:definition
  39. [39]
  40. [40]
  41. [41]
  42. [42]
    The Rise of Self-Supervised Learning in Autonomous Systems - MDPI
    Aug 16, 2024 · In response, self-supervised learning (SSL) has emerged as a promising alternative, leveraging unlabeled data to learn meaningful ...1. Introduction · 6.3. Self-Supervised... · 8. Challenges And Future...
  43. [43]
    A survey on self-supervised methods for visual representation learning
    Mar 4, 2025 · Such a task typically entails a popular failure mode, called representation collapse. It commonly describes trivial solutions, e.g., constant ...