Fact-checked by Grok 2 weeks ago

Zero-shot learning

Zero-shot learning (ZSL) is a machine learning paradigm that enables predictive models to recognize, classify, or perform tasks on categories or instances not encountered during training, by leveraging auxiliary semantic information—such as textual descriptions, attributes, or embeddings—that bridges knowledge from observed (seen) classes to novel (unseen) ones.^[1] Introduced in computer vision for visual object categorization, ZSL was pioneered by Lampert et al. in 2009, who framed it as transferring discriminative knowledge via shared binary attributes between seen and unseen classes, allowing classification without direct training examples for target categories.^[2] This approach addresses fundamental limitations of traditional supervised learning, such as the need for exhaustive labeled datasets, and has proven essential in scenarios with open-world data distribution, where new classes emerge dynamically.^[3] Over time, ZSL has evolved from attribute-based methods to more sophisticated frameworks, gaining prominence with the rise of deep learning and pre-trained representations like word vectors from models such as Word2Vec or BERT.^[4] At its core, ZSL operates through knowledge transfer mechanisms that align visual or multimodal features with semantic spaces; for instance, embedding-based methods project inputs and class descriptions into a shared latent space for compatibility matching, while generative techniques synthesize pseudo-samples for unseen classes using variational autoencoders or GANs to mitigate domain shifts.^[3] Variants include conventional ZSL, which assumes test data solely from unseen classes, and generalized ZSL (GZSL), a more realistic setting that tests on mixtures of seen and unseen classes, often suffering from hubness and bias toward seen classes challenges.^[4] Evaluation typically relies on benchmarks like Animals with Attributes (AwA), Caltech-UCSD Birds (CUB), and SUN, measuring accuracy via top-k recognition or harmonic means in GZSL to balance seen-unseen performance.^[5] ZSL's applications span diverse domains, including image and video classification for novel species or objects in wildlife monitoring, natural language processing for zero-shot text classification and question answering in multilingual settings, and robotics for adapting to unseen environments or actions without retraining.^[6] In healthcare, it facilitates zero-shot diagnosis from medical images using semantic descriptions, while in autonomous vehicles, it supports recognition of rare traffic scenarios via knowledge graphs or ontologies.^[7] Advances as of 2021 integrated ZSL with large foundation models, enabling emergent capabilities like zero-shot prompting in vision-language systems such as CLIP, which align images and text for broad generalization.^[8] Further progress in 2024-2025 includes diffusion-based generative methods and enhanced zero-shot performance in multimodal large language models like GPT-4V, though challenges in true generalization persist.^[9] Despite progress, ongoing challenges include semantic loss during transfer, scalability to high-dimensional data, and robustness against noisy auxiliary information.^[4]

Fundamentals

Definition and Motivation

Zero-shot learning (ZSL) is a machine learning paradigm that enables models to recognize and classify instances from unseen classes at test time, without any training examples for those classes, by leveraging auxiliary semantic information to transfer knowledge from seen classes.^[4] This approach was formally introduced in the seminal work by Lampert et al. (2009), which framed ZSL as the problem of object classification where training and test classes are disjoint, meaning no visual examples of the target classes are available during training.^[10] In essence, ZSL shifts the focus from data-driven pattern recognition to semantically informed inference, allowing systems to handle open-world scenarios where new categories continually emerge. The motivation for ZSL stems from the practical limitations of traditional supervised learning, which demands extensive labeled data for every class—a requirement that is often infeasible due to data scarcity, high annotation costs, and the dynamic nature of real-world environments.^[4] By enabling generalization to novel categories without retraining, ZSL addresses these challenges and emulates human-like cognition, where individuals can infer properties of unfamiliar objects from linguistic descriptions or prior knowledge rather than direct observation. This capability is particularly valuable in domains like computer vision and natural language processing, where the explosion of potential classes outpaces data collection efforts. In the basic ZSL workflow, models are trained on a set of seen classes using paired visual features and auxiliary information, such as class attributes or textual descriptions, to learn a compatibility function that maps visual inputs to a shared semantic space.^[4] At inference, unseen classes—described only semantically—are classified by projecting test instances into this space and matching them to the nearest unseen class representation via semantic transfer. For instance, a model trained on images of horses and patterns like stripes could classify a zebra (an unseen class) by recognizing its visual features as compatible with the attribute combination "striped horse," without ever encountering zebra images during training.^[10]

Comparison with Other Paradigms

Zero-shot learning (ZSL) fundamentally differs from supervised learning by enabling the recognition of entirely novel classes without any labeled training examples for those classes, instead leveraging auxiliary information like semantic descriptions or attributes to transfer knowledge from seen classes. In supervised learning, models require extensive labeled datasets covering all target classes to learn discriminative features, limiting applicability to scenarios where new categories emerge without prior data collection.^[11] In contrast to few-shot learning, which adapts models using a minimal number of labeled examples (typically 1 to 5 per novel class) to generalize via metric-based or optimization techniques, ZSL relies solely on auxiliary knowledge without direct exemplars, emphasizing semantic bridging over episodic training.^[12] One-shot learning, a specific case of few-shot learning, provides exactly one labeled example per new class to facilitate adaptation, whereas ZSL avoids even this single instance by focusing on cross-modal or embedding alignments for inference on unseen categories. ZSL also extends beyond traditional transfer learning, which typically involves pre-training on a source task with abundant data and fine-tuning on a related target task—often sharing similar classes or features—by enabling generalization to semantically related but completely novel classes through compatibility functions or shared latent spaces.^[11] This semantic transfer in ZSL supports open-world applications where test classes are disjoint from training ones, unlike transfer learning's emphasis on domain adaptation within overlapping distributions. The following table summarizes key distinctions among these paradigms:

Paradigm	Data Requirements for Novel Classes	Generalization Type	Typical Use Cases
Supervised Learning	Many labeled examples per class	Intra-class discrimination within seen data	Abundant labeled datasets for closed-set classification^[11]
Transfer Learning	Labeled source data; optional target labels	To related tasks/domains via feature reuse	Fine-tuning pre-trained models on similar problems^[11]
Few-Shot Learning	1–5 labeled examples per class	To novel classes with minimal support	Data-efficient adaptation in dynamic environments^[12]
One-Shot Learning	Exactly 1 labeled example per class	To novel classes from single instance	Extreme data scarcity, e.g., personalized recognition
Zero-Shot Learning	Zero labeled examples; auxiliary information	To unseen classes via semantics	Open-vocabulary tasks like emerging categories

Historical Background

Origins and Early Developments

Zero-shot learning originated in the late 2000s as researchers sought to overcome the scalability issues of supervised image classification systems, where models trained on fixed datasets struggled to handle emerging categories without additional labeled data. A key motivation stemmed from challenges observed in early benchmarks like the Caltech-101 dataset, introduced in 2003, which contained images across 101 object categories but underscored the need for methods that could generalize to unseen classes in dynamic real-world scenarios. The foundational work in this area was presented by Lampert, Nickisch, and Harmeling in 2009, who proposed an attribute-based framework for detecting unseen object classes through between-class attribute transfer. In their approach, human-interpretable attributes—such as "has stripes" or "is furry"—served as intermediaries to map visual features from known classes to novel ones, enabling recognition without direct training examples for the target categories. This method relied on hand-crafted image features, such as SIFT descriptors and histogram of oriented gradients (HOG), combined with probabilistic models for attribute prediction and class inference. To support evaluation, they introduced the Animals with Attributes (AWA) dataset, comprising over 30,000 images of 50 animal species annotated with 85 binary attributes, which became a standard benchmark for early zero-shot experiments.^[10] Concurrent developments further solidified attribute-centric ideas for zero-shot recognition. For instance, Farhadi et al. (2009) explored describing objects via relative attributes to facilitate transfer learning, emphasizing how semantic descriptions could bridge seen and unseen categories without deep neural architectures. The term "zero-shot learning" itself was formally coined in a 2009 paper by Palatucci et al., who framed it as a machine learning problem involving semantic output codes for predicting labels from auxiliary information, laying theoretical groundwork in a non-vision-specific context. These pre-deep learning efforts prioritized logical rules and shallow classifiers, marking the shift toward knowledge transfer in recognition tasks.^[13]

Key Milestones in Deep Learning Era

The deep learning era transformed zero-shot learning (ZSL) by leveraging neural networks for visual feature extraction and aligning them with rich semantic representations, building on earlier attribute-based foundations. A pivotal advancement occurred in 2013 with the DeViSE model, which embedded images from deep convolutional networks into a semantic space derived from skip-gram word embeddings, enabling zero-shot transfer by measuring compatibility between visual and textual descriptions. This approach demonstrated improved performance on large-scale datasets like ImageNet, achieving hit rates of up to 10% across thousands of novel labels not observed during training by exploiting the vast unlabeled text corpus for semantic knowledge.^[14] From 2015 to 2017, the focus shifted toward compatibility learning frameworks that optimized joint embeddings between visual features and class semantics, often using ranking losses to handle fine-grained distinctions. Akata et al. introduced structured output embeddings in 2015, mapping images and hierarchical class labels into a shared space via bidirectional ranking, which boosted zero-shot accuracy on datasets like CUB-200-2011 to 28.0% by incorporating semantic hierarchies. This was extended in 2016 with latent embeddings for zero-shot classification, where a structured prediction model learned compatibility functions in a low-dimensional latent space, improving performance in generalized ZSL settings on Animals with Attributes. These methods emphasized end-to-end training with deep architectures, moving beyond hand-crafted attributes.^[15]^[16] A landmark review by Xian et al. in 2017 synthesized these developments, establishing standardized benchmarks like Animals with Attributes 2 and ImageNet subsets while critiquing the limitations of direct transfer methods, thus catalyzing the transition to more robust, deep generative paradigms. This evaluation underscored the superiority of embedding-based deep models, with top methods reaching 50-60% accuracy on seen classes but highlighting domain shift challenges for unseen ones.^[17] The emergence of generative ZSL in 2018 addressed these gaps by synthesizing synthetic features for unseen classes, exemplified by Verma et al.'s framework that used conditional variational autoencoders (f-VAEs) to generate diverse samples conditioned on semantic attributes, mitigating the hubness problem and achieving 46.7% harmonic mean accuracy on generalized ZSL for the CUB benchmark. This generative shift enabled better calibration between seen and unseen domains, paving the way for handling imbalanced class distributions.^[18] In the 2020s, large language models revolutionized ZSL in natural language processing through in-context learning, as demonstrated by GPT-3, which performed tasks like classification and translation in zero-shot settings by conditioning on prompted examples without fine-tuning, attaining around 60% average score on SuperGLUE benchmarks. A key multimodal advancement was CLIP in 2021, which aligned images and text via contrastive learning for broad zero-shot generalization in vision-language tasks. This paradigm extended ZSL principles to multimodal and vision-language models, influencing hybrid approaches for visual tasks.^[19]^[8]

Core Concepts

Auxiliary Information

Auxiliary information in zero-shot learning (ZSL) refers to the semantic side data that connects visual features of seen classes to descriptions of unseen classes, enabling recognition without direct training examples. This information typically includes attributes, word vectors, knowledge graphs, and textual descriptions, each offering distinct ways to encode class semantics. Attributes consist of human-interpretable descriptors, either binary (e.g., presence or absence of "has stripes") or continuous (e.g., numerical values for "body size"), that articulate key properties of classes. Word vectors, derived from models like Word2Vec, represent classes as dense embeddings capturing linguistic similarities, such as proximity between "zebra" and "horse" in vector space. Knowledge graphs, such as WordNet hierarchies, model classes as nodes connected by relational edges (e.g., "is-a" or "part-of" links), providing structured ontological knowledge. Textual descriptions encompass natural language summaries or sentences that describe class characteristics, often in paragraph form.^[20] The primary role of auxiliary information is to establish transferable semantic links between seen and unseen classes, allowing models to infer properties of novel categories. For instance, attributes can compose a zebra's representation as a horse-like animal with stripes, facilitating prediction via shared trait similarities. Word vectors enable analogy-based transfer, where unseen classes are positioned relative to seen ones based on co-occurrence patterns in language data. Knowledge graphs support hierarchical propagation of knowledge, such as inferring attributes of a "hyena" from its relation to "mammal" and "carnivore." Textual descriptions allow for flexible, context-rich bridging, capturing nuanced details like behavior or habitat that rigid attributes might overlook. These mechanisms collectively address the absence of unseen class examples by aligning visual and semantic spaces.^[21]^[22]^[23] Auxiliary information is constructed through manual or automatic processes, balancing accuracy with scalability. Manual annotation involves expert labeling, as seen in the Animals with Attributes (AWA) dataset, where 85 binary and continuous attributes describe 50 animal classes, enabling early ZSL experiments on fine-grained recognition. In contrast, automatic methods harvest data from large corpora; for example, word vectors are trained on vast text collections like Wikipedia dumps using models such as skip-gram, while textual descriptions are extracted directly from encyclopedia articles for each class. Knowledge graphs are often pre-built from lexical resources like WordNet but can be augmented automatically via relation extraction tools. Early ZSL works, such as those in the deep learning era, frequently combined manual attributes with automatic embeddings to enhance transferability.^[23]^[20] Despite their utility, auxiliary information has notable limitations that can hinder ZSL performance. Attributes may introduce subjectivity or incompleteness in coverage, failing to capture all relevant traits for diverse classes. Word vectors and embeddings are prone to noise from training corpora, leading to distorted semantic distances (e.g., cultural biases in language data). Knowledge graphs suffer from incompleteness, as not all relations or nodes are exhaustively defined, potentially isolating unseen classes from seen ones. Textual descriptions can vary in quality and length, introducing variability in semantic density. These issues often amplify the hubness problem in high-dimensional spaces or exacerbate domain shift between visual and semantic modalities.^[24]^[20]

Semantic Embeddings and Spaces

In zero-shot learning (ZSL), semantic embeddings and spaces form the foundational mechanism for transferring knowledge from seen to unseen classes by projecting heterogeneous data—such as visual features and textual descriptions—into a unified representation. This joint embedding space allows for semantic alignment, where the similarity between an input instance and a class label is computed based on their proximity, enabling recognition without direct training examples for novel categories. Typically, these spaces are low-dimensional Euclidean manifolds, though hyperbolic geometries have been explored to better capture hierarchical semantic relationships inherent in class taxonomies.^[14]^[25] Pre-trained embeddings serve as the building blocks for this joint space. For semantic descriptions, word embeddings like GloVe, which capture co-occurrence statistics in corpora, or contextual models such as BERT, which generate dynamic representations from transformer architectures, provide rich textual vectors for class labels. On the visual side, deep convolutional neural networks (CNNs), exemplified by ResNet architectures, extract feature vectors from images, which are then mapped into the semantic space to align with textual embeddings. This projection ensures that visually similar objects cluster near semantically related class descriptions, facilitating transfer learning. A core component is the compatibility function, which quantifies the alignment between an input \mathbf{x} (e.g., an image feature) and a class label y (e.g., a semantic description). Commonly formulated as a bilinear form, it is expressed as:

s(\mathbf{x}, y) = \phi(\mathbf{x})^\top \psi(y),

where \phi: \mathcal{X} \to \mathcal{Z} projects the visual feature \mathbf{x} into the joint embedding space \mathcal{Z}, and \psi: \mathcal{Y} \to \mathcal{Z} embeds the semantic representation of y into the same space; the inner product then measures their semantic compatibility.^[14]^[16] This function underpins classification by selecting the class with the highest score, often refined through ranking losses during training on seen classes. To ensure effective transfer, calibration of the embedding space is essential, particularly to maintain balanced representations that prevent overemphasis on seen classes and promote equitable semantic coverage for unseen ones. Techniques such as normalization or bias correction adjust the projections to mitigate distributional shifts, ensuring that the joint space reflects auxiliary semantic information uniformly.^[26] High-dimensional embeddings, however, introduce the hubness problem, where certain points (hubs) become nearest neighbors to disproportionately many others due to the curse of dimensionality, leading to biased similarity searches that favor seen-class prototypes over unseen ones. This phenomenon degrades ZSL performance, as hubs can dominate compatibility scores, and is typically addressed through dimensionality reduction or hubness-aware metrics, though it remains a key challenge in embedding design.^[27]

Methodologies

Attribute-Based Approaches

Attribute-based approaches represent an early paradigm in zero-shot learning, leveraging explicit semantic attributes as auxiliary information to transfer knowledge from seen to unseen classes without relying on deep neural networks. These methods focus on decomposing object recognition into intermediate attribute prediction tasks, enabling classification through semantic matching rather than direct visual feature alignment. In the direct attribute prediction (DAP) framework, a model first predicts a vector of attribute scores \mathbf{a} for an input \mathbf{x} using classifiers trained on seen classes, then determines the class label y by matching \mathbf{a} to predefined class prototypes \mathbf{A}_y via \arg\max_y P(y|\mathbf{a}).^[10] The probabilistic foundation of this approach models the class posterior as P(y|\mathbf{x}) \propto P(\mathbf{a}|\mathbf{x}) P(y|\mathbf{a}), where attribute independence is often assumed using a naive Bayes classifier to compute the joint likelihood efficiently.^[10] A foundational implementation appears in Lampert et al. (2009), who trained one-vs-all support vector machines (SVMs) for each binary attribute on seen-class data, achieving zero-shot recognition by propagating attribute predictions to unseen class prototypes.^[10] These techniques offer high interpretability, as attribute predictions provide human-readable explanations for classifications, but they demand extensive manual annotation of attributes and scale poorly with increasing numbers of attributes or classes due to the combinatorial explosion in modeling dependencies.^[10] Benchmarking typically occurs on datasets like Animals with Attributes (AWA), which includes 30,475 images of 50 animal classes annotated with 85 binary attributes, and aPy (Attribute Pascal and Yahoo), featuring 15,339 images across 32 coarse-grained classes with 64 attributes.^[10]^[28]

Embedding and Compatibility Methods

Embedding and compatibility methods represent a class of deep learning techniques in zero-shot learning that project visual features into a shared semantic space with class descriptions, enabling classification via compatibility scores between projected features and unseen class embeddings. These approaches avoid explicit attribute prediction by learning end-to-end mappings that align visual and semantic representations, allowing direct inference on novel classes through similarity computation. By focusing on projection functions and scoring mechanisms, they facilitate knowledge transfer without requiring generative modeling or transductive assumptions. The foundational framework involves learning projection matrices W_v to map visual features \phi(x) to semantic space and W_s to map semantic embeddings \psi(y) to visual space, with the objective of minimizing the reconstruction error \| W_v \phi(x) - \psi(y) \| for seen class samples during training. This bidirectional alignment ensures that visual inputs are embedded closely to their corresponding semantic descriptions, while semantic-to-visual projections regularize the space for robustness. At inference, the compatibility score s(x, y) = \phi(x)^\top W_v^\top W_s \psi(y) (or a simplified linear form) ranks candidate unseen classes, selecting the highest-scoring match as the prediction. Compatibility learning employs a ranking-based objective to enforce that correct class embeddings score higher than incorrect ones for a given input. Specifically, the loss is formulated as \min L = \sum \log(1 + \exp(s(x, y') - s(x, y))), where y is the true class and the sum is over incorrect classes y', promoting margin-based separation in the embedded space via a smooth approximation to pairwise hinge ranking. This objective is optimized jointly with visual feature extraction, often using stochastic gradient descent on deep convolutional networks. Key variants include Attribute Label Embedding (ALE), which learns the compatibility matrix W via ridge regression initialized on seen classes and refined with the ranking loss for efficient bilinear scoring. DEVISE (Deep Visual-Semantic Embedding), in contrast, projects deep visual activations directly into word-vector semantic space using a hinge ranking loss \max(0, \mu + s(x, y') - s(x, y)) (with margin \mu), supporting open-ended vocabulary by leveraging large-scale linguistic embeddings like word2vec. These methods differ in their embedding choices—ALE uses compact attribute vectors, while DEVISE employs high-dimensional distributional semantics—but both emphasize scalable projection learning over exhaustive pairwise comparisons. Performance in zero-shot settings is evaluated using the harmonic mean accuracy, defined as H = 2 \times \frac{\text{acc}_\text{seen} \times \text{acc}_\text{unseen}}{\text{acc}_\text{seen} + \text{acc}_\text{unseen}}, which balances recognition on seen and unseen classes to penalize bias toward training data. For instance, ALE achieves around 30-40% unseen accuracy on Animals with Attributes datasets under standard splits, while DEVISE reports up to 6.0% top-1 accuracy on ImageNet subsets (2-hop labels), highlighting the trade-off between embedding richness and generalization.^[29]

Generative and Transductive Techniques

Generative techniques in zero-shot learning (ZSL) address the challenge of unseen classes by synthesizing synthetic visual features conditioned on semantic embeddings, thereby augmenting the training data for classifiers. These methods typically employ generative adversarial networks (GANs) or variational autoencoders (VAEs) to model the distribution of visual features for novel categories. For instance, in a conditional GAN setup, a generator produces features x for an unseen class by sampling latent noise z \sim \mathcal{N}(0, I) and conditioning on the class's semantic representation \psi(y), formalized as x = G(z, \psi(y)), where G is the generator network. This synthesis allows the model to train on both real seen-class features and generated unseen-class features, mitigating the imbalance in standard ZSL setups.^[30] A seminal approach is the feature generating network f-CLSWGAN, which integrates a conditional least-squares GAN with Wasserstein distance optimization and a classification loss to produce high-quality CNN features for unseen classes. This hybrid method balances seen and unseen class representations during training, achieving significant improvements in generalized ZSL (GZSL) by reducing the seen-unseen bias, with reported harmonic mean accuracies exceeding 20% on benchmarks like Animals with Attributes 2 (AWA2). Similarly, conditional VAE-based methods, such as CVAE-ZSL, utilize a decoder p(x|z, y) to reconstruct features from a latent space informed by semantic attributes, enabling probabilistic generation that captures class variability. These generative strategies have demonstrated robustness across datasets, outperforming embedding-based methods by up to 10-15% in unseen class accuracy on tasks like image classification.^[31]^[32] Transductive techniques extend ZSL by leveraging unlabeled test samples at inference time to refine class boundaries without accessing seen-class training data, often through clustering or structure constraints. These methods assume access to test instances and their semantic descriptions, allowing adaptation to the target domain. For example, visual structure constraints on class centers can align prototypes derived from test clusters with semantic spaces, improving projection generality and addressing domain shifts. Clustering-based approaches, such as those using k-means on test features followed by prototype matching, detect out-of-distribution samples and assign labels by minimizing distances to semantic clusters, enhancing accuracy in transductive settings by 5-10% over inductive baselines on datasets like CUB-200. This incorporation of test-time information enables better generalization, particularly in scenarios with distribution mismatches.^[33] Hybrid generative-transductive methods combine synthesis with test adaptation for more balanced performance in GZSL. Such integrations alleviate the hubness problem and bias toward seen classes, yielding state-of-the-art results like 65.6% harmonic mean accuracy in GZSL on the Oxford Flowers-102 (FLO) dataset. Overall, these techniques advance ZSL by enabling data-efficient learning and robust adaptation, though they require careful hyperparameter tuning to avoid mode collapse in generation.^[30]

Generalized Zero-Shot Learning

Problem Setup and Formulation

In generalized zero-shot learning (GZSL), the training phase utilizes labeled data from a set of seen classes \mathcal{S}, supplemented by auxiliary semantic information (such as attribute vectors or word embeddings) available for both seen classes \mathcal{S} and a disjoint set of unseen classes \mathcal{U}. At test time, the model must classify samples drawn from the combined label space \mathcal{Y} = \mathcal{S} \cup \mathcal{U}, with the primary goal of achieving balanced recognition across both seen and unseen classes while mitigating the inherent bias toward seen classes that arises from training exclusively on \mathcal{S}. GZSL models are trained to minimize the empirical risk on seen classes, but evaluation quantifies the trade-off between unseen-class generalization and seen-class retention using the harmonic mean of accuracies, H = \frac{2 \cdot \mathrm{Acc}_s \cdot \mathrm{Acc}_u}{\mathrm{Acc}_s + \mathrm{Acc}_u}, assuming equal proportions of seen and unseen samples in testing. This metric highlights the challenge of domain shift in the test setting, where performance drops if the model overfits to seen classes. Unlike standard zero-shot learning (ZSL), which evaluates models solely on samples from \mathcal{U} and assumes no overlap with training classes, GZSL tests on \mathcal{S} \cup \mathcal{U} to reveal discrepancies between the source domain (seen data) and the target domain (mixed seen-unseen data). This extension makes GZSL a more practical and stringent benchmark for real-world deployment, where unseen classes emerge alongside familiar ones. Performance in GZSL is typically assessed using the harmonic mean of per-class accuracies, H = \frac{2 \cdot \mathrm{Acc}_s \cdot \mathrm{Acc}_u}{\mathrm{Acc}_s + \mathrm{Acc}_u}, where \mathrm{Acc}_s and \mathrm{Acc}_u represent the mean accuracies on seen and unseen classes, respectively. This metric emphasizes balanced competence, as it diminishes sharply if a model overfits to seen classes at the expense of unseen ones. Prominent benchmark datasets for GZSL evaluations include Animals with Attributes 2 (AWA2), comprising 37,322 images from 50 animal classes with 85 binary attributes, and Caltech-UCSD Birds-200-2011 (CUB), which features 11,788 images across 200 bird species annotated with 312 binary attributes. These datasets, split into seen and unseen classes according to predefined protocols, enable standardized comparisons of GZSL methods under realistic semantic gaps.

Mitigation Strategies for Domain Shift

In generalized zero-shot learning (GZSL), domain shift manifests as a bias toward seen classes during test-time evaluation, where models trained solely on seen data struggle to generalize to unseen classes due to distributional discrepancies in the visual-semantic embedding space. Mitigation strategies address this by enforcing balanced representations, adjusting inference decisions, or transferring knowledge across modalities to reduce the seen-unseen imbalance. These approaches typically aim to improve the harmonic mean (H-mean) accuracy, which balances seen and unseen class recognition rates and is a standard metric for evaluating GZSL performance on benchmarks like the SUN dataset. Semantic autoencoders provide a foundational strategy for creating balanced embeddings that mitigate domain shift by aligning visual features with semantic attributes while preventing overfitting to seen classes. In this framework, an encoder projects input visual features into a low-dimensional semantic space, and a decoder reconstructs the original features, with regularization terms ensuring the embeddings remain discriminative yet transferable to unseen classes. For instance, the Semantic Autoencoder (SAE) incorporates a compatibility function and reconstruction loss to handle the hubness issue—where seen class prototypes dominate the embedding space—and has been extended to GZSL variants using multimodal variational autoencoders for shared latent spaces across vision and semantics. These methods reduce bias by learning invariant representations, leading to more equitable treatment of seen and unseen domains.^[34]^[35] Calibration methods offer post-hoc adjustments to decision boundaries, correcting the inherent bias without retraining the core model. A seminal technique is calibrated stacking, which combines zero-shot predictions for unseen classes with seen-class predictions using a tunable calibration factor γ to reweight probabilities, effectively shifting the decision boundary to favor unseen classes during inference. This approach treats the GZSL problem as a biased classification task and adjusts logits post-training, improving balance on datasets with significant domain gaps. More advanced logit adjustment strategies build on this by theoretically deriving calibration parameters based on class priors, further refining boundaries to account for semantic discrepancies. Knowledge transfer via cross-modal distillation enhances semantic alignment by propagating information from vision to text spaces, reducing domain shift through teacher-student frameworks. In these methods, a pre-trained vision-language model (e.g., CLIP) acts as a teacher, distilling cross-modal alignments into a student network that generates or refines embeddings for unseen classes, ensuring consistent visual-semantic mappings. This distillation preserves rich textual semantics while adapting visual features, mitigating bias in GZSL by leveraging large-scale pre-training to bridge modality gaps. For example, self-distillation variants enforce feature consistency across modalities, boosting transfer to unseen domains without auxiliary data.^[36]^[37] Outlier detection techniques address domain shift by modeling seen classes as inliers and treating unseen samples as outliers during inference, routing the latter to specialized zero-shot mechanisms. Density-based methods, such as the Local Outlier Factor (LoF), estimate the manifold of seen class features and compute outlier scores to distinguish test samples, preventing misclassification of unseen instances as seen ones. Integrated into GZSL pipelines, these approaches use the outlier probability to adjust predictions, effectively calibrating the decision process for imbalanced domains. Early adaptations combined LoF with semantic embeddings to improve unseen recall without generative overhead.^[38] Recent advancements since 2023 incorporate large language models (LLMs) for calibration through prompt-based semantic alignment, leveraging their zero-shot capabilities to refine class descriptions and embeddings dynamically. By prompting LLMs to generate or adjust semantic attributes for unseen classes—e.g., via self-adaptive prompts that align structural semantics—these methods calibrate the visual-semantic space at test time, reducing domain bias in vision-language models. Further 2024-2025 works include evolutionary GZSL for robust attribute transfer and cluster-based disentangling to alleviate domain gaps, as well as extensions to 3D point clouds using dynamic calibration. This LLM-driven approach enhances interpretability and adaptability, with prompts guiding alignment to mitigate shifts in diverse domains like fine-grained recognition. For evaluation, such strategies have notably improved H-mean on the SUN dataset; for instance, semantic-reinforced methods achieve around 39.7% H-mean, representing gains of 6-10% over uncalibrated baselines by better balancing seen-unseen accuracies.^[39]^[40]^[41]

Applications

Computer Vision

Zero-shot learning (ZSL) has found significant applications in computer vision, particularly for image and video recognition tasks where labeled data for novel classes is unavailable or scarce. In these scenarios, models leverage auxiliary information such as semantic attributes or textual descriptions to generalize to unseen categories, enabling flexible deployment in real-world settings without retraining.^[8] A prominent use case is fine-grained recognition, exemplified by bird species identification on the Caltech-UCSD Birds-200-2011 (CUB) dataset, which contains 11,788 images across 200 bird classes with rich attribute annotations for semantic transfer.^[5] ZSL approaches here map visual features to attribute spaces, allowing classification of unseen bird species based on descriptions like beak shape or plumage patterns, addressing the challenge of subtle inter-class variations in fine-grained tasks. Another key application is action recognition from videos, where ZSL extends to temporal data by aligning video embeddings with semantic action labels, as demonstrated on datasets like UCF-101 and HMDB-51 for recognizing novel human activities without video-specific training.^[42] The CLIP model, introduced in 2021, represents a seminal technique for zero-shot image-text matching through contrastive pretraining on 400 million image-caption pairs, enabling direct similarity computation between visual and natural language embeddings.^[8] This embedding-based method briefly references compatibility functions from broader ZSL methodologies to score class hypotheses via cosine similarity in a joint space. On the ImageNet dataset, which serves as a standard benchmark with 1,000 classes split into seen and unseen for evaluation, CLIP achieves approximately 63.3% top-1 accuracy on unseen classes using its Vision Transformer base configuration (ViT-B/32).^[8] Similarly, the Animals with Attributes (AwA) dataset, featuring 50 animal classes with 85 binary attributes and 30,475 images, is widely used to assess attribute-based ZSL performance in coarser object recognition.^[5] Beyond classification, ZSL supports zero-shot object detection on the Microsoft COCO dataset, where models localize and categorize unseen objects using semantic embeddings, as pioneered in works that split COCO's 80 classes into 48 seen and 17 unseen for testing.^[43] In medical imaging, ZSL aids diagnosis of rare diseases by generalizing from common pathologies to unseen conditions via trait-guided representations, such as in chest X-ray analysis where models adapt to novel disease traits without patient-specific examples.^[44] These applications highlight ZSL's impact in scaling vision systems to diverse, data-limited domains while maintaining robust generalization.

Natural Language Processing

Zero-shot learning (ZSL) in natural language processing (NLP) enables models to perform tasks such as classification, generation, and understanding on unseen classes or languages without task-specific training data, leveraging pre-trained representations and semantic alignments. This capability is particularly valuable in dynamic environments where new categories emerge frequently, allowing systems to generalize from existing knowledge to novel scenarios. In NLP, ZSL often relies on prompt engineering or auxiliary knowledge to bridge seen and unseen domains, facilitating applications in resource-constrained settings.^[45] Key use cases include sentiment analysis on unseen topics, where models classify opinions on novel subjects like emerging social issues without prior examples, achieving accuracies up to 70-80% on benchmarks by aligning textual embeddings with sentiment labels. Similarly, intent detection in chatbots uses ZSL to identify user goals in unseen dialogues, such as novel query types in customer support, enabling robust handling of out-of-distribution inputs through capsule networks or prompt-based classification.^[46]^[47] A prominent technique is prompt-based ZSL in large language models (LLMs) like GPT-4, where natural language instructions guide inference without fine-tuning; for instance, prompting "Classify this text as positive/negative: [text]" yields effective zero-shot classification by exploiting the model's internalized world knowledge. This approach has demonstrated strong performance on diverse NLP tasks, with GPT-4 achieving over 90% accuracy in simple zero-shot settings compared to earlier models. Representative examples include zero-shot named entity recognition (NER) for new entity types, such as identifying "cryptocurrency" in financial texts unseen during training, by leveraging type descriptions to generate entity-aware embeddings and outperforming traditional NER by 10-15% F1-score on low-resource datasets. Another example is multilingual translation without parallel data, where models like multilingual neural machine translation systems perform zero-shot inference between unseen language pairs, such as English-to-Swahili, by relying on shared latent spaces and achieving BLEU scores around 20-30 for distant languages.^[48] Advances since 2022 have centered on instruction fine-tuning of models like T5 and BART, where diverse task instructions are used to enhance zero-shot generalization; for example, Flan-T5 improves zero-shot performance on GLUE tasks by 15-20% over base T5 through scaled instruction tuning on thousands of examples.^[49] These methods integrate generative techniques briefly, such as synthesizing pseudo-labels for unseen classes during inference. Datasets like GLUE variants, including SuperGLUE and instruction-augmented subsets, serve as zero-shot benchmarks in NLP, evaluating models on tasks such as natural language inference and question answering without in-domain training, with top instruction-tuned models reaching 80-90% average scores.^[49]

Emerging Domains

Zero-shot learning (ZSL) is increasingly applied in interdisciplinary fields beyond traditional computer vision and natural language processing, enabling systems to adapt to novel scenarios with minimal or no task-specific training. In robotics, ZSL facilitates zero-shot grasping of novel objects through textual or multimodal descriptions, allowing robots to infer manipulation strategies for unseen items without prior physical interaction. For instance, frameworks like ORACLE-Grasp leverage large multimodal models to select grasp poses based on semantic prompts, achieving robust performance on diverse household objects in real-world settings. Similarly, ShapeGrasp decomposes objects into geometric primitives using vision-language models, enabling task-oriented grasping for unfamiliar shapes with success rates exceeding 80% in zero-shot evaluations. These approaches enhance robotic adaptability in unstructured environments, such as warehouses or homes. In healthcare, ZSL supports the diagnosis of rare conditions by generalizing from descriptions of symptoms and known diseases to identify unseen pathologies. Large language models (LLMs) applied in zero-shot prompting have shown promise in differential diagnosis for rare diseases, outperforming traditional methods on benchmarks like RareBench by integrating knowledge graphs for reasoning over sparse data. Agentic systems further advance this by enabling training-free diagnosis in few-shot or zero-shot scenarios, with evaluations demonstrating improved accuracy for conditions like genetic disorders that lack extensive annotated datasets. Such applications are particularly valuable for resource-limited settings, where collecting labeled data for every rare case is infeasible. Recommendation systems benefit from ZSL to handle new user or item types without historical interactions, promoting cold-start solutions through attribute-based or generative embeddings. Zero-shot recommender systems like ZESRec train on auxiliary data to predict preferences for unseen entities, achieving up to 20% improvements in hit rates compared to content-based baselines on datasets like Amazon reviews. More recent generative foundation models, such as RecBase, enable cross-domain zero-shot recommendations by pretraining on diverse interaction patterns, enhancing personalization for emerging e-commerce categories. Notable examples include zero-shot reinforcement learning (RL) for unseen environments, where model-based agents like Dr. G use self-supervised world models to generalize policies across visual tasks, succeeding in 70-90% of novel continuous control scenarios without task-specific fine-tuning. In audio processing, ZSL classifies novel sounds via attribute descriptions, with methods like sound attribute transfer improving zero-shot accuracy on environmental audio benchmarks by 10-15% over baseline CLAP models through discriminative captioning. Recent advancements from 2024-2025 integrate ZSL with multimodal LLMs, such as adaptations of Flamingo-inspired architectures, for embodied AI tasks like navigation and interaction in physical spaces. These systems support few-shot to zero-shot transfer in robotics, enabling agents to reason over visual-textual inputs for complex behaviors in simulated and real environments. The scalability of ZSL proves beneficial in dynamic settings like autonomous driving, where it allows vehicles to handle unforeseen traffic patterns or obstacles by generalizing from semantic descriptions, reducing the need for exhaustive scenario-specific training data. However, real-time constraints pose unique challenges in these domains, particularly in robotics and healthcare, where computational demands of multimodal inference can exceed hardware limits, necessitating lightweight approximations without sacrificing generalization. In embodied applications, ensuring low-latency ZSL under partial observability remains critical for safety-critical deployments.

Challenges and Future Directions

Persistent Limitations

In generalized zero-shot learning (GZSL), domain shift remains a core limitation, where the distribution mismatch between seen and unseen classes causes models to favor seen-class predictions during evaluation on mixed test sets. This shift leads to seen-class dominance, with unseen-class accuracies often substantially lower due to the model's bias toward the more familiar training distribution.^[20] High-dimensional embeddings exacerbate this through the hubness problem, in which a small number of points (hubs) become nearest neighbors to disproportionately many queries, skewing compatibility scores and hindering accurate unseen-class matching.^[27] Additionally, an inherent bias stems from over-reliance on seen-class semantics during training, which fails to adequately represent the diverse attributes of unseen categories, resulting in brittle generalization beyond the training domain.^[20] Scalability poses further challenges, particularly for generative approaches that synthesize unseen-class features; these methods incur high computational costs from training complex models like GANs or VAEs on large feature spaces. Performance also critically depends on the quality of auxiliary information, such as semantic attributes or word embeddings, which can introduce errors if annotations are incomplete or inconsistent across datasets.^[50] Evaluation protocols suffer from gaps, including the absence of fully standardized splits for unseen classes, which allows data leakage or inconsistent partitioning and complicates fair benchmarking across studies.^[20] Empirically, these issues manifest in significant performance drops; on the AWA2 dataset, for instance, average accuracies typically decline from approximately 85% on seen classes to 35% on unseen classes, a reduction of 50 percentage points.

Recent Advances and Trends

Recent advances in zero-shot learning (ZSL) have been driven by the emergence of multimodal foundation models, which leverage large-scale pre-training to enable effective zero-shot inference across vision and language tasks. For instance, BLIP-2, introduced in 2023, employs a lightweight querying transformer to bridge frozen image encoders and large language models (LLMs), achieving state-of-the-art zero-shot performance on vision-language benchmarks like VQAv2 with accuracies around 65% without task-specific fine-tuning.^[51] Building on this, 2025 developments in multimodal large language models (MLLMs), such as those explored in zero-shot anomaly detection, have extended these capabilities to reasoning-intensive applications by integrating visual grounding with natural language prompts, demonstrating improved generalization to unseen scenarios in industrial inspection tasks.^[52] Additionally, neuro-symbolic hybrids have gained traction, combining neural embeddings with logical reasoning to enhance interpretability and accuracy in ZSL.^[53] Key trends include the integration of LLMs for reasoning-based ZSL, where chain-of-thought prompting and logical augmentation enable models to infer unseen class relationships through step-by-step semantic decomposition. A 2023 method, LoT, augments zero-shot prompts with logical structures, improving arithmetic and commonsense reasoning accuracy in LLMs by around 1-4% on benchmarks like GSM8K without additional training data.^[54] By 2025, instruction-following LLMs have further advanced this by supporting zero-shot task adaptation, achieving high accuracies in text classification tasks via hand-crafted prompts.^[55] Parallel to this, federated learning frameworks have emerged for privacy-preserving ZSL, allowing distributed clients to share mid-level semantic knowledge without raw data exchange. A 2025 extension to decentralized settings enables zero-shot adaptation across heterogeneous devices, reducing communication overhead by 40% compared to centralized methods.^[56] Looking ahead, scaling ZSL to over 1 million classes is being pursued through knowledge graphs that encode hierarchical semantic relationships, facilitating transductive inference for large-scale unseen categories. A 2025 transductive ZSL method leveraging knowledge graphs and graph convolutions on biomedical datasets like UMLS achieved 72% top-1 accuracy for novel disease classifications by propagating embeddings across graph edges.^[57] Ethical considerations are increasingly prominent, particularly in addressing biases from semantic transfers that propagate cultural or demographic imbalances to unseen classes; recent frameworks advocate for fairness-aware distillation in ZSL pipelines to mitigate disparity amplification, with audits showing up to 25% reduction in biased predictions on fairness benchmarks.^[58] The 2025 outlook emphasizes ZSL deployment on edge AI devices, where in-context learning enables reconfiguration without retraining, as highlighted in industry surveys on edge AI trends.^[59] Persistent open problems include robustness to adversarial perturbations on unseen classes, where vision-language models remain vulnerable, with 2025 studies proposing training-free confidence calibration to improve zero-shot adversarial accuracy by 12-18% on CIFAR-10-C under strong attacks, though gaps persist in compositional unseen threats.^[60]^[61]

References

[1]
Attribute-Based Classification for Zero-Shot Visual Object ...
Jul 30, 2013 · We study the problem of object recognition for categories for which we have no training examples, a task also called zero--data or zero-shot learning.
[2]
A Survey of Zero-Shot Learning: Settings, Methods, and Applications
In this paper, we provide a comprehensive survey of zero-shot learning. First of all, we provide an overview of zero-shot learning. According to the data ...<|control11|><|separator|>
[3]
[PDF] A Review of Generalized Zero-Shot Learning Methods - arXiv
In this review paper, we present a comprehensive review on GZSL. Firstly, we provide an overview of GZSL including the problems and challenges. Then, we ...Missing: seminal | Show results with:seminal
[4]
[PDF] Zero-Shot Learning - the Good, the Bad and the Ugly
The purpose of this paper is three- fold. First, given the fact that there is no agreed upon zero- shot learning benchmark, we first define a new benchmark by ...Missing: seminal | Show results with:seminal<|control11|><|separator|>
[5]
Zero-shot learning and its applications from autonomous vehicles to ...
In this review paper, we introduce a novel and broaden solution called Few / one-shot learning, and present the definition of the ZSL problem as an extreme case ...Missing: seminal | Show results with:seminal
[6]
Zero-shot learning and its applications from autonomous vehicles to ...
We review over fundamentals and the challenging steps of Zero-Shot Learning, including state-of-the-art categories of solutions, as well as our recommended ...
[7]
Learning to detect unseen object classes by between-class attribute ...
Learning to detect unseen object classes by between-class attribute transfer. Abstract: We study the problem of object classification when training and test ...
[8]
[PDF] A Survey on Transfer Learning
This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression and clustering problems. In this ...
[9]
[1703.05175] Prototypical Networks for Few-shot Learning - arXiv
Mar 15, 2017 · We propose prototypical networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set.
[10]
https://ieeexplore.ieee.org/document/5206594
[11]
DeViSE: A Deep Visual-Semantic Embedding Model - NIPS papers
In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic ...
[12]
Evaluation of Output Embeddings for Fine-Grained Image ... - arXiv
Sep 30, 2014 · This project shows that compelling classification performance can be achieved on such categories even without labeled training data.
[13]
[1603.08895] Latent Embeddings for Zero-shot Classification - arXiv
Mar 29, 2016 · We present a novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification.
[14]
[1703.04394] Zero-Shot Learning -- The Good, the Bad and the Ugly
Mar 13, 2017 · The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark.Missing: 2015 | Show results with:2015
[15]
Generalized Zero-Shot Learning via Synthesized Examples - arXiv
This paper presents a generative framework for generalized zero-shot learning using a variational autoencoder to generate novel examples from seen/unseen ...
[16]
[PDF] Zero-Shot Learning - A Comprehensive Evaluation of the Good, the ...
The purpose of this paper is three-fold. ... In summary, our work extensively evaluated the good and bad aspects of zero-shot learning while sanitizing the ugly ...
[17]
Zero-Shot Learning by Convex Combination of Semantic Embeddings
This paper proposes a method for zero-shot learning using convex combination of class label embeddings, mapping images without additional training.
[18]
Zero-Shot Learning with Common Sense Knowledge Graphs - arXiv
Jun 18, 2020 · Our results show that ZSL-KG improves over existing WordNet-based methods on five out of six zero-shot benchmark datasets in language and vision ...
[19]
[PDF] DeViSE: A Deep Visual-Semantic Embedding Model
More recently, Socher et al [18] presented a model for zero-shot learning where a deep neural network was first trained in an unsupervised manner from many ...
[20]
[PDF] A comprehensive survey of zero-shot image classification
Mar 31, 2022 · In this article, a comprehensive survey is given on the methodology, implementation, and fair evaluations for practical and applied computing ...
[21]
[PDF] Hyperbolic Visual Embedding Learning for Zero-Shot Recognition
This paper proposes a Hyperbolic Visual Embedding. Learning Network for zero-shot recognition. The network learns image embeddings in hyperbolic space, ...
[22]
Generalized Zero-Shot Learning with Deep Calibration Network
In this paper, we study generalized zero-shot learning that assumes accessible to target classes for unseen data during training.
[23]
Improving zero-shot learning by mitigating the hubness problem
Dec 20, 2014 · We propose a simple method to correct it by taking the proximity distribution of potential neighbours across many mapped vectors into account.
[24]
[PDF] Describing Objects by their Attributes
Describing Objects by their Attributes. Ali Farhadi, Ian Endres, Derek Hoiem ... In CVPR, 2009. 7. [11] S.C. Levinson. Pragmatics. CUP, 1983. 5. [12] ...
[25]
Feature Generating Networks for Zero-Shot Learning - arXiv
Dec 4, 2017 · We propose a novel generative adversarial network (GAN) that synthesizes CNN features conditioned on class-level semantic information.
[26]
[PDF] Feature Generating Networks for Zero-Shot Learning
(1) We pro- pose a novel conditional generative model f-CLSWGAN that synthesizes CNN features of unseen classes by optimiz- ing the Wasserstein distance ...
[27]
[PDF] A Generative Model for Zero Shot Learning Using Conditional ...
However, we would like to reempha- size that these results, although high, are inconclusive in the strictest definition of zero-shot learning. ... Lampert, H.
[28]
A zero-shot learning framework via cluster-prototype matching
We propose a Cluster-Prototype Matching (CPM) framework which exploits the distribution information of samples to explore the cluster structure of samples.
[29]
[1704.08345] Semantic Autoencoder for Zero-Shot Learning - arXiv
Apr 26, 2017 · In this work, we present a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the encoder-decoder paradigm, an encoder ...
[30]
https://arxiv.org/abs/1712.00981
[31]
[PDF] Audio-Visual Generalised Zero-Shot Learning With Cross-Modal ...
This paper proposes a method for audio-visual zero-shot learning using cross-modal attention and textual label embeddings, aligning audio-visual embeddings ...
[32]
Leveraging Self-Distillation and Disentanglement Network to ...
To address this issue, this paper proposes a GZSL framework that enhances the consistency of visual–semantic features using a self-distillation and ...<|separator|>
[33]
A Review of Generalized Zero-Shot Learning Methods - IEEE Xplore
Mar 6, 2023 · Generalized zero-shot learning (GZSL) trains a model to classify data when some output classes are unknown during supervised learning.
[34]
(PDF) Towards Realistic Zero-Shot Classification via Self Structural ...
Sep 2, 2025 · Finally, we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student ...
[35]
[PDF] Semantic-guided Reinforced Region Embedding for Generalized ...
As shown in Table 1, the SR2E gets a higher harmonic mean. (H) than SE-DD for all four datasets (65.8% in CUB, 67.5% in AwA2, 39.7% in SUN, 46.4% in aPY). Also, ...
[36]
Learning Transferable Visual Models From Natural Language ...
Feb 26, 2021 · The paper proposes learning visual models by predicting image-caption pairs, then using natural language for zero-shot transfer to downstream ...
[37]
Zero-shot action recognition in videos: A survey - ScienceDirect.com
Jun 7, 2021 · We present a survey of the methods that comprise techniques to perform visual feature extraction and semantic feature extraction as well to learn the mapping ...
[38]
[1804.04340] Zero-Shot Object Detection - arXiv
Apr 12, 2018 · We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training.
[39]
Generalized Zero-shot Chest X-ray Diagnosis through Trait-Guided ...
Because of its ability to identify new classes, ZSL may be potentially useful in radiology diagnosis, especially for the diagnosis of rare diseases from ...
[40]
A Practical Survey on Zero-shot Prompt Design for In-context Learning
Sep 22, 2023 · This paper presents a comprehensive review of in-context learning techniques, focusing on different types of prompts, including discrete, continuous, few-shot, ...
[41]
[2305.11442] Zero-Shot Text Classification via Self-Supervised Tuning
May 19, 2023 · This paper proposes self-supervised tuning for zero-shot text classification, using unlabeled data and first sentence prediction to tune ...
[42]
Intent Detection and Zero-shot Intent Classification for Chatbots
Here we propose a capsule-based approach that classifies the intent and a zero-shot learning to identify the unseen intent.
[43]
[2310.04726] Zero-shot Cross-lingual Transfer without Parallel Corpus
Oct 7, 2023 · We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model. It consists of a Bilingual Task Fitting module.
[44]
Pre-trained Language Models can be Fully Zero-Shot Learners - arXiv
Dec 14, 2022 · In this paper, we propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding.
[45]
https://arxiv.org/abs/2309.13205
[46]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen ...
Jan 30, 2023 · This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained ...Missing: advances 2023-2025
[47]
[PDF] Towards Zero-Shot Anomaly Detection and Reasoning with ...
Recent advances in Multimodal Large Language Models. (MLLMs) [7, 15, 44, 45, 47, 48, 57, 58, 111] have shown revolutionary reasoning capabilities in various ...
[48]
Boosting zero-shot learning through neuro-symbolic integration
FLPN is a novel neuro-symbolic architecture for zero-shot image recognition. FLPN incorporates prior knowledge by optimizing first order logic constraints.
[49]
Unlocking the Potential of Generative AI through Neuro-Symbolic ...
Feb 16, 2025 · This paper systematically studies diverse NSAI architectures, highlighting their unique approaches to integrating neural and symbolic components.
[50]
(PDF) Enhancing Zero-Shot Chain-of-Thought Reasoning in Large ...
Aiming to improve the zero-shot chain-of-thought reasoning ability of large language models, we propose Logical Chain-of-Thought (LogiCoT), a neurosymbolic ...<|separator|>
[51]
[PDF] Zero-shot and Few-shot Learning with Instruction-following LLMs for ...
Jan 19, 2025 · LLMs have shown high results without task fine-tuning but based only on hand-crafted task descriptions with instructions, and zero or lim- ited ...
[52]
Federated zero-shot learning with mid-level semantic knowledge ...
We formulate a new Federated Zero-Shot Learning (FZSL) paradigm to learn mid-level semantic knowledge at multiple local clients with non-shared local data.
[53]
[2509.26462] Zero-Shot Decentralized Federated Learning - arXiv
Sep 30, 2025 · We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across ...Missing: preserving 2023-2025
[54]
Transductive zero-shot learning via knowledge graph and ... - NIH
Aug 6, 2025 · To tackle this problem, we propose a transductive zero-shot learning method, based on Knowledge Graph and Graph Convolutional Network. We ...Missing: 1M+ | Show results with:1M+
[55]
(PDF) Advancing Large Language Models with Knowledge ...
Jan 23, 2025 · This article proposes solutions to these issues, including fairness-aware distillation, adaptive pipelines, and privacy-preserving frameworks.
[56]
[PDF] THE 2025 EDGE AI TECHNOLOGY REPORT | Ceva's IP
Enabled by in-context (zero- shot) learning, these on-device models can be reconfigured to adapt to new scenarios without retraining, even on devices with ...
[57]
On the Zero-shot Adversarial Robustness of Vision-Language Models
In this paper, we propose a truly zero-shot and training-free approach that can significantly improve the VLM's zero-shot adversarial robustness. Specifically, ...Missing: unseen problems
[58]
Improving Zero-Shot Adversarial Robustness in Vision-Language ...
Oct 6, 2025 · Adversarial fine-tuning robustifies zero-shot models by aligning prediction scores of individual adversaries with their clean counterparts, ...<|control11|><|separator|>