Fact-checked by Grok 2 weeks ago

Zero-shot learning

Zero-shot learning (ZSL) is a that enables predictive models to recognize, classify, or perform tasks on categories or instances not encountered during , by leveraging auxiliary semantic —such as textual descriptions, attributes, or embeddings—that bridges from observed (seen) classes to novel (unseen) ones. Introduced in for visual object categorization, ZSL was pioneered by Lampert et al. in 2009, who framed it as transferring discriminative via shared binary attributes between seen and unseen classes, allowing without direct examples for target categories. This approach addresses fundamental limitations of traditional , such as the need for exhaustive labeled datasets, and has proven essential in scenarios with open-world data distribution, where new classes emerge dynamically. Over time, ZSL has evolved from attribute-based methods to more sophisticated frameworks, gaining prominence with the rise of and pre-trained representations like word vectors from models such as or . At its core, ZSL operates through mechanisms that align visual or features with semantic spaces; for instance, embedding-based methods project inputs and class descriptions into a shared for matching, while generative techniques synthesize pseudo-samples for unseen classes using variational autoencoders or GANs to mitigate shifts. Variants include conventional ZSL, which assumes test data solely from unseen classes, and generalized ZSL (GZSL), a more realistic setting that tests on mixtures of seen and unseen classes, often suffering from hubness and bias toward seen classes challenges. typically relies on benchmarks like Animals with Attributes (AwA), Caltech-UCSD Birds (CUB), and SUN, measuring accuracy via top-k recognition or harmonic means in GZSL to balance seen-unseen performance. ZSL's applications span diverse domains, including and video for novel or objects in , natural language processing for zero-shot text and in multilingual settings, and for adapting to unseen environments or actions without retraining. In healthcare, it facilitates zero-shot from medical using semantic descriptions, while in autonomous vehicles, it supports of rare traffic scenarios via knowledge graphs or ontologies. Advances as of 2021 integrated ZSL with large foundation models, enabling emergent capabilities like zero-shot prompting in vision-language systems such as CLIP, which align and text for broad generalization. Further progress in 2024-2025 includes diffusion-based generative methods and enhanced zero-shot performance in multimodal large language models like GPT-4V, though challenges in true generalization persist. Despite progress, ongoing challenges include semantic loss during transfer, scalability to high-dimensional data, and robustness against noisy auxiliary information.

Fundamentals

Definition and Motivation

Zero-shot learning (ZSL) is a paradigm that enables models to recognize and classify instances from unseen classes at test time, without any training examples for those classes, by leveraging auxiliary semantic information to transfer knowledge from seen classes. This approach was formally introduced in the seminal work by Lampert et al. (2009), which framed ZSL as the problem of object where training and test classes are disjoint, meaning no visual examples of the target classes are available during . In essence, ZSL shifts the focus from data-driven to semantically informed inference, allowing systems to handle open-world scenarios where new categories continually emerge. The motivation for ZSL stems from the practical limitations of traditional , which demands extensive for every class—a requirement that is often infeasible due to data scarcity, high costs, and the dynamic of real-world environments. By enabling generalization to novel categories without retraining, ZSL addresses these challenges and emulates human-like , where individuals can infer properties of unfamiliar objects from linguistic descriptions or prior knowledge rather than direct observation. This capability is particularly valuable in domains like and , where the explosion of potential classes outpaces data collection efforts. In the basic ZSL workflow, models are trained on a set of seen classes using paired visual features and auxiliary information, such as class attributes or textual descriptions, to learn a compatibility function that maps visual inputs to a shared semantic . At , unseen classes—described only semantically—are classified by projecting test instances into this and matching them to the nearest unseen class representation via semantic transfer. For instance, a model trained on images of horses and patterns like stripes could classify a zebra (an unseen class) by recognizing its visual features as compatible with the attribute combination "striped horse," without ever encountering zebra images during training.

Comparison with Other Paradigms

Zero-shot learning (ZSL) fundamentally differs from by enabling the recognition of entirely novel classes without any labeled training examples for those classes, instead leveraging auxiliary information like semantic descriptions or attributes to transfer knowledge from seen classes. In , models require extensive labeled datasets covering all target classes to learn discriminative features, limiting applicability to scenarios where new categories emerge without prior . In contrast to , which adapts models using a minimal number of labeled examples (typically 1 to 5 per ) to generalize via metric-based or optimization techniques, ZSL relies solely on auxiliary knowledge without direct exemplars, emphasizing semantic bridging over episodic training. One-shot learning, a specific case of , provides exactly one labeled example per new to facilitate adaptation, whereas ZSL avoids even this single instance by focusing on cross-modal or alignments for inference on unseen categories. ZSL also extends beyond traditional transfer learning, which typically involves pre-training on a source task with abundant data and fine-tuning on a related target task—often sharing similar classes or features—by enabling generalization to semantically related but completely novel classes through compatibility functions or shared latent spaces. This semantic transfer in ZSL supports open-world applications where test classes are disjoint from training ones, unlike transfer learning's emphasis on within overlapping distributions. The following table summarizes key distinctions among these paradigms:
ParadigmData Requirements for Novel ClassesGeneralization TypeTypical Use Cases
Many labeled examples per classIntra-class discrimination within seen dataAbundant labeled datasets for closed-set
Labeled source data; optional target labelsTo related tasks/domains via feature reuse pre-trained models on similar problems
1–5 labeled examples per classTo novel classes with minimal supportData-efficient adaptation in dynamic environments
One-Shot LearningExactly 1 labeled example per classTo novel classes from single instanceExtreme data scarcity, e.g., personalized
Zero-Shot LearningZero labeled examples; auxiliary informationTo unseen classes via semanticsOpen-vocabulary tasks like emerging categories

Historical Background

Origins and Early Developments

Zero-shot learning originated in the late 2000s as researchers sought to overcome the scalability issues of supervised image classification systems, where models trained on fixed datasets struggled to handle emerging categories without additional labeled data. A key motivation stemmed from challenges observed in early benchmarks like the , introduced in 2003, which contained images across 101 object categories but underscored the need for methods that could generalize to unseen classes in dynamic real-world scenarios. The foundational work in this area was presented by Lampert, Nickisch, and Harmeling in 2009, who proposed an attribute-based framework for detecting unseen object classes through between-class attribute transfer. In their approach, human-interpretable attributes—such as "has stripes" or "is furry"—served as intermediaries to map visual features from known classes to novel ones, enabling recognition without direct training examples for the target categories. This method relied on hand-crafted image features, such as SIFT descriptors and , combined with probabilistic models for attribute prediction and class inference. To support evaluation, they introduced the Animals with Attributes (AWA) dataset, comprising over 30,000 images of 50 animal species annotated with 85 binary attributes, which became a standard benchmark for early zero-shot experiments. Concurrent developments further solidified attribute-centric ideas for zero-shot recognition. For instance, Farhadi et al. (2009) explored describing objects via relative attributes to facilitate , emphasizing how semantic descriptions could bridge seen and unseen categories without deep neural architectures. The term "zero-shot learning" itself was formally coined in a 2009 paper by Palatucci et al., who framed it as a problem involving semantic output codes for predicting labels from auxiliary information, laying theoretical groundwork in a non-vision-specific . These pre-deep learning efforts prioritized logical rules and shallow classifiers, marking the shift toward in recognition tasks.

Key Milestones in Deep Learning Era

The deep learning era transformed zero-shot learning (ZSL) by leveraging neural networks for visual feature extraction and aligning them with rich semantic representations, building on earlier attribute-based foundations. A pivotal advancement occurred in with the DeViSE model, which embedded images from deep convolutional networks into a semantic space derived from skip-gram word embeddings, enabling zero-shot transfer by measuring compatibility between visual and textual descriptions. This approach demonstrated improved performance on large-scale datasets like , achieving hit rates of up to 10% across thousands of novel labels not observed during training by exploiting the vast unlabeled text corpus for semantic knowledge. From 2015 to 2017, the focus shifted toward learning frameworks that optimized joint embeddings between visual features and class semantics, often using ranking losses to handle fine-grained distinctions. Akata et al. introduced structured output embeddings in 2015, mapping images and hierarchical class labels into a via bidirectional ranking, which boosted zero-shot accuracy on datasets like CUB-200-2011 to 28.0% by incorporating semantic hierarchies. This was extended in 2016 with latent embeddings for zero-shot classification, where a model learned functions in a low-dimensional , improving performance in generalized ZSL settings on Animals with Attributes. These methods emphasized end-to-end training with deep architectures, moving beyond hand-crafted attributes. A landmark review by et al. in 2017 synthesized these developments, establishing standardized benchmarks like Animals with Attributes 2 and subsets while critiquing the limitations of direct transfer methods, thus catalyzing the transition to more robust, generative paradigms. This evaluation underscored the superiority of embedding-based models, with top methods reaching 50-60% accuracy on seen classes but highlighting shift challenges for unseen ones. The emergence of generative ZSL in 2018 addressed these gaps by synthesizing synthetic features for unseen classes, exemplified by Verma et al.'s framework that used conditional variational autoencoders (f-VAEs) to generate diverse samples conditioned on semantic attributes, mitigating the hubness problem and achieving 46.7% accuracy on generalized ZSL for the benchmark. This generative shift enabled better calibration between seen and unseen domains, paving the way for handling imbalanced class distributions. In the 2020s, large language models revolutionized ZSL in through in-context learning, as demonstrated by , which performed tasks like and in zero-shot settings by conditioning on prompted examples without , attaining around 60% average score on SuperGLUE benchmarks. A key multimodal advancement was CLIP in , which aligned images and text via contrastive learning for broad zero-shot in vision-language tasks. This paradigm extended ZSL principles to and vision-language models, influencing approaches for visual tasks.

Core Concepts

Auxiliary Information

Auxiliary information in zero-shot learning (ZSL) refers to the semantic side data that connects visual features of seen classes to descriptions of unseen classes, enabling recognition without direct training examples. This information typically includes attributes, word vectors, knowledge graphs, and textual descriptions, each offering distinct ways to encode class semantics. Attributes consist of human-interpretable descriptors, either binary (e.g., presence or absence of "has stripes") or continuous (e.g., numerical values for "body size"), that articulate key properties of classes. Word vectors, derived from models like Word2Vec, represent classes as dense embeddings capturing linguistic similarities, such as proximity between "zebra" and "horse" in vector space. Knowledge graphs, such as WordNet hierarchies, model classes as nodes connected by relational edges (e.g., "is-a" or "part-of" links), providing structured ontological knowledge. Textual descriptions encompass natural language summaries or sentences that describe class characteristics, often in paragraph form. The primary role of auxiliary information is to establish transferable semantic links between seen and classes, allowing models to infer properties of novel categories. For instance, attributes can compose a zebra's as a horse-like with stripes, facilitating via shared similarities. Word vectors enable analogy-based , where classes are positioned relative to seen ones based on co-occurrence patterns in language data. Knowledge graphs support hierarchical propagation of knowledge, such as inferring attributes of a "hyena" from its relation to "" and "." Textual descriptions allow for flexible, context-rich bridging, capturing nuanced details like or that rigid attributes might overlook. These mechanisms collectively address the absence of class examples by aligning visual and semantic spaces. Auxiliary information is constructed through manual or automatic processes, balancing accuracy with . Manual involves expert labeling, as seen in the Animals with Attributes () dataset, where 85 binary and continuous attributes describe 50 animal classes, enabling early ZSL experiments on fine-grained recognition. In contrast, automatic methods harvest data from large corpora; for example, word vectors are trained on vast text collections like dumps using models such as skip-gram, while textual descriptions are extracted directly from encyclopedia articles for each class. Knowledge graphs are often pre-built from lexical resources like but can be augmented automatically via relation extraction tools. Early ZSL works, such as those in the era, frequently combined manual attributes with automatic embeddings to enhance transferability. Despite their utility, auxiliary information has notable limitations that can hinder ZSL performance. Attributes may introduce subjectivity or incompleteness in coverage, failing to capture all relevant traits for diverse classes. Word vectors and embeddings are prone to from corpora, leading to distorted semantic distances (e.g., cultural biases in ). Knowledge graphs suffer from incompleteness, as not all relations or nodes are exhaustively defined, potentially isolating unseen classes from seen ones. Textual descriptions can vary in quality and length, introducing variability in semantic density. These issues often amplify the hubness problem in high-dimensional spaces or exacerbate domain shift between visual and semantic modalities.

Semantic Embeddings and Spaces

In zero-shot learning (ZSL), semantic and form the foundational mechanism for transferring knowledge from seen to unseen classes by projecting heterogeneous data—such as visual features and textual descriptions—into a unified . This joint embedding allows for semantic alignment, where the similarity between an input instance and a class label is computed based on their proximity, enabling without direct training examples for novel categories. Typically, these are low-dimensional manifolds, though geometries have been explored to better capture hierarchical semantic relationships inherent in class taxonomies. Pre-trained embeddings serve as the building blocks for this joint space. For semantic descriptions, word embeddings like , which capture statistics in corpora, or contextual models such as , which generate dynamic representations from transformer architectures, provide rich textual vectors for class labels. On the visual side, deep convolutional neural networks (CNNs), exemplified by ResNet architectures, extract feature vectors from images, which are then mapped into the semantic space to align with textual embeddings. This projection ensures that visually similar objects cluster near semantically related class descriptions, facilitating . A core component is the compatibility function, which quantifies the alignment between an input \mathbf{x} (e.g., an image feature) and a class label y (e.g., a semantic description). Commonly formulated as a bilinear form, it is expressed as: s(\mathbf{x}, y) = \phi(\mathbf{x})^\top \psi(y), where \phi: \mathcal{X} \to \mathcal{Z} projects the visual feature \mathbf{x} into the joint embedding space \mathcal{Z}, and \psi: \mathcal{Y} \to \mathcal{Z} embeds the semantic representation of y into the same space; the inner product then measures their semantic compatibility. This function underpins classification by selecting the class with the highest score, often refined through losses during on seen classes. To ensure effective transfer, calibration of the embedding space is essential, particularly to maintain balanced representations that prevent overemphasis on seen classes and promote equitable semantic coverage for unseen ones. Techniques such as or bias correction adjust the projections to mitigate distributional shifts, ensuring that the joint space reflects auxiliary uniformly. High-dimensional embeddings, however, introduce the hubness problem, where certain points (hubs) become nearest neighbors to disproportionately many others due to the curse of dimensionality, leading to biased similarity searches that favor seen-class prototypes over unseen ones. This phenomenon degrades ZSL performance, as hubs can dominate compatibility scores, and is typically addressed through or hubness-aware metrics, though it remains a key challenge in embedding design.

Methodologies

Attribute-Based Approaches

Attribute-based approaches represent an early in zero-shot learning, leveraging explicit semantic attributes as auxiliary to from seen to unseen classes without relying on deep neural networks. These methods focus on decomposing into intermediate attribute tasks, enabling through semantic matching rather than direct visual alignment. In the direct attribute (DAP) , a model first predicts a of attribute scores \mathbf{a} for an input \mathbf{x} using classifiers trained on seen classes, then determines the class label y by matching \mathbf{a} to predefined class prototypes \mathbf{A}_y via \arg\max_y P(y|\mathbf{a}). The probabilistic foundation of this approach models the class posterior as P(y|\mathbf{x}) \propto P(\mathbf{a}|\mathbf{x}) P(y|\mathbf{a}), where attribute independence is often assumed using a to compute the joint likelihood efficiently. A foundational implementation appears in Lampert et al. (2009), who trained one-vs-all support vector machines (SVMs) for each attribute on seen-class data, achieving zero-shot recognition by propagating attribute predictions to unseen class prototypes. These techniques offer high interpretability, as attribute predictions provide human-readable explanations for classifications, but they demand extensive manual of attributes and scale poorly with increasing numbers of attributes or classes due to the in modeling dependencies. Benchmarking typically occurs on datasets like Animals with Attributes (AWA), which includes 30,475 images of 50 animal classes annotated with 85 binary attributes, and aPy (Attribute Pascal and ), featuring 15,339 images across 32 coarse-grained classes with 64 attributes.

Embedding and Compatibility Methods

Embedding and compatibility methods represent a class of techniques in zero-shot learning that project visual features into a shared semantic space with class descriptions, enabling via compatibility scores between projected features and unseen class . These approaches avoid explicit attribute prediction by learning end-to-end mappings that align visual and semantic representations, allowing direct on novel classes through similarity computation. By focusing on projection functions and scoring mechanisms, they facilitate without requiring generative modeling or transductive assumptions. The foundational framework involves learning projection matrices W_v to map visual features \phi(x) to semantic space and W_s to map semantic embeddings \psi(y) to visual space, with the objective of minimizing the reconstruction error \| W_v \phi(x) - \psi(y) \| for seen class samples during training. This bidirectional alignment ensures that visual inputs are embedded closely to their corresponding semantic descriptions, while semantic-to-visual projections regularize the space for robustness. At inference, the compatibility score s(x, y) = \phi(x)^\top W_v^\top W_s \psi(y) (or a simplified ) ranks candidate unseen classes, selecting the highest-scoring match as the prediction. Compatibility learning employs a ranking-based objective to enforce that correct class embeddings score higher than incorrect ones for a given input. Specifically, the loss is formulated as \min L = \sum \log(1 + \exp(s(x, y') - s(x, y))), where y is the true class and the sum is over incorrect classes y', promoting margin-based separation in the embedded space via a smooth approximation to pairwise hinge ranking. This objective is optimized jointly with visual feature extraction, often using on deep convolutional networks. Key variants include Attribute Label Embedding (ALE), which learns the compatibility matrix W via initialized on seen classes and refined with the loss for efficient bilinear scoring. DEVISE (Deep Visual-Semantic ), in contrast, projects deep visual activations directly into word-vector semantic space using a hinge loss \max(0, \mu + s(x, y') - s(x, y)) (with margin \mu), supporting open-ended vocabulary by leveraging large-scale linguistic like . These methods differ in their embedding choices—ALE uses compact attribute vectors, while DEVISE employs high-dimensional —but both emphasize scalable learning over exhaustive pairwise comparisons. Performance in zero-shot settings is evaluated using the accuracy, defined as H = 2 \times \frac{\text{acc}_\text{seen} \times \text{acc}_\text{unseen}}{\text{acc}_\text{seen} + \text{acc}_\text{unseen}}, which balances recognition on seen and unseen classes to penalize bias toward training data. For instance, ALE achieves around 30-40% unseen accuracy on Animals with Attributes datasets under standard splits, while DEVISE reports up to 6.0% top-1 accuracy on subsets (2-hop labels), highlighting the trade-off between embedding richness and generalization.

Generative and Transductive Techniques

Generative techniques in zero-shot learning (ZSL) address the challenge of unseen classes by synthesizing synthetic visual features conditioned on semantic embeddings, thereby augmenting the training data for classifiers. These methods typically employ (GANs) or variational autoencoders (VAEs) to model the distribution of visual features for novel categories. For instance, in a conditional GAN setup, a produces features x for an unseen by sampling latent z \sim \mathcal{N}(0, I) and conditioning on the class's semantic representation \psi(y), formalized as x = G(z, \psi(y)), where G is the . This synthesis allows the model to train on both real seen-class features and generated unseen-class features, mitigating the imbalance in standard ZSL setups. A seminal approach is the feature generating network f-CLSWGAN, which integrates a conditional least-squares GAN with Wasserstein distance optimization and a loss to produce high-quality features for unseen classes. This method balances seen and unseen class representations during training, achieving significant improvements in generalized ZSL (GZSL) by reducing the seen-unseen bias, with reported accuracies exceeding 20% on benchmarks like Animals with Attributes 2 (AWA2). Similarly, conditional VAE-based methods, such as CVAE-ZSL, utilize a p(x|z, y) to reconstruct features from a informed by semantic attributes, enabling probabilistic generation that captures class variability. These generative strategies have demonstrated robustness across datasets, outperforming embedding-based methods by up to 10-15% in unseen class accuracy on tasks like image . Transductive techniques extend ZSL by leveraging unlabeled test samples at time to refine boundaries without accessing seen-class , often through clustering or constraints. These methods assume access to test instances and their semantic descriptions, allowing to the target domain. For example, visual constraints on centers can align derived from test clusters with semantic spaces, improving projection generality and addressing domain shifts. Clustering-based approaches, such as those using k-means on test features followed by prototype matching, detect out-of-distribution samples and assign labels by minimizing distances to semantic clusters, enhancing accuracy in transductive settings by 5-10% over inductive baselines on datasets like CUB-200. This incorporation of test-time information enables better generalization, particularly in scenarios with distribution mismatches. Hybrid generative-transductive methods combine with test for more balanced performance in GZSL. Such integrations alleviate the hubness problem and toward seen classes, yielding state-of-the-art results like 65.6% accuracy in GZSL on the Flowers-102 (FLO) . Overall, these techniques advance ZSL by enabling data-efficient learning and robust , though they require careful hyperparameter tuning to avoid mode collapse in .

Generalized Zero-Shot Learning

Problem Setup and Formulation

In generalized zero-shot learning (GZSL), the phase utilizes labeled data from a set of seen classes \mathcal{S}, supplemented by auxiliary semantic information (such as attribute vectors or word embeddings) available for both seen classes \mathcal{S} and a disjoint set of classes \mathcal{U}. At test time, the model must classify samples drawn from the combined label space \mathcal{Y} = \mathcal{S} \cup \mathcal{U}, with the primary goal of achieving balanced recognition across both seen and unseen classes while mitigating the inherent toward seen classes that arises from exclusively on \mathcal{S}. GZSL models are trained to minimize the empirical on seen classes, but evaluation quantifies the between unseen-class and seen-class retention using the of accuracies, H = \frac{2 \cdot \mathrm{Acc}_s \cdot \mathrm{Acc}_u}{\mathrm{Acc}_s + \mathrm{Acc}_u}, assuming equal proportions of seen and unseen samples in testing. This metric highlights the challenge of domain shift in the test setting, where drops if the model overfits to seen classes. Unlike standard zero-shot learning (ZSL), which evaluates models solely on samples from \mathcal{U} and assumes no overlap with training classes, GZSL tests on \mathcal{S} \cup \mathcal{U} to reveal discrepancies between the source domain (seen data) and the domain (mixed seen-unseen data). This extension makes GZSL a more practical and stringent benchmark for real-world deployment, where unseen classes emerge alongside familiar ones. Performance in GZSL is typically assessed using the of per-class accuracies, H = \frac{2 \cdot \mathrm{Acc}_s \cdot \mathrm{Acc}_u}{\mathrm{Acc}_s + \mathrm{Acc}_u}, where \mathrm{Acc}_s and \mathrm{Acc}_u represent the mean accuracies on seen and unseen classes, respectively. This emphasizes balanced competence, as it diminishes sharply if a model overfits to seen classes at the expense of unseen ones. Prominent benchmark datasets for GZSL evaluations include Animals with Attributes 2 (AWA2), comprising 37,322 images from 50 animal classes with 85 attributes, and Caltech-UCSD Birds-200-2011 (), which features 11,788 images across 200 species annotated with 312 attributes. These datasets, split into seen and unseen classes according to predefined protocols, enable standardized comparisons of GZSL methods under realistic semantic gaps.

Mitigation Strategies for Domain Shift

In generalized zero-shot learning (GZSL), domain shift manifests as a toward seen classes during test-time evaluation, where models trained solely on seen data struggle to generalize to unseen classes due to distributional discrepancies in the visual-semantic space. Mitigation strategies address this by enforcing balanced representations, adjusting decisions, or transferring across modalities to reduce the seen-unseen imbalance. These approaches typically aim to improve the (H-mean) accuracy, which balances seen and unseen class recognition rates and is a standard metric for evaluating GZSL performance on benchmarks like the SUN . Semantic autoencoders provide a foundational strategy for creating balanced that mitigate domain shift by aligning visual features with semantic attributes while preventing to seen classes. In this framework, an projects input visual features into a low-dimensional semantic space, and a reconstructs the original features, with regularization terms ensuring the embeddings remain discriminative yet transferable to unseen classes. For instance, the (SAE) incorporates a function and to handle the hubness issue—where seen class prototypes dominate the embedding space—and has been extended to GZSL variants using multimodal variational autoencoders for shared latent spaces across and semantics. These methods reduce by learning invariant representations, leading to more equitable treatment of seen and unseen domains. Calibration methods offer post-hoc adjustments to decision boundaries, correcting the inherent bias without retraining the core model. A seminal technique is calibrated stacking, which combines zero-shot predictions for unseen classes with seen-class predictions using a tunable calibration factor γ to reweight probabilities, effectively shifting the to favor unseen classes during inference. This approach treats the GZSL problem as a biased task and adjusts logits post-training, improving balance on datasets with significant domain gaps. More advanced logit adjustment strategies build on this by theoretically deriving parameters based on class priors, further refining boundaries to account for semantic discrepancies. Knowledge transfer via cross-modal enhances semantic alignment by propagating information from to text spaces, reducing shift through teacher-student frameworks. In these methods, a pre-trained vision-language model (e.g., CLIP) acts as , distilling cross-modal alignments into a that generates or refines embeddings for unseen classes, ensuring consistent visual-semantic mappings. This preserves rich textual semantics while adapting visual features, mitigating bias in GZSL by leveraging large-scale pre-training to bridge gaps. For example, self- variants enforce feature consistency across modalities, boosting transfer to unseen domains without auxiliary data. Outlier detection techniques address domain shift by modeling seen classes as inliers and treating unseen samples as outliers during inference, routing the latter to specialized zero-shot mechanisms. Density-based methods, such as the Local Outlier Factor (LoF), estimate the manifold of seen class features and compute outlier scores to distinguish test samples, preventing misclassification of unseen instances as seen ones. Integrated into GZSL pipelines, these approaches use the outlier probability to adjust predictions, effectively calibrating the decision process for imbalanced domains. Early adaptations combined LoF with semantic embeddings to improve unseen recall without generative overhead. Recent advancements since 2023 incorporate large language models (LLMs) for through prompt-based semantic , leveraging their zero-shot capabilities to refine class descriptions and embeddings dynamically. By prompting LLMs to generate or adjust semantic attributes for unseen classes—e.g., via self-adaptive prompts that align structural semantics—these methods calibrate the visual-semantic space at test time, reducing domain bias in vision-language models. Further 2024-2025 works include evolutionary GZSL for robust attribute and cluster-based disentangling to alleviate domain gaps, as well as extensions to point clouds using dynamic calibration. This LLM-driven approach enhances interpretability and adaptability, with prompts guiding to mitigate shifts in diverse domains like fine-grained recognition. For evaluation, such strategies have notably improved H-mean on the SUN dataset; for instance, semantic-reinforced methods achieve around 39.7% H-mean, representing gains of 6-10% over uncalibrated baselines by better balancing seen-unseen accuracies.

Applications

Computer Vision

Zero-shot learning (ZSL) has found significant applications in computer vision, particularly for image and video recognition tasks where labeled data for novel classes is unavailable or scarce. In these scenarios, models leverage auxiliary information such as semantic attributes or textual descriptions to generalize to unseen categories, enabling flexible deployment in real-world settings without retraining. A prominent use case is fine-grained recognition, exemplified by bird species identification on the Caltech-UCSD Birds-200-2011 (CUB) dataset, which contains 11,788 images across 200 bird classes with rich attribute annotations for semantic transfer. ZSL approaches here map visual features to attribute spaces, allowing classification of unseen bird species based on descriptions like beak shape or plumage patterns, addressing the challenge of subtle inter-class variations in fine-grained tasks. Another key application is action recognition from videos, where ZSL extends to temporal data by aligning video embeddings with semantic action labels, as demonstrated on datasets like UCF-101 and HMDB-51 for recognizing novel human activities without video-specific training. The CLIP model, introduced in 2021, represents a seminal technique for zero-shot image-text matching through contrastive pretraining on 400 million image-caption pairs, enabling direct similarity computation between visual and natural language embeddings. This embedding-based method briefly references compatibility functions from broader ZSL methodologies to score class hypotheses via in a joint space. On the dataset, which serves as a standard benchmark with 1,000 classes split into seen and unseen for evaluation, CLIP achieves approximately 63.3% top-1 accuracy on unseen classes using its base configuration (ViT-B/32). Similarly, the Animals with Attributes (AwA) dataset, featuring 50 animal classes with 85 binary attributes and 30,475 images, is widely used to assess attribute-based ZSL performance in coarser . Beyond classification, ZSL supports zero-shot object detection on the COCO dataset, where models localize and categorize unseen objects using semantic embeddings, as pioneered in works that split COCO's 80 classes into 48 seen and 17 unseen for testing. In medical imaging, ZSL aids diagnosis of rare diseases by generalizing from common pathologies to unseen conditions via trait-guided representations, such as in chest analysis where models adapt to novel disease traits without patient-specific examples. These applications highlight ZSL's impact in scaling vision systems to diverse, data-limited domains while maintaining robust .

Natural Language Processing

Zero-shot learning (ZSL) in (NLP) enables models to perform tasks such as , generation, and understanding on unseen classes or languages without task-specific training data, leveraging pre-trained representations and semantic alignments. This capability is particularly valuable in dynamic environments where new categories emerge frequently, allowing systems to generalize from existing knowledge to novel scenarios. In NLP, ZSL often relies on or auxiliary knowledge to bridge seen and unseen domains, facilitating applications in resource-constrained settings. Key use cases include on unseen topics, where models classify opinions on novel subjects like emerging social issues without prior examples, achieving accuracies up to 70-80% on benchmarks by aligning textual embeddings with sentiment labels. Similarly, intent detection in chatbots uses ZSL to identify user goals in unseen dialogues, such as novel query types in , enabling robust handling of out-of-distribution inputs through capsule networks or prompt-based . A prominent technique is prompt-based ZSL in large language models (LLMs) like , where instructions guide inference without ; for instance, prompting "Classify this text as positive/negative: [text]" yields effective zero-shot classification by exploiting the model's internalized world knowledge. This approach has demonstrated strong performance on diverse tasks, with achieving over 90% accuracy in simple zero-shot settings compared to earlier models. Representative examples include zero-shot (NER) for new entity types, such as identifying "" in financial texts unseen during , by leveraging type descriptions to generate entity-aware embeddings and outperforming traditional NER by 10-15% F1-score on low-resource datasets. Another example is multilingual without parallel data, where models like multilingual systems perform zero-shot inference between unseen language pairs, such as English-to-Swahili, by relying on shared latent spaces and achieving scores around 20-30 for distant languages. Advances since 2022 have centered on instruction fine-tuning of models like and , where diverse task instructions are used to enhance zero-shot generalization; for example, Flan-T5 improves zero-shot performance on GLUE tasks by 15-20% over base through scaled instruction tuning on thousands of examples. These methods integrate generative techniques briefly, such as synthesizing pseudo-labels for unseen classes during inference. Datasets like GLUE variants, including SuperGLUE and instruction-augmented subsets, serve as zero-shot benchmarks in , evaluating models on tasks such as natural language inference and without in-domain training, with top instruction-tuned models reaching 80-90% average scores.

Emerging Domains

Zero-shot learning (ZSL) is increasingly applied in interdisciplinary fields beyond traditional and , enabling systems to adapt to novel scenarios with minimal or no task-specific training. In , ZSL facilitates zero-shot grasping of novel objects through textual or descriptions, allowing robots to infer manipulation strategies for unseen items without prior physical interaction. For instance, frameworks like ORACLE-Grasp leverage large models to select grasp poses based on semantic prompts, achieving robust performance on diverse household objects in real-world settings. Similarly, ShapeGrasp decomposes objects into geometric primitives using vision-language models, enabling task-oriented ing for unfamiliar shapes with success rates exceeding 80% in zero-shot evaluations. These approaches enhance robotic adaptability in unstructured environments, such as warehouses or homes. In healthcare, ZSL supports the of conditions by generalizing from descriptions of symptoms and known diseases to identify unseen pathologies. Large language models (LLMs) applied in zero-shot prompting have shown promise in for diseases, outperforming traditional methods on benchmarks like RareBench by integrating knowledge graphs for reasoning over sparse data. Agentic systems further advance this by enabling training-free in few-shot or zero-shot scenarios, with evaluations demonstrating improved accuracy for conditions like genetic disorders that lack extensive annotated datasets. Such applications are particularly valuable for resource-limited settings, where collecting for every case is infeasible. Recommendation systems benefit from ZSL to handle new user or item types without historical interactions, promoting cold-start solutions through attribute-based or generative embeddings. Zero-shot recommender systems like ZESRec train on auxiliary data to predict preferences for unseen entities, achieving up to 20% improvements in hit rates compared to content-based baselines on datasets like reviews. More recent generative foundation models, such as RecBase, enable cross-domain zero-shot recommendations by pretraining on diverse interaction patterns, enhancing for emerging categories. Notable examples include zero-shot () for unseen environments, where model-based agents like Dr. G use self-supervised world models to generalize policies across visual tasks, succeeding in 70-90% of novel continuous control scenarios without task-specific . In audio processing, ZSL classifies novel sounds via attribute descriptions, with methods like sound attribute transfer improving zero-shot accuracy on environmental audio benchmarks by 10-15% over baseline CLAP models through discriminative captioning. Recent advancements from 2024-2025 integrate ZSL with LLMs, such as adaptations of Flamingo-inspired architectures, for embodied tasks like and in physical spaces. These systems support few-shot to zero-shot transfer in , enabling agents to reason over visual-textual inputs for complex behaviors in simulated and real environments. The scalability of ZSL proves beneficial in dynamic settings like autonomous driving, where it allows vehicles to handle unforeseen traffic patterns or obstacles by generalizing from semantic descriptions, reducing the need for exhaustive scenario-specific training data. However, real-time constraints pose unique challenges in these domains, particularly in and healthcare, where computational demands of can exceed hardware limits, necessitating lightweight approximations without sacrificing . In embodied applications, ensuring low-latency ZSL under partial remains critical for safety-critical deployments.

Challenges and Future Directions

Persistent Limitations

In generalized zero-shot learning (GZSL), domain shift remains a core limitation, where the distribution mismatch between seen and unseen classes causes models to favor seen-class predictions during on mixed sets. This shift leads to seen-class dominance, with unseen-class accuracies often substantially lower due to the model's bias toward the more familiar training distribution. High-dimensional embeddings exacerbate this through the hubness problem, in which a small number of points (hubs) become nearest neighbors to disproportionately many queries, skewing scores and hindering accurate unseen-class matching. Additionally, an inherent stems from over-reliance on seen-class semantics during training, which fails to adequately represent the diverse attributes of unseen categories, resulting in brittle beyond the training domain. Scalability poses further challenges, particularly for generative approaches that synthesize unseen-class ; these methods incur high computational costs from training complex models like GANs or VAEs on large feature spaces. Performance also critically depends on the quality of auxiliary information, such as semantic attributes or word embeddings, which can introduce errors if annotations are incomplete or inconsistent across datasets. Evaluation protocols suffer from gaps, including the absence of fully standardized splits for classes, which allows leakage or inconsistent partitioning and complicates fair across studies. Empirically, these issues manifest in significant performance drops; on the AWA2 dataset, for instance, average accuracies typically decline from approximately 85% on seen classes to 35% on classes, a reduction of 50 percentage points. Recent advances in zero-shot learning (ZSL) have been driven by the emergence of models, which leverage large-scale pre-training to enable effective zero-shot inference across and tasks. For instance, BLIP-2, introduced in , employs a lightweight querying to bridge frozen image encoders and large language models (LLMs), achieving state-of-the-art zero-shot performance on vision-language benchmarks like VQAv2 with accuracies around 65% without task-specific . Building on this, 2025 developments in large language models (MLLMs), such as those explored in zero-shot , have extended these capabilities to reasoning-intensive applications by integrating visual grounding with prompts, demonstrating improved to unseen scenarios in industrial inspection tasks. Additionally, neuro-symbolic hybrids have gained traction, combining neural embeddings with to enhance interpretability and accuracy in ZSL. Key trends include the integration of LLMs for reasoning-based ZSL, where chain-of-thought prompting and logical augmentation enable models to infer unseen class relationships through step-by-step semantic decomposition. A 2023 method, LoT, augments zero-shot prompts with logical structures, improving arithmetic and accuracy in LLMs by around 1-4% on benchmarks like GSM8K without additional training data. By 2025, instruction-following LLMs have further advanced this by supporting zero-shot task , achieving high accuracies in text tasks via hand-crafted prompts. Parallel to this, frameworks have emerged for privacy-preserving ZSL, allowing distributed clients to share mid-level semantic knowledge without raw data exchange. A 2025 extension to decentralized settings enables zero-shot across heterogeneous devices, reducing communication overhead by 40% compared to centralized methods. Looking ahead, scaling ZSL to over 1 million classes is being pursued through knowledge graphs that encode hierarchical semantic relationships, facilitating transductive inference for large-scale categories. A 2025 transductive ZSL method leveraging knowledge graphs and graph convolutions on biomedical datasets like UMLS achieved 72% top-1 accuracy for novel disease classifications by propagating embeddings across graph edges. Ethical considerations are increasingly prominent, particularly in addressing biases from semantic transfers that propagate cultural or demographic imbalances to classes; recent frameworks advocate for fairness-aware in ZSL pipelines to mitigate disparity amplification, with audits showing up to 25% reduction in biased predictions on fairness benchmarks. The 2025 outlook emphasizes ZSL deployment on edge devices, where in-context learning enables reconfiguration without retraining, as highlighted in industry surveys on edge AI trends. Persistent open problems include robustness to adversarial perturbations on classes, where vision-language models remain vulnerable, with 2025 studies proposing training-free calibration to improve zero-shot adversarial accuracy by 12-18% on CIFAR-10-C under strong attacks, though gaps persist in compositional unseen threats.

References

  1. [1]
    Attribute-Based Classification for Zero-Shot Visual Object ...
    Jul 30, 2013 · We study the problem of object recognition for categories for which we have no training examples, a task also called zero--data or zero-shot learning.
  2. [2]
    A Survey of Zero-Shot Learning: Settings, Methods, and Applications
    In this paper, we provide a comprehensive survey of zero-shot learning. First of all, we provide an overview of zero-shot learning. According to the data ...<|control11|><|separator|>
  3. [3]
    [PDF] A Review of Generalized Zero-Shot Learning Methods - arXiv
    In this review paper, we present a comprehensive review on GZSL. Firstly, we provide an overview of GZSL including the problems and challenges. Then, we ...Missing: seminal | Show results with:seminal
  4. [4]
    [PDF] Zero-Shot Learning - the Good, the Bad and the Ugly
    The purpose of this paper is three- fold. First, given the fact that there is no agreed upon zero- shot learning benchmark, we first define a new benchmark by ...Missing: seminal | Show results with:seminal<|control11|><|separator|>
  5. [5]
    Zero-shot learning and its applications from autonomous vehicles to ...
    In this review paper, we introduce a novel and broaden solution called Few / one-shot learning, and present the definition of the ZSL problem as an extreme case ...Missing: seminal | Show results with:seminal
  6. [6]
    Zero-shot learning and its applications from autonomous vehicles to ...
    We review over fundamentals and the challenging steps of Zero-Shot Learning, including state-of-the-art categories of solutions, as well as our recommended ...
  7. [7]
    Learning to detect unseen object classes by between-class attribute ...
    Learning to detect unseen object classes by between-class attribute transfer. Abstract: We study the problem of object classification when training and test ...
  8. [8]
    [PDF] A Survey on Transfer Learning
    This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression and clustering problems. In this ...
  9. [9]
    [1703.05175] Prototypical Networks for Few-shot Learning - arXiv
    Mar 15, 2017 · We propose prototypical networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set.
  10. [10]
  11. [11]
    DeViSE: A Deep Visual-Semantic Embedding Model - NIPS papers
    In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic ...
  12. [12]
    Evaluation of Output Embeddings for Fine-Grained Image ... - arXiv
    Sep 30, 2014 · This project shows that compelling classification performance can be achieved on such categories even without labeled training data.
  13. [13]
    [1603.08895] Latent Embeddings for Zero-shot Classification - arXiv
    Mar 29, 2016 · We present a novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification.
  14. [14]
    [1703.04394] Zero-Shot Learning -- The Good, the Bad and the Ugly
    Mar 13, 2017 · The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark.Missing: 2015 | Show results with:2015
  15. [15]
    Generalized Zero-Shot Learning via Synthesized Examples - arXiv
    This paper presents a generative framework for generalized zero-shot learning using a variational autoencoder to generate novel examples from seen/unseen ...
  16. [16]
    [PDF] Zero-Shot Learning - A Comprehensive Evaluation of the Good, the ...
    The purpose of this paper is three-fold. ... In summary, our work extensively evaluated the good and bad aspects of zero-shot learning while sanitizing the ugly ...
  17. [17]
    Zero-Shot Learning by Convex Combination of Semantic Embeddings
    This paper proposes a method for zero-shot learning using convex combination of class label embeddings, mapping images without additional training.
  18. [18]
    Zero-Shot Learning with Common Sense Knowledge Graphs - arXiv
    Jun 18, 2020 · Our results show that ZSL-KG improves over existing WordNet-based methods on five out of six zero-shot benchmark datasets in language and vision ...
  19. [19]
    [PDF] DeViSE: A Deep Visual-Semantic Embedding Model
    More recently, Socher et al [18] presented a model for zero-shot learning where a deep neural network was first trained in an unsupervised manner from many ...
  20. [20]
    [PDF] A comprehensive survey of zero-shot image classification
    Mar 31, 2022 · In this article, a comprehensive survey is given on the methodology, implementation, and fair evaluations for practical and applied computing ...
  21. [21]
    [PDF] Hyperbolic Visual Embedding Learning for Zero-Shot Recognition
    This paper proposes a Hyperbolic Visual Embedding. Learning Network for zero-shot recognition. The network learns image embeddings in hyperbolic space, ...
  22. [22]
    Generalized Zero-Shot Learning with Deep Calibration Network
    In this paper, we study generalized zero-shot learning that assumes accessible to target classes for unseen data during training.
  23. [23]
    Improving zero-shot learning by mitigating the hubness problem
    Dec 20, 2014 · We propose a simple method to correct it by taking the proximity distribution of potential neighbours across many mapped vectors into account.
  24. [24]
    [PDF] Describing Objects by their Attributes
    Describing Objects by their Attributes. Ali Farhadi, Ian Endres, Derek Hoiem ... In CVPR, 2009. 7. [11] S.C. Levinson. Pragmatics. CUP, 1983. 5. [12] ...
  25. [25]
    Feature Generating Networks for Zero-Shot Learning - arXiv
    Dec 4, 2017 · We propose a novel generative adversarial network (GAN) that synthesizes CNN features conditioned on class-level semantic information.
  26. [26]
    [PDF] Feature Generating Networks for Zero-Shot Learning
    (1) We pro- pose a novel conditional generative model f-CLSWGAN that synthesizes CNN features of unseen classes by optimiz- ing the Wasserstein distance ...
  27. [27]
    [PDF] A Generative Model for Zero Shot Learning Using Conditional ...
    However, we would like to reempha- size that these results, although high, are inconclusive in the strictest definition of zero-shot learning. ... Lampert, H.
  28. [28]
    A zero-shot learning framework via cluster-prototype matching
    We propose a Cluster-Prototype Matching (CPM) framework which exploits the distribution information of samples to explore the cluster structure of samples.
  29. [29]
    [1704.08345] Semantic Autoencoder for Zero-Shot Learning - arXiv
    Apr 26, 2017 · In this work, we present a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the encoder-decoder paradigm, an encoder ...
  30. [30]
  31. [31]
    [PDF] Audio-Visual Generalised Zero-Shot Learning With Cross-Modal ...
    This paper proposes a method for audio-visual zero-shot learning using cross-modal attention and textual label embeddings, aligning audio-visual embeddings ...
  32. [32]
    Leveraging Self-Distillation and Disentanglement Network to ...
    To address this issue, this paper proposes a GZSL framework that enhances the consistency of visual–semantic features using a self-distillation and ...<|separator|>
  33. [33]
    A Review of Generalized Zero-Shot Learning Methods - IEEE Xplore
    Mar 6, 2023 · Generalized zero-shot learning (GZSL) trains a model to classify data when some output classes are unknown during supervised learning.
  34. [34]
    (PDF) Towards Realistic Zero-Shot Classification via Self Structural ...
    Sep 2, 2025 · Finally, we propose to self-learn the CLIP image encoder with both individual and structural semantic alignment through a teacher-student ...
  35. [35]
    [PDF] Semantic-guided Reinforced Region Embedding for Generalized ...
    As shown in Table 1, the SR2E gets a higher harmonic mean. (H) than SE-DD for all four datasets (65.8% in CUB, 67.5% in AwA2, 39.7% in SUN, 46.4% in aPY). Also, ...
  36. [36]
    Learning Transferable Visual Models From Natural Language ...
    Feb 26, 2021 · The paper proposes learning visual models by predicting image-caption pairs, then using natural language for zero-shot transfer to downstream ...
  37. [37]
    Zero-shot action recognition in videos: A survey - ScienceDirect.com
    Jun 7, 2021 · We present a survey of the methods that comprise techniques to perform visual feature extraction and semantic feature extraction as well to learn the mapping ...
  38. [38]
    [1804.04340] Zero-Shot Object Detection - arXiv
    Apr 12, 2018 · We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training.
  39. [39]
    Generalized Zero-shot Chest X-ray Diagnosis through Trait-Guided ...
    Because of its ability to identify new classes, ZSL may be potentially useful in radiology diagnosis, especially for the diagnosis of rare diseases from ...
  40. [40]
    A Practical Survey on Zero-shot Prompt Design for In-context Learning
    Sep 22, 2023 · This paper presents a comprehensive review of in-context learning techniques, focusing on different types of prompts, including discrete, continuous, few-shot, ...
  41. [41]
    [2305.11442] Zero-Shot Text Classification via Self-Supervised Tuning
    May 19, 2023 · This paper proposes self-supervised tuning for zero-shot text classification, using unlabeled data and first sentence prediction to tune ...
  42. [42]
    Intent Detection and Zero-shot Intent Classification for Chatbots
    Here we propose a capsule-based approach that classifies the intent and a zero-shot learning to identify the unseen intent.
  43. [43]
    [2310.04726] Zero-shot Cross-lingual Transfer without Parallel Corpus
    Oct 7, 2023 · We propose a novel approach to conduct zero-shot cross-lingual transfer with a pre-trained model. It consists of a Bilingual Task Fitting module.
  44. [44]
    Pre-trained Language Models can be Fully Zero-Shot Learners - arXiv
    Dec 14, 2022 · In this paper, we propose nonparametric prompting PLM (NPPrompt) for fully zero-shot language understanding.
  45. [45]
  46. [46]
    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen ...
    Jan 30, 2023 · This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from off-the-shelf frozen pre-trained ...Missing: advances 2023-2025
  47. [47]
    [PDF] Towards Zero-Shot Anomaly Detection and Reasoning with ...
    Recent advances in Multimodal Large Language Models. (MLLMs) [7, 15, 44, 45, 47, 48, 57, 58, 111] have shown revolutionary reasoning capabilities in various ...
  48. [48]
    Boosting zero-shot learning through neuro-symbolic integration
    FLPN is a novel neuro-symbolic architecture for zero-shot image recognition. FLPN incorporates prior knowledge by optimizing first order logic constraints.
  49. [49]
    Unlocking the Potential of Generative AI through Neuro-Symbolic ...
    Feb 16, 2025 · This paper systematically studies diverse NSAI architectures, highlighting their unique approaches to integrating neural and symbolic components.
  50. [50]
    (PDF) Enhancing Zero-Shot Chain-of-Thought Reasoning in Large ...
    Aiming to improve the zero-shot chain-of-thought reasoning ability of large language models, we propose Logical Chain-of-Thought (LogiCoT), a neurosymbolic ...<|separator|>
  51. [51]
    [PDF] Zero-shot and Few-shot Learning with Instruction-following LLMs for ...
    Jan 19, 2025 · LLMs have shown high results without task fine-tuning but based only on hand-crafted task descriptions with instructions, and zero or lim- ited ...
  52. [52]
    Federated zero-shot learning with mid-level semantic knowledge ...
    We formulate a new Federated Zero-Shot Learning (FZSL) paradigm to learn mid-level semantic knowledge at multiple local clients with non-shared local data.
  53. [53]
    [2509.26462] Zero-Shot Decentralized Federated Learning - arXiv
    Sep 30, 2025 · We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across ...Missing: preserving 2023-2025
  54. [54]
    Transductive zero-shot learning via knowledge graph and ... - NIH
    Aug 6, 2025 · To tackle this problem, we propose a transductive zero-shot learning method, based on Knowledge Graph and Graph Convolutional Network. We ...Missing: 1M+ | Show results with:1M+
  55. [55]
    (PDF) Advancing Large Language Models with Knowledge ...
    Jan 23, 2025 · This article proposes solutions to these issues, including fairness-aware distillation, adaptive pipelines, and privacy-preserving frameworks.
  56. [56]
    [PDF] THE 2025 EDGE AI TECHNOLOGY REPORT | Ceva's IP
    Enabled by in-context (zero- shot) learning, these on-device models can be reconfigured to adapt to new scenarios without retraining, even on devices with ...
  57. [57]
    On the Zero-shot Adversarial Robustness of Vision-Language Models
    In this paper, we propose a truly zero-shot and training-free approach that can significantly improve the VLM's zero-shot adversarial robustness. Specifically, ...Missing: unseen problems
  58. [58]
    Improving Zero-Shot Adversarial Robustness in Vision-Language ...
    Oct 6, 2025 · Adversarial fine-tuning robustifies zero-shot models by aligning prediction scores of individual adversaries with their clean counterparts, ...<|control11|><|separator|>