Fact-checked by Grok 2 weeks ago

Few-shot learning

Few-shot learning (FSL) is a subfield of that enables models to recognize patterns, classify, or perform tasks on new categories using only a limited number of labeled examples—typically one to five per class—by leveraging prior knowledge from related tasks to achieve rapid . This approach addresses the data-hungry nature of traditional methods, which often require thousands of samples for training, thereby mitigating challenges such as high costs, data concerns, and computational demands associated with large datasets. Motivated by cognition's ability to learn novel concepts from minimal exposure, FSL has gained prominence since the mid-2010s, with foundational works introducing techniques to optimize models for quick adaptation. Key methodologies in FSL can be categorized at data, model, and algorithm levels. At the data level, techniques augment limited samples through transformations or synthesis to enrich training. Model-level approaches, such as prototypical networks, compute class prototypes in embedding space and classify queries based on distance metrics, enabling effective few-shot classification without task-specific fine-tuning. Algorithm-level methods, exemplified by model-agnostic meta-learning (MAML), train models via bi-level optimization to find initialization parameters that allow fast adaptation to new tasks through a few gradient updates. Hybrid paradigms, including transfer learning and in-context learning from large language models, further enhance FSL by combining pre-trained representations with few-shot prompts. FSL finds applications across domains like for image classification and , for and entity recognition, and for , where scarcity is common. Recent advancements, surveyed in over 200 studies, emphasize integration and robustness to distribution shifts, though challenges persist in scaling to real-world variability and long-tailed distributions.

Introduction

Definition and Scope

Few-shot learning is a paradigm designed to enable models to generalize effectively to new tasks or categories using only a very small number of labeled training examples, typically ranging from 1 to 5 per class—referred to as "shots." This approach contrasts sharply with traditional supervised methods, which often require vast amounts of data to achieve high performance due to their reliance on gradient-based optimization and large-scale parameter updates. By incorporating prior knowledge or meta-level learning strategies, few-shot learning mimics human-like rapid adaptation, allowing systems to make accurate predictions or classifications despite data scarcity. The scope of few-shot learning extends across supervised, semi-supervised, and settings, adapting to varying levels of available labels. In the supervised variant, which is the most commonly studied, tasks are typically formulated as "N-way K-shot" classification problems: the model must differentiate among N novel classes (ways) using exactly K labeled support examples per class, drawn from a disjoint set of base classes used for pre-training. This emphasizes task-agnostic generalization, where the model learns not just specific classes but of learning from limited data. Few-shot learning addresses critical real-world challenges where acquiring large labeled datasets is impractical or impossible, such as diagnosing rare diseases from limited patient records or identifying newly discovered in ecological monitoring. These scenarios highlight the paradigm's value in domains like healthcare and biodiversity conservation, where data scarcity arises from ethical, logistical, or rarity constraints. The conceptual foundations of few-shot learning have early roots in 1990s research within and , including studies on human rapid word learning from sparse examples and theoretical work quantifying prior knowledge in probably approximately correct () learning models. However, its formalization as a distinct paradigm emerged in the 2010s, coinciding with advances in , through influential contributions like one-shot systems and matching networks for .

Historical Development

The concept of few-shot learning draws early inspiration from studies in the and 1990s, which explored human capabilities for rapid from minimal examples, such as one-shot learning of , contrasting with the data-intensive requirements of traditional models. These insights highlighted challenges like in sequential learning, prompting initial efforts to mimic human-like efficiency. In the 2000s, Bayesian approaches emerged as foundational methods in for handling few samples, with seminal work by et al. introducing the Congealing in 2000 to align transformations across limited digit examples via shared . This was followed by Fei-Fei et al.'s variational Bayesian framework in 2003, which enabled one-shot learning of object categories by leveraging prior knowledge from base classes to infer novel ones from 1-5 examples, achieving promising results on real-world images. Subsequent Bayesian advancements, such as Lake et al.'s probabilistic program induction in the early 2010s, further bridged and by modeling concepts as executable programs, attaining human-level performance on one-shot handwritten character recognition. These non-deep methods dominated until around 2015, emphasizing generative priors over discriminative classifiers. The boom post-2012, ignited by AlexNet's success on , catalyzed a in few-shot learning from rule-based and Bayesian systems to neural architectures capable of leveraging large-scale pretraining for . Koch et al. marked this transition in 2015 with Siamese convolutional networks, which used learning for one-shot image tasks, outperforming prior non-deep baselines. A surge followed in 2016-2017 at major conferences like NeurIPS and ICML, with Vinyals et al.'s Matching Networks introducing attention-based learning for episodic on few-shot , achieving state-of-the-art on Omniglot datasets. Concurrently, Finn et al.'s Model-Agnostic (MAML) in 2017 proposed optimization-based to find initial parameters enabling rapid , influencing subsequent techniques. -scale pretraining became integral, allowing few-shot adaptations by transferring representations from millions of labeled images to downstream tasks with scarce data. Since 2018, few-shot learning has seen institutional growth through dedicated workshops at NeurIPS and ICML, starting with the annual workshop at NeurIPS in 2017 but expanding significantly thereafter to foster interdisciplinary advances in and low-data regimes. Researchers like have played pivotal roles, with her work on MAML sparking broader interest in optimization for few-shot adaptation and inspiring hybrid methods integrating with transfer. This evolution reflects a maturation from cognitive-inspired Bayesian roots to scalable deep neural frameworks, enabling practical few-shot capabilities in vision and beyond.

Background Concepts

Limitations of Traditional Supervised Learning

Traditional , especially when employing deep neural networks, relies heavily on vast amounts of to achieve reliable performance. These models typically require thousands of labeled examples per class to converge effectively and learn generalizable features, as the high dimensionality of the parameter space demands sufficient coverage to avoid underfitting or suboptimal solutions. For instance, in large-scale image classification tasks like , successful training involves over 1,000 examples per class to enable the network to capture intricate patterns without excessive variance. When the available is scarce—such as only a handful of examples per class—the models fail to learn meaningful representations, resulting in severe where the network memorizes noise in the limited training set rather than extracting underlying structures. Empirical studies underscore the dramatic performance degradation in low-data regimes. On benchmarks like miniImageNet, a baseline supervised classifier trained from scratch with just 5 examples per class (5-shot setting) in a 5-way task achieves approximately 50% accuracy, representing a drop of more than 50% compared to the 80-90% accuracies attainable with thousands of examples per class in full supervised training scenarios. Similar trends appear on CIFAR-FS, where reducing training data from abundant samples to 5 per class causes accuracy to plummet due to insufficient exposure to class variability, highlighting the of standard approaches outside data-rich environments. Beyond data quantity, traditional exhibits poor to unseen distributions, making it vulnerable to shifts in input data or task conditions. Models trained on one distribution often fail to handle even mild perturbations or corruptions, such as changes in lighting or viewpoint in images, leading to substantial accuracy declines on out-of-distribution test sets. In sequential settings, this is exacerbated by catastrophic , where adapting the model to new data overwrites previously learned knowledge, preventing stable performance across multiple tasks without extensive retraining. The resource implications further compound these challenges, as amassing large labeled datasets incurs high costs in time, labor, and computation. In specialized domains like , labeling requires expert clinicians to annotate images or records, often costing thousands of dollars per dataset due to the need for accuracy and compliance. Similarly, in , collecting labeled interaction data demands physical experimentation with hardware, involving setup, safety protocols, and iterative trials that can take weeks or months, rendering traditional impractical for real-world deployment where is constrained. Zero-shot learning (ZSL) enables models to classify instances from unseen classes without any training examples for those classes, relying instead on auxiliary information such as semantic attributes, textual descriptions, or embeddings to bridge the gap between seen and unseen categories. This paradigm was formalized in 2009 by Lampert et al., who introduced attribute-based transfer for object categorization, using datasets like Animals with Attributes where models predict attribute vectors (e.g., "has stripes" or "is furry") for classes based on learned mappings from seen classes. In , ZSL often involves projecting visual features into a shared semantic space for compatibility prediction, allowing generalization to purely classes without direct supervision. Transfer learning, in contrast, leverages knowledge from models pretrained on large-scale source datasets to improve performance on related target tasks with limited data, typically by the pretrained weights. Popularized following the success of deep convolutional networks like , trained on in 2012, transfer learning gained traction around 2014 through systematic studies on feature reusability, showing that lower-layer features (e.g., edges and textures) transfer well across domains while higher layers require more adaptation. It is particularly effective for few-shot scenarios when the source and target domains are similar, as domain mismatch can degrade performance despite abundant source data. Few-shot learning occupies a middle ground between these paradigms, permitting 1 to K (typically small K, e.g., 5) examples per novel to adapt models, thus bridging the no-data constraint of ZSL and the data-intensive of . While ZSL suits scenarios with entirely novel classes relying on semantic priors, and emphasizes hierarchical feature reuse from massive pretraining, few-shot methods integrate both by combining auxiliary knowledge with sparse supervision to enhance . Historically, ZSL's foundations trace to Lampert et al.'s 2009 work on attribute transfer, while 's modern form emerged from AlexNet's 2012 ImageNet dominance and subsequent 2014 analyses of deep feature transferability.

Core Methods

Meta-Learning Techniques

Meta-learning, often described as "learning to learn," is a in few-shot learning that trains models on a of tasks drawn from a meta-training set, enabling rapid adaptation to novel tasks with minimal examples through a process involving inner and outer loops. In the inner loop, the model performs task-specific updates to adapt to a support set of few examples, while the outer loop optimizes the initial model parameters to minimize the loss on a query set across multiple tasks, thereby learning a generalizable initialization that facilitates quick . This approach addresses the data scarcity in few-shot scenarios by leveraging meta-knowledge from diverse tasks, contrasting with traditional that requires large datasets per task. A seminal method in this framework is Model-Agnostic Meta-Learning (MAML), which learns initial parameters \theta that can be efficiently fine-tuned for any task using , regardless of the underlying model architecture. For a given task, the adapted parameters are computed as \theta' = \theta - \alpha \nabla_\theta \mathcal{L}_{\text{task}}(\theta), where \alpha is the inner-loop and \mathcal{L}_{\text{task}} is the task-specific ; the meta-objective then updates \theta via \theta \leftarrow \theta - \beta \nabla_\theta \sum_{\text{task}} \mathcal{L}_{\text{task}}(\theta'), with \beta as the outer-loop , minimizing the post-adaptation across tasks. MAML has been particularly effective in settings, enabling few-shot adaptation in as few as one or five steps. The algorithm extends MAML by simplifying the meta-update through first-order parameter perturbations, avoiding computationally expensive second-order gradients while approximating the same objective. In , after inner-loop updates on a task to reach \theta', the meta-parameters are shifted toward \theta' by a small multiple of their difference, iteratively over tasks, which empirically yields similar generalization to MAML but with reduced overhead. This makes suitable for larger models and datasets. Beyond optimization-focused methods, has been applied using recurrent architectures, such as LSTM-based meta-learners that treat episode-wise as a sequence prediction problem, updating weights via through time to handle few-shot or tasks. In , approaches like RL² employ to acquire policies that learn from sparse rewards across procedurally generated tasks, demonstrating transfer to unseen environments. These techniques highlight 's versatility across domains. Evaluations of meta-learning methods have shown strong performance on benchmark datasets like Omniglot, a collection of handwritten characters designed for few-shot , where MAML achieves over 95% accuracy in 5-way 1-shot settings after meta-training on similar tasks, underscoring the efficacy of task distributions in enabling rapid generalization.

Metric-Learning Approaches

Metric-learning approaches in few-shot learning focus on learning a feature embedding space where classification of query samples is performed by measuring distances, such as or , to support samples from each class. These methods map both support and query instances into a using a encoder, enabling non-parametric without updating model parameters during inference. This paradigm contrasts with optimization-based techniques by emphasizing representation learning over task-specific adaptation, often integrating seamlessly with frameworks for episode-based training. Prototypical Networks exemplify this approach by computing a for each as the of the embedded support points. The for c is defined as \mathbf{p}_c = \frac{1}{|S_c|} \sum_{(x_i, y_i) \in S_c} f_\phi(x_i), where f_\phi is the embedding parameterized by \phi, and S_c denotes the support set for c. A query sample x is then classified to the with the nearest , using a distance metric like squared , followed by softmax over similarities. Trained via episodic with loss on classification probabilities, this method achieves strong performance on image classification tasks. Matching Networks employ mechanisms to perform non-parametric matching between query and support samples in the space. They use a bidirectional LSTM encoder to process variable-length inputs, such as images or text, producing that capture contextual . An module then computes cosine similarities between the query and each , weighted by the labels to yield probabilities without explicit prototypes. This enables flexible handling of heterogeneous modalities and was initially demonstrated on one-shot learning benchmarks. Relation Networks extend metric learning by replacing hand-crafted distance functions with a learnable deep similarity module. Instead of fixed metrics like , a small g_\phi computes a relation score between embedded query f_\theta(x) and support f_\theta(x_i) samples, producing a similarity map that is fed into a classifier. The entire model is trained end-to-end using loss on episode-wise tasks, allowing the metric to adapt to complex data distributions. This approach improves robustness on challenging datasets by learning nuanced comparisons. These methods offer interpretable through explicit computations and efficient , as requires only forward passes without updates. On the miniImageNet , they demonstrate 2-5% accuracy gains over earlier baselines in 5-way 1-shot settings, with Prototypical Networks reaching approximately 49.4% and Relation Networks around 50.4%.

Optimization-Based Methods

Optimization-based methods in few-shot learning aim to design functions or optimization updates that enable rapid with only a few steps during adaptation to new tasks. These approaches typically involve frameworks where the model learns an initialization or update rule that allows quick on limited support examples, often addressing challenges like and slow in low-data regimes. By modifying the training objectives—such as through regularization or constraints—these methods enhance across tasks while minimizing computational overhead in the inner optimization . One prominent strategy incorporates auxiliary tasks in low-shot learning to regularize the model using with related datasets, thereby improving feature representations for novel classes. For instance, self-supervised auxiliary tasks, such as predicting rotations or patch locations, are combined with supervised on base classes during pre-training. This joint optimization leverages abundant unlabeled to learn transferable features, effectively regularizing against when adapting to few-shot novel classes by combining base and novel class training in episodic setups. Empirical evaluations on benchmarks like miniImageNet demonstrate that this approach boosts 5-way 1-shot accuracy to 62.93% using a wide residual network backbone, representing an absolute improvement of approximately 1.8% over baselines without auxiliary supervision. In semi-supervised cross-domain settings, such as using unlabeled from tieredImageNet to aid miniImageNet adaptation, accuracy further increases to 64.03% in 1-shot scenarios, highlighting regularization benefits across domains like synthetic to real s. Gradient Episodic Memory (GEM) addresses catastrophic forgetting in sequential task learning by projecting gradients onto subspaces that preserve performance on prior tasks, making it suitable for the episodic nature of few-shot meta-training. During each episode, GEM maintains an episodic memory of representative examples from past tasks and computes the current gradient g for the new task. To prevent interference, it solves a quadratic program to find the projected gradient \tilde{g} that lies in the subspace orthogonal to directions harming previous tasks: \min_{\tilde{g}} \frac{1}{2} \| g - \tilde{g} \|_2^2 \quad \text{subject to} \quad \langle \tilde{g}, g_k \rangle \geq 0 \quad \forall k < t, where g_k are gradients computed on past task examples in memory, ensuring non-negative inner products to avoid increasing losses on earlier tasks. This orthogonal projection maintains backward compatibility while allowing forward transfer, with the projection efficiently approximated via a first-order method requiring minimal additional computation per step. GEM has been shown to reduce forgetting in continual few-shot scenarios, preserving accuracy on base tasks while adapting to novel ones. Implicit Model-Agnostic Meta-Learning (iMAML) advances bilevel optimization in few-shot learning by employing implicit gradients, which circumvent the need for explicit through unrolled inner-loop computations. Unlike explicit methods that through multiple gradient steps, iMAML solves a regularized inner objective for each task: \phi' = \arg\min_{\phi} \mathcal{L}(\phi, D_{tr}) + \frac{\lambda}{2} \|\phi - \theta\|_2^2, where \theta are meta-parameters, D_{tr} is the support set, and \lambda > 0 is a regularization strength. The meta-gradient is then computed via implicit on the optimality condition, approximated using conjugate gradients for the Hessian-vector product, enabling to deeper networks and longer inner horizons without second-order derivatives. This avoids the memory-intensive unrolling of explicit MAML while achieving comparable or superior . On few-shot image benchmarks, iMAML attains 49.30% accuracy in 5-way 1-shot miniImageNet (Hessian-free variant), outperforming explicit MAML's 48.70% by about 0.6%, with gains up to 1-2% in higher-shot settings. These methods tie into broader by refining MAML-like initialization strategies for efficient few-shot . Overall, optimization-based approaches yield small accuracy improvements, such as ~0.6% on standard few-shot benchmarks like miniImageNet, by enhancing rapid convergence through targeted loss modifications.

Advanced Techniques

Data Augmentation Strategies

Data augmentation strategies in few-shot learning serve to artificially expand the limited training datasets by generating synthetic examples, thereby increasing the effective sample size without the need for additional labeled data. This approach is especially vital for addressing challenges in imbalanced datasets or scenarios involving novel classes, where only a handful of examples per class are available, helping to mitigate overfitting and improve model generalization. Classical techniques, including geometric transformations such as rotations, flips, and translations, as well as color jittering, have been specifically adapted for few-shot settings to preserve semantic content while introducing variability. A foundational contribution in this area came from the introduction of methods, where a meta-learned "hallucinator" model generates imaginary examples from the support set using these label-preserving transforms, effectively simulating additional real instances during training. This technique demonstrated substantial improvements, achieving up to a 6-point accuracy gain on low-shot classification tasks within the benchmark. In feature-space augmentation, methods like and its variants operate by interpolating directly in the embedding space to create diverse synthetic samples from the few available shots, promoting smoother decision boundaries and better representation learning. A key formulation for simple is: \begin{align*} \mathbf{x}' &= \lambda \mathbf{x}_i + (1 - \lambda) \mathbf{x}_j, \\ \mathbf{y}' &= \lambda \mathbf{y}_i + (1 - \lambda) \mathbf{y}_j, \end{align*} where \lambda is sampled from a \Beta(\alpha, \alpha), and \mathbf{x}_i, \mathbf{y}_i along with \mathbf{x}_j, \mathbf{y}_j are pairs of input features and labels from the support set. Extending this to manifold mixup in few-shot contexts regularizes the feature manifold enriched by self-supervision, yielding 3-8% accuracy improvements over prior baselines on standard benchmarks like mini-ImageNet and CIFAR-FS. For domain-specific applications in , augmentation pipelines often employ a progression from basic to more intricate transformations—such as starting with geometric shifts and advancing to intensity perturbations—to build robustness against variations in novel classes. These pipelines, integrated into simple pre-training and frameworks, have significantly elevated few-shot performance on vision datasets like tiered-ImageNet. Despite their benefits, standalone data augmentation strategies in few-shot learning often deliver only modest performance gains, typically in the range of 5-10% accuracy improvement, due to challenges in generating sufficiently diverse and high-quality without introducing noise. To maximize efficacy, these methods are frequently hybridized with approaches, which leverage episodic training to better adapt the augmentations to task-specific distributions.

Generative Model Integration

Generative model integration in few-shot learning leverages probabilistic models such as variational autoencoders (VAEs), (GANs), and diffusion models to synthesize diverse synthetic examples from a limited number of support shots, thereby expanding the effective training data and enhancing classifier performance on novel classes. This approach addresses data scarcity by conditioning generation on the few available examples, enabling the creation of realistic variations that capture class-specific features without relying on external datasets. Few-shot GANs adapt standard GAN architectures for conditional generation tailored to unseen classes during inference, often incorporating meta-learning to fine-tune the generator and discriminator rapidly. A notable early example is the FIGR framework (2019), which meta-trains a GAN using the Reptile algorithm to produce novel images from as few as four support examples per class, demonstrating effective adaptation for tasks like digit and character synthesis. Another GAN-based approach is ProtoGAN (2019), which synthesizes additional instances for novel categories in action recognition using class prototypes to condition the generation. Variational approaches, such as aligned VAEs, extend this paradigm by operating in latent spaces to generate class prototypes from support samples. Diffusion-based methods, emerging prominently since 2022, offer superior sample quality for few-shot image synthesis by reversing a forward noise-adding process. In Few-Shot Diffusion Models (FSDM), conditioning on patch-based representations of support shots enables high-fidelity generation, with the reverse denoising step approximated as p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \approx \mathcal{N}(\boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)), where \boldsymbol{\mu}_\theta and \boldsymbol{\Sigma}_\theta are predicted by a parameterized by \theta, conditioned on the timestep t and noisy input \mathbf{x}_t. This yields diverse outputs that outperform prior diffusion baselines in metrics like on datasets such as miniImageNet. Subsequent works as of 2024 have further advanced few-shot diffusion by addressing challenges like the curse of dimensionality in high-dimensional data. Empirical results show these methods yield substantial accuracy gains in few-shot , with up to 12% improvement over baselines on fine-grained benchmarks like CUB-200 when using aligned VAEs for augmentation. Nonetheless, reliance on introduces ethical risks, as generative models can amplify biases present in the limited support shots, potentially leading to skewed representations in downstream applications.

Applications and Benchmarks

Computer Vision Tasks

Few-shot learning has been extensively applied to image classification tasks, where models adapt to recognize new object categories using only a limited number of labeled examples, typically 1 to 5 per . This addresses the of in visual by leveraging pre-trained extractors, such as convolutional neural networks or Vision Transformers, to generalize from base classes to novel ones. For instance, Vision Transformers on the miniImageNet dataset enables high adaptability, achieving competitive performance in 5-way classification settings by incorporating prototypical networks or relation modules that compute similarities between support and query images. In and semantic segmentation, few-shot methods extend to localize and delineate instances of unseen classes, producing bounding boxes or pixel-level masks from sparse annotations. frameworks train detectors to predict novel categories by aligning features from support images to query scenes, often building on architectures like Faster R-CNN for detection. The FSOD framework, introduced in 2020, provides a and evaluation protocol for few-shot , demonstrating how region proposal networks can be adapted with attention mechanisms to handle class imbalance and low-shot scenarios. For segmentation, approaches like prototypical alignment networks generate class prototypes from few support masks, enabling precise delineation of unseen objects in images without extensive retraining. Real-world deployments of few-shot learning in include wildlife monitoring, where models identify rare species from 1-5 photographs captured by camera traps, facilitating biodiversity conservation in data-limited environments. In , few-shot techniques detect novel pathologies, such as rare retinal diseases in fundus photographs, by adapting segmentation models to unseen anomalies with minimal expert annotations, thus accelerating clinical diagnostics for under-represented conditions. These applications often integrate generative aids briefly to augment support sets, enhancing robustness to variations in or pose. Key benchmarks for evaluating few-shot performance in include miniImageNet, which splits 100 classes into 64/16/20 for training/validation/testing in 5-way tasks, and tieredImageNet, a larger variant with 608 classes organized hierarchically to reduce . State-of-the-art methods on these datasets achieve accuracies of approximately 84-85% in 5-way 5-shot as of 2025, with improvements driven by transductive inference, self-supervised pre-training, and vision-language models like CLIP adaptations.

Natural Language Processing

In natural language processing, few-shot learning has been particularly impactful for text classification tasks, where large pre-trained language models enable effective adaptation with minimal examples through in-context prompting. For instance, the GPT-3 model, with 175 billion parameters, demonstrates strong performance on sentiment analysis benchmarks like the Stanford Sentiment Treebank (SST-2), achieving 95.0% accuracy in a 32-shot setting without any fine-tuning, by incorporating task instructions and examples directly into the input prompt. This prompt-based approach leverages the model's parametric knowledge to generalize from few demonstrations, outperforming smaller models and approaching supervised baselines on datasets such as SST-2 and customer reviews. Such methods highlight how scaling model size enhances few-shot capabilities, allowing classification of sentiments or topics with as few as 16 examples while maintaining high precision. Few-shot learning also addresses challenges in , especially for low-resource languages where parallel data is scarce. Meta-learning techniques, such as META-MT, enable rapid adaptation of systems by treating translation tasks as meta-optimization problems, training the model to quickly adjust to new language pairs or domains. In experiments on domain-specific corpora like those from the dataset (simulating low-resource scenarios), META-MT improves scores by up to 2.5 points over standard when adapting with only 4,000 tokens of in-domain data (equivalent to roughly 200-300 sentences), demonstrating effective transfer for tasks akin to IWSLT low-resource setups. This approach is particularly valuable for adapting to underrepresented languages, where 5-shot or similar minimal adaptations yield substantial gains in translation quality compared to zero-shot baselines. For and text generation, in-context learning in transformer-based models facilitates few-shot performance by conditioning outputs on a handful of examples embedded in the prompt. exemplifies this, attaining competitive results on QA datasets like Natural Questions, with 21.4% exact match in a 64-shot setting, by framing queries and answers in natural language prompts that guide the model's autoregressive generation. Similarly, in few-shot (NER), meta-learning frameworks like FEWNER enable models to recognize entities with 1-10 labeled examples per class, achieving F1 scores of around 70-80% on benchmarks such as CoNLL-2003 by optimizing entity embeddings and classifiers through episodic . These techniques underscore the role of transformers in enabling generative tasks with limited supervision, where the model infers patterns from contextual examples to produce coherent responses or extractions. Despite these advances, few-shot learning in faces challenges, including high to design, where subtle changes in phrasing can alter performance by 10-20 percentage points, as observed in evaluations across and tasks. Benchmarks like FewGLUE, a few-shot of SuperGLUE with 32 examples per task, reveal persistent gaps compared to full supervision; for example, achieves around 63% average score on FewGLUE tasks with optimized , lagging the 85%+ of fully trained models, highlighting limitations in and robustness for complex reasoning. Addressing these requires refined prompting strategies and methods that briefly metric-learning for better embeddings, though remains a key hurdle in deploying few-shot systems reliably.

Evaluation Datasets and Metrics

In few-shot learning, evaluation relies on standardized benchmarks that simulate the scarcity of through episodic tasks, where models are tested on classes with limited examples. These datasets are typically split into meta-training, meta-validation, and meta-testing sets to assess across tasks, emphasizing the model's ability to adapt quickly without . Common protocols involve N-way K-shot settings, such as 5-way 1-shot or 20-way 5-shot, where N denotes the number of classes per episode and K the examples per class in the set. For tasks, Omniglot serves as a foundational benchmark, comprising 1,623 handwritten characters from 50 alphabets, with 20 examples per character in its background set for training and the remainder for evaluation. It is often evaluated in a 20-way 1-shot setup to test rapid concept acquisition, mimicking human-like one-shot learning on simple, abstract symbols. Another widely used vision dataset is miniImageNet, a subset of with 100 classes (64 for training, 16 for validation, and 20 for testing) and 600 images per class, resized to 84x84 pixels; it supports 5-way K-shot evaluations (K=1 or 5) to gauge performance on more complex, natural images. In , FewRel provides a large-scale for few-shot relation classification, containing 70,000 sentences across 100 relations extracted from , with 80-way 5-shot episodes to evaluate relational reasoning under data scarcity. For natural language inference and entailment tasks, the SNLI dataset is adapted, featuring 570,000 sentence pairs labeled for entailment, contradiction, or neutral relations; it is commonly used in 5-way or 10-way few-shot setups to assess semantic understanding with minimal examples. To test cross-domain robustness, Meta-Dataset aggregates ten diverse sources (e.g., Omniglot, miniImageNet, traffic signs, birds, and text domains like SVHN and ), enabling evaluation of few-shot classifiers across heterogeneous distributions in N-way K-shot ; this benchmark highlights domain shift challenges by sampling from multiple datasets. Recent extensions as of 2025 incorporate multimodal data for broader applicability. Performance metrics in few-shot learning prioritize task-specific measures within the episodic framework, where each divides data into a support set for and a query set for testing. For , accuracy on the query set is the primary metric, averaged over multiple to account for variability; in balanced N-way setups, it directly reflects generalization to unseen classes. For few-shot object detection, mean Average Precision () evaluates localization and jointly, often at IoU thresholds like 0.5, to quantify detection quality with scarce annotations. To balance per-class performance in imbalanced scenarios, the harmonic mean of accuracies across ways is sometimes used, providing a robust indicator of equitable .

Challenges and Future Directions

Open Problems

One prominent in few-shot learning (FSL) is the generalization gap, particularly the poor transfer of learned representations across domains. For instance, models trained on often fail to to real-world scenarios due to distribution shifts, leading to significant degradation in cross-domain tasks. This issue is exacerbated by meta-overfitting, where FSL models, especially approaches, overfit to the specific benchmark tasks used during meta-training, such as mini-ImageNet, rather than acquiring robust, transferable knowledge. As a result, real-world deployment remains challenging, as models exhibit when encountering unseen variations in data or task structure. Robustness issues further complicate FSL, with models showing heightened vulnerability to label in low-data regimes. Label —common in practical applications—amplifies errors, as small sample sizes make it difficult for models to distinguish noisy from clean examples, leading to biased prototypes or embeddings. Fairness concerns are also amplified by these small samples, where inherent biases result in discriminatory outcomes, such as disproportionate error rates across demographic groups, due to underrepresentation in the few available examples. Scalability poses another critical hurdle, driven by the high computational costs of meta-training and the challenges of handling high-dimensional . Meta-learning paradigms like MAML require nested optimization loops over numerous tasks, resulting in substantial training times and resource demands that limit applicability to resource-constrained environments. Extending FSL to high-dimensional modalities, such as videos, intensifies this problem, as the temporal and spatial complexity increases the parameter space and requirements, often leading to inefficient processing and poor with few shots. Finally, the theoretical foundations of FSL remain underdeveloped, with a notable lack of guarantees on and conditions for success. While some analyses provide bounds for specific settings, such as meta-sparse regression, there is no comprehensive framework explaining why FSL succeeds in certain task distributions but fails in others, hindering the design of reliable algorithms. This gap underscores the need for rigorous theoretical insights into bounds and meta-optimization to predict and improve FSL performance systematically. Recent advancements in few-shot learning have increasingly integrated large foundation models, such as vision-language models (VLMs) and large language models (LLMs), to enable zero-to-few-shot capabilities. For instance, adaptations of CLIP have demonstrated improved few-shot performance by aligning visual and textual representations through and lightweight . Similarly, Flamingo, a VLM that interleaves visual and textual data during pretraining, supports few-shot learning for tasks like visual (VQA) by conditioning on a small number of image-text pairs, outperforming prior models in zero-shot settings and extending to few-shot with minimal adaptation. These integrations leverage the emergent generalization properties of foundation models, allowing few-shot adaptation without full retraining. Self-supervised pretraining has emerged as a key strategy to enhance few-shot initialization by exploiting vast unlabeled data, particularly through contrastive learning extensions that learn robust representations prior to few-shot . Methods like contrastive mixtures in self-supervised frameworks have shown that pretraining on unlabeled images improves few-shot classification accuracy on benchmarks such as miniImageNet, as the learned features reduce reliance on labeled support sets. For example, data-efficient contrastive self-supervision, such as in SwAV or MoCo variants, enables better transfer to few-shot scenarios by capturing invariant structures in data, leading to more stable prototypes in metric-based learners. This approach is particularly effective in low-data regimes, where self-supervised initialization mitigates and boosts across domains. Multimodal few-shot learning has gained traction by combining and modalities, enabling tasks like VQA with only 1-5 examples through frozen or lightly adapted models. In approaches like those using frozen LLMs with visual encoders, few-shot prompting with image-question-answer triplets achieves competitive performance on VQA datasets by leveraging cross-modal alignments. extensions for VLMs further refine this by distilling adaptive prompts, allowing rapid binding of novel visual concepts to linguistic descriptions in few-shot benchmarks. These methods highlight the potential for unified representations that generalize across vision-language tasks with sparse supervision. Looking ahead, neurosymbolic hybrids are poised to enhance interpretability in few-shot learning by fusing neural with symbolic reasoning, as seen in frameworks that learn domain abstractions from few skill demonstrations for long-horizon tasks. Federated few-shot learning addresses privacy-preserving needs by enabling collaborative adaptation across distributed clients with limited labels, such as in FewFedPIT, which improves model utility while maintaining guarantees. As of , scaling laws in models continue to improve few-shot , with empirical relations showing that larger pretrained capacities yield predictable improvements in adaptation efficiency.

References

  1. [1]
    A Summary of Approaches to Few-Shot Learning - arXiv
    Mar 7, 2022 · Few-Shot Learning refers to the problem of learning the underlying pattern in the data just from a few training samples. Requiring a large ...
  2. [2]
    [2205.06743] A Comprehensive Survey of Few-shot Learning - arXiv
    May 13, 2022 · This survey investigates 200+ papers on few-shot learning (FSL), comparing concepts, proposing a taxonomy, and highlighting applications in ...Missing: key | Show results with:key
  3. [3]
  4. [4]
    [1703.05175] Prototypical Networks for Few-shot Learning - arXiv
    Mar 15, 2017 · We propose prototypical networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set.Missing: MAML | Show results with:MAML
  5. [5]
    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
    Mar 9, 2017 · We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent.<|separator|>
  6. [6]
    Generalizing from a Few Examples: A Survey on Few-shot Learning
    Jun 12, 2020 · Few-shot learning (FSL) uses prior knowledge to rapidly generalize to new tasks with only a few samples and supervised information.
  7. [7]
    Few Shot Learning for Rare Disease Diagnosis - DSpace@MIT
    The goal of this thesis is to develop few shot learning methods that can overcome the data limitations of deep learning approaches to diagnose patients with ...
  8. [8]
    Applying Few-Shot Learning for In-the-Wild Camera-Trap Species ...
    Jul 31, 2023 · Few-shot learning aims to adapt to a new task with a small amount of labeled data, and researchers have explored multiple ways of achieving that ...
  9. [9]
    [PDF] Building Machines That Learn and Think Like People
    Apr 1, 2016 · Furthermore, the human capacity for one-shot learning suggests that these models are built upon rich domain knowledge rather than starting from ...
  10. [10]
    [PDF] Human-level concept learning through probabilistic program induction
    Dec 10, 2015 · The model uses probabilistic program induction, representing concepts as simple programs, to learn from single examples and achieve human-level ...
  11. [11]
    A Survey on Machine Learning from Few Samples - arXiv
    Sep 6, 2020 · In this survey, we review the evolution history ... Access Paper: View a PDF of the paper titled A Survey on Machine Learning from Few Samples ...
  12. [12]
  13. [13]
    [1606.04080] Matching Networks for One Shot Learning - arXiv
    Jun 13, 2016 · In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories.
  14. [14]
    How transferable are features in deep neural networks? - arXiv
    In this paper we experimentally quantify the generality versus specificity of neurons in each layer of a deep convolutional neural network and report a few ...
  15. [15]
    [1909.04630] Meta-Learning with Implicit Gradients - arXiv
    Sep 10, 2019 · Experimentally, we show that these benefits of implicit MAML translate into empirical gains on few-shot image recognition benchmarks.
  16. [16]
    [1706.08840] Gradient Episodic Memory for Continual Learning - arXiv
    Jun 26, 2017 · Gradient Episodic Memory (GEM) is a model for continual learning that alleviates forgetting and allows transfer of knowledge to previous tasks.Missing: few- shot
  17. [17]
    A Comprehensive Survey on Data Augmentation
    ### Summary on Role of Data Augmentation in Few-Shot Learning and Its Importance for Imbalanced Classes
  18. [18]
    [1801.05401] Low-Shot Learning from Imaginary Data - arXiv
    Jan 16, 2018 · Title:Low-Shot Learning from Imaginary Data. Authors:Yu-Xiong Wang, Ross Girshick, Martial Hebert, Bharath Hariharan. View a PDF of the paper ...
  19. [19]
    Charting the Right Manifold: Manifold Mixup for Few-shot Learning
    Jul 28, 2019 · This work investigates the role of learning relevant feature manifold for few-shot tasks using self-supervision and regularization techniques.
  20. [20]
    Pushing the Limits of Simple Pipelines for Few-Shot Learning - arXiv
    Apr 15, 2022 · We seek to push the limits of a simple-but-effective pipeline for more realistic and practical settings of few-shot image classification.
  21. [21]
    Effective and Robust Data Augmentation for Few-Shot Learning
    We propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data.
  22. [22]
    [PDF] and Few-Shot Learning via Aligned Variational Autoencoders
    The CADA-VAE model uses aligned VAEs to learn a shared latent space of image features and class embeddings, enabling knowledge transfer to unseen classes.
  23. [23]
    [1901.02199] FIGR: Few-shot Image Generation with Reptile - arXiv
    Jan 8, 2019 · FIGR is a GAN meta-trained with Reptile for few-shot image generation, generating novel images with as little as 4 images from an unseen class.
  24. [24]
    ProtoGAN: Towards Few Shot Learning for Action Recognition - arXiv
    Sep 17, 2019 · In this paper, we address this problem by proposing a novel ProtoGAN framework which synthesizes additional examples for novel categories.
  25. [25]
    [2205.15463] Few-Shot Diffusion Models - arXiv
    May 30, 2022 · In this paper, we present Few-Shot Diffusion Models (FSDM), a framework for few-shot generation leveraging conditional DDPMs.
  26. [26]
    AI models collapse when trained on recursively generated data
    Jul 24, 2024 · Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set ...
  27. [27]
    [PDF] Few-Shot Object Detection With Attention-RPN and Multi-Relation ...
    We propose a general few-shot object detection network that learns the matching metric be- tween image pairs based on the Faster R-CNN framework equipped with ...
  28. [28]
    A few-shot rare wildlife image classification method based on style ...
    A model trained by our method was used to classify six rare wildlife species with a classification accuracy of 92.2% and an F1 score of 93.3%. The deep ...
  29. [29]
    [PDF] Language Models are Few-Shot Learners - arXiv
    Jul 22, 2020 · In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call. GPT-3, and measuring ...
  30. [30]
    None
    ### Summary of Few-Shot Adaptation Results on IWSLT Dataset
  31. [31]
    FewRel: A Large-Scale Supervised Few-Shot Relation Classification ...
    Oct 24, 2018 · We present a Few-Shot Relation Classification Dataset (FewRel), consisting of 70, 000 sentences on 100 relations derived from Wikipedia and annotated by ...Missing: original | Show results with:original
  32. [32]
    A Dataset of Datasets for Learning to Learn from Few Examples - arXiv
    Mar 7, 2019 · Meta-Dataset is a large-scale benchmark for training and evaluating models for few-shot classification, consisting of diverse datasets and ...Missing: original | Show results with:original
  33. [33]
    [2405.12299] Perturbing the Gradient for Alleviating Meta Overfitting
    May 20, 2024 · This paper proposes a number of solutions to tackle meta-overfitting on few-shot learning settings, such as few-shot sinusoid regression and few shot ...
  34. [34]
    An Overview of Deep Neural Networks for Few-Shot Learning
    Dec 19, 2024 · This paper provides a comprehensive survey of FSL, reviewing prominent deep learning based approaches of FSL.
  35. [35]
    [2204.05494] Few-shot Learning with Noisy Labels - arXiv
    Apr 12, 2022 · Robustness to label noise is therefore essential for FSL methods to be practical, but this problem surprisingly remains largely unexplored.
  36. [36]
    A Comprehensive Review of Few-shot Action Recognition - arXiv
    Jul 20, 2024 · Few-shot action recognition aims to address the high cost and impracticality of manually labeling complex and variable video data in action recognition.
  37. [37]
    Flamingo: a Visual Language Model for Few-Shot Learning - arXiv
    Apr 29, 2022 · These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning ...
  38. [38]
    Multimodal Few-Shot Learning with Frozen Language Models - arXiv
    Jun 25, 2021 · We present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language).
  39. [39]
    [2011.03426] Self-Supervised Learning from Contrastive Mixtures ...
    Nov 6, 2020 · We specifically address the few-shot learning scenario where ... self-supervised pretraining without contrastive loss terms. Of all ...
  40. [40]
    [2302.14794] Meta Learning to Bridge Vision and Language Models ...
    Feb 28, 2023 · We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to ...
  41. [41]
    Few-Shot Neuro-Symbolic Imitation Learning for Long-Horizon ...
    Aug 29, 2025 · We propose a novel neuro-symbolic framework that jointly learns continuous control policies and symbolic domain abstractions from a few skill ...
  42. [42]
    FewFedPIT: Towards Privacy-preserving and Few-shot Federated ...
    Mar 10, 2024 · In this paper, we propose a novel federated algorithm, FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few- ...
  43. [43]
    Scaling Laws for the Few-Shot Adaptation of Pre-trained ... - arXiv
    Oct 13, 2021 · Our current main goal is to investigate how the amount of pre-training data affects the few-shot generalization performance of standard image classifiers.