Fact-checked by Grok 2 weeks ago

Feature learning

Feature learning, also known as representation learning, is a core paradigm in that enables algorithms to automatically discover and extract useful representations from raw, high-dimensional data, thereby facilitating downstream tasks such as , , and prediction without relying on manual . This approach addresses the critical dependency of performance on data representation, where traditional methods require labor-intensive hand-crafting of features tailored to specific tasks, often limiting scalability and generalization across domains. By contrast, feature learning algorithms learn hierarchical or disentangled representations that capture underlying explanatory factors of variation in the data, such as edges in images or semantic relationships in text, improving efficiency and effectiveness in handling complex, unstructured inputs like audio, video, and . Key methods in feature learning encompass unsupervised techniques, including probabilistic models like restricted Boltzmann machines (RBMs) and sparse , which learn sparse, overcomplete representations to model data distributions; auto-encoder variants, such as denoising and contractive auto-encoders, that compress and reconstruct data to enforce robust encodings; and manifold learning approaches that assume data lies on low-dimensional manifolds embedded in high-dimensional spaces. Supervised and self-supervised paradigms extend these by incorporating labels or pretext tasks to guide representation quality, while deep architectures, often pre-trained layer-wise, enable end-to-end learning of multi-level features in neural networks. Feature learning has driven transformative advancements across applications, notably reducing error rates by up to 30% through deep belief networks, achieving top performance on the large-scale visual recognition challenge by dropping object classification errors from 26.1% to 15.3% with convolutional networks, and enabling distributed word representations in that outperform traditional bag-of-words models. Ongoing research focuses on challenges like out-of-distribution , where learned features must remain robust to shifts in data distributions, as explored in recent analyses of behavior under spurious correlations. These developments underscore feature learning's role in advancing toward more autonomous and interpretable systems.

Fundamentals

Definition and Motivation

Feature learning, also known as representation learning, refers to a set of techniques in where algorithms automatically discover useful representations or transformations of raw data that make subsequent tasks easier to perform, without relying on manual . These representations aim to disentangle the underlying factors of variation in the data, transforming high-dimensional inputs into more compact and informative forms that capture essential structures. The primary motivation for feature learning arises from the limitations of traditional manual feature design, which requires extensive domain expertise and struggles with the curse of dimensionality in high-dimensional data such as images, audio, or text. Manual engineering often fails to scale effectively, as crafting effective features by hand becomes impractical for complex, large-scale datasets where the input space grows exponentially, leading to sparse sampling and poor generalization. In contrast, feature learning enables end-to-end optimization, allowing models to adapt s directly to the data's intrinsic properties and the demands of downstream tasks. Key benefits include enhanced by focusing on invariant and transferable features, scalability to high-dimensional inputs through and hierarchical abstraction, and flexibility across diverse tasks like , , and . The basic workflow involves feeding raw data into a learning , which iteratively refines hierarchical or task-specific features through optimization, and then applying these features to models for prediction or analysis. For instance, in image processing, raw pixel intensities serve as inputs, whereas learned features might represent edges or textures that abstract low-level patterns into higher-level concepts useful for .

Historical Development

The origins of feature learning trace back to early 20th-century statistical methods for . (), introduced by in 1901, provided a foundational technique for by identifying principal axes of variation in data, enabling the extraction of uncorrelated features from high-dimensional observations. In the , (ICA) emerged as a key advancement, with Pierre Comon's 1994 formulation defining it as a method to separate multivariate signals into statistically independent subcomponents, and Tony Bell and Terrence Sejnowski's 1995 infomax approach popularizing its application through information maximization principles. These techniques laid the groundwork for unsupervised feature extraction in statistical . The late 1990s and early 2000s saw the development of sparse coding models, which inspired modern dictionary learning. Bruno Olshausen and David Field's 1996 work demonstrated that learning sparse linear codes for natural images could yield receptive fields resembling those in the mammalian , introducing the idea of overcomplete dictionaries for efficient signal representation. Autoencoders, first conceptualized in the by David Rumelhart, Geoffrey , and Ronald Williams as part of backpropagation-based representation learning, gained renewed traction in the 2000s for feature extraction. Dictionary learning advanced further with the algorithm by Michal Aharon, Michael Elad, and Alfred Bruckstein in 2006, which iteratively optimizes sparse representations over overcomplete bases, enhancing applications in and . A pivotal revival occurred in 2006 with Geoffrey Hinton's introduction of deep belief networks (DBNs), which used layer-wise unsupervised pretraining to address vanishing gradients in deep architectures, reigniting interest in hierarchical feature learning. The 2010s marked the explosion of deep learning paradigms, exemplified by Alex Krizhevsky, Ilya Sutskever, and Hinton's AlexNet in 2012, a convolutional neural network that achieved breakthrough performance on the ImageNet dataset through supervised feature hierarchies, catalyzing widespread adoption of end-to-end learning. Unsupervised and self-supervised methods surged, with Tomas Mikolov et al.'s Word2Vec in 2013 enabling dense vector representations of words via predictive training on text corpora. Jacob Devlin et al.'s BERT in 2018 extended this to bidirectional transformer-based pretraining, learning contextual features from masked language modeling. In vision, Ting Chen et al.'s SimCLR framework in 2020 simplified contrastive self-supervised learning, producing robust visual representations without labels. The integration of transformers facilitated scalable feature learning across modalities in the late 2010s and 2020s. Alec Radford et al.'s CLIP in 2021 aligned image and text features through contrastive pretraining on vast , enabling zero-shot capabilities. Foundation models like OpenAI's , detailed by Tom Brown et al. in 2020, demonstrated of versatile representations via autoregressive pretraining on internet-scale text. Subsequent iterations, including in 2023, emphasized efficient, transferable features for diverse tasks. In 2024, advancements continued with OpenAI's GPT-4o, which integrated native processing of text, audio, and images for more cohesive representations; Meta's Llama 3, scaling open-source efficient learning; and Google's 1.5, enhancing long-context feature extraction. Key reviews, such as , Aaron Courville, and Pascal Vincent's 2013 survey on representation learning and , Bengio, and Hinton's 2015 overview of , synthesized these milestones, highlighting the shift toward hierarchical, data-driven feature extraction. As of November 2025, emphasis has grown on scalable, efficient representations in models, supporting and self-supervised paradigms for real-world deployment.

Supervised Feature Learning

Supervised Dictionary Learning

Supervised dictionary learning extends the principles of dictionary learning by incorporating labeled data to guide the optimization process toward discriminative representations suitable for tasks such as classification. In this approach, a dictionary D—an overcomplete basis consisting of atoms—and sparse codes S are learned such that the input data X can be approximately reconstructed as X \approx D S, while simultaneously minimizing a supervised loss that leverages class labels to enhance discrimination between categories. This integration of labels ensures that the learned features are not only sparse and reconstructive but also task-oriented, improving performance in downstream supervised learning scenarios. The formulation typically involves minimizing a composite objective function that balances reconstruction fidelity, sparsity, and supervised task performance. A common objective is to minimize \|X - D S\|_2^2 + \lambda \|S\|_1 augmented with a supervised loss term, such as the for integration with support vector machines (SVMs), which penalizes misclassifications based on the labels. More advanced variants directly incorporate the error into the dictionary update, as in the label-consistent (LC-KSVD) algorithm proposed by Jiang et al. in 2011, which enforces label consistency by adding a term that aligns sparse codes with class labels through a . Another seminal method, introduced by Mairal et al. in 2008, learns a shared overcomplete along with class-specific decision functions (linear or kernel-based) to enable discriminative sparse representations for . A key optimization problem in supervised dictionary learning can be expressed as: \arg\min_{D,S} \frac{1}{2} \|Y - f(DS)\|_2^2 + \alpha \|X - DS\|_2^2 + \beta \Omega(S), where Y represents the label matrix, f is a classifier function (e.g., linear or softmax), \alpha and \beta are regularization parameters, and \Omega(S) is a sparsity-inducing penalty such as the \ell_1-norm. This formulation, central to methods like LC-KSVD, alternates between updating the sparse codes S via pursuit algorithms (e.g., orthogonal matching pursuit) and refining the dictionary D to minimize both reconstruction and classification errors. The sparsity enforced in these methods promotes interpretability by selecting only a few relevant atoms per signal, allowing of discriminative features, and facilitates seamless integration with classifiers like SVMs for end-to-end optimization. In applications such as face recognition, labels guide the formation of class-specific atoms in the dictionary, leading to improved recognition accuracy; for instance, LC-KSVD has demonstrated superior performance over counterparts on datasets like Extended Yale B, achieving 97.5% accuracy in controlled settings. Unlike dictionary learning, which lacks label guidance and focuses solely on , supervised variants explicitly optimize for task discrimination.

Supervised Neural Networks

Supervised neural networks extract task-specific features from through an end-to-end optimization process driven by , which computes gradients of the loss with respect to network weights and updates them to align intermediate representations with the provided labels. This mechanism enables hierarchical feature learning, where shallow layers detect low-level patterns such as edges and textures in input data, while deeper layers compose these into high-level abstractions like object parts or semantic concepts relevant to the task. As a linear precursor, supervised dictionary learning provided foundational ideas for sparse, class-discriminative representations that influenced the development of these more flexible neural models. Key architectures include multilayer perceptrons (MLPs), which process tabular data by stacking fully connected layers to learn non-linear transformations, and convolutional neural networks (CNNs), designed for spatial data like images through convolutional filters that capture local invariances. Pioneering examples are , introduced in 1989 for handwritten digit recognition using convolutional layers to extract digit-specific features from pixel inputs, and from 2012, which scaled deeper CNNs with dropout and GPU acceleration to achieve breakthrough performance on large-scale image classification by learning robust visual hierarchies. Training involves minimizing a , such as the loss defined as L = -\sum_{i} y_i \log(\hat{y}_i), where y represents the true label distribution and \hat{y} the predicted probabilities from softmax outputs, with learned features manifesting as the activations in hidden layers. Non-linearities, introduced via activation functions like ReLU (f(x) = \max(0, x)), allow networks to model complex mappings efficiently without vanishing gradients. These networks offer advantages in handling intricate data dependencies through their parametric, hierarchical structure, and support transfer learning by reusing pre-trained features from source tasks—such as convolutional bases from —to initialize models for related targets, often yielding substantial gains in performance and data efficiency. The field saw a major revival after 2010, fueled by GPU-enabled parallel computation that made training deep architectures feasible on massive datasets. By 2025, advancements like low-rank adaptation () have enhanced efficient of large supervised models by injecting low-rank matrices into pre-trained weights, reducing computational overhead while preserving feature quality. For instance, in tasks, iteratively refines features from raw pixels in early layers to discriminative semantic representations in later ones, enabling accurate predictions on complex visual inputs.

Unsupervised Feature Learning

Clustering-Based Methods

Clustering-based methods partition data points into groups based on similarity, revealing inherent structures that can be used as features in . These features often take the form of cluster assignments, centroids, or probabilities of belonging to each cluster, enabling representation of data in a lower-dimensional or more interpretable space. For instance, hard assignments assign each point to a single cluster, while soft assignments use probabilities to capture uncertainty, allowing for more nuanced feature representations. A key example of soft clustering is the Gaussian Mixture Model (GMM), a probabilistic approach that models data as a mixture of Gaussian distributions. GMMs estimate parameters via the algorithm, assigning points soft memberships based on posterior probabilities, which is useful for feature learning in and tasks. A foundational approach is the , which iteratively partitions data into K clusters to minimize intra-cluster variance. The process begins with random initialization of K centroids, followed by assignment of each data point to the nearest centroid based on , and then updating each centroid as the of points in its ; this alternates until . The objective function optimized is the within-cluster : \min_{\mu_1, \dots, \mu_K} \sum_{k=1}^K \sum_{i \in C_k} \| x_i - \mu_k \|^2, where C_k denotes the set of points in cluster k and \mu_k is its centroid. In feature learning, the resulting centroids or assignment vectors serve as a dictionary for encoding new data, such as projecting inputs onto cluster centers to form sparse representations. Variants extend K-means to handle complex data structures. Spectral clustering leverages the eigenvectors of a similarity graph's Laplacian matrix to embed data in a lower-dimensional space before applying K-means, effectively capturing non-linear manifolds and non-convex clusters. Hierarchical clustering, in contrast, constructs a tree-like dendrogram by successively merging or splitting clusters, enabling multi-scale feature extraction where features at different resolutions represent coarse-to-fine groupings. These methods derive features by treating clusters as prototypes. In text analysis, documents are often represented as histograms over word clusters, extending the to capture topic-like structures via cluster proportions. Similarly, in , local image descriptors are clustered to form a visual vocabulary, with images encoded as histograms of visual words for tasks like . Post-clustering, embeddings can be refined by concatenating assignment vectors or distances to centroids. Despite their utility, clustering-based methods have limitations, including K-means' assumption of isotropic, spherical clusters, which fails on elongated or irregular shapes. Additionally, the algorithm is sensitive to the choice of K, addressed via the elbow method—plotting within-cluster variance against K to identify a —or the silhouette score, which measures how well points fit their clusters relative to others. The silhouette score for a point i is given by s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}, where a(i) is intra-cluster distance and b(i) is the nearest-cluster distance. By 2025, clustering remains relevant through integration with deep networks, as in Deep Embedded Clustering (DEC), which jointly optimizes an autoencoder for feature extraction and a clustering layer for assignments, improving representations on benchmarks like MNIST. Recent advances, including deep learning-driven feature extraction with convolutional neural networks and multi-scale unsupervised networks, further enhance performance on complex datasets like images.

Dimensionality Reduction Techniques

Dimensionality reduction techniques in feature learning aim to transform high-dimensional data into a lower-dimensional representation while preserving essential structure, such as variance or local neighborhoods, to facilitate subsequent analysis or modeling. These methods are particularly valuable in unsupervised settings, where they extract latent features by identifying directions of maximum variability or manifold geometries without relying on labels. Linear approaches like () focus on global variance preservation, whereas nonlinear methods such as locally linear embedding (LLE) and emphasize local or geodesic structures to uncover non-Euclidean manifolds. Principal component analysis (PCA) is a foundational linear technique that projects data onto orthogonal axes capturing the directions of greatest variance. Given a centered X \in \mathbb{R}^{n \times d}, computes the \Sigma = \frac{1}{n} X^T X and performs eigen-decomposition \Sigma = U \Lambda U^T, where U contains the eigenvectors and \Lambda the eigenvalues ordered by descending magnitude. The reduced representation Z = X U_k is obtained by projecting onto the top k eigenvectors, maximizing the retained variance \frac{1}{n} \operatorname{tr}(Z^T Z). In practice, (SVD) of X is often used for , yielding X = V \Sigma W^T, with principal components as the columns of W. assumes linear relationships among features and that variance is a suitable proxy for information, which may not hold for nonlinear data structures. Beyond , nonlinear methods address manifold assumptions where high-dimensional lies on a low-dimensional curved surface. Locally linear embedding (LLE) preserves local neighborhood by reconstructing each point as a of its neighbors in both input and embedding spaces. For a point x_i with k-nearest neighbors, reconstruction weights W_{ij} minimize \sum_j ||x_i - \sum_j W_{ij} x_j||^2 subject to \sum_j W_{ij} = 1, assuming local . The low-dimensional Y \in \mathbb{R}^{n \times m} then minimizes the embedding cost \sum_i || y_i - \sum_j W_{ij} y_j ||^2, solved via an eigenvalue problem on the matrix M = (I - W)^T (I - W), retaining the bottom m+1 eigenvectors (excluding the trivial all-ones solution). LLE assumes that local neighborhoods remain linearly reconstructible in the low-dimensional space but does not explicitly preserve global distances. Isomap extends to nonlinear manifolds by estimating distances via shortest paths on a neighborhood , effectively generalizing LLE to preserve global . It constructs a where edges represent Euclidean distances to nearest neighbors, computes all-pairs shortest paths using , and applies classical MDS to the for . This approach assumes the manifold is to a convex subset of , making it suitable for data with intrinsic curvature, such as facial images or protein configurations. In contrast, t-distributed stochastic neighbor embedding (t-SNE) is primarily a tool that minimizes divergences between high- and low-dimensional probability distributions of pairwise similarities but is non-parametric and less ideal for general feature extraction due to its sensitivity to hyperparameters and lack of out-of-sample extensions. A more recent nonlinear technique, Uniform Manifold Approximation and Projection (UMAP), builds on to preserve both local and global structure through fuzzy simplicial sets and optimization. UMAP is faster and more scalable than t-SNE, making it suitable for large datasets in feature learning and as of 2025. In feature learning, reduced representations from these techniques serve as compact inputs to downstream classifiers, often improving efficiency and generalization by mitigating of dimensionality. For instance, -derived features can be whitened by components inversely with their deviations, decorrelating and normalizing for models like support vector machines. Computationally, via scales as O(\min(n^2 d, n d^2)), efficient for moderate dimensions, while LLE involves O(n^2 k) for neighbor searches and eigenvalue solving at O(n^3), limiting scalability without approximations. These methods thus enable discovery of interpretable low-dimensional features underlying complex datasets.

Independent Component Analysis

Independent Component Analysis (ICA) is an unsupervised feature learning technique that decomposes multivariate data into statistically independent components, enabling the extraction of underlying features from mixed signals without prior knowledge of the mixing process. This method is particularly valuable in scenarios where data arises from linear mixtures of hidden sources, such as in and , by assuming that the original sources are non-Gaussian and independent. The core principle of ICA models the observed \mathbf{X} as a linear transformation of independent source signals \mathbf{S}, expressed as \mathbf{X} = \mathbf{A} \mathbf{S}, where \mathbf{A} is the unknown mixing matrix. The goal is to estimate an unmixing matrix \mathbf{W} \approx \mathbf{A}^{-1} such that the recovered signals \mathbf{Y} = \mathbf{W} \mathbf{X} approximate the independent sources \mathbf{S}. Often, is applied as a preprocessing step to whiten the data, centering and sphering \mathbf{X} to simplify the ICA estimation. The objective of ICA is to maximize the statistical of the components, typically by measuring and enhancing non-Gaussianity through , defined as J(\mathbf{y}) \approx \sum_k \left[ G(y_k) - G(v) \right]^2, where G is a non-quadratic approximating the sources' distributions and v is a Gaussian . Alternatively, can be achieved by minimizing between components, which equates to maximizing the joint of the outputs under a fixed marginal constraint. Key algorithms for ICA include , which employs for rapid convergence, and Infomax, which uses ascent to maximize the log-likelihood of the data under an independence model. In , the update rule for the weight vector \mathbf{w} in one unit is given by: \mathbf{w}^+ \propto \mathbb{E} \left\{ \mathbf{z} g(\mathbf{w}^T \mathbf{z}) \right\} - \mathbb{E} \left\{ g'(\mathbf{w}^T \mathbf{z}) \right\} \mathbf{w}, where \mathbf{z} = \mathbf{W} \mathbf{x} is the whitened data, g is a nonlinearity (e.g., g(u) = \tanh(u)), and the expectation is over the data distribution; subsequent orthogonalization and normalization ensure decorrelation. Infomax, in contrast, trains a neural network via backpropagation to maximize mutual information, treating separation as an information-theoretic optimization problem. ICA relies on key assumptions: the source signals are statistically , the mixing is linear, and at most one source is Gaussian to ensure , as Gaussian sources would be indistinguishable under linear transformations. These assumptions distinguish ICA from methods focused on , emphasizing higher-order statistics for separation. In applications, ICA excels at blind source separation, such as isolating individual speech signals in problem for audio processing, where multiple voices overlap in a noisy environment. It is also widely used for feature extraction in (EEG) and (fMRI), decomposing brain signals into independent components to identify artifacts or neural patterns.

Unsupervised Dictionary Learning

Unsupervised dictionary learning aims to discover an overcomplete set of basis elements, or atoms, from unlabeled to represent signals as sparse linear combinations of these atoms, thereby capturing intrinsic data structures without . This approach contrasts with complete bases like by allowing more atoms than data dimensions, promoting efficient and interpretable representations. The core idea originates from sparse coding models that enforce sparsity to mimic biological systems, where neurons respond selectively to image parts. The standard formulation minimizes the reconstruction error while enforcing sparsity in the coefficients: \min_{D, S} \frac{1}{2} \| X - D S \|_2^2 + \lambda \| S \|_1, \quad \text{subject to } \| d_i \|_2 \leq 1 \ \forall i Here, X \in \mathbb{R}^{n \times m} is the data matrix with m signals of dimension n, D \in \mathbb{R}^{n \times k} is the dictionary with k > n atoms d_i, and S \in \mathbb{R}^{k \times m} are the sparse codes, with \lambda > 0 balancing fidelity and sparsity. This optimization jointly learns the dictionary D and codes S, often solved via alternating minimization: sparse coding for fixed D, then dictionary update. Constraints on atom norms prevent trivial solutions like scaling up atoms and down coefficients. Key algorithms include the method, introduced in 2006, which iteratively alternates between sparse coding using orthogonal and updates via on restricted error matrices to refine individual atoms while preserving sparsity. For large-scale data, online variants process samples sequentially using stochastic approximations, updating the incrementally to reduce computational cost and enable scalability to millions of examples. These methods have become foundational for unsupervised feature extraction due to their balance of accuracy and efficiency. Sparsity plays a crucial role by representing each signal as a of only a few atoms, typically 5-10% non-zero coefficients per signal, which encourages part-based decompositions such as edges or textures in images rather than holistic representations. This promotes disentangled features that align with perceptual organization, as sparse activations lead to localized, overlapping receptive fields akin to simple cells in . In applications, learning excels in image denoising, where learned atoms serve as adaptive filters for textures; for instance, training on noisy patches yields dictionaries that reconstruct clean signals with PSNR improvements of 2-5 over fixed-wavelet methods on standard benchmarks like and images. It also supports by sampling sparse codes from the learned to generate novel patches that preserve statistical properties of input textures. An important extension is (NMF), which imposes non-negativity on both factors for additive parts-based learning: \min_{W, H} \| X - W H \|_F^2 \quad \text{subject to } W \geq 0, H \geq 0 where W acts as the and H the coefficients, yielding interpretable features like components from images, as non-negativity prevents subtractive cancellations. NMF variants often use multiplicative updates for . Evaluation typically measures reconstruction error via mean squared error on held-out data and sparsity level through metrics like the average number of non-zero coefficients per signal or the \ell_0-pseudo-norm, ensuring the dictionary achieves low distortion with high compression. These metrics confirm the method's ability to generalize beyond training sets.

Semi-Supervised Feature Learning

Graph-Based Methods

Graph-based methods in semi-supervised feature learning construct a similarity graph G = (V, E), where vertices V represent data points (both labeled and unlabeled), and edges E are weighted by pairwise similarities, often derived from kernel functions or distance metrics such as Gaussian kernels. Labels from a small set of annotated nodes are propagated to unlabeled ones by enforcing smoothness on the graph manifold, assuming that nearby points share similar features and labels. This is formalized through Laplacian regularization, minimizing the objective \min_f \sum_{i,j} W_{ij} \|f_i - f_j\|^2 + \sum_{i \in \mathcal{L}} \|f_i - y_i\|^2, where W is the affinity matrix, f are the predicted labels or features, \mathcal{L} denotes labeled nodes, and y_i are ground-truth labels. The first term promotes local consistency via the graph Laplacian \mathcal{L} = D - W (with D the degree matrix), while the second enforces fitting to labeled data. Key algorithms include label propagation, introduced in 2003 as an iterative smoothing process that diffuses labels across the graph until convergence, treating predictions as harmonic functions that satisfy the Laplace equation on unlabeled nodes. Graph embeddings, such as Laplacian eigenmaps, learn low-dimensional representations by solving the eigendecomposition of the normalized Laplacian, preserving local geometry and manifold structure in the embedding space. These embeddings serve as learned features, capturing the intrinsic data geometry while incorporating label information for semi-supervised refinement. The process solves for feature assignments F via the linear system (I - \alpha \mathcal{L}) F = Y, where \mathcal{L} is the normalized Laplacian, \alpha \in (0,1) controls propagation strength, and Y is the initial labeled matrix extended with zeros for unlabeled nodes; this yields closed-form harmonic solutions that extend labels smoothly. In feature learning, these methods produce embeddings as harmonic functions on the , providing low-dimensional coordinates that respect the manifold's and propagate supervisory signals effectively. This leverages the manifold assumption—that high-dimensional data lies on a low-dimensional structure—enabling robust feature extraction even with scarce labels, as demonstrated in text tasks like categorization, where graph-based propagation improved accuracy over supervised baselines by utilizing document similarity graphs. Advantages include scalability to large datasets via sparse representations and superior performance in low-label regimes, outperforming inductive methods by 10-20% in error rates on benchmark text datasets with only 1-5% . Variants extend this framework, such as transductive SVMs adapted to graphs, which incorporate manifold regularization into the SVM for joint optimization of decision boundaries and label propagation on the similarity . Recent updates as of 2025 integrate neural networks (GNNs) for dynamic , where evolving topologies are handled by temporal to learn time-varying features in semi-supervised settings, achieving state-of-the-art results on node classification with missing attributes by evolving structures during training.

Generative Model Approaches

Generative model approaches to semi-supervised feature learning leverage both limited and abundant unlabeled data by modeling the joint distribution P(X, Y) for labeled examples and the P(X) for unlabeled ones, enabling the extraction of robust features that capture underlying data structures. This probabilistic framework allows the model to infer labels for unlabeled data through , thereby improving in scenarios with scarce annotations. Features are typically derived from latent variables Z in (VAE)-like architectures, where Z represents a compressed, disentangled representation that encodes class-invariant properties while incorporating supervisory signals from labels. Key algorithms in this domain include semi-supervised generative adversarial networks () with an auxiliary classifier, introduced in 2016, which extend the framework by training a discriminator to not only distinguish real from fake samples but also predict class labels on real data, using unlabeled samples to refine the . Another foundational is ladder networks, proposed in 2015, which integrate denoising objectives with supervised through a hierarchical of encoder-decoder pairs that enforce consistency across layers, allowing clean feature representations to propagate bidirectionally. These approaches train by maximizing the likelihood on while applying minimization on unlabeled predictions, with the objective formulated as L = \sum_{\text{labeled}} \log P(Y \mid X) + \sum_{\text{unlabeled}} H(P(Y \mid X)), where H(\cdot) denotes the of the predicted class distribution, encouraging confident pseudo-labels for unlabeled samples to regularize the model. The Z serves as the primary source of semi-supervised representations, providing embeddings that blend supervised discrimination with density modeling to enhance downstream tasks such as or retrieval. In applications, these methods excel in with few labels, where generative models like semi-supervised GANs generate plausible segmentations for unlabeled images to augment training, achieving notable improvements in pixel-wise accuracy on datasets like Cityscapes. Similarly, for , they model normal data distributions to identify deviations in unlabeled samples, as seen in setups using VAEs to flag outliers in industrial monitoring with minimal labeled anomalies. Despite their strengths, these approaches assume the correctness of the underlying , which can lead to poor performance if the assumed is misspecified, and they often incur high computational costs due to iterative sampling and inference in high-dimensional spaces.

Self-Supervised Feature Learning

Core Principles and Contrastive Learning

Self-supervised feature learning is a paradigm within that generates supervisory signals directly from the input itself, enabling models to learn meaningful representations without relying on explicit labels. This approach treats the as both input and output, creating pretext tasks that encourage the model to capture underlying structures, such as spatial relationships or contextual dependencies. As a of unsupervised methods focused on pre-training, self-supervised learning has gained prominence for its ability to leverage vast amounts of unlabeled to initialize models for downstream supervised tasks. Pretext tasks form the core of self-supervised learning by defining surrogate objectives that provide pseudo-labels derived from the data. Examples include rotation prediction, where the model learns to identify the angle (e.g., 0°, 90°, 180°, or 270°) to which an image has been rotated; colorization, which involves predicting color values for images to reconstruct the original; and jigsaw puzzles, where the model rearranges shuffled image patches to their correct positions. These tasks promote the extraction of invariant and discriminative features, such as edges, textures, and object parts. By 2025, masked modeling has emerged as a dominant trend, extending beyond vision to non-vision domains like text and sequences, exemplified by BERT-style approaches that predict masked tokens in sentences to learn contextual embeddings. Contrastive learning, a key mechanism in self-supervised feature learning, operates by contrasting positive pairs—typically augmented views of the same instance—against negative pairs from different instances to pull similar representations closer and push dissimilar ones apart in the space. This is often formulated using the InfoNCE (Noise-Contrastive Estimation) loss, which maximizes the between positive pairs while minimizing it for negatives: \mathcal{L}_{NCE} = -\mathbb{E} \left[ \log \frac{\exp(\operatorname{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{N} \exp(\operatorname{sim}(z_i, z_k)/\tau)} \right] Here, z_i and z_j are projections of positive pair embeddings, \operatorname{sim}(\cdot, \cdot) is a similarity function (e.g., ), \tau is a parameter, and the sum includes one positive and N-1 negatives. Seminal techniques include instance discrimination via (CPC), which predicts future representations in using autoregressive modeling, and Momentum Contrast (MoCo), which maintains a dynamic queue of negative samples updated via a encoder to enable large-batch training without collapsing representations. These methods offer significant advantages, including scalability to massive unlabeled datasets—such as billions of images—without the need for annotations, and strong transferability to downstream tasks like , where pre-trained features achieve competitive performance after minimal adaptation. For instance, MoCo v2 demonstrates linear probing accuracy exceeding 70% on , rivaling supervised pre-training while using only unlabeled data. Evaluation typically involves , where a simple is trained on frozen self-supervised features to assess representation quality on held-out labeled data, providing a standardized metric for transferability across benchmarks.

Applications in Text and Language

Self-supervised feature learning has revolutionized natural language processing by enabling models to derive rich representations from vast unlabeled text corpora, focusing on tasks that predict linguistic structures without explicit supervision. One foundational approach is the skip-gram model in Word2Vec, which learns word embeddings by maximizing the log probability of context words given a target word, formulated as \max \sum_{t=1}^{T} \sum_{-c \leq o \leq c, o \neq 0} \log P(w_{t+o} | w_t), where T is the sequence length and c is the context window size. This method produces dense vector representations that capture semantic similarities, such as "king" - "man" + "woman" ≈ "queen," facilitating downstream applications like semantic search. Transformer-based models advanced this paradigm with bidirectional contextual embeddings, exemplified by BERT's masked language modeling (MLM), where 15% of input tokens are randomly masked and the model predicts them based on surrounding context, alongside next sentence prediction to understand discourse relations. These pre-training objectives yield hierarchical features, starting from token-level embeddings and building to sentence-level representations through multi-layer self-attention mechanisms, enabling nuanced understanding of syntax and semantics. The GPT series, conversely, employs causal language modeling, training on next-token prediction in a unidirectional manner to generate coherent sequences, with models like scaling to 175 billion parameters for emergent zero-shot capabilities in tasks such as . By November 2025, GPT-5 further enhances zero-shot feature extraction through integrated reasoning modules, achieving superior performance on benchmarks without task-specific . These learned features transfer effectively to downstream applications, including , where fine-tuned BERT variants significantly outperform traditional methods on datasets like SST-2, capturing subtle emotional nuances in reviews. In , contextual embeddings from models like mBERT—pre-trained on 104 languages—improve cross-lingual transfer, enabling zero-shot between unseen language pairs with scores several to 15 points higher than non-contextual baselines. Multilingual extensions like mBERT democratize access to high-quality features for low-resource languages, supporting applications in global sentiment monitoring and services. Despite these advances, self-supervised language models face significant challenges, including high computational costs for on terabyte-scale corpora, often requiring thousands of GPU-hours that limit for smaller groups. Additionally, biases inherent in pre-training data—such as gender or racial stereotypes amplified during —persist in embeddings, necessitating mitigation techniques like debiasing to ensure fairer downstream applications.

Applications in Images and Vision

Self-supervised feature learning has revolutionized by enabling the extraction of robust image representations from vast unlabeled datasets, leveraging pretext tasks and data augmentations to foster invariance to transformations like cropping, color distortion, and geometric changes. In this domain, contrastive methods such as SimCLR treat augmented views of the same image as positive pairs while contrasting them against others, achieving strong linear evaluation accuracies on by simplifying prior frameworks to focus on large-batch training and strong augmentations without architectural complexities like memory banks. Similarly, non-contrastive approaches like DINO employ self-distillation in a teacher-student setup on Vision Transformers (ViTs), where the student predicts the teacher's output on augmented crops without relying on negative samples, leading to emergent properties such as semantic segmentation from maps and superior transfer to downstream tasks. These methods learn features that capture spatial hierarchies and object-centric structures, outperforming supervised pretraining when fine-tuned on limited labeled data for and segmentation. Pretext tasks further enhance feature invariance in vision; for instance, prediction trains networks to classify the angle (e.g., 0°, 90°, 180°, or 270°) an image has been rotated, encouraging the model to discern orientation-invariant semantics without labels. To prevent representational collapse in architectures, SimSiam introduces a stop-gradient operation on one branch during similarity maximization between augmented views, allowing simple predictors to yield representations competitive with contrastive baselines on classification. Vision Transformers, pretrained self-supervised via methods like or masked image modeling, treat images as sequences of patch embeddings, yielding features that excel in capturing global dependencies and local details, with self-supervised ViTs often surpassing convolutional counterparts in scenarios. Recent advancements by 2025 integrate diffusion models into self-supervised , using denoising as a task to generate intermediate representations that support few-shot detection and understanding with minimal annotations. These diffusion-based approaches enable efficient deployment on devices by distilling encoders from generative priors, reducing computational overhead while maintaining in for tasks. In downstream applications, self-supervised features pretrained on large corpora like significantly boost on COCO and semantic segmentation on ADE20K, particularly in low-data regimes where they outperform fully supervised models trained from scratch—for example, achieving up to 6 points higher mIoU with only 1% labeled data. Evaluation often employs k-NN classification on frozen features, where top-1 accuracies exceeding 70% on benchmarks signal the quality of learned representations for transfer.

Applications in Graphs, Video, Audio, and Data

Self-supervised feature learning has been extended to graph-structured data through contrastive approaches that generate or -level representations without labels. (GraphCL), introduced in 2020, employs data augmentations such as subgraph sampling and attribute masking to create positive pairs from the same , while contrasting them against negatives from other graphs, using a encoder and InfoNCE loss to maximize agreement between augmented views. This method outperforms prior unsupervised baselines on classification tasks like Cora and datasets, achieving up to 5% accuracy gains in semi-supervised settings. Complementing -level methods, InfoGraph (2020) focuses on -level embeddings by maximizing between a and its substructures, such as s, edges, and subgraphs, via a variational bound, enabling effective representation for tasks like classification on molecular datasets. In video data, leverages temporal dynamics through contrastive objectives that align features across frames or clips. Temporal Contrastive Learning for Video Representation (TCLR), proposed in 2021, introduces a with instance-level and temporal contrastive losses to enforce variation within video instances over time, without relying on explicit pretext tasks like rotation , and demonstrates superior transfer to action recognition on Kinetics-400, improving top-1 accuracy by 2-3% over prior methods. Motion serves as another key pretext task, where models forecast future frame displacements or from past frames, as explored in works decoupling motion from static context to learn spatiotemporal features transferable to downstream video understanding. Modality-specific augmentations, such as temporal cropping or frame shuffling, are crucial for generating robust positives in these approaches. For audio signals, contrastive methods adapt to sequential spectrograms or raw waveforms by predicting latent representations. Contrastive Predictive Coding (CPC), from 2018, trains encoders to predict future samples in a using an and noise-contrastive estimation on audio sequences, yielding representations that rival supervised features for speech tasks like recognition on LibriSpeech. Building on this, wav2vec (2019) applies masked prediction to raw audio, where a convolutional encoder contextualizes masked latent vectors, and a contrastive loss distinguishes true targets from distractors, achieving word error rates competitive with supervised models after on 960 hours of unlabeled data. These techniques highlight the role of temporal augmentations, like or addition, in capturing phonetic and prosodic structures. Multimodal self-supervised learning aligns representations across modalities, such as and , using joint contrastive objectives. CLIP (2021) trains separate encoders for images and text on 400 million pairs via a contrastive loss that maximizes for matching pairs while minimizing for non-matches, enabling zero-shot transfer to image classification on with 76% top-1 accuracy using a vision transformer. Extending to unified architectures, Flamingo (2022) integrates frozen and models with cross-attention layers for on visual , processing interleaved image-text inputs to generate representations adaptable to multimodal tasks like captioning. Cross-modal alignment remains central, often via shared embedding spaces or bidirectional contrastive losses. Across these domains, common strategies include modality-tailored augmentations—such as dropping in graphs to preserve while introducing variability—and cross-modal objectives that bridge disparate data types for richer representations. However, challenges persist, including for large-scale graphs and videos, where memory-intensive augmentations limit on datasets exceeding millions of nodes or hours of footage, and precise in settings, where distribution shifts between modalities can degrade transfer performance.

Deep and Multilayer Architectures

Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) are probabilistic graphical models consisting of a with visible units \mathbf{v} representing input data and hidden units \mathbf{h} capturing latent features, where connections exist only between visible and hidden layers, with no intra-layer connections. This restricted connectivity simplifies inference and learning compared to fully connected Boltzmann machines. The joint probability distribution over visible and hidden units is defined by an energy function E(\mathbf{v}, \mathbf{h}) = -\mathbf{b}^T \mathbf{v} - \mathbf{c}^T \mathbf{h} - \mathbf{v}^T \mathbf{W} \mathbf{h}, where \mathbf{b} and \mathbf{c} are biases for visible and hidden units, respectively, and \mathbf{W} is the weight matrix between layers. The probability of a visible vector is given by P(\mathbf{v}) = \frac{\sum_{\mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h})}}{Z}, where Z is the partition function summing over all possible configurations. RBMs typically assume binary units, with visible units often modeled as Bernoulli or Gaussian distributions to handle real-valued data. Training RBMs involves maximizing the log-likelihood of the data, approximated via contrastive divergence (CD-k), an efficient method using k steps of to estimate gradients without computing the intractable partition function. In CD-k, positive phase updates use data-driven expectations, while negative phase uses model-generated "fantasy" particles from approximate sampling, adjusting weights to make P(\mathbf{v}) approximate the empirical data distribution, often factorized as \prod P(v_i) for independent margins. This approach enables of features by minimizing the Kullback-Leibler divergence between data and model distributions. In feature learning, hidden unit activations serve as compact, distributed representations of input data, capturing higher-order correlations. These RBMs can be stacked layer-wise: the hidden layer of one RBM becomes the visible layer for the next, enabling greedy pre-training of deep networks; the top two layers form an associative memory, with lower layers trained as directed beliefs, forming deep belief networks (DBNs) as introduced by Hinton et al. in 2006. This layer-wise procedure initializes weights effectively for subsequent fine-tuning, addressing vanishing gradient issues in deep architectures. Applications of RBMs include , where stacked RBMs compress high-dimensional data into low-dimensional codes outperforming on datasets like MNIST, achieving better reconstruction error. In , RBMs model user ratings as visible units, learning personalized recommendations; for instance, on the dataset, they yielded errors around 0.90, competitive with matrix factorization methods. The binary nature of hidden features suits sparse, combinatorial data representations. Inference in RBMs is tractable due to : the activation probability for hidden unit i is h_i = \sigma(c_i + \mathbf{v}^T \mathbf{W}_{:i}), where \sigma is the , allowing mean-field approximations or exact sampling. Reconstruction of visible units from hidden activations follows symmetrically: v_j' = \sigma(b_j + \mathbf{h}^T \mathbf{W}_{j:}). Limitations of RBMs include their restriction to units, limiting applicability to continuous without extensions like Gaussian RBMs, and slow due to Gibbs sampling mixing times, though mitigated by persistent contrastive divergence (), which reuses Markov chains across updates for faster convergence. RBMs serve as a generative, probabilistic counterpart to deterministic methods like autoencoders.

Autoencoders

Autoencoders are a class of neural networks designed to learn efficient data representations by compressing inputs into a lower-dimensional and then reconstructing the original input from this representation. The architecture consists of an encoder function f(x) that maps the input x to a latent code z, followed by a function g(z) that reconstructs the output \hat{x} = g(f(x)). minimizes the reconstruction loss, typically the \|x - g(f(x))\|^2, which encourages the network to capture essential features while discarding noise or redundancies. This bottleneck structure in the z enforces , making autoencoders useful for feature extraction in high-dimensional data. Several variants extend the basic to improve robustness and generative capabilities. Denoising autoencoders introduce noise to the input during training, such as Gaussian perturbations or masking, and optimize the reconstruction of the clean input, thereby learning features invariant to corruptions. Variational autoencoders (VAEs) incorporate probabilistic modeling by treating the encoder as an approximate posterior q(z|x) over the latent variables and the decoder as the likelihood p(x|z), with a p(z) often assumed Gaussian. They maximize the (ELBO): \mathcal{L} = \mathbb{E}_{q(z|x)} [\log p(x|z)] - D_{KL}(q(z|x) \| p(z)), where the first term promotes reconstruction fidelity and the second enforces alignment with the prior via Kullback-Leibler divergence. Sparse autoencoders promote sparsity in the latent representations by adding an L1 penalty on the activations of the hidden units, encouraging only a few neurons to activate for any input and yielding more interpretable, distributed features. Autoencoders are trained using (SGD) on the reconstruction loss, with their popularity surging in the following advances in that enabled effective optimization of deeper architectures. The latent representations z serve as compact features for downstream tasks, outperforming traditional methods like on nonlinear data manifolds. In applications, autoencoders excel in , where high reconstruction errors flag outliers as deviations from learned normal patterns. They also facilitate pre-training for supervised tasks, initializing classifiers with robust features to improve performance on limited .

Modern Deep Networks for Feature Extraction

Modern deep networks have revolutionized feature extraction by enabling the learning of complex, hierarchical representations from raw data, surpassing earlier architectures through innovations in attention mechanisms and residual connections. Transformers, introduced in 2017, utilize self-attention to capture global dependencies across input sequences, producing contextual embeddings that serve as rich features for downstream tasks such as and beyond. These embeddings allow models to weigh the importance of different parts of the input dynamically, facilitating the extraction of relational features without relying on recurrent structures. Evolutions in convolutional neural networks (CNNs) have further advanced feature learning by addressing challenges in training very deep architectures. ResNet, proposed in 2015, incorporates residual blocks that enable the construction of networks with hundreds of layers, where shortcut connections mitigate vanishing gradients and allow the learning of residual functions, yielding robust hierarchical feature maps for image recognition. Building on this, EfficientNet in 2019 introduced compound scaling to balance network depth, width, and resolution, achieving state-of-the-art performance on benchmarks like with significantly fewer parameters, thus extracting efficient, transferable features for vision tasks. Hybrid architectures combine the strengths of CNNs and transformers to leverage local and global . The (ViT), from 2020, treats images as sequences of patches embedded into transformer inputs, enabling end-to-end learning of spatial that rival or exceed CNNs on large-scale sets when pre-trained appropriately. As of 2025, emerging trends emphasize generative and relational learning within deep networks. Diffusion models have gained prominence for through iterative denoising processes, learning balanced representations that capture manifolds without to superficial patterns, as demonstrated in recent analyses of their learning dynamics. neural networks, such as Graph Convolutional Networks (GCNs) introduced in 2016, extend to relational by propagating across structures, producing embeddings that encode neighborhood dependencies for tasks like semi-supervised classification. Training these networks typically involves self-supervised pre-training on large unlabeled datasets to learn general , followed by on specific tasks, with intermediate layer activations serving as versatile feature maps for . This paradigm enhances quality across domains, from vision to graphs. These architectures offer key advantages, including hierarchical building that captures multi-scale abstractions, to massive datasets via parallelizable components, and adaptability to diverse data types like images, text, and graphs, driving breakthroughs in representation learning.

Dynamic Feature Learning

Representations for Sequential Data

Representations for sequential data in feature learning involve models that process time-series or ordered inputs by maintaining evolving hidden states to capture temporal dependencies. The core principle is to learn a sequence of hidden states h_t at each time step t, updated via h_t = f(h_{t-1}, x_t), where x_t is the input at time t and f is a nonlinear function, typically parameterized by weights in a neural network; these hidden states serve as dynamic features representing the evolving context of the sequence. Recurrent neural networks (RNNs), introduced in the , form the foundational architecture for this approach, enabling the network to retain information from previous time steps through recurrent connections. A key advancement came with (LSTM) units in 1997, which address limitations in standard RNNs by incorporating gating mechanisms to regulate information flow. In LSTMs, the forget gate is computed as f_t = \sigma(W_f [h_{t-1}, x_t] + b_f), the input gate as i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), and the output gate similarly, where \sigma is the , W are weights, and b are biases. The cell state evolves as c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, and the hidden state is h_t = o_t \odot \tanh(c_t), with \odot denoting element-wise multiplication and \tilde{c}_t the candidate cell update. A variant, the gated recurrent unit (GRU) introduced in 2014, simplifies the LSTM by merging the cell and states into a single unit with fewer gates—an update gate and a reset gate—reducing while maintaining comparable on sequential tasks. Bidirectional RNNs extend these architectures by sequences in both forward and backward directions, concatenating states from each pass to provide fuller contextual representations for tasks requiring global sequence information. These representations find applications in domains like , where deep RNNs have achieved state-of-the-art performance by modeling acoustic sequences as evolving features for phonetic decoding, and stock price prediction, where LSTM-based models capture temporal patterns in historical prices to forecast future trends. In sequence-to-sequence () frameworks, the encoder's hidden states serve as contextual features for the decoder, enabling tasks such as . By the mid-2020s, transformer-based architectures, leveraging self-attention mechanisms, have increasingly supplemented RNNs for sequential data, enabling parallel processing and better handling of long-range dependencies in tasks like and time-series forecasting. A primary challenge in training these models is the , where gradients diminish exponentially over long sequences during through time, hindering learning of long-term dependencies; LSTMs mitigate this by using constant error carousels in cell states to propagate gradients more stably.

Methods for Evolving and Temporal Structures

In methods for evolving and temporal structures, feature learning focuses on generating dynamic embeddings e_t(v) for nodes v or edges at discrete time steps t, which capture the structural evolution of networks through temporal transitions between snapshots. These representations adapt to changes in , such as node additions, edge formations, or attribute updates, enabling models to model relational dynamics without recomputing static features from scratch. The core principle involves propagating information across time-aware neighborhoods, where embeddings evolve by integrating historical states with current structural deltas, often leveraging recurrent mechanisms to maintain continuity. Key algorithms in this domain include dynamic graph neural networks (GNNs), which extend static GNNs to handle temporal changes. EvolveGCN, introduced in 2019, decomposes the over time and evolves GCN parameters using recurrent neural networks (RNNs), allowing efficient updates without full retraining. Building on this, TGAT (Temporal ) from 2020 incorporates time-aware mechanisms to weigh historical interactions based on their recency and relevance, producing inductive embeddings suitable for unseen nodes in evolving graphs. These methods significantly outperform static baselines on dynamic tasks in and social networks, as they explicitly model temporal dependencies in aggregation. For continuous-time settings, temporal point processes model event sequences in graphs, where features represent functions that govern interaction rates. The , a self-exciting model, defines the conditional as \lambda(t) = \mu + \sum_{t_i < t} \alpha \exp(-\beta (t - t_i)), with baseline rate \mu, excitation amplitude \alpha, and \beta, capturing how past influence future ones in networked . Graph Hawkes Neural Networks extend this by parameterizing the via GNNs over graph structure, enabling feature learning for temporal knowledge graphs and improving accuracy on . Applications span domains with evolving relational data. In , dynamic GNNs track user interactions for community detection and influence propagation, as demonstrated in temporal citation networks where embeddings evolve to predict links. forecasting benefits from these methods by modeling networks as time-varying graphs, with models like temporal GNNs achieving lower errors in short-term flow predictions compared to autoregressive baselines. As of 2025, real-time fraud detection in graphs has adopted dynamic GNN frameworks, such as FinGuard-GNN, which updates embeddings incrementally to detect evolving patterns in . Embedding updates typically follow a recurrent formulation, such as e_t = \text{RNN}(e_{t-1}, \Delta A_t), where \Delta A_t denotes the adjacency change at time t, optionally incorporating sequential RNNs for evolving attributes. This allows scalable processing of large-scale temporal graphs, with memory embeddings in models like TGNs significantly reducing computational overhead through efficient histories. Evaluation emphasizes tasks like over varying time horizons, assessing how well models forecast future edges given historical snapshots. Metrics such as area under the curve () and (MRR) are computed across short (e.g., next timestep) and long (e.g., 10-step) horizons, revealing that time-aware methods like TGAT often outperform snapshot-based approaches on datasets like Wikipedia edits or MOOC interactions.

References

  1. [1]
    [PDF] Representation Learning: A Review and New Perspectives - arXiv
    Apr 23, 2014 · In Bengio and LeCun (2007), one of us introduced the notion of AI-tasks, which are challenging for current machine learning algorithms, and ...
  2. [2]
    [PDF] Understanding and Improving Feature Learning for Out-of ...
    Understanding feature learning in neural networks is crucial to understanding how they generalize to different data distributions [2, 11, 12, 62, 67, 70]. Deep ...
  3. [3]
    Representation Learning: A Review and New Perspectives - arXiv
    Jun 24, 2012 · This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, ...
  4. [4]
    None
    ### Summary of Representation Learning (Chapter 15, Deep Learning Book)
  5. [5]
    None
    ### Summary of Feature Learning in CNNs from https://www.deeplearningbook.org/contents/convnets.html
  6. [6]
    [PDF] Pearson, K. 1901. On lines and planes of closest fit to systems of ...
    Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2:559-572. http://pbil.univ-lyon1.fr/R/pearson1901.
  7. [7]
    [PDF] Independent Component Analysis - Computer Science
    ... Independent component analysis. 6. 1.3.1 Definition. 6. 1.3.2 Applications. 7. 1.3.3 How to find the independent components. 7. 1.4 History of ICA. 11 v. Page 6 ...
  8. [8]
    [PDF] Emergence of simple-cell receptive field properties by learning a ...
    We show that a learning algorithm that attempts to find sparse linear codes for natural scenes will develop a complete family of localized, oriented, bandpass ...Missing: dictionary | Show results with:dictionary
  9. [9]
    [PDF] K-SVD: An Algorithm for Designing Overcomplete Dictionaries for ...
    In this paper, we present a novel algorithm for adapting dictio- naries so as to represent signals sparsely. Given a set of training signals. , we seek the ...
  10. [10]
    [PDF] A Fast Learning Algorithm for Deep Belief Nets
    We show how to use “complementary priors” to eliminate the explaining- away effects that make inference difficult in densely connected belief nets.
  11. [11]
    [PDF] ImageNet Classification with Deep Convolutional Neural Networks
    The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in ...
  12. [12]
    Efficient Estimation of Word Representations in Vector Space - arXiv
    Jan 16, 2013 · We propose two novel model architectures for computing continuous vector representations of words from very large data sets.
  13. [13]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  14. [14]
    A Simple Framework for Contrastive Learning of Visual ... - arXiv
    Feb 13, 2020 · This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised ...
  15. [15]
    Learning Transferable Visual Models From Natural Language ...
    Feb 26, 2021 · Learning Transferable Visual Models From Natural Language Supervision. Authors:Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, ...
  16. [16]
    [2005.14165] Language Models are Few-Shot Learners - arXiv
    May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
  17. [17]
    Deep learning | Nature
    ... Published: 27 May 2015. Deep learning. Yann LeCun,; Yoshua Bengio &; Geoffrey Hinton. Nature volume 521, pages 436–444 (2015)Cite this article.
  18. [18]
    Deep learning: Historical overview from inception to actualization ...
    This study aims to provide a historical narrative of deep learning, tracing its origins from the cybernetic era to its current state-of-the-art status.
  19. [19]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in ...
  20. [20]
    [0809.3083] Supervised Dictionary Learning - arXiv
    Sep 18, 2008 · This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary.
  21. [21]
    ImageNet Classification with Deep Convolutional Neural Networks
    Authors. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. Abstract. We trained a large, deep convolutional neural network to classify the 1.3 million ...
  22. [22]
    [PDF] Handwritten Digit Recognition with a Back-Propagation Network
    The main point of this paper is to show that large back-propagation (BP) net- works can be applied to real image-recognition problems without a large, complex.
  23. [23]
    [PDF] Rectified Linear Units Improve Restricted Boltzmann Machines
    Rectified linear units (RLUs) improve RBMs by learning better features for object recognition and face verification, and preserving relative intensities unlike ...
  24. [24]
    How transferable are features in deep neural networks? - arXiv
    Nov 6, 2014 · Access Paper: View a PDF of the paper titled How transferable are features in deep neural networks?, by Jason Yosinski and 3 other authors.
  25. [25]
    LoRA: Low-Rank Adaptation of Large Language Models - arXiv
    Jun 17, 2021 · We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the ...
  26. [26]
    [PDF] Learning Feature Representations with K-means
    More recently, we have found that using K-means clustering as the unsupervised learning module in these types of “feature learning” pipelines can lead to excel-.
  27. [27]
    An Analysis of Single-Layer Networks in Unsupervised Feature ...
    We will apply several off-the-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR-10, ...
  28. [28]
    An efficient k-means clustering algorithm: analysis and implementation
    A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm ...<|separator|>
  29. [29]
    [PDF] On Spectral Clustering: Analysis and an algorithm Andrew Y. Ng CS ...
    In this paper, we present a simple spectral clustering algorithm that can be ... Here, we build upon the recent work of Weiss [11] and Meila and Shi [6], who.
  30. [30]
    A Review on Analysis of K-Means Clustering Machine Learning ...
    Apr 15, 2024 · The objective of writing the paper is how K- Means clustering algorithm is applied on the model dataset based on unsupervised learning. We used ...
  31. [31]
    [PDF] Video Google: A Text Retrieval Approach to Object Matching in Videos
    Building a visual vocabulary. The objective here is to vector quantize the descriptors into clusters which will be the visual 'words' for text retrieval.
  32. [32]
    [PDF] a graphical aid to the interpretation and validation of cluster analysis
    Silhouettes of an example where eight points are divided over two very tight clusters, for k = 2. Page 9. P.J. Rousseeuw / Graphical aid to cluster analysis. 61.
  33. [33]
    Unsupervised Deep Embedding for Clustering Analysis - arXiv
    Nov 19, 2015 · In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural ...
  34. [34]
    Principal component analysis: a review and recent developments
    Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing ...Missing: seminal | Show results with:seminal
  35. [35]
    [PDF] Think Globally, Fit Locally: Unsupervised Learning of Low ...
    Here we describe locally linear embedding (LLE), an unsupervised learning algorithm ... In previous work (Roweis and Saul, 2000), for example, we applied LLE.
  36. [36]
    A Guide to Principal Component Analysis (PCA) for Machine Learning
    What are the assumptions and limitations of PCA? · PCA assumes a correlation between features. · PCA is sensitive to the scale of the features. · PCA is not robust ...
  37. [37]
    [PDF] Visualizing Data using t-SNE - Journal of Machine Learning Research
    We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
  38. [38]
    Principal Component Analysis (PCA): Explained Step-by-Step | Built In
    Principal component analysis was first introduced by Karl Pearson in 1901 as a method for identifying the principal axes of variation in multidimensional data, ...
  39. [39]
    [PDF] Independent Component Analysis: Algorithms and Applications
    For details, see (Hyvärinen, 1999b). In FastICA, convergence speed is optimized by the choice of the matrices diag(αi) and diag(βi). Another advantage of ...
  40. [40]
    [PDF] Independent Component Analysis: A Tutorial
    The FastICA algorithm and the underlying contrast functions have a number of desirable properties when compared with existing methods for ICA. 1. The ...
  41. [41]
    [PDF] An information-maximisation approach to blind separation and blind ...
    A brief report of this research appears in Bell & Sejnowski (1995). 2 Information maximisation. The basic problem tackled here is how to maximise the mutual ...
  42. [42]
  43. [43]
    Learning the parts of objects by non-negative matrix factorization
    Oct 21, 1999 · Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text.
  44. [44]
    [PDF] Learning with Local and Global Consistency - NIPS papers
    We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive in- ference.
  45. [45]
    [PDF] Semi-Supervised Learning Using Gaussian Fields and Harmonic ...
    In this paper we introduce a new approach to semi-supervised learning that is based on a random field model defined on a weighted graph over the unlabeled and.
  46. [46]
    [PDF] Laplacian Eigenmaps for Dimensionality Reduction and Data ...
    To simplify the analysis, the neighbor-. Page 14. 1386. M. Belkin and P. Niyogi ing points (xij 's) are assumed to lie on a locally linear patch on the manifold.
  47. [47]
    [PDF] Semi-Supervised Learning with Graphs - cs.wisc.edu
    We present a series of novel semi-supervised learning approaches arising from a graph representation, where labeled and unlabeled instances are represented as.
  48. [48]
    Dynamic graph structure evolution for node classification with ...
    Jul 16, 2025 · This paper proposes the evolving graph structure (EGS) framework for semi-supervised node classification with missing attributes.
  49. [49]
    Semi-Supervised Learning with Deep Generative Models - arXiv
    This paper revisits semi-supervised learning with generative models, using deep generative models and variational methods to improve generalization from small ...
  50. [50]
    Conditional Image Synthesis With Auxiliary Classifier GANs - arXiv
    Oct 30, 2016 · This paper introduces new methods for training GANs for image synthesis, using label conditioning for 128x128 resolution images with global ...Missing: Semi- supervised
  51. [51]
    [1507.02672] Semi-Supervised Learning with Ladder Networks - arXiv
    Jul 9, 2015 · Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the ...
  52. [52]
    [PDF] Semi Supervised Semantic Segmentation Using Generative ...
    This paper uses a GAN-based semi-supervised framework with a generator and classifier, using both labeled and unlabeled data, including fake images, to improve ...
  53. [53]
    Semi-Supervised Anomaly Detection Based on Deep Generative ...
    Jun 4, 2022 · We propose a novel semi-supervised anomaly detection approach based on deep generative models with Transformers for identifying unusual (abnormal) images from ...Missing: segmentation | Show results with:segmentation
  54. [54]
    [1603.08511] Colorful Image Colorization - arXiv
    Mar 28, 2016 · Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder.
  55. [55]
    Unsupervised Learning of Visual Representations by Solving ... - arXiv
    Mar 30, 2016 · We build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a pretext task, which requires no manual labeling.
  56. [56]
    Rethinking self-supervised learning for time series forecasting
    Dec 3, 2024 · In time series forecasting, masked modeling offers a distinct advantage by implicitly guiding the model to capture fine-grained temporal ...
  57. [57]
    Representation Learning with Contrastive Predictive Coding - arXiv
    We propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding.
  58. [58]
    Momentum Contrast for Unsupervised Visual Representation Learning
    Nov 13, 2019 · We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up,
  59. [59]
    Rethinking Evaluation Protocols of Visual Representations Learned ...
    Apr 7, 2023 · In this work, we try to figure out the cause of performance sensitivity by conducting extensive experiments with state-of-the-art SSL methods.
  60. [60]
    Introducing GPT-5 - OpenAI
    Aug 7, 2025 · Introducing GPT-5. Our smartest, fastest, most useful model yet, with built-in thinking that puts expert-level intelligence in everyone's hands.
  61. [61]
    BERT applications in natural language processing: a review
    ١٥‏/٠٣‏/٢٠٢٥ · This review study examines the complex nature of BERT, including its structure, utilization in different NLP tasks, and the further development of its design ...
  62. [62]
    Self-Supervised Learning Principles Challenges and Emerging ...
    Feb 24, 2025 · This survey provides a comprehensive overview of self-supervised learning, covering its fundamental principles, major methodological approaches, ...3.3. Generative Approaches · 4.4. Robotics And... · 5. Challenges And Future...Missing: seminal papers
  63. [63]
    Contrastive Self-Supervised Learning of Graph Representations
    Jul 15, 2020 · Abstract:We propose Graph Contrastive Learning (GraphCL), a general framework for learning node representations in a self supervised manner.
  64. [64]
    InfoGraph: Unsupervised and Semi-supervised Graph-Level ... - arXiv
    Jul 31, 2019 · This paper studies learning the representations of whole graphs in both unsupervised and semi-supervised scenarios.
  65. [65]
    TCLR: Temporal Contrastive Learning for Video Representation
    Jan 20, 2021 · We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video ...
  66. [66]
    wav2vec: Unsupervised Pre-training for Speech Recognition - arXiv
    Apr 11, 2019 · We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio ...
  67. [67]
    [PDF] Boltzmann Machines - Computer Science
    Mar 25, 2007 · A Boltzmann Machine is a network of symmetrically connected, neuron- like units that make stochastic decisions about whether to be on or off ...<|control11|><|separator|>
  68. [68]
    [PDF] Restricted Boltzmann Machines
    The main worry with CD is that there will be deep minima of the energy function far away from the data. – To find these we need to run the Markov chain for.
  69. [69]
    [PDF] Training Products of Experts by Minimizing Contrastive Divergence
    Mayraz and Hinton (in preparation) report good comparative results for the larger. MNIST database at www.research.att.com/~yann/ocr/mnist and they were careful ...
  70. [70]
    Reducing the Dimensionality of Data with Neural Networks - Science
    We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much ...
  71. [71]
    [PDF] Restricted Boltzmann Machines for Collaborative Filtering
    Salakhutdinov, R., & Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbour- hood structure. AI and Statistics. Srebro, N., & ...
  72. [72]
    [PDF] Using Fast Weights to Improve Persistent Contrastive Divergence
    (Tieleman, 2008) showed that, given a fixed amount of computation, restricted Boltzmann machines can learn better models using this “Persistent Contrastive.
  73. [73]
    [1312.6114] Auto-Encoding Variational Bayes - arXiv
    Dec 20, 2013 · Authors:Diederik P Kingma, Max Welling. View a PDF of the paper titled Auto-Encoding Variational Bayes, by Diederik P Kingma and 1 other authors.
  74. [74]
    [PDF] Sparse autoencoder
    These notes describe the sparse autoencoder learning algorithm, which is one approach to automatically learn features from unlabeled data. In some domains, such ...
  75. [75]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors. View PDF HTML (experimental). Abstract:The ...
  76. [76]
    [1512.03385] Deep Residual Learning for Image Recognition - arXiv
    Dec 10, 2015 · We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.
  77. [77]
    EfficientNet: Rethinking Model Scaling for Convolutional Neural ...
    We propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient.
  78. [78]
    [2010.11929] An Image is Worth 16x16 Words: Transformers ... - arXiv
    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Authors:Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk ...
  79. [79]
    [2412.01021] On the Feature Learning in Diffusion Models - arXiv
    Dec 2, 2024 · Diffusion models learn balanced data representations due to denoising, unlike classification models which focus on easy-to-learn patterns.
  80. [80]
    Semi-Supervised Classification with Graph Convolutional Networks
    We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks.
  81. [81]
    [PDF] Self-Supervised Learning in Deep Networks - arXiv
    During the training process, we first pre-train the model with self- supervision to enable it to learn common feature expressions on a large amount of unlabeled ...
  82. [82]
    Finding Structure in Time - Elman - 1990 - Cognitive Science
    A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/semantic features for words.
  83. [83]
    [PDF] 1990-elman.pdf - Gwern
    The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks.Missing: paper | Show results with:paper
  84. [84]
    [PDF] LONG SHORT-TERM MEMORY 1 INTRODUCTION
    See Schmidhuber and Hochreiter (1996) and. Hochreiter and Schmidhuber (1996, 1997) for additional results in this vein. LSTM architecture. We use a 3-layer net ...
  85. [85]
    [PDF] Learning Phrase Representations using RNN Encoder–Decoder for ...
    In this paper, we propose a novel neu- ral network model called RNN Encoder–. Decoder that consists of two recurrent neural networks (RNN). One RNN en-.
  86. [86]
    Bidirectional recurrent neural networks | IEEE Journals & Magazine
    Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN).
  87. [87]
    Speech Recognition with Deep Recurrent Neural Networks - arXiv
    Mar 22, 2013 · Abstract:Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist ...
  88. [88]
    Stock Market Prediction Using LSTM Recurrent Neural Network
    This article aims to build a model using Recurrent Neural Networks (RNN) and especially Long-Short Term Memory model (LSTM) to predict future stock market ...
  89. [89]
    [PDF] the vanishing gradient problem during learning recurrent neural nets ...
    Bengio, P. Simard, and P. Frasconi, \Learning long-term dependencies with gradient descent is di cult", IEEE Transactions on Neural Networks, 5(2):157{166 (1994) ...
  90. [90]
    [PDF] Learning long-term dependencies with gradient descent is difficult
    Our only claim here is that discrete propagation of error offers interesting solutions to the vanishing gradient problem in recurrent network. Our ...
  91. [91]
    EvolveGCN: Evolving Graph Convolutional Networks for Dynamic ...
    Feb 26, 2019 · We propose EvolveGCN, which adapts the graph convolutional network (GCN) model along the temporal dimension without resorting to node embeddings.
  92. [92]
    [2002.07962] Inductive Representation Learning on Temporal Graphs
    Feb 19, 2020 · We propose the temporal graph attention (TGAT) layer to efficiently aggregate temporal-topological neighborhood features as well as to learn the time-feature ...
  93. [93]
    [PDF] Neural Temporal Point Processes: A Review - IJCAI
    Temporal point processes (TPP) are probabilistic generative models for continuous-time event se- quences. Neural TPPs combine the fundamental.
  94. [94]
    [PDF] Graph Hawkes Neural Network for Forecasting on Temporal ...
    The Hawkes process has become a standard method for modeling self-exciting event sequences with different event types. A recent work has generalized the Hawkes ...
  95. [95]
    Enhancement of traffic forecasting through graph neural network ...
    This study investigates information fusion methods for GNN-based traffic predictions, including their benefits and challenges.Missing: fraud | Show results with:fraud
  96. [96]
  97. [97]
    Temporal Graph Networks for Deep Learning on Dynamic Graphs
    Jun 18, 2020 · In this paper, we present Temporal Graph Networks (TGNs), a generic, efficient framework for deep learning on dynamic graphs represented as sequences of timed ...
  98. [98]
    [PDF] Towards Better Evaluation for Dynamic Link Prediction
    Nodes, edges, weights or attributes in a dynamic graph can be added, deleted or adjusted over time. Therefore, understanding and analyzing the temporal patterns ...Missing: horizons | Show results with:horizons