Fact-checked by Grok 2 weeks ago

Feature learning

Feature learning, also known as representation learning, is a core paradigm in machine learning that enables algorithms to automatically discover and extract useful representations from raw, high-dimensional data, thereby facilitating downstream tasks such as classification, regression, and prediction without relying on manual feature engineering.^[1] This approach addresses the critical dependency of machine learning performance on data representation, where traditional methods require labor-intensive hand-crafting of features tailored to specific tasks, often limiting scalability and generalization across domains.^[1] By contrast, feature learning algorithms learn hierarchical or disentangled representations that capture underlying explanatory factors of variation in the data, such as edges in images or semantic relationships in text, improving efficiency and effectiveness in handling complex, unstructured inputs like audio, video, and natural language.^[1] Key methods in feature learning encompass unsupervised techniques, including probabilistic models like restricted Boltzmann machines (RBMs) and sparse coding, which learn sparse, overcomplete representations to model data distributions; auto-encoder variants, such as denoising and contractive auto-encoders, that compress and reconstruct data to enforce robust encodings; and manifold learning approaches that assume data lies on low-dimensional manifolds embedded in high-dimensional spaces.^[1] Supervised and self-supervised paradigms extend these by incorporating labels or pretext tasks to guide representation quality, while deep architectures, often pre-trained layer-wise, enable end-to-end learning of multi-level features in neural networks.^[1] Feature learning has driven transformative advancements across applications, notably reducing speech recognition error rates by up to 30% through deep belief networks, achieving top performance on the ImageNet large-scale visual recognition challenge by dropping object classification errors from 26.1% to 15.3% with convolutional networks, and enabling distributed word representations in natural language processing that outperform traditional bag-of-words models.^[1] Ongoing research focuses on challenges like out-of-distribution generalization, where learned features must remain robust to shifts in data distributions, as explored in recent analyses of neural network behavior under spurious correlations.^[2] These developments underscore feature learning's role in advancing artificial intelligence toward more autonomous and interpretable systems.

Fundamentals

Definition and Motivation

Feature learning, also known as representation learning, refers to a set of techniques in machine learning where algorithms automatically discover useful representations or transformations of raw data that make subsequent tasks easier to perform, without relying on manual feature engineering.^[3] These representations aim to disentangle the underlying factors of variation in the data, transforming high-dimensional inputs into more compact and informative forms that capture essential structures.^[4] The primary motivation for feature learning arises from the limitations of traditional manual feature design, which requires extensive domain expertise and struggles with the curse of dimensionality in high-dimensional data such as images, audio, or text.^[3] Manual engineering often fails to scale effectively, as crafting effective features by hand becomes impractical for complex, large-scale datasets where the input space grows exponentially, leading to sparse sampling and poor generalization.^[3] In contrast, feature learning enables end-to-end optimization, allowing models to adapt representations directly to the data's intrinsic properties and the demands of downstream tasks.^[4] Key benefits include enhanced generalization by focusing on invariant and transferable features, scalability to high-dimensional inputs through dimensionality reduction and hierarchical abstraction, and flexibility across diverse tasks like classification, regression, and density estimation.^[3] The basic workflow involves feeding raw data into a learning algorithm, which iteratively refines hierarchical or task-specific features through optimization, and then applying these features to machine learning models for prediction or analysis.^[4] For instance, in image processing, raw pixel intensities serve as inputs, whereas learned features might represent edges or textures that abstract low-level patterns into higher-level concepts useful for object recognition.^[5]

Historical Development

The origins of feature learning trace back to early 20th-century statistical methods for pattern recognition. Principal component analysis (PCA), introduced by Karl Pearson in 1901, provided a foundational technique for dimensionality reduction by identifying principal axes of variation in data, enabling the extraction of uncorrelated features from high-dimensional observations.^[6] In the 1990s, independent component analysis (ICA) emerged as a key advancement, with Pierre Comon's 1994 formulation defining it as a method to separate multivariate signals into statistically independent subcomponents, and Tony Bell and Terrence Sejnowski's 1995 infomax approach popularizing its application through information maximization principles.^[7] These techniques laid the groundwork for unsupervised feature extraction in statistical pattern recognition. The late 1990s and early 2000s saw the development of sparse coding models, which inspired modern dictionary learning. Bruno Olshausen and David Field's 1996 work demonstrated that learning sparse linear codes for natural images could yield receptive fields resembling those in the mammalian visual cortex, introducing the idea of overcomplete dictionaries for efficient signal representation.^[8] Autoencoders, first conceptualized in the 1980s by David Rumelhart, Geoffrey Hinton, and Ronald Williams as part of backpropagation-based representation learning, gained renewed traction in the 2000s for unsupervised feature extraction. Dictionary learning advanced further with the K-SVD algorithm by Michal Aharon, Michael Elad, and Alfred Bruckstein in 2006, which iteratively optimizes sparse representations over overcomplete bases, enhancing applications in signal processing and image compression.^[9] A pivotal revival occurred in 2006 with Geoffrey Hinton's introduction of deep belief networks (DBNs), which used layer-wise unsupervised pretraining to address vanishing gradients in deep architectures, reigniting interest in hierarchical feature learning.^[10] The 2010s marked the explosion of deep learning paradigms, exemplified by Alex Krizhevsky, Ilya Sutskever, and Hinton's AlexNet in 2012, a convolutional neural network that achieved breakthrough performance on the ImageNet dataset through supervised feature hierarchies, catalyzing widespread adoption of end-to-end learning.^[11] Unsupervised and self-supervised methods surged, with Tomas Mikolov et al.'s Word2Vec in 2013 enabling dense vector representations of words via predictive training on text corpora.^[12] Jacob Devlin et al.'s BERT in 2018 extended this to bidirectional transformer-based pretraining, learning contextual features from masked language modeling.^[13] In vision, Ting Chen et al.'s SimCLR framework in 2020 simplified contrastive self-supervised learning, producing robust visual representations without labels.^[14] The integration of transformers facilitated scalable feature learning across modalities in the late 2010s and 2020s. Alec Radford et al.'s CLIP in 2021 aligned image and text features through contrastive pretraining on vast paired data, enabling zero-shot multimodal capabilities.^[15] Foundation models like OpenAI's GPT-3, detailed by Tom Brown et al. in 2020, demonstrated few-shot learning of versatile representations via autoregressive pretraining on internet-scale text.^[16] Subsequent iterations, including GPT-4 in 2023, emphasized efficient, transferable features for diverse tasks. In 2024, advancements continued with OpenAI's GPT-4o, which integrated native multimodal processing of text, audio, and images for more cohesive representations; Meta's Llama 3, scaling open-source efficient learning; and Google's Gemini 1.5, enhancing long-context feature extraction.^[17]^[18]^[19] Key reviews, such as Yoshua Bengio, Aaron Courville, and Pascal Vincent's 2013 survey on representation learning and Yann LeCun, Bengio, and Hinton's 2015 overview of deep learning, synthesized these milestones, highlighting the shift toward hierarchical, data-driven feature extraction.^[3]^[20] As of November 2025, emphasis has grown on scalable, efficient representations in foundation models, supporting multimodal and self-supervised paradigms for real-world deployment.^[21]

Supervised Feature Learning

Supervised Dictionary Learning

Supervised dictionary learning extends the principles of dictionary learning by incorporating labeled data to guide the optimization process toward discriminative representations suitable for tasks such as classification. In this approach, a dictionary D—an overcomplete basis consisting of atoms—and sparse codes S are learned such that the input data X can be approximately reconstructed as X \approx D S, while simultaneously minimizing a supervised loss that leverages class labels to enhance discrimination between categories. This integration of labels ensures that the learned features are not only sparse and reconstructive but also task-oriented, improving performance in downstream supervised learning scenarios. The formulation typically involves minimizing a composite objective function that balances reconstruction fidelity, sparsity, and supervised task performance. A common objective is to minimize \|X - D S\|_2^2 + \lambda \|S\|_1 augmented with a supervised loss term, such as the hinge loss for integration with support vector machines (SVMs), which penalizes misclassifications based on the labels. More advanced variants directly incorporate the classification error into the dictionary update, as in the label-consistent K-SVD (LC-KSVD) algorithm proposed by Jiang et al. in 2011,^[22] which enforces label consistency by adding a term that aligns sparse codes with class labels through a linear classifier. Another seminal method, introduced by Mairal et al. in 2008, learns a shared overcomplete dictionary along with class-specific decision functions (linear or kernel-based) to enable discriminative sparse representations for classification.^[23] A key optimization problem in supervised dictionary learning can be expressed as:

\arg\min_{D,S} \frac{1}{2} \|Y - f(DS)\|_2^2 + \alpha \|X - DS\|_2^2 + \beta \Omega(S),

where Y represents the label matrix, f is a classifier function (e.g., linear or softmax), \alpha and \beta are regularization parameters, and \Omega(S) is a sparsity-inducing penalty such as the \ell_1-norm. This formulation, central to methods like LC-KSVD, alternates between updating the sparse codes S via pursuit algorithms (e.g., orthogonal matching pursuit) and refining the dictionary D to minimize both reconstruction and classification errors. The sparsity enforced in these methods promotes interpretability by selecting only a few relevant atoms per signal, allowing visualization of discriminative features, and facilitates seamless integration with classifiers like SVMs for end-to-end optimization. In applications such as face recognition, labels guide the formation of class-specific atoms in the dictionary, leading to improved recognition accuracy; for instance, LC-KSVD has demonstrated superior performance over unsupervised counterparts on datasets like Extended Yale B, achieving 97.5% accuracy in controlled settings.^[22] Unlike unsupervised dictionary learning, which lacks label guidance and focuses solely on reconstruction, supervised variants explicitly optimize for task discrimination.

Supervised Neural Networks

Supervised neural networks extract task-specific features from labeled data through an end-to-end optimization process driven by backpropagation, which computes gradients of the loss with respect to network weights and updates them to align intermediate representations with the provided labels.^[24] This mechanism enables hierarchical feature learning, where shallow layers detect low-level patterns such as edges and textures in input data, while deeper layers compose these into high-level abstractions like object parts or semantic concepts relevant to the task.^[24] As a linear precursor, supervised dictionary learning provided foundational ideas for sparse, class-discriminative representations that influenced the development of these more flexible neural models.^[23] Key architectures include multilayer perceptrons (MLPs), which process tabular data by stacking fully connected layers to learn non-linear transformations, and convolutional neural networks (CNNs), designed for spatial data like images through convolutional filters that capture local invariances.^[25] Pioneering examples are LeNet, introduced in 1989 for handwritten digit recognition using convolutional layers to extract digit-specific features from pixel inputs, and AlexNet from 2012, which scaled deeper CNNs with dropout and GPU acceleration to achieve breakthrough performance on large-scale image classification by learning robust visual hierarchies.^[26]^[25] Training involves minimizing a loss function, such as the cross-entropy loss defined as

L = -\sum_{i} y_i \log(\hat{y}_i),

where y represents the true label distribution and \hat{y} the predicted probabilities from softmax outputs, with learned features manifesting as the activations in hidden layers. Non-linearities, introduced via activation functions like ReLU (f(x) = \max(0, x)), allow networks to model complex mappings efficiently without vanishing gradients.^[27] These networks offer advantages in handling intricate data dependencies through their parametric, hierarchical structure, and support transfer learning by reusing pre-trained features from source tasks—such as convolutional bases from ImageNet—to initialize models for related targets, often yielding substantial gains in performance and data efficiency.^[28] The field saw a major revival after 2010, fueled by GPU-enabled parallel computation that made training deep architectures feasible on massive datasets. By 2025, advancements like low-rank adaptation (LoRA) have enhanced efficient fine-tuning of large supervised models by injecting low-rank matrices into pre-trained weights, reducing computational overhead while preserving feature quality.^[29] For instance, in image classification tasks, gradient descent iteratively refines features from raw pixels in early layers to discriminative semantic representations in later ones, enabling accurate predictions on complex visual inputs.^[24]

Unsupervised Feature Learning

Clustering-Based Methods

Clustering-based methods partition data points into groups based on similarity, revealing inherent structures that can be used as features in unsupervised learning. These features often take the form of cluster assignments, centroids, or probabilities of belonging to each cluster, enabling representation of data in a lower-dimensional or more interpretable space.^[30] For instance, hard assignments assign each point to a single cluster, while soft assignments use probabilities to capture uncertainty, allowing for more nuanced feature representations. A key example of soft clustering is the Gaussian Mixture Model (GMM), a probabilistic approach that models data as a mixture of Gaussian distributions. GMMs estimate parameters via the Expectation-Maximization (EM) algorithm, assigning points soft memberships based on posterior probabilities, which is useful for feature learning in density estimation and anomaly detection tasks.^[31]^[32] A foundational approach is the K-means algorithm, which iteratively partitions data into K clusters to minimize intra-cluster variance. The process begins with random initialization of K centroids, followed by assignment of each data point to the nearest centroid based on Euclidean distance, and then updating each centroid as the mean of points in its cluster; this alternates until convergence. The objective function optimized is the within-cluster sum of squares:

\min_{\mu_1, \dots, \mu_K} \sum_{k=1}^K \sum_{i \in C_k} \| x_i - \mu_k \|^2,

where C_k denotes the set of points in cluster k and \mu_k is its centroid.^[33] In feature learning, the resulting centroids or assignment vectors serve as a dictionary for encoding new data, such as projecting inputs onto cluster centers to form sparse representations.^[30] Variants extend K-means to handle complex data structures. Spectral clustering leverages the eigenvectors of a similarity graph's Laplacian matrix to embed data in a lower-dimensional space before applying K-means, effectively capturing non-linear manifolds and non-convex clusters.^[34] Hierarchical clustering, in contrast, constructs a tree-like dendrogram by successively merging or splitting clusters, enabling multi-scale feature extraction where features at different resolutions represent coarse-to-fine groupings. These methods derive features by treating clusters as prototypes. In text analysis, documents are often represented as histograms over word clusters, extending the bag-of-words model to capture topic-like structures via cluster proportions.^[35] Similarly, in computer vision, local image descriptors are clustered to form a visual vocabulary, with images encoded as histograms of visual words for tasks like object recognition.^[36] Post-clustering, embeddings can be refined by concatenating assignment vectors or distances to centroids. Despite their utility, clustering-based methods have limitations, including K-means' assumption of isotropic, spherical clusters, which fails on elongated or irregular shapes. Additionally, the algorithm is sensitive to the choice of K, addressed via the elbow method—plotting within-cluster variance against K to identify a diminishing returns point—or the silhouette score, which measures how well points fit their clusters relative to others. The silhouette score for a point i is given by

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))},

where a(i) is intra-cluster distance and b(i) is the nearest-cluster distance.^[37] By 2025, clustering remains relevant through integration with deep networks, as in Deep Embedded Clustering (DEC), which jointly optimizes an autoencoder for feature extraction and a clustering layer for assignments, improving representations on benchmarks like MNIST. Recent advances, including deep learning-driven feature extraction with convolutional neural networks and multi-scale unsupervised networks, further enhance performance on complex datasets like images.^[38]^[39]

Dimensionality Reduction Techniques

Dimensionality reduction techniques in feature learning aim to transform high-dimensional data into a lower-dimensional representation while preserving essential structure, such as variance or local neighborhoods, to facilitate subsequent analysis or modeling.^[40] These methods are particularly valuable in unsupervised settings, where they extract latent features by identifying directions of maximum variability or manifold geometries without relying on labels.^[41] Linear approaches like principal component analysis (PCA) focus on global variance preservation, whereas nonlinear methods such as locally linear embedding (LLE) and Isomap emphasize local or geodesic structures to uncover non-Euclidean manifolds. Principal component analysis (PCA) is a foundational linear technique that projects data onto orthogonal axes capturing the directions of greatest variance.^[40] Given a centered data matrix X \in \mathbb{R}^{n \times d}, PCA computes the covariance matrix \Sigma = \frac{1}{n} X^T X and performs eigen-decomposition \Sigma = U \Lambda U^T, where U contains the eigenvectors and \Lambda the eigenvalues ordered by descending magnitude.^[40] The reduced representation Z = X U_k is obtained by projecting onto the top k eigenvectors, maximizing the retained variance \frac{1}{n} \operatorname{tr}(Z^T Z).^[40] In practice, singular value decomposition (SVD) of X is often used for numerical stability, yielding X = V \Sigma W^T, with principal components as the columns of W.^[40] PCA assumes linear relationships among features and that variance is a suitable proxy for information, which may not hold for nonlinear data structures.^[42] Beyond linearity, nonlinear methods address manifold assumptions where high-dimensional data lies on a low-dimensional curved surface. Locally linear embedding (LLE) preserves local neighborhood geometry by reconstructing each point as a linear combination of its neighbors in both input and embedding spaces. For a data point x_i with k-nearest neighbors, reconstruction weights W_{ij} minimize \sum_j ||x_i - \sum_j W_{ij} x_j||^2 subject to \sum_j W_{ij} = 1, assuming local linearity. The low-dimensional embedding Y \in \mathbb{R}^{n \times m} then minimizes the embedding cost \sum_i || y_i - \sum_j W_{ij} y_j ||^2, solved via an eigenvalue problem on the matrix M = (I - W)^T (I - W), retaining the bottom m+1 eigenvectors (excluding the trivial all-ones solution). LLE assumes that local neighborhoods remain linearly reconstructible in the low-dimensional space but does not explicitly preserve global distances.^[41] Isomap extends multidimensional scaling to nonlinear manifolds by estimating geodesic distances via shortest paths on a neighborhood graph, effectively generalizing LLE to preserve global geometry. It constructs a graph where edges represent Euclidean distances to nearest neighbors, computes all-pairs shortest paths using Dijkstra's algorithm, and applies classical MDS to the geodesic distance matrix for embedding. This approach assumes the manifold is isometric to a convex subset of Euclidean space, making it suitable for data with intrinsic curvature, such as facial images or protein configurations. In contrast, t-distributed stochastic neighbor embedding (t-SNE) is primarily a visualization tool that minimizes divergences between high- and low-dimensional probability distributions of pairwise similarities but is non-parametric and less ideal for general feature extraction due to its sensitivity to hyperparameters and lack of out-of-sample extensions.^[43] A more recent nonlinear technique, Uniform Manifold Approximation and Projection (UMAP), builds on topological data analysis to preserve both local and global structure through fuzzy simplicial sets and cross-entropy optimization. UMAP is faster and more scalable than t-SNE, making it suitable for large datasets in feature learning and visualization as of 2025.^[44] In feature learning, reduced representations from these techniques serve as compact inputs to downstream classifiers, often improving efficiency and generalization by mitigating the curse of dimensionality.^[45] For instance, PCA-derived features can be whitened by scaling components inversely with their standard deviations, decorrelating and normalizing data for models like support vector machines.^[40] Computationally, PCA via SVD scales as O(\min(n^2 d, n d^2)), efficient for moderate dimensions, while LLE involves O(n^2 k) for neighbor searches and eigenvalue solving at O(n^3), limiting scalability without approximations.^[41] These methods thus enable unsupervised discovery of interpretable low-dimensional features underlying complex datasets.^[40]

Independent Component Analysis

Independent Component Analysis (ICA) is an unsupervised feature learning technique that decomposes multivariate data into statistically independent components, enabling the extraction of underlying features from mixed signals without prior knowledge of the mixing process.^[7] This method is particularly valuable in scenarios where data arises from linear mixtures of hidden sources, such as in signal processing and neuroscience, by assuming that the original sources are non-Gaussian and independent.^[46] The core principle of ICA models the observed data matrix \mathbf{X} as a linear transformation of independent source signals \mathbf{S}, expressed as \mathbf{X} = \mathbf{A} \mathbf{S}, where \mathbf{A} is the unknown mixing matrix.^[7] The goal is to estimate an unmixing matrix \mathbf{W} \approx \mathbf{A}^{-1} such that the recovered signals \mathbf{Y} = \mathbf{W} \mathbf{X} approximate the independent sources \mathbf{S}.^[46] Often, principal component analysis is applied as a preprocessing step to whiten the data, centering and sphering \mathbf{X} to simplify the ICA estimation.^[47] The objective of ICA is to maximize the statistical independence of the components, typically by measuring and enhancing non-Gaussianity through negentropy, defined as J(\mathbf{y}) \approx \sum_k \left[ G(y_k) - G(v) \right]^2, where G is a non-quadratic function approximating the sources' distributions and v is a Gaussian variable.^[7] Alternatively, independence can be achieved by minimizing mutual information between components, which equates to maximizing the joint entropy of the outputs under a fixed marginal entropy constraint.^[48] Key algorithms for ICA include FastICA, which employs fixed-point iteration for rapid convergence, and Infomax, which uses gradient ascent to maximize the log-likelihood of the data under an independence model.^[46] In FastICA, the update rule for the weight vector \mathbf{w} in one unit is given by:

\mathbf{w}^+ \propto \mathbb{E} \left\{ \mathbf{z} g(\mathbf{w}^T \mathbf{z}) \right\} - \mathbb{E} \left\{ g'(\mathbf{w}^T \mathbf{z}) \right\} \mathbf{w},

where \mathbf{z} = \mathbf{W} \mathbf{x} is the whitened data, g is a nonlinearity (e.g., g(u) = \tanh(u)), and the expectation is over the data distribution; subsequent orthogonalization and normalization ensure decorrelation.^[46] Infomax, in contrast, trains a neural network via backpropagation to maximize mutual information, treating separation as an information-theoretic optimization problem.^[48] ICA relies on key assumptions: the source signals are statistically independent, the mixing is linear, and at most one source is Gaussian to ensure identifiability, as Gaussian sources would be indistinguishable under linear transformations.^[7] These assumptions distinguish ICA from methods focused on correlation, emphasizing higher-order statistics for separation.^[47] In applications, ICA excels at blind source separation, such as isolating individual speech signals in the cocktail party problem for audio processing, where multiple voices overlap in a noisy environment.^[7] It is also widely used for feature extraction in electroencephalography (EEG) and functional magnetic resonance imaging (fMRI), decomposing brain signals into independent components to identify artifacts or neural patterns.^[7]

Unsupervised Dictionary Learning

Unsupervised dictionary learning aims to discover an overcomplete set of basis elements, or atoms, from unlabeled data to represent signals as sparse linear combinations of these atoms, thereby capturing intrinsic data structures without supervision. This approach contrasts with complete bases like PCA by allowing more atoms than data dimensions, promoting efficient and interpretable representations. The core idea originates from sparse coding models that enforce sparsity to mimic biological vision systems, where neurons respond selectively to image parts. The standard formulation minimizes the reconstruction error while enforcing sparsity in the coefficients:

\min_{D, S} \frac{1}{2} \| X - D S \|_2^2 + \lambda \| S \|_1, \quad \text{subject to } \| d_i \|_2 \leq 1 \ \forall i

Here, X \in \mathbb{R}^{n \times m} is the data matrix with m signals of dimension n, D \in \mathbb{R}^{n \times k} is the dictionary with k > n atoms d_i, and S \in \mathbb{R}^{k \times m} are the sparse codes, with \lambda > 0 balancing fidelity and sparsity. This optimization jointly learns the dictionary D and codes S, often solved via alternating minimization: sparse coding for fixed D, then dictionary update. Constraints on atom norms prevent trivial solutions like scaling up atoms and down coefficients.^[49] Key algorithms include the K-SVD method, introduced in 2006, which iteratively alternates between sparse coding using orthogonal matching pursuit and dictionary updates via singular value decomposition on restricted error matrices to refine individual atoms while preserving sparsity. For large-scale data, online variants process samples sequentially using stochastic approximations, updating the dictionary incrementally to reduce computational cost and enable scalability to millions of examples. These methods have become foundational for unsupervised feature extraction due to their balance of accuracy and efficiency.^[49] Sparsity plays a crucial role by representing each signal as a linear combination of only a few dictionary atoms, typically 5-10% non-zero coefficients per signal, which encourages part-based decompositions such as edges or textures in images rather than holistic representations. This promotes disentangled features that align with perceptual organization, as sparse activations lead to localized, overlapping receptive fields akin to simple cells in visual cortex. In applications, unsupervised dictionary learning excels in image denoising, where learned atoms serve as adaptive filters for textures; for instance, training on noisy patches yields dictionaries that reconstruct clean signals with PSNR improvements of 2-5 dB over fixed-wavelet methods on standard benchmarks like Barbara and Lena images. It also supports texture synthesis by sampling sparse codes from the learned dictionary to generate novel patches that preserve statistical properties of input textures. An important extension is non-negative matrix factorization (NMF), which imposes non-negativity on both factors for additive parts-based learning:

\min_{W, H} \| X - W H \|_F^2 \quad \text{subject to } W \geq 0, H \geq 0

where W acts as the dictionary and H the coefficients, yielding interpretable features like facial components from images, as non-negativity prevents subtractive cancellations. NMF variants often use multiplicative updates for convergence.^[50] Evaluation typically measures reconstruction error via mean squared error on held-out data and sparsity level through metrics like the average number of non-zero coefficients per signal or the \ell_0-pseudo-norm, ensuring the dictionary achieves low distortion with high compression. These metrics confirm the method's ability to generalize beyond training sets.^[49]

Semi-Supervised Feature Learning

Graph-Based Methods

Graph-based methods in semi-supervised feature learning construct a similarity graph G = (V, E), where vertices V represent data points (both labeled and unlabeled), and edges E are weighted by pairwise similarities, often derived from kernel functions or distance metrics such as Gaussian kernels. Labels from a small set of annotated nodes are propagated to unlabeled ones by enforcing smoothness on the graph manifold, assuming that nearby points share similar features and labels. This is formalized through Laplacian regularization, minimizing the objective \min_f \sum_{i,j} W_{ij} \|f_i - f_j\|^2 + \sum_{i \in \mathcal{L}} \|f_i - y_i\|^2, where W is the affinity matrix, f are the predicted labels or features, \mathcal{L} denotes labeled nodes, and y_i are ground-truth labels. The first term promotes local consistency via the graph Laplacian \mathcal{L} = D - W (with D the degree matrix), while the second enforces fitting to labeled data.^[51]^[52] Key algorithms include label propagation, introduced in 2003 as an iterative smoothing process that diffuses labels across the graph until convergence, treating predictions as harmonic functions that satisfy the Laplace equation on unlabeled nodes. Graph embeddings, such as Laplacian eigenmaps, learn low-dimensional representations by solving the eigendecomposition of the normalized Laplacian, preserving local geometry and manifold structure in the embedding space. These embeddings serve as learned features, capturing the intrinsic data geometry while incorporating label information for semi-supervised refinement. The process solves for feature assignments F via the linear system (I - \alpha \mathcal{L}) F = Y, where \mathcal{L} is the normalized Laplacian, \alpha \in (0,1) controls propagation strength, and Y is the initial labeled matrix extended with zeros for unlabeled nodes; this yields closed-form harmonic solutions that extend labels smoothly.^[52]^[53] In feature learning, these methods produce embeddings as harmonic functions on the graph, providing low-dimensional coordinates that respect the manifold's topology and propagate supervisory signals effectively. This leverages the manifold assumption—that high-dimensional data lies on a low-dimensional structure—enabling robust feature extraction even with scarce labels, as demonstrated in text classification tasks like web page categorization, where graph-based propagation improved accuracy over supervised baselines by utilizing document similarity graphs. Advantages include scalability to large datasets via sparse graph representations and superior performance in low-label regimes, outperforming inductive methods by 10-20% in error rates on benchmark text datasets with only 1-5% labeled data.^[54]^[52] Variants extend this framework, such as transductive SVMs adapted to graphs, which incorporate manifold regularization into the SVM hinge loss for joint optimization of decision boundaries and label propagation on the similarity graph. Recent updates as of 2025 integrate graph neural networks (GNNs) for dynamic graphs, where evolving topologies are handled by temporal message passing to learn time-varying features in semi-supervised settings, achieving state-of-the-art results on node classification with missing attributes by evolving graph structures during training.^[55]

Generative Model Approaches

Generative model approaches to semi-supervised feature learning leverage both limited labeled data and abundant unlabeled data by modeling the joint distribution P(X, Y) for labeled examples and the marginal distribution P(X) for unlabeled ones, enabling the extraction of robust features that capture underlying data structures.^[56] This probabilistic framework allows the model to infer labels for unlabeled data through density estimation, thereby improving generalization in scenarios with scarce annotations. Features are typically derived from latent variables Z in variational autoencoder (VAE)-like architectures, where Z represents a compressed, disentangled representation that encodes class-invariant properties while incorporating supervisory signals from labels.^[56] Key algorithms in this domain include semi-supervised generative adversarial networks (GANs) with an auxiliary classifier, introduced in 2016, which extend the GAN framework by training a discriminator to not only distinguish real from fake samples but also predict class labels on real data, using unlabeled samples to refine the decision boundary.^[57] Another foundational method is ladder networks, proposed in 2015, which integrate unsupervised denoising objectives with supervised classification through a hierarchical structure of encoder-decoder pairs that enforce consistency across layers, allowing clean feature representations to propagate bidirectionally.^[58] These approaches train by maximizing the likelihood on labeled data while applying entropy minimization on unlabeled predictions, with the objective formulated as

L = \sum_{\text{labeled}} \log P(Y \mid X) + \sum_{\text{unlabeled}} H(P(Y \mid X)),

where H(\cdot) denotes the entropy of the predicted class distribution, encouraging confident pseudo-labels for unlabeled samples to regularize the model. The latent space Z serves as the primary source of semi-supervised representations, providing embeddings that blend supervised discrimination with unsupervised density modeling to enhance downstream tasks such as classification or retrieval.^[56] In applications, these methods excel in image segmentation with few labels, where generative models like semi-supervised GANs generate plausible segmentations for unlabeled images to augment training, achieving notable improvements in pixel-wise accuracy on datasets like Cityscapes.^[59] Similarly, for anomaly detection, they model normal data distributions to identify deviations in unlabeled samples, as seen in setups using VAEs to flag outliers in industrial monitoring with minimal labeled anomalies.^[60] Despite their strengths, these approaches assume the correctness of the underlying generative model, which can lead to poor performance if the assumed data distribution is misspecified, and they often incur high computational costs due to iterative sampling and inference in high-dimensional spaces.^[56]

Self-Supervised Feature Learning

Core Principles and Contrastive Learning

Self-supervised feature learning is a paradigm within unsupervised learning that generates supervisory signals directly from the input data itself, enabling models to learn meaningful representations without relying on explicit labels. This approach treats the data as both input and output, creating pretext tasks that encourage the model to capture underlying structures, such as spatial relationships or contextual dependencies. As a subset of unsupervised methods focused on pre-training, self-supervised learning has gained prominence for its ability to leverage vast amounts of unlabeled data to initialize models for downstream supervised tasks. Pretext tasks form the core of self-supervised learning by defining surrogate objectives that provide pseudo-labels derived from the data. Examples include rotation prediction, where the model learns to identify the angle (e.g., 0°, 90°, 180°, or 270°) to which an image has been rotated; colorization, which involves predicting color values for grayscale images to reconstruct the original; and jigsaw puzzles, where the model rearranges shuffled image patches to their correct positions. These tasks promote the extraction of invariant and discriminative features, such as edges, textures, and object parts. By 2025, masked modeling has emerged as a dominant trend, extending beyond vision to non-vision domains like text and sequences, exemplified by BERT-style approaches that predict masked tokens in sentences to learn contextual embeddings.^[61]^[62]^[63] Contrastive learning, a key mechanism in self-supervised feature learning, operates by contrasting positive pairs—typically augmented views of the same instance—against negative pairs from different instances to pull similar representations closer and push dissimilar ones apart in the embedding space. This is often formulated using the InfoNCE (Noise-Contrastive Estimation) loss, which maximizes the mutual information between positive pairs while minimizing it for negatives:

\mathcal{L}_{NCE} = -\mathbb{E} \left[ \log \frac{\exp(\operatorname{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{N} \exp(\operatorname{sim}(z_i, z_k)/\tau)} \right]

Here, z_i and z_j are projections of positive pair embeddings, \operatorname{sim}(\cdot, \cdot) is a similarity function (e.g., cosine similarity), \tau is a temperature parameter, and the sum includes one positive and N-1 negatives. Seminal techniques include instance discrimination via Contrastive Predictive Coding (CPC), which predicts future representations in latent space using autoregressive modeling, and Momentum Contrast (MoCo), which maintains a dynamic queue of negative samples updated via a momentum encoder to enable large-batch training without collapsing representations.^[64]^[65] These methods offer significant advantages, including scalability to massive unlabeled datasets—such as billions of images—without the need for annotations, and strong transferability to downstream tasks like classification, where pre-trained features achieve competitive performance after minimal adaptation. For instance, MoCo v2 demonstrates linear probing accuracy exceeding 70% on ImageNet, rivaling supervised pre-training while using only unlabeled data. Evaluation typically involves linear probing, where a simple linear classifier is trained on frozen self-supervised features to assess representation quality on held-out labeled data, providing a standardized metric for transferability across benchmarks.^[65]^[66]

Applications in Text and Language

Self-supervised feature learning has revolutionized natural language processing by enabling models to derive rich representations from vast unlabeled text corpora, focusing on tasks that predict linguistic structures without explicit supervision. One foundational approach is the skip-gram model in Word2Vec, which learns word embeddings by maximizing the log probability of context words given a target word, formulated as \max \sum_{t=1}^{T} \sum_{-c \leq o \leq c, o \neq 0} \log P(w_{t+o} | w_t), where T is the sequence length and c is the context window size. This method produces dense vector representations that capture semantic similarities, such as "king" - "man" + "woman" ≈ "queen," facilitating downstream applications like semantic search. Transformer-based models advanced this paradigm with bidirectional contextual embeddings, exemplified by BERT's masked language modeling (MLM), where 15% of input tokens are randomly masked and the model predicts them based on surrounding context, alongside next sentence prediction to understand discourse relations. These pre-training objectives yield hierarchical features, starting from token-level embeddings and building to sentence-level representations through multi-layer self-attention mechanisms, enabling nuanced understanding of syntax and semantics. The GPT series, conversely, employs causal language modeling, training on next-token prediction in a unidirectional manner to generate coherent sequences, with models like GPT-3 scaling to 175 billion parameters for emergent zero-shot capabilities in tasks such as question answering. By November 2025, GPT-5 further enhances zero-shot feature extraction through integrated reasoning modules, achieving superior performance on benchmarks like GLUE without task-specific fine-tuning.^[67] These learned features transfer effectively to downstream applications, including sentiment analysis, where fine-tuned BERT variants significantly outperform traditional methods on datasets like SST-2, capturing subtle emotional nuances in reviews.^[68] In machine translation, contextual embeddings from models like mBERT—pre-trained on 104 languages—improve cross-lingual transfer, enabling zero-shot translation between unseen language pairs with BLEU scores several to 15 points higher than non-contextual baselines. Multilingual extensions like mBERT democratize access to high-quality features for low-resource languages, supporting applications in global sentiment monitoring and translation services. Despite these advances, self-supervised language models face significant challenges, including high computational costs for training on terabyte-scale corpora, often requiring thousands of GPU-hours that limit accessibility for smaller research groups.^[69] Additionally, biases inherent in pre-training data—such as gender or racial stereotypes amplified during unsupervised learning—persist in embeddings, necessitating mitigation techniques like debiasing fine-tuning to ensure fairer downstream applications.

Applications in Images and Vision

Self-supervised feature learning has revolutionized computer vision by enabling the extraction of robust image representations from vast unlabeled datasets, leveraging pretext tasks and data augmentations to foster invariance to transformations like cropping, color distortion, and geometric changes. In this domain, contrastive methods such as SimCLR treat augmented views of the same image as positive pairs while contrasting them against others, achieving strong linear evaluation accuracies on ImageNet by simplifying prior frameworks to focus on large-batch training and strong augmentations without architectural complexities like memory banks. Similarly, non-contrastive approaches like DINO employ self-distillation in a teacher-student setup on Vision Transformers (ViTs), where the student predicts the teacher's output on augmented crops without relying on negative samples, leading to emergent properties such as semantic segmentation from attention maps and superior transfer to downstream tasks. These methods learn features that capture spatial hierarchies and object-centric structures, outperforming supervised pretraining when fine-tuned on limited labeled data for object detection and segmentation. Pretext tasks further enhance feature invariance in vision; for instance, rotation prediction trains networks to classify the angle (e.g., 0°, 90°, 180°, or 270°) an image has been rotated, encouraging the model to discern orientation-invariant semantics without labels. To prevent representational collapse in siamese architectures, SimSiam introduces a stop-gradient operation on one branch during similarity maximization between augmented views, allowing simple predictors to yield representations competitive with contrastive baselines on ImageNet classification. Vision Transformers, pretrained self-supervised via methods like DINO or masked image modeling, treat images as sequences of patch embeddings, yielding features that excel in capturing global dependencies and local details, with self-supervised ViTs often surpassing convolutional counterparts in transfer learning scenarios. Recent advancements by 2025 integrate diffusion models into self-supervised vision, using denoising as a pretext task to generate intermediate representations that support few-shot landmark detection and image understanding with minimal annotations. These diffusion-based approaches enable efficient deployment on mobile devices by distilling lightweight encoders from generative priors, reducing computational overhead while maintaining high fidelity in feature extraction for edge vision tasks. In downstream applications, self-supervised features pretrained on large corpora like ImageNet significantly boost object detection on COCO and semantic segmentation on ADE20K, particularly in low-data regimes where they outperform fully supervised models trained from scratch—for example, achieving up to 6 points higher mIoU with only 1% labeled data. Evaluation often employs k-NN classification on frozen features, where top-1 accuracies exceeding 70% on ImageNet benchmarks signal the quality of learned representations for transfer.

Applications in Graphs, Video, Audio, and Multimodal Data

Self-supervised feature learning has been extended to graph-structured data through contrastive approaches that generate node or graph-level representations without labels. Graph Contrastive Learning (GraphCL), introduced in 2020, employs data augmentations such as subgraph sampling and attribute masking to create positive pairs from the same graph, while contrasting them against negatives from other graphs, using a graph neural network encoder and InfoNCE loss to maximize agreement between augmented views.^[70] This method outperforms prior unsupervised baselines on node classification tasks like Cora and PubMed datasets, achieving up to 5% accuracy gains in semi-supervised settings. Complementing node-level methods, InfoGraph (2020) focuses on graph-level embeddings by maximizing mutual information between a graph and its substructures, such as nodes, edges, and subgraphs, via a variational bound, enabling effective representation for tasks like graph classification on molecular datasets.^[71] In video data, self-supervised learning leverages temporal dynamics through contrastive objectives that align features across frames or clips. Temporal Contrastive Learning for Video Representation (TCLR), proposed in 2021, introduces a framework with instance-level and temporal contrastive losses to enforce variation within video instances over time, without relying on explicit pretext tasks like rotation prediction, and demonstrates superior transfer to action recognition on Kinetics-400, improving top-1 accuracy by 2-3% over prior methods.^[72] Motion prediction serves as another key pretext task, where models forecast future frame displacements or optical flow from past frames, as explored in works decoupling motion from static context to learn spatiotemporal features transferable to downstream video understanding. Modality-specific augmentations, such as temporal cropping or frame shuffling, are crucial for generating robust positives in these approaches. For audio signals, contrastive methods adapt to sequential spectrograms or raw waveforms by predicting latent representations. Contrastive Predictive Coding (CPC), from 2018, trains encoders to predict future samples in a latent space using an autoregressive model and noise-contrastive estimation on audio sequences, yielding representations that rival supervised features for speech tasks like phoneme recognition on LibriSpeech.^[64] Building on this, wav2vec (2019) applies masked prediction to raw audio, where a convolutional encoder contextualizes masked latent vectors, and a contrastive loss distinguishes true targets from distractors, achieving word error rates competitive with supervised models after fine-tuning on 960 hours of unlabeled data.^[73] These techniques highlight the role of temporal augmentations, like time shifting or noise addition, in capturing phonetic and prosodic structures. Multimodal self-supervised learning aligns representations across modalities, such as vision and language, using joint contrastive objectives. CLIP (2021) trains separate encoders for images and text on 400 million pairs via a contrastive loss that maximizes cosine similarity for matching pairs while minimizing for non-matches, enabling zero-shot transfer to image classification on ImageNet with 76% top-1 accuracy using a vision transformer.^[15] Extending to unified architectures, Flamingo (2022) integrates frozen vision and language models with cross-attention layers for few-shot learning on visual question answering, processing interleaved image-text inputs to generate representations adaptable to multimodal tasks like captioning. Cross-modal alignment remains central, often via shared embedding spaces or bidirectional contrastive losses. Across these domains, common strategies include modality-tailored augmentations—such as edge dropping in graphs to preserve structure while introducing variability—and cross-modal objectives that bridge disparate data types for richer representations. However, challenges persist, including scalability for large-scale graphs and videos, where memory-intensive augmentations limit training on datasets exceeding millions of nodes or hours of footage, and precise alignment in multimodal settings, where distribution shifts between modalities can degrade transfer performance.^[70]^[15]

Deep and Multilayer Architectures

Restricted Boltzmann Machines

Restricted Boltzmann machines (RBMs) are probabilistic graphical models consisting of a bipartite graph with visible units \mathbf{v} representing input data and hidden units \mathbf{h} capturing latent features, where connections exist only between visible and hidden layers, with no intra-layer connections. This restricted connectivity simplifies inference and learning compared to fully connected Boltzmann machines.^[74] The joint probability distribution over visible and hidden units is defined by an energy function E(\mathbf{v}, \mathbf{h}) = -\mathbf{b}^T \mathbf{v} - \mathbf{c}^T \mathbf{h} - \mathbf{v}^T \mathbf{W} \mathbf{h}, where \mathbf{b} and \mathbf{c} are biases for visible and hidden units, respectively, and \mathbf{W} is the weight matrix between layers. The probability of a visible vector is given by P(\mathbf{v}) = \frac{\sum_{\mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h})}}{Z}, where Z is the partition function summing over all possible configurations.^[74] RBMs typically assume binary units, with visible units often modeled as Bernoulli or Gaussian distributions to handle real-valued data.^[75] Training RBMs involves maximizing the log-likelihood of the data, approximated via contrastive divergence (CD-k), an efficient method using k steps of Gibbs sampling to estimate gradients without computing the intractable partition function.^[76] In CD-k, positive phase updates use data-driven expectations, while negative phase uses model-generated "fantasy" particles from approximate sampling, adjusting weights to make P(\mathbf{v}) approximate the empirical data distribution, often factorized as \prod P(v_i) for independent margins.^[76] This approach enables unsupervised learning of features by minimizing the Kullback-Leibler divergence between data and model distributions.^[75] In feature learning, hidden unit activations serve as compact, distributed representations of input data, capturing higher-order correlations.^[10] These RBMs can be stacked layer-wise: the hidden layer of one RBM becomes the visible layer for the next, enabling greedy pre-training of deep networks; the top two layers form an associative memory, with lower layers trained as directed beliefs, forming deep belief networks (DBNs) as introduced by Hinton et al. in 2006.^[10] This layer-wise procedure initializes weights effectively for subsequent fine-tuning, addressing vanishing gradient issues in deep architectures.^[10] Applications of RBMs include dimensionality reduction, where stacked RBMs compress high-dimensional data into low-dimensional codes outperforming principal component analysis on datasets like MNIST, achieving better reconstruction error.^[77] In collaborative filtering, RBMs model user ratings as visible units, learning personalized recommendations; for instance, on the Netflix dataset, they yielded root mean square errors around 0.90, competitive with matrix factorization methods.^[78] The binary nature of hidden features suits sparse, combinatorial data representations.^[78] Inference in RBMs is tractable due to conditional independence: the activation probability for hidden unit i is h_i = \sigma(c_i + \mathbf{v}^T \mathbf{W}_{:i}), where \sigma is the sigmoid function, allowing mean-field approximations or exact sampling.^[75] Reconstruction of visible units from hidden activations follows symmetrically: v_j' = \sigma(b_j + \mathbf{h}^T \mathbf{W}_{j:}).^[75] Limitations of RBMs include their restriction to binary units, limiting applicability to continuous data without extensions like Gaussian RBMs, and slow training due to Gibbs sampling mixing times, though mitigated by persistent contrastive divergence (PCD), which reuses Markov chains across updates for faster convergence.^[79] RBMs serve as a generative, probabilistic counterpart to deterministic methods like autoencoders.^[77]

Autoencoders

Autoencoders are a class of unsupervised neural networks designed to learn efficient data representations by compressing inputs into a lower-dimensional latent space and then reconstructing the original input from this representation. The architecture consists of an encoder function f(x) that maps the input x to a latent code z, followed by a decoder function g(z) that reconstructs the output \hat{x} = g(f(x)). Training minimizes the reconstruction loss, typically the mean squared error \|x - g(f(x))\|^2, which encourages the network to capture essential features while discarding noise or redundancies. This bottleneck structure in the latent space z enforces dimensionality reduction, making autoencoders useful for feature extraction in high-dimensional data. Several variants extend the basic autoencoder to improve robustness and generative capabilities. Denoising autoencoders introduce noise to the input during training, such as Gaussian perturbations or masking, and optimize the reconstruction of the clean input, thereby learning features invariant to corruptions. Variational autoencoders (VAEs) incorporate probabilistic modeling by treating the encoder as an approximate posterior q(z|x) over the latent variables and the decoder as the likelihood p(x|z), with a prior p(z) often assumed Gaussian. They maximize the evidence lower bound (ELBO):

\mathcal{L} = \mathbb{E}_{q(z|x)} [\log p(x|z)] - D_{KL}(q(z|x) \| p(z)),

where the first term promotes reconstruction fidelity and the second enforces alignment with the prior via Kullback-Leibler divergence. Sparse autoencoders promote sparsity in the latent representations by adding an L1 penalty on the activations of the hidden units, encouraging only a few neurons to activate for any input and yielding more interpretable, distributed features.^[80]^[81] Autoencoders are trained using stochastic gradient descent (SGD) on the reconstruction loss, with their popularity surging in the 2010s following advances in deep learning that enabled effective optimization of deeper architectures. The latent representations z serve as compact features for downstream tasks, outperforming traditional methods like principal component analysis on nonlinear data manifolds. In applications, autoencoders excel in anomaly detection, where high reconstruction errors flag outliers as deviations from learned normal patterns. They also facilitate pre-training for supervised tasks, initializing classifiers with robust features to improve performance on limited labeled data.

Modern Deep Networks for Feature Extraction

Modern deep networks have revolutionized feature extraction by enabling the learning of complex, hierarchical representations from raw data, surpassing earlier architectures through innovations in attention mechanisms and residual connections. Transformers, introduced in 2017, utilize self-attention to capture global dependencies across input sequences, producing contextual embeddings that serve as rich features for downstream tasks such as natural language processing and beyond.^[82] These embeddings allow models to weigh the importance of different parts of the input dynamically, facilitating the extraction of relational features without relying on recurrent structures. Evolutions in convolutional neural networks (CNNs) have further advanced feature learning by addressing challenges in training very deep architectures. ResNet, proposed in 2015, incorporates residual blocks that enable the construction of networks with hundreds of layers, where shortcut connections mitigate vanishing gradients and allow the learning of residual functions, yielding robust hierarchical feature maps for image recognition.^[83] Building on this, EfficientNet in 2019 introduced compound scaling to balance network depth, width, and resolution, achieving state-of-the-art performance on benchmarks like ImageNet with significantly fewer parameters, thus extracting efficient, transferable features for vision tasks.^[84] Hybrid architectures combine the strengths of CNNs and transformers to leverage local and global feature extraction. The Vision Transformer (ViT), from 2020, treats images as sequences of patches embedded into transformer inputs, enabling end-to-end learning of spatial features that rival or exceed CNNs on large-scale datasets when pre-trained appropriately.^[85] As of 2025, emerging trends emphasize generative and relational feature learning within deep networks. Diffusion models have gained prominence for extracting features through iterative denoising processes, learning balanced representations that capture data manifolds without overfitting to superficial patterns, as demonstrated in recent analyses of their feature learning dynamics.^[86] Graph neural networks, such as Graph Convolutional Networks (GCNs) introduced in 2016, extend feature extraction to relational data by propagating information across graph structures, producing node embeddings that encode neighborhood dependencies for tasks like semi-supervised classification.^[87] Training these networks typically involves self-supervised pre-training on large unlabeled datasets to learn general features, followed by fine-tuning on specific tasks, with intermediate layer activations serving as versatile feature maps for transfer learning.^[88] This paradigm enhances feature quality across domains, from vision to graphs. These architectures offer key advantages, including hierarchical feature building that captures multi-scale abstractions, scalability to massive datasets via parallelizable components, and adaptability to diverse data types like images, text, and graphs, driving breakthroughs in representation learning.

Dynamic Feature Learning

Representations for Sequential Data

Representations for sequential data in feature learning involve models that process time-series or ordered inputs by maintaining evolving hidden states to capture temporal dependencies. The core principle is to learn a sequence of hidden states h_t at each time step t, updated via h_t = f(h_{t-1}, x_t), where x_t is the input at time t and f is a nonlinear function, typically parameterized by weights in a neural network; these hidden states serve as dynamic features representing the evolving context of the sequence.^[89] Recurrent neural networks (RNNs), introduced in the 1990s, form the foundational architecture for this approach, enabling the network to retain information from previous time steps through recurrent connections.^[90] A key advancement came with long short-term memory (LSTM) units in 1997, which address limitations in standard RNNs by incorporating gating mechanisms to regulate information flow. In LSTMs, the forget gate is computed as f_t = \sigma(W_f [h_{t-1}, x_t] + b_f), the input gate as i_t = \sigma(W_i [h_{t-1}, x_t] + b_i), and the output gate similarly, where \sigma is the sigmoid function, W are weights, and b are biases. The cell state evolves as c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, and the hidden state is h_t = o_t \odot \tanh(c_t), with \odot denoting element-wise multiplication and \tilde{c}_t the candidate cell update.^[91] A variant, the gated recurrent unit (GRU) introduced in 2014, simplifies the LSTM by merging the cell and hidden states into a single unit with fewer gates—an update gate and a reset gate—reducing computational complexity while maintaining comparable performance on sequential tasks.^[92] Bidirectional RNNs extend these architectures by processing sequences in both forward and backward directions, concatenating hidden states from each pass to provide fuller contextual representations for tasks requiring global sequence information.^[93] These representations find applications in domains like speech recognition, where deep RNNs have achieved state-of-the-art performance by modeling acoustic sequences as evolving features for phonetic decoding,^[94] and stock price prediction, where LSTM-based models capture temporal patterns in historical prices to forecast future trends. In sequence-to-sequence (seq2seq) frameworks, the encoder's hidden states serve as contextual features for the decoder, enabling tasks such as machine translation.^[95] By the mid-2020s, transformer-based architectures, leveraging self-attention mechanisms, have increasingly supplemented RNNs for sequential data, enabling parallel processing and better handling of long-range dependencies in tasks like natural language processing and time-series forecasting.^[82] A primary challenge in training these models is the vanishing gradient problem, where gradients diminish exponentially over long sequences during backpropagation through time, hindering learning of long-term dependencies; LSTMs mitigate this by using constant error carousels in cell states to propagate gradients more stably.^[96]^[97]^[91]

Methods for Evolving and Temporal Structures

In methods for evolving and temporal structures, feature learning focuses on generating dynamic embeddings e_t(v) for nodes v or edges at discrete time steps t, which capture the structural evolution of networks through temporal transitions between snapshots.^[98] These representations adapt to changes in connectivity, such as node additions, edge formations, or attribute updates, enabling models to model relational dynamics without recomputing static features from scratch.^[99] The core principle involves propagating information across time-aware neighborhoods, where embeddings evolve by integrating historical states with current structural deltas, often leveraging recurrent mechanisms to maintain continuity.^[98] Key algorithms in this domain include dynamic graph neural networks (GNNs), which extend static GNNs to handle temporal changes. EvolveGCN, introduced in 2019, decomposes the graph adjacency matrix over time and evolves GCN parameters using recurrent neural networks (RNNs), allowing efficient updates without full retraining.^[98] Building on this, TGAT (Temporal Graph Attention) from 2020 incorporates time-aware attention mechanisms to weigh historical interactions based on their recency and relevance, producing inductive embeddings suitable for unseen nodes in evolving graphs.^[99] These methods significantly outperform static baselines on dynamic link prediction tasks in citation and social networks, as they explicitly model temporal dependencies in feature aggregation.^[98] For continuous-time settings, temporal point processes model event sequences in graphs, where features represent intensity functions that govern interaction rates. The Hawkes process, a self-exciting model, defines the conditional intensity as \lambda(t) = \mu + \sum_{t_i < t} \alpha \exp(-\beta (t - t_i)), with baseline rate \mu, excitation amplitude \alpha, and decay \beta, capturing how past events influence future ones in networked data.^[100] Graph Hawkes Neural Networks extend this by parameterizing the intensity via GNNs over graph structure, enabling feature learning for temporal knowledge graphs and improving forecasting accuracy on event prediction.^[101] Applications span domains with evolving relational data. In social network analysis, dynamic GNNs track user interactions for community detection and influence propagation, as demonstrated in temporal citation networks where embeddings evolve to predict collaboration links.^[98] Traffic forecasting benefits from these methods by modeling road networks as time-varying graphs, with models like temporal GNNs achieving lower mean absolute errors in short-term flow predictions compared to autoregressive baselines.^[102] As of 2025, real-time fraud detection in financial transaction graphs has adopted dynamic GNN frameworks, such as FinGuard-GNN, which updates embeddings incrementally to detect evolving collusion patterns in streaming data.^[103] Embedding updates typically follow a recurrent formulation, such as e_t = \text{RNN}(e_{t-1}, \Delta A_t), where \Delta A_t denotes the adjacency change at time t, optionally incorporating sequential RNNs for evolving node attributes.^[98] This allows scalable processing of large-scale temporal graphs, with memory embeddings in models like TGNs significantly reducing computational overhead through efficient message passing histories.^[104] Evaluation emphasizes tasks like link prediction over varying time horizons, assessing how well models forecast future edges given historical snapshots. Metrics such as area under the ROC curve (AUC) and mean reciprocal rank (MRR) are computed across short (e.g., next timestep) and long (e.g., 10-step) horizons, revealing that time-aware methods like TGAT often outperform snapshot-based approaches on datasets like Wikipedia edits or MOOC interactions.^[105]

References

[1]
[PDF] Representation Learning: A Review and New Perspectives - arXiv
Apr 23, 2014 · In Bengio and LeCun (2007), one of us introduced the notion of AI-tasks, which are challenging for current machine learning algorithms, and ...
[2]
[PDF] Understanding and Improving Feature Learning for Out-of ...
Understanding feature learning in neural networks is crucial to understanding how they generalize to different data distributions [2, 11, 12, 62, 67, 70]. Deep ...
[3]
Representation Learning: A Review and New Perspectives - arXiv
Jun 24, 2012 · This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, ...
[4]
None
### Summary of Representation Learning (Chapter 15, Deep Learning Book)
[5]
None
### Summary of Feature Learning in CNNs from https://www.deeplearningbook.org/contents/convnets.html
[6]
[PDF] Pearson, K. 1901. On lines and planes of closest fit to systems of ...
Pearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2:559-572. http://pbil.univ-lyon1.fr/R/pearson1901.
[7]
[PDF] Independent Component Analysis - Computer Science
... Independent component analysis. 6. 1.3.1 Definition. 6. 1.3.2 Applications. 7. 1.3.3 How to find the independent components. 7. 1.4 History of ICA. 11 v. Page 6 ...
[8]
[PDF] Emergence of simple-cell receptive field properties by learning a ...
We show that a learning algorithm that attempts to find sparse linear codes for natural scenes will develop a complete family of localized, oriented, bandpass ...Missing: dictionary | Show results with:dictionary
[9]
[PDF] K-SVD: An Algorithm for Designing Overcomplete Dictionaries for ...
In this paper, we present a novel algorithm for adapting dictio- naries so as to represent signals sparsely. Given a set of training signals. , we seek the ...
[10]
[PDF] A Fast Learning Algorithm for Deep Belief Nets
We show how to use “complementary priors” to eliminate the explaining- away effects that make inference difficult in densely connected belief nets.
[11]
[PDF] ImageNet Classification with Deep Convolutional Neural Networks
The specific contributions of this paper are as follows: we trained one of the largest convolutional neural networks to date on the subsets of ImageNet used in ...
[12]
Efficient Estimation of Word Representations in Vector Space - arXiv
Jan 16, 2013 · We propose two novel model architectures for computing continuous vector representations of words from very large data sets.
[13]
[1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
Oct 11, 2018 · BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
[14]
A Simple Framework for Contrastive Learning of Visual ... - arXiv
Feb 13, 2020 · This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised ...
[15]
Learning Transferable Visual Models From Natural Language ...
Feb 26, 2021 · Learning Transferable Visual Models From Natural Language Supervision. Authors:Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, ...
[16]
[2005.14165] Language Models are Few-Shot Learners - arXiv
May 28, 2020 · GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks ...
[17]
Deep learning | Nature
... Published: 27 May 2015. Deep learning. Yann LeCun,; Yoshua Bengio &; Geoffrey Hinton. Nature volume 521, pages 436–444 (2015)Cite this article.
[18]
Deep learning: Historical overview from inception to actualization ...
This study aims to provide a historical narrative of deep learning, tracing its origins from the cybernetic era to its current state-of-the-art status.
[19]
Learning representations by back-propagating errors - Nature
Oct 9, 1986 · We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in ...
[20]
[0809.3083] Supervised Dictionary Learning - arXiv
Sep 18, 2008 · This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary.
[21]
ImageNet Classification with Deep Convolutional Neural Networks
Authors. Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. Abstract. We trained a large, deep convolutional neural network to classify the 1.3 million ...
[22]
[PDF] Handwritten Digit Recognition with a Back-Propagation Network
The main point of this paper is to show that large back-propagation (BP) net- works can be applied to real image-recognition problems without a large, complex.
[23]
[PDF] Rectified Linear Units Improve Restricted Boltzmann Machines
Rectified linear units (RLUs) improve RBMs by learning better features for object recognition and face verification, and preserving relative intensities unlike ...
[24]
How transferable are features in deep neural networks? - arXiv
Nov 6, 2014 · Access Paper: View a PDF of the paper titled How transferable are features in deep neural networks?, by Jason Yosinski and 3 other authors.
[25]
LoRA: Low-Rank Adaptation of Large Language Models - arXiv
Jun 17, 2021 · We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the ...
[26]
[PDF] Learning Feature Representations with K-means
More recently, we have found that using K-means clustering as the unsupervised learning module in these types of “feature learning” pipelines can lead to excel-.
[27]
An Analysis of Single-Layer Networks in Unsupervised Feature ...
We will apply several off-the-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs, K-means clustering, and Gaussian mixtures) to CIFAR-10, ...
[28]
An efficient k-means clustering algorithm: analysis and implementation
A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm ...<|separator|>
[29]
[PDF] On Spectral Clustering: Analysis and an algorithm Andrew Y. Ng CS ...
In this paper, we present a simple spectral clustering algorithm that can be ... Here, we build upon the recent work of Weiss [11] and Meila and Shi [6], who.
[30]
A Review on Analysis of K-Means Clustering Machine Learning ...
Apr 15, 2024 · The objective of writing the paper is how K- Means clustering algorithm is applied on the model dataset based on unsupervised learning. We used ...
[31]
[PDF] Video Google: A Text Retrieval Approach to Object Matching in Videos
Building a visual vocabulary. The objective here is to vector quantize the descriptors into clusters which will be the visual 'words' for text retrieval.
[32]
[PDF] a graphical aid to the interpretation and validation of cluster analysis
Silhouettes of an example where eight points are divided over two very tight clusters, for k = 2. Page 9. P.J. Rousseeuw / Graphical aid to cluster analysis. 61.
[33]
Unsupervised Deep Embedding for Clustering Analysis - arXiv
Nov 19, 2015 · In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural ...
[34]
Principal component analysis: a review and recent developments
Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing ...Missing: seminal | Show results with:seminal
[35]
[PDF] Think Globally, Fit Locally: Unsupervised Learning of Low ...
Here we describe locally linear embedding (LLE), an unsupervised learning algorithm ... In previous work (Roweis and Saul, 2000), for example, we applied LLE.
[36]
A Guide to Principal Component Analysis (PCA) for Machine Learning
What are the assumptions and limitations of PCA? · PCA assumes a correlation between features. · PCA is sensitive to the scale of the features. · PCA is not robust ...
[37]
[PDF] Visualizing Data using t-SNE - Journal of Machine Learning Research
We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
[38]
Principal Component Analysis (PCA): Explained Step-by-Step | Built In
Principal component analysis was first introduced by Karl Pearson in 1901 as a method for identifying the principal axes of variation in multidimensional data, ...
[39]
[PDF] Independent Component Analysis: Algorithms and Applications
For details, see (Hyvärinen, 1999b). In FastICA, convergence speed is optimized by the choice of the matrices diag(αi) and diag(βi). Another advantage of ...
[40]
[PDF] Independent Component Analysis: A Tutorial
The FastICA algorithm and the underlying contrast functions have a number of desirable properties when compared with existing methods for ICA. 1. The ...
[41]
[PDF] An information-maximisation approach to blind separation and blind ...
A brief report of this research appears in Bell & Sejnowski (1995). 2 Information maximisation. The basic problem tackled here is how to maximise the mutual ...
[42]
https://www.keboola.com/blog/pca-machine-learning
[43]
Learning the parts of objects by non-negative matrix factorization
Oct 21, 1999 · Here we demonstrate an algorithm for non-negative matrix factorization that is able to learn parts of faces and semantic features of text.
[44]
[PDF] Learning with Local and Global Consistency - NIPS papers
We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive in- ference.
[45]
[PDF] Semi-Supervised Learning Using Gaussian Fields and Harmonic ...
In this paper we introduce a new approach to semi-supervised learning that is based on a random field model defined on a weighted graph over the unlabeled and.
[46]
[PDF] Laplacian Eigenmaps for Dimensionality Reduction and Data ...
To simplify the analysis, the neighbor-. Page 14. 1386. M. Belkin and P. Niyogi ing points (xij 's) are assumed to lie on a locally linear patch on the manifold.
[47]
[PDF] Semi-Supervised Learning with Graphs - cs.wisc.edu
We present a series of novel semi-supervised learning approaches arising from a graph representation, where labeled and unlabeled instances are represented as.
[48]
Dynamic graph structure evolution for node classification with ...
Jul 16, 2025 · This paper proposes the evolving graph structure (EGS) framework for semi-supervised node classification with missing attributes.
[49]
Semi-Supervised Learning with Deep Generative Models - arXiv
This paper revisits semi-supervised learning with generative models, using deep generative models and variational methods to improve generalization from small ...
[50]
Conditional Image Synthesis With Auxiliary Classifier GANs - arXiv
Oct 30, 2016 · This paper introduces new methods for training GANs for image synthesis, using label conditioning for 128x128 resolution images with global ...Missing: Semi- supervised
[51]
[1507.02672] Semi-Supervised Learning with Ladder Networks - arXiv
Jul 9, 2015 · Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the ...
[52]
[PDF] Semi Supervised Semantic Segmentation Using Generative ...
This paper uses a GAN-based semi-supervised framework with a generator and classifier, using both labeled and unlabeled data, including fake images, to improve ...
[53]
Semi-Supervised Anomaly Detection Based on Deep Generative ...
Jun 4, 2022 · We propose a novel semi-supervised anomaly detection approach based on deep generative models with Transformers for identifying unusual (abnormal) images from ...Missing: segmentation | Show results with:segmentation
[54]
[1603.08511] Colorful Image Colorization - arXiv
Mar 28, 2016 · Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder.
[55]
Unsupervised Learning of Visual Representations by Solving ... - arXiv
Mar 30, 2016 · We build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a pretext task, which requires no manual labeling.
[56]
Rethinking self-supervised learning for time series forecasting
Dec 3, 2024 · In time series forecasting, masked modeling offers a distinct advantage by implicitly guiding the model to capture fine-grained temporal ...
[57]
Representation Learning with Contrastive Predictive Coding - arXiv
We propose a universal unsupervised learning approach to extract useful representations from high-dimensional data, which we call Contrastive Predictive Coding.
[58]
Momentum Contrast for Unsupervised Visual Representation Learning
Nov 13, 2019 · We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up,
[59]
Rethinking Evaluation Protocols of Visual Representations Learned ...
Apr 7, 2023 · In this work, we try to figure out the cause of performance sensitivity by conducting extensive experiments with state-of-the-art SSL methods.
[60]
Introducing GPT-5 - OpenAI
Aug 7, 2025 · Introducing GPT-5. Our smartest, fastest, most useful model yet, with built-in thinking that puts expert-level intelligence in everyone's hands.
[61]
BERT applications in natural language processing: a review
١٥‏/٠٣‏/٢٠٢٥ · This review study examines the complex nature of BERT, including its structure, utilization in different NLP tasks, and the further development of its design ...
[62]
Self-Supervised Learning Principles Challenges and Emerging ...
Feb 24, 2025 · This survey provides a comprehensive overview of self-supervised learning, covering its fundamental principles, major methodological approaches, ...3.3. Generative Approaches · 4.4. Robotics And... · 5. Challenges And Future...Missing: seminal papers
[63]
Contrastive Self-Supervised Learning of Graph Representations
Jul 15, 2020 · Abstract:We propose Graph Contrastive Learning (GraphCL), a general framework for learning node representations in a self supervised manner.
[64]
InfoGraph: Unsupervised and Semi-supervised Graph-Level ... - arXiv
Jul 31, 2019 · This paper studies learning the representations of whole graphs in both unsupervised and semi-supervised scenarios.
[65]
TCLR: Temporal Contrastive Learning for Video Representation
Jan 20, 2021 · We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video ...
[66]
wav2vec: Unsupervised Pre-training for Speech Recognition - arXiv
Apr 11, 2019 · We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio ...
[67]
[PDF] Boltzmann Machines - Computer Science
Mar 25, 2007 · A Boltzmann Machine is a network of symmetrically connected, neuron- like units that make stochastic decisions about whether to be on or off ...<|control11|><|separator|>
[68]
[PDF] Restricted Boltzmann Machines
The main worry with CD is that there will be deep minima of the energy function far away from the data. – To find these we need to run the Markov chain for.
[69]
[PDF] Training Products of Experts by Minimizing Contrastive Divergence
Mayraz and Hinton (in preparation) report good comparative results for the larger. MNIST database at www.research.att.com/~yann/ocr/mnist and they were careful ...
[70]
Reducing the Dimensionality of Data with Neural Networks - Science
We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much ...
[71]
[PDF] Restricted Boltzmann Machines for Collaborative Filtering
Salakhutdinov, R., & Hinton, G. E. (2007). Learning a nonlinear embedding by preserving class neighbour- hood structure. AI and Statistics. Srebro, N., & ...
[72]
[PDF] Using Fast Weights to Improve Persistent Contrastive Divergence
(Tieleman, 2008) showed that, given a fixed amount of computation, restricted Boltzmann machines can learn better models using this “Persistent Contrastive.
[73]
[1312.6114] Auto-Encoding Variational Bayes - arXiv
Dec 20, 2013 · Authors:Diederik P Kingma, Max Welling. View a PDF of the paper titled Auto-Encoding Variational Bayes, by Diederik P Kingma and 1 other authors.
[74]
[PDF] Sparse autoencoder
These notes describe the sparse autoencoder learning algorithm, which is one approach to automatically learn features from unlabeled data. In some domains, such ...
[75]
[1706.03762] Attention Is All You Need - arXiv
Jun 12, 2017 · View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors. View PDF HTML (experimental). Abstract:The ...
[76]
[1512.03385] Deep Residual Learning for Image Recognition - arXiv
Dec 10, 2015 · We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.
[77]
EfficientNet: Rethinking Model Scaling for Convolutional Neural ...
We propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient.
[78]
[2010.11929] An Image is Worth 16x16 Words: Transformers ... - arXiv
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Authors:Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk ...
[79]
[2412.01021] On the Feature Learning in Diffusion Models - arXiv
Dec 2, 2024 · Diffusion models learn balanced data representations due to denoising, unlike classification models which focus on easy-to-learn patterns.
[80]
Semi-Supervised Classification with Graph Convolutional Networks
We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks.
[81]
[PDF] Self-Supervised Learning in Deep Networks - arXiv
During the training process, we first pre-train the model with self- supervision to enable it to learn common feature expressions on a large amount of unlabeled ...
[82]
Finding Structure in Time - Elman - 1990 - Cognitive Science
A set of simulations is reported which range from relatively simple problems (temporal version of XOR) to discovering syntactic/semantic features for words.
[83]
[PDF] 1990-elman.pdf - Gwern
The current report develops a proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks.Missing: paper | Show results with:paper
[84]
[PDF] LONG SHORT-TERM MEMORY 1 INTRODUCTION
See Schmidhuber and Hochreiter (1996) and. Hochreiter and Schmidhuber (1996, 1997) for additional results in this vein. LSTM architecture. We use a 3-layer net ...
[85]
[PDF] Learning Phrase Representations using RNN Encoder–Decoder for ...
In this paper, we propose a novel neu- ral network model called RNN Encoder–. Decoder that consists of two recurrent neural networks (RNN). One RNN en-.
[86]
Bidirectional recurrent neural networks | IEEE Journals & Magazine
Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN).
[87]
Speech Recognition with Deep Recurrent Neural Networks - arXiv
Mar 22, 2013 · Abstract:Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist ...
[88]
Stock Market Prediction Using LSTM Recurrent Neural Network
This article aims to build a model using Recurrent Neural Networks (RNN) and especially Long-Short Term Memory model (LSTM) to predict future stock market ...
[89]
[PDF] the vanishing gradient problem during learning recurrent neural nets ...
Bengio, P. Simard, and P. Frasconi, \Learning long-term dependencies with gradient descent is di cult", IEEE Transactions on Neural Networks, 5(2):157{166 (1994) ...
[90]
[PDF] Learning long-term dependencies with gradient descent is difficult
Our only claim here is that discrete propagation of error offers interesting solutions to the vanishing gradient problem in recurrent network. Our ...
[91]
EvolveGCN: Evolving Graph Convolutional Networks for Dynamic ...
Feb 26, 2019 · We propose EvolveGCN, which adapts the graph convolutional network (GCN) model along the temporal dimension without resorting to node embeddings.
[92]
[2002.07962] Inductive Representation Learning on Temporal Graphs
Feb 19, 2020 · We propose the temporal graph attention (TGAT) layer to efficiently aggregate temporal-topological neighborhood features as well as to learn the time-feature ...
[93]
[PDF] Neural Temporal Point Processes: A Review - IJCAI
Temporal point processes (TPP) are probabilistic generative models for continuous-time event se- quences. Neural TPPs combine the fundamental.
[94]
[PDF] Graph Hawkes Neural Network for Forecasting on Temporal ...
The Hawkes process has become a standard method for modeling self-exciting event sequences with different event types. A recent work has generalized the Hawkes ...
[95]
Enhancement of traffic forecasting through graph neural network ...
This study investigates information fusion methods for GNN-based traffic predictions, including their benefits and challenges.Missing: fraud | Show results with:fraud
[96]
https://www.bioinf.jku.at/publications/older/2304.pdf
[97]
Temporal Graph Networks for Deep Learning on Dynamic Graphs
Jun 18, 2020 · In this paper, we present Temporal Graph Networks (TGNs), a generic, efficient framework for deep learning on dynamic graphs represented as sequences of timed ...
[98]
[PDF] Towards Better Evaluation for Dynamic Link Prediction
Nodes, edges, weights or attributes in a dynamic graph can be added, deleted or adjusted over time. Therefore, understanding and analyzing the temporal patterns ...Missing: horizons | Show results with:horizons