Fact-checked by Grok 2 weeks ago

Neural tangent kernel

The Neural Tangent Kernel (NTK) is a that characterizes the training dynamics of overparameterized neural networks in the infinite-width limit, demonstrating that on such networks behaves equivalently to using this kernel, which remains constant throughout training under specific parameterizations. Introduced in , the NTK bridges with classical kernel methods by mapping the evolution of a neural network's output during optimization to a in , where the kernel \Theta(x, x') encodes similarities between inputs based on the network's gradients. At its core, the NTK arises from the empirical kernel \hat{\Theta}_t(x, x') = \nabla_\theta f_t(x)^T \nabla_\theta f_t(x'), where f_t is the network function at time t and \theta are the parameters; in the infinite-width regime, this converges to a deterministic that is positive-definite for non-polynomial activations and spherically symmetric , ensuring global to the minimum norm solution for least-squares . Key theoretical results include guarantees requiring network widths scaling polynomially with input dimension (e.g., n > C m^6 \lambda_0^{-4} \delta^{-3} for ReLU networks with high probability), and generalization bounds tying error rates to the kernel's eigenvalues and effective dimensionality. These insights explain why wide networks achieve strong despite overparameterization, with training aligning fastest along the kernel's principal components, motivating techniques like . The NTK framework has been extended beyond fully connected networks to convolutional architectures via the Convolutional NTK (CNTK), enabling exact computation of the kernel limit and achieving competitive performance on tasks like (e.g., 77% accuracy with infinite-width approximations). Applications include , where NTK spectra predict model performance without full training; small-data , outperforming finite-width ResNets on subsets of with fewer than 640 samples; and tasks like and image inpainting. Further developments incorporate finite-width corrections, multi-head attention layers, and equivariant models, while alternatives like the (NNGP) kernel address initialization behaviors. Despite its theoretical elegance, the NTK has limitations, including high computational demands—e.g., O(m^3) for inverting the on m samples, rendering it impractical for large datasets like —and reliance on unrealistically large widths for in . It also struggles with non-smooth operations like max pooling and evolving kernels under standard parameterizations, prompting ongoing research into empirical approximations and broader architectures.

Definition and Motivation

Formal Definition

The neural tangent kernel (NTK) arises in the context of analyzing wide neural networks, where the input space is denoted by \mathcal{X} \subseteq \mathbb{R}^{d_0}, the parameter space by \Theta \subseteq \mathbb{R}^P with P the total number of parameters, and the function space by \mathcal{F} consisting of functions f: \mathcal{X} \to \mathbb{R}^C mapping inputs to C-dimensional outputs. For a f_\theta: \mathcal{X} \to \mathbb{R}^C parameterized by \theta \in \Theta, the NTK at initialization is defined as the kernel matrix \Theta(x, x') = \mathbb{E}_{\theta \sim \text{init}} \left[ \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x') \right] for inputs x, x' \in \mathcal{X}, where the is over the random initialization of parameters and the gradients are with respect to \theta. This captures the inner product of the parameter gradients for the network outputs at x and x'. In the scalar output case (C=1), the NTK takes the form \Theta(x, x') = \sum_{l=1}^L \mathbb{E} \left[ \frac{\partial f}{\partial a_l}(x) \frac{\partial f}{\partial a_l}(x') \right], where a_l denotes the pre-activation at layer l, and the expectation includes recursive contributions from preceding layers that propagate the through the network depth. For vector-valued outputs (C > 1), the NTK is a C \times C with component-wise entries \Theta^{pq}(x, x') = \mathbb{E}_{\theta \sim \text{init}} \left[ \nabla_\theta f_\theta^p(x)^\top \nabla_\theta f_\theta^q(x') \right] for output dimensions p, q \in \{1, \dots, C\}, where f_\theta^p is the p-th component of the output .

Historical Development

The concept of the neural tangent kernel (NTK) built upon earlier theoretical investigations into the behavior of wide neural networks, where over-parameterization simplifies optimization and leads to kernel-like dynamics. A key preceding work was the analysis by Livni, Shalev-Shwartz, and Shamir, which demonstrated that training wide shallow networks can be computationally efficient and equivalent to under certain conditions. This laid groundwork for understanding the "lazy" training regime, where parameters remain nearly fixed during , as explored in subsequent 2017 studies on loss landscapes in deep and wide networks by Nguyen and Hein, among others. The NTK was formally introduced in 2018 by Arthur Jacot, Franck Gabriel, and Clément Hongler in their seminal paper "Neural Tangent Kernel: Convergence and in Neural Networks," which analyzed the of networks in the infinite-width limit. Published at the NeurIPS conference that year, the work showed that on such networks corresponds to kernel in using the NTK, a fixed that remains stable post-initialization. This formulation had immediate impact by bridging modern practices with classical kernel methods, offering explanations for the observed convergence and generalization in over-parameterized models through a perspective. Early extensions in 2019 by and collaborators applied the NTK framework to convolutional neural networks (CNNs), deriving the convolutional NTK (CNTK) and providing efficient algorithms for its exact computation, which revealed similarities in performance between infinite-width CNNs and finite ones.

Mathematical Framework

Infinite Width Limit

The infinite width limit refers to the theoretical regime in which the number of neurons (width) in each hidden layer of a deep is scaled to infinity, enabling precise mathematical analysis of the network's behavior, particularly under gradient-based training. This limit reveals that over-parameterized networks exhibit simplified dynamics, bridging connections to classical methods like and Gaussian processes. Seminal work established that, under appropriate scaling, the network's output at initialization and its evolution during training converge to deterministic limits, facilitating the emergence of the neural tangent kernel (NTK). A key aspect of this limit is the scaling regime designed to maintain signal stability and prevent exploding or vanishing gradients as width grows. The layer widths w are taken to infinity sequentially from input to output layers, while weights are initialized such that their variance is $2/\text{fan-in} for ReLU activations; this He initialization ensures that the variance of pre-activations remains constant (order 1) across layers, promoting reliable convergence in the wide limit. To keep training dynamics stable, the learning rate \eta is scaled proportionally to $1/w in the standard parametrization, counteracting the growth in gradient magnitudes and ensuring the relative parameter updates diminish appropriately. At initialization, the plays a crucial role: the random features generated by the wide layers, due to the summation over many independent neuron outputs, converge in to a multivariate . This prior captures the network's predictive before training, with the covariance determined recursively through the layer structure and functions. In the infinite width limit, training via operates in the "lazy" regime, where parameters undergo negligible changes from their initial values relative to the network's scale. Consequently, the network functions primarily as a fixed, nonlinear feature map composed with an evolving linear readout, akin to kernel ridge regression on the initial features. This lazy behavior arises because the high dimensionality suppresses significant feature adaptation, stabilizing the training trajectory and aligning the network's evolution with that of a deterministic .

Derivation of the NTK

The derivation of the neural tangent kernel (NTK) begins with the dynamics of a neural network under gradient descent training in the infinite-width limit. Consider a neural network parameterized by \theta, with output function f_\theta(x) for input x, trained to minimize a loss L(f_\theta) via continuous-time gradient flow: \frac{d\theta}{dt} = -\nabla_\theta L(f_\theta). This induces an evolution on the function itself, \frac{df_\theta}{dt}(x) = \sum_i \frac{\partial f_\theta}{\partial \theta_i}(x) \frac{d\theta_i}{dt}, which simplifies under the mean-field scaling of the infinite-width regime—where parameters are initialized with variance scaling inversely with width—to a kernel-induced dynamics for least-squares loss: \frac{\partial f}{\partial t}(x) = -\mathbb{E}_{x'}[\Theta(t; x, x')(f(t; x') - y)], with \Theta denoting the NTK. The NTK arises as the kernel governing this evolution and takes the form \Theta(t; x, x') = \mathbb{E}_\theta \left[ \sum_i \frac{\partial f_\theta}{\partial \theta_i}(t; x) \frac{\partial f_\theta}{\partial \theta_i}(t; x') \right], where the expectation is over random initializations \theta(0). This expression captures the instantaneous linearization of the network's gradient at time t, reflecting how perturbations in parameters affect outputs at inputs x and x'. In the infinite-width limit, the NTK separates the function evolution from parameter updates, enabling analysis of training as kernel gradient descent. For multilayer networks, the NTK is computed recursively layer by layer. Let \Sigma^l(x, x') be the covariance kernel of the pre-activations at layer l, satisfying \Sigma^l(x, x') = \sigma_w^2 \mathbb{E}[\sigma(u) \sigma(u')] + \sigma_b^2, where u, u' \sim \mathcal{N}(\mathbf{0}, \Sigma^{l-1}(x, x')) and \sigma_w^2, \sigma_b^2 are the variances of weights and biases. The NTK \Theta^l(x, x') then satisfies the recurrence \Theta^l(x, x') = \Theta^{l-1}(x, x') \cdot \mathbb{E}[\dot{\sigma}(u) \dot{\sigma}(u')] + \Sigma^l(x, x'), where \dot{\sigma} is the derivative of the activation \sigma, starting from the input kernel \Theta^0(x, x') = x^\top x'. For example, with \sigma(z) = \tanh(z), \dot{\sigma}(z) = 1 - \tanh^2(z), and the expectations are computed over the correlated Gaussians. This builds the full NTK layer by layer, converging deterministically as width tends to infinity. A proof sketch for the NTK's behavior in the infinite-width limit employs the Dyson series expansion of the parameter flow or mean-field approximations to the stochastic differential equations governing wide networks. These show that fluctuations vanish, yielding a constant limiting kernel \Theta independent of training time, thus \frac{\partial f}{\partial t}(x) = -\Theta(x, x')(f(x') - y).

Properties of the NTK

Kernel Evolution During Training

In the infinite-width limit, the neural tangent kernel (NTK) is deterministic and remains constant throughout training, satisfying Θ(t) = Θ(0) for all times t ≥ 0. This constancy arises because the kernel's variation during gradient descent scales as O(1/w), where w denotes the network width, rendering the evolution negligible as w → ∞. In finite-width networks, empirical studies reveal that the NTK evolves slowly during training, deviating from the infinite-width constancy while still approximating kernel regression dynamics over much of the optimization process. Recent analyses from 2023 highlight end-of-training () dynamics in which the eigenvectors of the NTK align with the data structure, such as class labels or sample-specific features, leading to a block-diagonal form that enhances intra-class correlations and simplifies late-stage convergence. These alignments, observed across architectures like ResNet and DenseNet on datasets including , underscore how finite-width effects amplify feature-specific adaptations at .

Determinism and Stability

At initialization, the neural tangent kernel (NTK) of wide neural networks concentrates around its mean value due to the applied to the network outputs, which behave like Gaussian processes in the infinite-width limit. This concentration arises as the width w increases, with the variance of the NTK scaling as O(1/w), ensuring that the random NTK approaches a deterministic limit with high probability. The NTK is positive semi-definite at initialization, a property that holds strictly for nonpolynomial activation functions regardless of network depth. This positive semi-definiteness allows the NTK to satisfy the conditions of the Mercer theorem, facilitating its decomposition into a feature map and enabling applications in kernel-based analysis and approximations. Randomness in the finite-width NTK at initialization can be mitigated through ensemble averaging over multiple random initializations, which converges to the exact infinite-width limiting kernel by estimating its expectation. This averaging technique reduces variance and provides a practical way to approximate the deterministic kernel regime empirically.

Interpretations

Linearization of Neural Networks

The of via the (NTK) provides a of the network's output around its initial parameters, treating the model as locally linear in the parameter space during training. Specifically, for a function f(x; \theta) where \theta denotes the parameters and x the input, the linearized model is given by f_{\text{lin}}(x; \theta) = f(x; \theta_0) + \langle \theta - \theta_0, \nabla_\theta f(x; \theta_0) \rangle, where \theta_0 is the initialization and \nabla_\theta f(x; \theta_0) is the with respect to the parameters at initialization. This views the evolution of the network during as movement in the to the manifold of network functions, where changes in the output are linearly proportional to changes in parameters. In the infinite-width limit, where the hidden layer widths scale to , this linearization becomes globally exact, as the NTK remains throughout training, decoupling parameter updates from nonlinear . The network output evolves according to a in : f(\theta_t) = f(\theta_0) + J(\theta_0) (\theta_t - \theta_0), scaled appropriately by width factors (typically O(1/\sqrt{w}) for adjustments, where w is the width), leading to deterministic behavior independent of specific initializations beyond their statistical properties. This global linearity arises because the J(\theta) = \partial f / \partial \theta stabilizes, preventing the network from deviating from its initial tangent plane during optimization. The NTK itself is the of the rows across inputs, defined as \Theta(x, x') = \langle \nabla_\theta f(x; \theta_0), \nabla_\theta f(x'; \theta_0) \rangle, which captures the inner product structure in the and governs the kernel-induced of the updates. This highlights how wide approximate a whose features are the vectors, enabling analysis of training trajectories as projections onto principal components defined by the NTK's eigenspectrum. Consequently, training a wide under with mean-squared error loss is equivalent to performing kernel in the induced by the NTK, minimizing \min_g \|y - \Theta^{1/2} g\|_2^2 + \lambda \|g\|_2^2, where g represents coefficients in the tangent feature basis and \lambda is a regularization parameter related to the and width. This equivalence underscores the NTK's role in transforming nonlinear network optimization into a solvable linear problem, providing insights into rates determined by the kernel's eigenvalues.

Connection to Kernel Methods

The neural tangent kernel (NTK) establishes a direct equivalence between the training dynamics of overparameterized neural networks and classical methods. Specifically, in the infinite-width limit under the NTK parameterization—where weights are rescaled by the inverse square root of the layer width— on the network parameters corresponds precisely to kernel in using the fixed NTK as the kernel matrix. This means the evolution of the network's predictions during training follows the same continuous-time as kernel ridge regression or in kernel methods, but with the NTK serving as the reproducing kernel. This connection draws an analogy to random features models, a technique for approximating methods by mapping inputs to a finite-dimensional feature space via random projections. In the NTK regime, a wide at initialization behaves as an infinite random features model, where the feature map is given by the of the network output with respect to the parameters, \nabla_\theta f_\theta(x), and the inner product of these features yields the empirical NTK. As the width tends to infinity, this converges to a deterministic , allowing to be interpreted as kernel machines with architecture-induced random features rather than hand-designed ones. The NTK further defines a (RKHS) tailored to geometry, where in the space are those achievable as linearizations around initialization. The RKHS norm for a f is given by \|f\|_\Theta = \inf \left\{ \|\theta\| : f = \nabla_\theta f_\theta \right\}, measuring the minimal parameter perturbation needed to realize f via the network's tangent plane, thus linking to parameter-space complexity. However, the NTK differs from traditional fixed kernels like the (RBF), which are typically shift-invariant and agnostic to data or architecture; the NTK is inherently data-dependent, varying with input pairs (x, x'), and architecture-specific, influenced by factors such as depth, activation , and initialization schemes. Recent analyses as of have raised critiques regarding the practical validity of these interpretations, particularly the assumptions of lazy and exact equivalence in finite-width settings, suggesting discrepancies between NTK predictions and actual behavior.

Main Theoretical Results

Convergence in

In the context of the neural tangent kernel (NTK), dynamics in the infinite-width limit can be analyzed through the lens of , where gradient flow on the (MSE) loss leads to toward the that minimizes \|f - y\|^2, with f denoting the network and y the labels. Specifically, under gradient flow, the evolution of the g_t = f_t - y follows \partial_t g_t = -\Theta g_t, where \Theta is the NTK operator, resulting in an governed by the eigenvalues of \Theta. This dynamics ensures that the error decreases to zero as time t \to \infty, provided the lies in the (RKHS) induced by the NTK. A central theorem establishes that this convergence occurs in a time scale of O(1/\lambda_{\min}), where \lambda_{\min} is the smallest eigenvalue of the NTK Gram matrix \tilde{K}, assuming the NTK remains constant during training in the infinite-width regime. Theorem 2 in the foundational work formalizes this for convex loss functions, including MSE, by showing that the NTK's positive-definiteness implies the loss is strictly convex in the function space, guaranteeing convergence to the unique global minimizer. In the overparameterized setting, where the number of parameters exceeds the data dimensionality, the solution corresponds to ridgeless kernel regression, yielding an interpolating function f_\infty that perfectly fits the training data: f_\infty(x_k) = y_k for all training points x_k, with the explicit form f_\infty(x) = \kappa_x^\top \tilde{K}^{-1} y + (f_0(x) - \kappa_x^\top \tilde{K}^{-1} y_0), where \kappa_x is the kernel vector at x and f_0, y_0 are initial values. The positivity of the NTK, as established by Proposition 2, ensures that the minimizer is unique within the , as the NTK induces a positive-definite inner product that prevents multiple solutions to the problem. Furthermore, when is positive definite, the rate is exponential, with residual components decaying exponentially as e^{-[\lambda_i](/page/lambda) t} along the eigenspaces of , with the slowest rate determined by [\lambda_{\min}\](/page/lambda)); for instance, in numerical examples, the residual norm under the input measure can decay as approximately \(0.5 e^{-\lambda_2 t} for a small eigenvalue \lambda_2. This analysis holds under the assumption that the NTK is time-invariant, a property that emerges in the infinite-width limit and simplifies the training dynamics to a linear system.

Generalization Bounds

In the neural tangent kernel (NTK) regime, where infinite-width neural networks behave as kernel machines, generalization bounds leverage classical results from (RKHS) theory applied to the NTK \Theta. The excess generalization error for kernel ridge regression using \Theta satisfies \text{Gen error} \leq O\left(\sqrt{\frac{d}{n}} \log n\right), where d denotes the effective dimension of \Theta, typically defined as d = \sum_i \frac{\lambda_i}{\lambda + \lambda_i} with \lambda_i the eigenvalues of the induced by \Theta and \lambda > 0 the regularization parameter, and n is the number of training samples. This bound captures the trade-off between model complexity (via d) and sample size, arising from uniform convergence arguments such as in the RKHS norm-bounded function class. The eigenvalue spectrum of the NTK fundamentally determines its effective dimension and thus the model's capacity for , mirroring the covariance structure in es. For ReLU-activated fully connected , the NTK eigenvalues often exhibit decay (e.g., \lambda_k \sim k^{-\alpha} with \alpha > 1 depending on depth and input dimension), which bounds d sublinearly in the ambient dimension and promotes low-frequency bias, enhancing out-of-sample performance by limiting overfitting to . This spectral ensures that the effective dimension remains manageable even for high-dimensional inputs, akin to the prior induced by a with the conjugate NTK kernel. Within the NTK framework, emerges implicitly at the threshold, where test error peaks when the effective model complexity matches n and subsequently declines in the overparameterized due to the stabilizing eigenvalue distribution of . This behavior aligns overparameterization with improved without explicit regularization. From 2020 to 2022, theoretical advances have refined the -variance in the NTK for wide neural networks, showing that bias decreases with increasing width while variance follows a non-monotonic curve driven by the NTK's multi-scale spectral structure, yielding or triple descent in high dimensions. These results establish that optimal occurs beyond , with variance controlled by the tail of the NTK eigenvalues.

Applications

Overparameterized Models

In overparameterized neural networks, where the number of parameters p greatly exceeds the number of training samples n (i.e., p \gg n), the neural tangent kernel (NTK) framework explains how wide networks can achieve zero training error, entering the regime. This occurs because, in the infinite-width limit, the network's evolution under behaves linearly, allowing the model to fit any training data perfectly without in expectation, as the NTK acts as a fixed that enables exact interpolation. The NTK also elucidates the implicit bias of gradient descent in this regime, directing the solution toward the minimum-norm interpolant in the parameter space, which corresponds equivalently to the minimum reproducing kernel Hilbert space (RKHS) norm solution induced by the NTK. This bias favors smoother functions in the function space, promoting generalization despite interpolation, as the RKHS norm penalizes complex representations. A key phenomenon explained by the NTK is the double descent curve in test error, where error first decreases, peaks near the interpolation threshold (p \approx n), and then decreases again in the overparameterized regime. This behavior arises from the eigenvalue spectrum of the NTK: smaller eigenvalues in the overparameterized limit lead to slower learning of certain directions but overall better alignment with the data distribution, mitigating the peak at interpolation. Empirical validations of the NTK in overparameterized settings include experiments on MNIST and using wide ResNets trained in the lazy regime, where parameter updates are small and the NTK remains nearly constant. For instance, on , the NTK approximation for wide ResNets achieves approximately 77% test accuracy, consistent with the infinite-width kernel limit and capturing lazy training dynamics, though lower than feature-adapting finite networks (around 90% or higher depending on depth). Similar results hold for MNIST, where NTK-based regression achieves near-perfect training fit and competitive test performance (over 98% accuracy) in wide architectures.

Kernel Regression Equivalence

The neural tangent kernel (NTK) provides an exact equivalence to kernel ridge regression in the infinite-width limit of neural networks, allowing learning problems to be reformulated and solved using standard kernel methods. Specifically, for a dataset of n training points \{x_i, y_i\}_{i=1}^n, the NTK matrix \Theta is computed as \Theta_{ij} = \Theta(x_i, x_j), where \Theta(\cdot, \cdot) is the NTK function. The ridge regression solution is then obtained by solving the linear system (\Theta + \lambda I)\alpha = y for the coefficients \alpha \in \mathbb{R}^n, with regularization parameter \lambda > 0. Predictions for a new input x are given by f(x) = k(x)^T \alpha, where k(x) = [\Theta(x, x_1), \dots, \Theta(x, x_n)]^T. This formulation leverages the reproducing kernel Hilbert space (RKHS) properties of the NTK, enabling the neural network's output to be expressed as a finite linear combination of kernel evaluations on the training data. In the ridgeless limit as \lambda \to 0, the solution reduces to the minimum-norm interpolant using the Moore-Penrose pseudoinverse \Theta^+, yielding \alpha = \Theta^+ y and predictions f(x) = k(x)^T \Theta^+ y. This case achieves perfect on the training set when \Theta is positive semi-definite and full , mirroring the behavior of overparameterized neural networks that achieve zero training loss. The NTK's , established for certain activation functions like ReLU on the hypersphere with at least two hidden layers, ensures the existence of such a unique minimum-norm solution. Training via gradient descent in the kernel regime corresponds to iterative kernel gradient descent, which converges to the ridge regression solution under the constant NTK assumption. Starting from an initial function f_0, the updates follow f_{t+1} = f_t - \eta \nabla_f \mathcal{L}(f_t), where the gradient is \nabla_f \mathcal{L}(f) = \Theta (f(X) - y) for squared loss \mathcal{L}(f) = \frac{1}{2} \|f(X) - y\|^2. With a sufficiently small learning rate \eta < 2 / \lambda_{\max}(\Theta), this process converges exponentially to the ridge solution f^* = \Theta (\Theta + \lambda I)^{-1} y, providing a dynamical systems interpretation of kernel methods. Exact computation of the kernel ridge solution requires forming and inverting the n \times n NTK matrix, incurring O(n^3) time and space complexity, which limits scalability for large n. To address this, approximations such as the Nyström method subsample the kernel matrix to estimate \Theta \approx Q M^{-1} Q^T, where Q is an n \times m subsampled block with m \ll n and M its m \times m submatrix, reducing complexity to O(n m^2 + m^3). Alternatively, random features approximate the NTK via Monte Carlo sampling of its eigenfunction expansion, mapping inputs to a finite-dimensional feature space \phi(x) \in \mathbb{R}^d such that \Theta(x, x') \approx \phi(x)^T \phi(x'), enabling linear regression in O(n d^2 + d^3) time for moderate d. These techniques preserve the theoretical guarantees of kernel regression while enabling practical use on datasets beyond small scales.

Practical Applications

Beyond theoretical insights, the NTK has practical uses in various domains. In (NAS), the NTK's eigenvalue spectrum or serves as a to predict architecture performance and trainability without full training, correlating with outcomes on datasets like and . For small-data , NTK kernel regression outperforms finite-width ResNet-34 models on limited samples, achieving consistently better accuracy on subsets of with 640 or fewer training examples due to lower variance and fewer hyperparameters. The NTK also applies to matrix completion tasks, such as and virtual drug screening, where precomputed NTK Gram matrices enable efficient optimization. Similarly, for image inpainting with convolutional networks, the NTK facilitates optimization using precomputed Gram matrices on images of size $2^p \times 2^q.

Extensions

Finite Width and Other Architectures

In finite-width neural networks, the neural tangent kernel (NTK) at initialization is a random matrix whose expectation converges to the infinite-width limit as the width increases, but it exhibits fluctuations of order O(1/\sqrt{w}), where w is the width of the hidden layers. During training with gradient descent, these fluctuations cause the NTK to evolve away from its initial value, leading to deviations from the lazy training regime observed in the infinite-width case, where the NTK remains constant. This evolution introduces feature learning effects even in wide but finite networks, with the magnitude of changes scaling inversely with width, impacting convergence and generalization. Extensions of the NTK to convolutional neural networks (CNNs) define the convolutional NTK (CNTK), which incorporates the spatial structure of convolutional layers while preserving the kernel's . For ReLU activations, the CNTK builds on arc-cosine kernels to compute the inner products between feature maps, enabling exact evaluation for infinitely wide CNNs. Introduced by Arora et al. in 2019, this framework demonstrates that wide CNNs trained with approximate using the CNTK, achieving competitive performance on image classification tasks like , where CNTK accuracy trails fully trained networks by only 6-7%. The CNTK thus bridges theoretical analysis with practical CNN architectures by fusing convolutional invariances into the kernel computation. For transformer architectures, the NTK formulation accounts for self- mechanisms and positional encodings, resulting in structured kernels that capture sequence dependencies. Positional encodings, such as sinusoidal or learned variants, impose a banded or Toeplitz structure on the NTK, enhancing its ability to model long-range interactions in sequences. This extension, explored in works like Hron et al. (2020), reveals that infinitely wide in the NTK regime perform implicit on input embeddings, with applications in tasks such as language modeling and translation, where the kernel's spectral properties influence extrapolation to longer sequences. Graph neural tangent kernels (GNTKs) extend the NTK to graph neural networks (GNNs) by integrating into the kernel definition, treating infinitely wide GNNs as kernel machines over graph signals. The GNTK aggregates neighborhood information through message-passing operations, yielding a kernel that is invariant to graph permutations and incorporates Laplacian eigenvalues for . Proposed by Du et al. in 2019, GNTKs combine the expressive power of GNNs with the theoretical guarantees of , enabling provable learning of smooth graph functions and applications in node classification and graph , where they outperform traditional graph kernels on benchmarks like molecular property prediction.

Recent Developments

In 2023, researchers introduced the weighted neural tangent kernel (WNTK), which extends the standard NTK by incorporating sample weights into the kernel formulation to better model dynamics under adjusted . This generalization addresses limitations of the NTK in capturing optimizers beyond plain , demonstrating improved accuracy—such as a 2.14% gain on binary-classification tasks—and faster convergence to a stable kernel in the infinite-width limit, as proven through stability theorems. Advancing the analysis of training dynamics, a 2025 study examined the evolution of NTK eigenvectors at the edge of stability (), a regime where the NTK's largest eigenvalue oscillates inversely with during . The work reveals that higher learning rates enhance the alignment of leading NTK eigenvectors and the full matrix with training targets across various architectures, providing theoretical insights via a two-layer linear network model and empirical validation on deep networks. This eigenvector dynamics perspective deepens understanding of and optimization behavior in overparameterized models. Theoretical progress in 2024 established strict positivity of the NTK for networks of arbitrary depth, provided the is non-polynomial, using a characterization of polynomial functions to prove . This result strengthens connections between NTK positivity and the memorization capacity of wide networks, enabling zero training loss via and offering implications for reliable optimization landscapes. Recent applications of the NTK framework have illuminated score estimation in models, where neural networks trained by approximate score functions for generative tasks. Leveraging NTK theory, analyses from 2024 derive explicit rates for these approximations, highlighting how the regime explains optimization and in score-based generative modeling, with bounds favorably in network width.

Limitations

Key Assumptions

The Neural Tangent Kernel (NTK) theory fundamentally relies on the infinite-width limit, where the widths of all hidden layers approach infinity, ensuring that the network function converges to a at initialization and that the NTK remains constant and deterministic throughout training under . This limit, combined with a sufficiently small to maintain the "lazy" training regime—where parameters undergo minimal deviations from their initial values—enables the equivalence between training and kernel gradient descent, but these assumptions break down in feature-learning regimes where larger learning rates or adaptive scaling allow nonlinear adaptations beyond linearization. Initialization in NTK analyses adopts a mean-field perspective, assuming weights are drawn independently and identically from a standard Gaussian distribution \mathcal{N}(0,1), scaled appropriately by layer depth and width to control variance; this i.i.d. setup facilitates the probabilistic convergence of the kernel but neglects dependencies introduced by mechanisms like , which couple parameters across samples and layers. Activation functions are presumed to be smooth, specifically continuous and twice differentiable with bounded second derivatives (e.g., or hyperbolic tangent), to preserve the positive-definiteness and stability of the NTK; while ReLU activations are frequently employed in practice and yield tractable kernels despite their piecewise linearity and non-differentiability at zero, more complex smooth functions like Swish can alter the kernel's form due to their multiplicative gating structure, potentially introducing non-stationarities in the feature map derivatives. The input distribution is assumed to be , with samples drawn i.i.d. from a fixed measure supported on a compact such as the unit sphere, ensuring the empirical is well-conditioned and invertible for guarantees; this precludes adversarial inputs or distribution shifts that could render the degenerate or violate the required separability of points.

Empirical and Theoretical Gaps

Despite its theoretical appeal, the neural tangent (NTK) framework exhibits a significant mismatch with the behavior of finite-width neural networks, as it primarily describes the "lazy training" regime where network parameters change minimally during optimization, leading to around initialization. In contrast, real-world finite-width networks often operate in the "rich" or feature-learning regime, where parameters evolve substantially to adapt representations, enabling better but deviating from NTK predictions. This discrepancy arises because the NTK assumes an infinite-width limit, which imposes unrealistic scaling requirements for , with finite-width corrections involving higher-order terms that are not fully captured in the standard theory. Empirically, experiments on models like three-layer linear networks and convolutional neural networks (CNNs) demonstrate stable training only within specific richness parameters, beyond which occurs, highlighting the NTK's inability to model feature evolution in practical settings. The NTK's applicability is also architecture-specific, performing adequately for CNNs but poorly for transformer-based models, particularly in tasks involving high-frequency . For vision transformers (ViTs), NTK-based metrics, such as those used in , yield low correlation with actual performance (e.g., Kendall-Tau correlations below 0.15 in pure ViT spaces), due to the NTK's focus on low-frequency components that overlook the high-frequency features critical to multi-head self-attention mechanisms. In contrast, NTK metrics achieve higher correlations (over 0.5) in CNN search spaces, where low-frequency approximations align better with convolutional inductive biases. A major practical barrier is the NTK's scalability, as computing the empirical NTK Gram matrix for n training samples and input dimension d incurs O(n² d⁴) time complexity in standard implementations, rendering it infeasible for large-scale datasets like ImageNet (n ≈ 10⁶), where full computation could take months on high-end GPUs. While approximations like Nyström methods reduce this to O(n √n log n), they introduce approximation errors that can undermine theoretical guarantees, limiting NTK's use in real-world deployment. Recent open questions in NTK research, particularly from 2024 onward, include its adaptation to continual learning, where the fixed kernel assumption fails in finite-width networks transitioning to , leading to catastrophic forgetting and dependent gradient updates that diverge from Bayesian ideals; however, recent advancements as of 2025, such as path-coordinated frameworks and parameter-efficient analyses, have begun addressing these challenges. Similarly, robustness to remains underexplored, with extensions to aleatoric noise providing estimators for posterior uncertainty but lacking validation on complex datasets or non-Gaussian noise distributions, leaving gaps in handling real-world perturbations like label noise or adversarial inputs.