The Neural Tangent Kernel (NTK) is a kernel function that characterizes the training dynamics of overparameterized neural networks in the infinite-width limit, demonstrating that gradient descent on such networks behaves equivalently to kernel regression using this kernel, which remains constant throughout training under specific parameterizations.[1] Introduced in 2018, the NTK bridges deep learning with classical kernel methods by mapping the evolution of a neural network's output function during optimization to a linear differential equation in function space, where the kernel \Theta(x, x') encodes similarities between inputs based on the network's gradients.[1][2]At its core, the NTK arises from the empirical kernel \hat{\Theta}_t(x, x') = \nabla_\theta f_t(x)^T \nabla_\theta f_t(x'), where f_t is the network function at time t and \theta are the parameters; in the infinite-width regime, this converges to a deterministic limit that is positive-definite for non-polynomial activations and spherically symmetric data, ensuring global convergence to the minimum norm solution for least-squares loss.[1] Key theoretical results include convergence guarantees requiring network widths scaling polynomially with input dimension (e.g., n > C m^6 \lambda_0^{-4} \delta^{-3} for ReLU networks with high probability), and generalization bounds tying error rates to the kernel's eigenvalues and effective dimensionality.[2] These insights explain why wide networks achieve strong generalization despite overparameterization, with training aligning fastest along the kernel's principal components, motivating techniques like early stopping.[1][3]The NTK framework has been extended beyond fully connected networks to convolutional architectures via the Convolutional NTK (CNTK), enabling exact computation of the kernel limit and achieving competitive performance on tasks like CIFAR-10classification (e.g., 77% accuracy with infinite-width approximations).[3] Applications include neural architecture search, where NTK spectra predict model performance without full training; small-data supervised learning, outperforming finite-width ResNets on subsets of CIFAR-10 with fewer than 640 samples; and tasks like matrix completion and image inpainting.[2] Further developments incorporate finite-width corrections, multi-head attention layers, and equivariant models, while alternatives like the Neural Network Gaussian Process (NNGP) kernel address initialization behaviors.[2]Despite its theoretical elegance, the NTK has limitations, including high computational demands—e.g., O(m^3) for inverting the Gram matrix on m samples, rendering it impractical for large datasets like ImageNet—and reliance on unrealistically large widths for convergence in practice.[2] It also struggles with non-smooth operations like max pooling and evolving kernels under standard parameterizations, prompting ongoing research into empirical approximations and broader architectures.[2]
Definition and Motivation
Formal Definition
The neural tangent kernel (NTK) arises in the context of analyzing wide neural networks, where the input space is denoted by \mathcal{X} \subseteq \mathbb{R}^{d_0}, the parameter space by \Theta \subseteq \mathbb{R}^P with P the total number of parameters, and the function space by \mathcal{F} consisting of functions f: \mathcal{X} \to \mathbb{R}^C mapping inputs to C-dimensional outputs.[1]For a neural network f_\theta: \mathcal{X} \to \mathbb{R}^C parameterized by \theta \in \Theta, the NTK at initialization is defined as the kernel matrix \Theta(x, x') = \mathbb{E}_{\theta \sim \text{init}} \left[ \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x') \right] for inputs x, x' \in \mathcal{X}, where the expectation is over the random initialization of parameters and the gradients are with respect to \theta.[1] This captures the inner product of the parameter gradients for the network outputs at x and x'.In the scalar output case (C=1), the NTK takes the form \Theta(x, x') = \sum_{l=1}^L \mathbb{E} \left[ \frac{\partial f}{\partial a_l}(x) \frac{\partial f}{\partial a_l}(x') \right], where a_l denotes the pre-activation at layer l, and the expectation includes recursive contributions from preceding layers that propagate the kernel through the network depth.[1]For vector-valued outputs (C > 1), the NTK is a C \times C matrix with component-wise entries \Theta^{pq}(x, x') = \mathbb{E}_{\theta \sim \text{init}} \left[ \nabla_\theta f_\theta^p(x)^\top \nabla_\theta f_\theta^q(x') \right] for output dimensions p, q \in \{1, \dots, C\}, where f_\theta^p is the p-th component of the output vector.[1]
Historical Development
The concept of the neural tangent kernel (NTK) built upon earlier theoretical investigations into the behavior of wide neural networks, where over-parameterization simplifies optimization and leads to kernel-like dynamics. A key preceding work was the 2014 analysis by Livni, Shalev-Shwartz, and Shamir, which demonstrated that training wide shallow networks can be computationally efficient and equivalent to kernel ridge regression under certain conditions.[4] This laid groundwork for understanding the "lazy" training regime, where parameters remain nearly fixed during gradient descent, as explored in subsequent 2017 studies on loss landscapes in deep and wide networks by Nguyen and Hein, among others.The NTK was formally introduced in 2018 by Arthur Jacot, Franck Gabriel, and Clément Hongler in their seminal paper "Neural Tangent Kernel: Convergence and Generalization in Neural Networks," which analyzed the trainingdynamics of deep networks in the infinite-width limit.[1] Published at the NeurIPS conference that year, the work showed that gradient descent on such networks corresponds to kernel gradient descent in function space using the NTK, a fixed kernel that remains stable post-initialization.This formulation had immediate impact by bridging modern deep learning practices with classical kernel methods, offering explanations for the observed convergence and generalization in over-parameterized models through a linearization perspective.[1]Early extensions in 2019 by Sanjeev Arora and collaborators applied the NTK framework to convolutional neural networks (CNNs), deriving the convolutional NTK (CNTK) and providing efficient algorithms for its exact computation, which revealed similarities in performance between infinite-width CNNs and finite ones.[3]
Mathematical Framework
Infinite Width Limit
The infinite width limit refers to the theoretical regime in which the number of neurons (width) in each hidden layer of a deep neural network is scaled to infinity, enabling precise mathematical analysis of the network's behavior, particularly under gradient-based training. This limit reveals that over-parameterized networks exhibit simplified dynamics, bridging connections to classical methods like kernel regression and Gaussian processes. Seminal work established that, under appropriate scaling, the network's output at initialization and its evolution during training converge to deterministic limits, facilitating the emergence of the neural tangent kernel (NTK).[1]A key aspect of this limit is the scaling regime designed to maintain signal stability and prevent exploding or vanishing gradients as width grows. The layer widths w are taken to infinity sequentially from input to output layers, while weights are initialized such that their variance is $2/\text{fan-in} for ReLU activations; this He initialization ensures that the variance of pre-activations remains constant (order 1) across layers, promoting reliable convergence in the wide limit. To keep training dynamics stable, the learning rate \eta is scaled proportionally to $1/w in the standard parametrization, counteracting the growth in gradient magnitudes and ensuring the relative parameter updates diminish appropriately.[5]At initialization, the central limit theorem plays a crucial role: the random features generated by the wide layers, due to the summation over many independent neuron outputs, converge in distribution to a multivariate Gaussian process. This Gaussian process prior captures the network's predictive distribution before training, with the covariance kernel determined recursively through the layer structure and activation functions.[6]In the infinite width limit, training via gradient descent operates in the "lazy" regime, where parameters undergo negligible changes from their initial values relative to the network's scale. Consequently, the network functions primarily as a fixed, nonlinear feature map composed with an evolving linear readout, akin to kernel ridge regression on the initial features. This lazy behavior arises because the high dimensionality suppresses significant feature adaptation, stabilizing the training trajectory and aligning the network's evolution with that of a deterministic kernel method.[1]
Derivation of the NTK
The derivation of the neural tangent kernel (NTK) begins with the dynamics of a neural network under gradient descent training in the infinite-width limit. Consider a neural network parameterized by \theta, with output function f_\theta(x) for input x, trained to minimize a loss L(f_\theta) via continuous-time gradient flow: \frac{d\theta}{dt} = -\nabla_\theta L(f_\theta).[1] This induces an evolution on the function itself, \frac{df_\theta}{dt}(x) = \sum_i \frac{\partial f_\theta}{\partial \theta_i}(x) \frac{d\theta_i}{dt}, which simplifies under the mean-field scaling of the infinite-width regime—where parameters are initialized with variance scaling inversely with width—to a kernel-induced dynamics for least-squares loss: \frac{\partial f}{\partial t}(x) = -\mathbb{E}_{x'}[\Theta(t; x, x')(f(t; x') - y)], with \Theta denoting the NTK.[1]The NTK arises as the kernel governing this evolution and takes the form \Theta(t; x, x') = \mathbb{E}_\theta \left[ \sum_i \frac{\partial f_\theta}{\partial \theta_i}(t; x) \frac{\partial f_\theta}{\partial \theta_i}(t; x') \right], where the expectation is over random initializations \theta(0).[1] This expression captures the instantaneous linearization of the network's gradient at time t, reflecting how perturbations in parameters affect outputs at inputs x and x'. In the infinite-width limit, the NTK separates the function evolution from parameter updates, enabling analysis of training as kernel gradient descent.[1]For multilayer networks, the NTK is computed recursively layer by layer. Let \Sigma^l(x, x') be the covariance kernel of the pre-activations at layer l, satisfying \Sigma^l(x, x') = \sigma_w^2 \mathbb{E}[\sigma(u) \sigma(u')] + \sigma_b^2, where u, u' \sim \mathcal{N}(\mathbf{0}, \Sigma^{l-1}(x, x')) and \sigma_w^2, \sigma_b^2 are the variances of weights and biases. The NTK \Theta^l(x, x') then satisfies the recurrence \Theta^l(x, x') = \Theta^{l-1}(x, x') \cdot \mathbb{E}[\dot{\sigma}(u) \dot{\sigma}(u')] + \Sigma^l(x, x'), where \dot{\sigma} is the derivative of the activation \sigma, starting from the input kernel \Theta^0(x, x') = x^\top x'. For example, with \sigma(z) = \tanh(z), \dot{\sigma}(z) = 1 - \tanh^2(z), and the expectations are computed over the correlated Gaussians.[1] This builds the full NTK layer by layer, converging deterministically as width tends to infinity.[1]A proof sketch for the NTK's behavior in the infinite-width limit employs the Dyson series expansion of the parameter flow or mean-field approximations to the stochastic differential equations governing wide networks. These show that fluctuations vanish, yielding a constant limiting kernel \Theta independent of training time, thus \frac{\partial f}{\partial t}(x) = -\Theta(x, x')(f(x') - y).[1]
Properties of the NTK
Kernel Evolution During Training
In the infinite-width limit, the neural tangent kernel (NTK) is deterministic and remains constant throughout training, satisfying Θ(t) = Θ(0) for all times t ≥ 0.[1] This constancy arises because the kernel's variation during gradient descent scales as O(1/w), where w denotes the network width, rendering the evolution negligible as w → ∞.[1]In finite-width networks, empirical studies reveal that the NTK evolves slowly during training, deviating from the infinite-width constancy while still approximating kernel regression dynamics over much of the optimization process.Recent analyses from 2023 highlight end-of-training (EoS) dynamics in which the eigenvectors of the NTK align with the data structure, such as class labels or sample-specific features, leading to a block-diagonal form that enhances intra-class correlations and simplifies late-stage convergence.[7] These alignments, observed across architectures like ResNet and DenseNet on datasets including CIFAR-10, underscore how finite-width effects amplify feature-specific adaptations at EoS.[7]
Determinism and Stability
At initialization, the neural tangent kernel (NTK) of wide neural networks concentrates around its mean value due to the central limit theorem applied to the network outputs, which behave like Gaussian processes in the infinite-width limit. This concentration arises as the width w increases, with the variance of the NTK scaling as O(1/w), ensuring that the random NTK approaches a deterministic limit with high probability.[8]The NTK is positive semi-definite at initialization, a property that holds strictly for nonpolynomial activation functions regardless of network depth.[1] This positive semi-definiteness allows the NTK to satisfy the conditions of the Mercer theorem, facilitating its decomposition into a feature map and enabling applications in kernel-based analysis and approximations.Randomness in the finite-width NTK at initialization can be mitigated through ensemble averaging over multiple random initializations, which converges to the exact infinite-width limiting kernel by estimating its expectation.[8] This averaging technique reduces variance and provides a practical way to approximate the deterministic kernel regime empirically.
Interpretations
Linearization of Neural Networks
The linearization of neural networks via the neural tangent kernel (NTK) provides a first-orderTaylorapproximation of the network's output around its initial parameters, treating the model as locally linear in the parameter space during training. Specifically, for a neural network function f(x; \theta) where \theta denotes the parameters and x the input, the linearized model is given byf_{\text{lin}}(x; \theta) = f(x; \theta_0) + \langle \theta - \theta_0, \nabla_\theta f(x; \theta_0) \rangle,where \theta_0 is the initialization and \nabla_\theta f(x; \theta_0) is the gradient with respect to the parameters at initialization. This approximation views the evolution of the network during gradient descent as movement in the tangent space to the manifold of network functions, where changes in the output are linearly proportional to changes in parameters.[1][9]In the infinite-width limit, where the hidden layer widths scale to infinity, this local linearization becomes globally exact, as the NTK remains constant throughout training, decoupling parameter updates from nonlinear featureevolution. The network output evolves according to a linear dynamics equation in function space: f(\theta_t) = f(\theta_0) + J(\theta_0) (\theta_t - \theta_0), scaled appropriately by width factors (typically O(1/\sqrt{w}) for learning rate adjustments, where w is the width), leading to deterministic behavior independent of specific initializations beyond their statistical properties. This global linearity arises because the Jacobian J(\theta) = \partial f / \partial \theta stabilizes, preventing the network from deviating from its initial tangent plane during optimization.[1][9]The NTK itself is the Gram matrix of the Jacobian rows across inputs, defined as \Theta(x, x') = \langle \nabla_\theta f(x; \theta_0), \nabla_\theta f(x'; \theta_0) \rangle, which captures the inner product structure in the tangent space and governs the kernel-induced geometry of the function updates. This interpretation highlights how wide networks approximate a linear model whose features are the Jacobian vectors, enabling analysis of training trajectories as projections onto principal components defined by the NTK's eigenspectrum.[1][9]Consequently, training a wide neural network under gradient descent with mean-squared error loss is equivalent to performing kernel ridge regression in the function space induced by the NTK, minimizing \min_g \|y - \Theta^{1/2} g\|_2^2 + \lambda \|g\|_2^2, where g represents coefficients in the tangent feature basis and \lambda is a regularization parameter related to the learning rate and width. This equivalence underscores the NTK's role in transforming nonlinear network optimization into a solvable linear problem, providing insights into convergence rates determined by the kernel's eigenvalues.[1][9]
Connection to Kernel Methods
The neural tangent kernel (NTK) establishes a direct equivalence between the training dynamics of overparameterized neural networks and classical kernel methods. Specifically, in the infinite-width limit under the NTK parameterization—where weights are rescaled by the inverse square root of the layer width—gradient descent on the network parameters corresponds precisely to kernel gradient descent in function space using the fixed NTK as the kernel matrix. This means the evolution of the network's predictions during training follows the same continuous-time ordinary differential equation as kernel ridge regression or gradient descent in kernel methods, but with the NTK serving as the reproducing kernel.[1]This connection draws an analogy to random features models, a technique for approximating kernel methods by mapping inputs to a finite-dimensional feature space via random projections. In the NTK regime, a wide neural network at initialization behaves as an infinite random features model, where the feature map is given by the Jacobian of the network output with respect to the parameters, \nabla_\theta f_\theta(x), and the inner product of these features yields the empirical NTK. As the width tends to infinity, this converges to a deterministic kernel, allowing neural networks to be interpreted as kernel machines with architecture-induced random features rather than hand-designed ones.[2]The NTK further defines a Reproducing Kernel Hilbert Space (RKHS) tailored to neural network geometry, where functions in the space are those achievable as linearizations around initialization. The RKHS norm for a function f is given by\|f\|_\Theta = \inf \left\{ \|\theta\| : f = \nabla_\theta f_\theta \right\},measuring the minimal parameter perturbation needed to realize f via the network's tangent plane, thus linking generalization to parameter-space complexity.[2] However, the NTK differs from traditional fixed kernels like the radial basis function (RBF), which are typically shift-invariant and agnostic to data or architecture; the NTK is inherently data-dependent, varying with input pairs (x, x'), and architecture-specific, influenced by factors such as depth, activation functions, and initialization schemes.[1][2]Recent analyses as of 2025 have raised critiques regarding the practical validity of these interpretations, particularly the assumptions of lazy training and exact equivalence in finite-width settings, suggesting discrepancies between NTK predictions and actual neural network behavior.[10]
In the context of the neural tangent kernel (NTK), training dynamics in the infinite-width limit can be analyzed through the lens of kernel regression, where gradient flow on the mean squared error (MSE) loss leads to convergence toward the solution that minimizes \|f - y\|^2, with f denoting the network function and y the target labels.[1] Specifically, under gradient flow, the evolution of the residual g_t = f_t - y follows \partial_t g_t = -\Theta g_t, where \Theta is the NTK operator, resulting in an exponential decay governed by the eigenvalues of \Theta.[1] This dynamics ensures that the training error decreases to zero as time t \to \infty, provided the targetfunction lies in the reproducing kernel Hilbert space (RKHS) induced by the NTK.[1]A central theorem establishes that this convergence occurs in a time scale of O(1/\lambda_{\min}), where \lambda_{\min} is the smallest eigenvalue of the NTK Gram matrix \tilde{K}, assuming the NTK remains constant during training in the infinite-width regime.[1] Theorem 2 in the foundational work formalizes this for convex loss functions, including MSE, by showing that the NTK's positive-definiteness implies the loss is strictly convex in the function space, guaranteeing convergence to the unique global minimizer.[1] In the overparameterized setting, where the number of parameters exceeds the data dimensionality, the solution corresponds to ridgeless kernel regression, yielding an interpolating function f_\infty that perfectly fits the training data: f_\infty(x_k) = y_k for all training points x_k, with the explicit form f_\infty(x) = \kappa_x^\top \tilde{K}^{-1} y + (f_0(x) - \kappa_x^\top \tilde{K}^{-1} y_0), where \kappa_x is the kernel vector at x and f_0, y_0 are initial values.[1]The positivity of the NTK, as established by Proposition 2, ensures that the minimizer is unique within the RKHS, as the NTK induces a positive-definite inner product that prevents multiple solutions to the interpolation problem.[1] Furthermore, when \Theta is positive definite, the convergence rate is exponential, with residual components decaying exponentially as e^{-[\lambda_i](/page/lambda) t} along the eigenspaces of \Theta, with the slowest rate determined by [\lambda_{\min}\](/page/lambda)); for instance, in numerical examples, the residual norm under the input measure can decay as approximately \(0.5 e^{-\lambda_2 t} for a small eigenvalue \lambda_2.[1] This analysis holds under the assumption that the NTK is time-invariant, a property that emerges in the infinite-width limit and simplifies the training dynamics to a linear system.[1]
Generalization Bounds
In the neural tangent kernel (NTK) regime, where infinite-width neural networks behave as kernel machines, generalization bounds leverage classical results from reproducing kernel Hilbert space (RKHS) theory applied to the NTK \Theta. The excess generalization error for kernel ridge regression using \Theta satisfies\text{Gen error} \leq O\left(\sqrt{\frac{d}{n}} \log n\right),where d denotes the effective dimension of \Theta, typically defined as d = \sum_i \frac{\lambda_i}{\lambda + \lambda_i} with \lambda_i the eigenvalues of the integral operator induced by \Theta and \lambda > 0 the regularization parameter, and n is the number of training samples. This bound captures the trade-off between model complexity (via d) and sample size, arising from uniform convergence arguments such as Rademacher complexity in the RKHS norm-bounded function class.[11]The eigenvalue spectrum of the NTK fundamentally determines its effective dimension and thus the model's capacity for generalization, mirroring the covariance structure in Gaussian processes. For ReLU-activated fully connected networks, the NTK eigenvalues often exhibit polynomial decay (e.g., \lambda_k \sim k^{-\alpha} with \alpha > 1 depending on depth and input dimension), which bounds d sublinearly in the ambient dimension and promotes low-frequency bias, enhancing out-of-sample performance by limiting overfitting to noise. This spectral decay ensures that the effective dimension remains manageable even for high-dimensional inputs, akin to the prior induced by a Gaussian process with the conjugate NTK kernel.[12]Within the NTK framework, double descent emerges implicitly at the interpolation threshold, where test error peaks when the effective model complexity matches n and subsequently declines in the overparameterized regime due to the stabilizing eigenvalue distribution of \Theta. This behavior aligns overparameterization with improved generalization without explicit regularization.[13]From 2020 to 2022, theoretical advances have refined the bias-variance decomposition in the NTK kernelregime for wide neural networks, showing that bias decreases with increasing width while variance follows a non-monotonic curve driven by the NTK's multi-scale spectral structure, yielding double or triple descent in high dimensions. These results establish that optimal generalization occurs beyond interpolation, with variance controlled by the tail of the NTK eigenvalues.[13][14]
Applications
Overparameterized Models
In overparameterized neural networks, where the number of parameters p greatly exceeds the number of training samples n (i.e., p \gg n), the neural tangent kernel (NTK) framework explains how wide networks can achieve zero training error, entering the interpolation regime. This occurs because, in the infinite-width limit, the network's evolution under gradient descent behaves linearly, allowing the model to fit any training data perfectly without overfitting in expectation, as the NTK acts as a fixed kernel that enables exact interpolation.The NTK also elucidates the implicit bias of gradient descent in this regime, directing the solution toward the minimum-norm interpolant in the parameter space, which corresponds equivalently to the minimum reproducing kernel Hilbert space (RKHS) norm solution induced by the NTK. This bias favors smoother functions in the function space, promoting generalization despite interpolation, as the RKHS norm penalizes complex representations.A key phenomenon explained by the NTK is the double descent curve in test error, where error first decreases, peaks near the interpolation threshold (p \approx n), and then decreases again in the overparameterized regime. This behavior arises from the eigenvalue spectrum of the NTK: smaller eigenvalues in the overparameterized limit lead to slower learning of certain directions but overall better alignment with the data distribution, mitigating the peak at interpolation.[15]Empirical validations of the NTK in overparameterized settings include experiments on MNIST and CIFAR-10 using wide ResNets trained in the lazy regime, where parameter updates are small and the NTK remains nearly constant. For instance, on CIFAR-10, the NTK approximation for wide ResNets achieves approximately 77% test accuracy, consistent with the infinite-width kernel limit and capturing lazy training dynamics, though lower than feature-adapting finite networks (around 90% or higher depending on depth).[3] Similar results hold for MNIST, where NTK-based regression achieves near-perfect training fit and competitive test performance (over 98% accuracy) in wide architectures.
Kernel Regression Equivalence
The neural tangent kernel (NTK) provides an exact equivalence to kernel ridge regression in the infinite-width limit of neural networks, allowing learning problems to be reformulated and solved using standard kernel methods. Specifically, for a dataset of n training points \{x_i, y_i\}_{i=1}^n, the NTK matrix \Theta is computed as \Theta_{ij} = \Theta(x_i, x_j), where \Theta(\cdot, \cdot) is the NTK function. The ridge regression solution is then obtained by solving the linear system (\Theta + \lambda I)\alpha = y for the coefficients \alpha \in \mathbb{R}^n, with regularization parameter \lambda > 0. Predictions for a new input x are given by f(x) = k(x)^T \alpha, where k(x) = [\Theta(x, x_1), \dots, \Theta(x, x_n)]^T. This formulation leverages the reproducing kernel Hilbert space (RKHS) properties of the NTK, enabling the neural network's output to be expressed as a finite linear combination of kernel evaluations on the training data.[1]In the ridgeless limit as \lambda \to 0, the solution reduces to the minimum-norm interpolant using the Moore-Penrose pseudoinverse \Theta^+, yielding \alpha = \Theta^+ y and predictions f(x) = k(x)^T \Theta^+ y. This case achieves perfect interpolation on the training set when \Theta is positive semi-definite and full rank, mirroring the behavior of overparameterized neural networks that achieve zero training loss. The NTK's positive definiteness, established for certain activation functions like ReLU on the hypersphere with at least two hidden layers, ensures the existence of such a unique minimum-norm solution.[1]Training via gradient descent in the kernel regime corresponds to iterative kernel gradient descent, which converges to the ridge regression solution under the constant NTK assumption. Starting from an initial function f_0, the updates follow f_{t+1} = f_t - \eta \nabla_f \mathcal{L}(f_t), where the gradient is \nabla_f \mathcal{L}(f) = \Theta (f(X) - y) for squared loss \mathcal{L}(f) = \frac{1}{2} \|f(X) - y\|^2. With a sufficiently small learning rate \eta < 2 / \lambda_{\max}(\Theta), this process converges exponentially to the ridge solution f^* = \Theta (\Theta + \lambda I)^{-1} y, providing a dynamical systems interpretation of kernel methods.[1]Exact computation of the kernel ridge solution requires forming and inverting the n \times n NTK matrix, incurring O(n^3) time and space complexity, which limits scalability for large n. To address this, approximations such as the Nyström method subsample the kernel matrix to estimate \Theta \approx Q M^{-1} Q^T, where Q is an n \times m subsampled block with m \ll n and M its m \times m submatrix, reducing complexity to O(n m^2 + m^3). Alternatively, random features approximate the NTK via Monte Carlo sampling of its eigenfunction expansion, mapping inputs to a finite-dimensional feature space \phi(x) \in \mathbb{R}^d such that \Theta(x, x') \approx \phi(x)^T \phi(x'), enabling linear regression in O(n d^2 + d^3) time for moderate d. These techniques preserve the theoretical guarantees of kernel regression while enabling practical use on datasets beyond small scales.[1]
Practical Applications
Beyond theoretical insights, the NTK has practical uses in various domains. In neural architecture search (NAS), the NTK's eigenvalue spectrum or condition number serves as a proxy to predict architecture performance and trainability without full training, correlating with outcomes on datasets like CIFAR-10 and ImageNet.[2]For small-data supervised learning, NTK kernel regression outperforms finite-width ResNet-34 models on limited samples, achieving consistently better accuracy on subsets of CIFAR-10 with 640 or fewer training examples due to lower variance and fewer hyperparameters.[2][16]The NTK also applies to matrix completion tasks, such as collaborative filtering and virtual drug screening, where precomputed NTK Gram matrices enable efficient optimization.[2][17] Similarly, for image inpainting with convolutional networks, the NTK facilitates optimization using precomputed Gram matrices on images of size $2^p \times 2^q.[2][17]
Extensions
Finite Width and Other Architectures
In finite-width neural networks, the neural tangent kernel (NTK) at initialization is a random matrix whose expectation converges to the infinite-width limit as the width increases, but it exhibits fluctuations of order O(1/\sqrt{w}), where w is the width of the hidden layers.[18] During training with gradient descent, these fluctuations cause the NTK to evolve away from its initial value, leading to deviations from the lazy training regime observed in the infinite-width case, where the NTK remains constant.[19] This evolution introduces feature learning effects even in wide but finite networks, with the magnitude of changes scaling inversely with width, impacting convergence and generalization.[18]Extensions of the NTK to convolutional neural networks (CNNs) define the convolutional NTK (CNTK), which incorporates the spatial structure of convolutional layers while preserving the kernel's positive definiteness. For ReLU activations, the CNTK builds on arc-cosine kernels to compute the inner products between feature maps, enabling exact evaluation for infinitely wide CNNs.[3] Introduced by Arora et al. in 2019, this framework demonstrates that wide CNNs trained with gradient descent approximate kernel regression using the CNTK, achieving competitive performance on image classification tasks like CIFAR-10, where CNTK accuracy trails fully trained networks by only 6-7%.[3] The CNTK thus bridges theoretical analysis with practical CNN architectures by fusing convolutional invariances into the kernel computation.For transformer architectures, the NTK formulation accounts for self-attention mechanisms and positional encodings, resulting in structured kernels that capture sequence dependencies. Positional encodings, such as sinusoidal or learned variants, impose a banded or Toeplitz structure on the attention NTK, enhancing its ability to model long-range interactions in sequences. This extension, explored in works like Hron et al. (2020), reveals that infinitely wide transformers in the NTK regime perform implicit kernel regression on input embeddings, with applications in natural language processing tasks such as language modeling and translation, where the kernel's spectral properties influence extrapolation to longer sequences.[20]Graph neural tangent kernels (GNTKs) extend the NTK to graph neural networks (GNNs) by integrating graph structure into the kernel definition, treating infinitely wide GNNs as kernel machines over graph signals. The GNTK aggregates neighborhood information through message-passing operations, yielding a kernel that is invariant to graph permutations and incorporates Laplacian eigenvalues for spectral analysis.[21] Proposed by Du et al. in 2019, GNTKs combine the expressive power of GNNs with the theoretical guarantees of kernel methods, enabling provable learning of smooth graph functions and applications in node classification and graph regression, where they outperform traditional graph kernels on benchmarks like molecular property prediction.[21]
Recent Developments
In 2023, researchers introduced the weighted neural tangent kernel (WNTK), which extends the standard NTK by incorporating sample weights into the kernel formulation to better model neural network dynamics under adjusted gradient descent.[22] This generalization addresses limitations of the NTK in capturing optimizers beyond plain gradient descent, demonstrating improved classification accuracy—such as a 2.14% gain on CIFAR-10 binary-classification tasks—and faster convergence to a stable kernel in the infinite-width limit, as proven through stability theorems.[22]Advancing the analysis of training dynamics, a 2025 study examined the evolution of NTK eigenvectors at the edge of stability (EoS), a regime where the NTK's largest eigenvalue oscillates inversely with learning rate during gradient descent.[23] The work reveals that higher learning rates enhance the alignment of leading NTK eigenvectors and the full kernel matrix with training targets across various architectures, providing theoretical insights via a two-layer linear network model and empirical validation on deep networks.[23] This eigenvector dynamics perspective deepens understanding of feature learning and optimization behavior in overparameterized models.Theoretical progress in 2024 established strict positivity of the NTK for feedforward networks of arbitrary depth, provided the activation function is non-polynomial, using a novel characterization of polynomial functions to prove positive definiteness.[24] This result strengthens connections between NTK positivity and the memorization capacity of wide networks, enabling zero training loss via gradient descent and offering implications for reliable optimization landscapes.[24]Recent applications of the NTK framework have illuminated score estimation in diffusion models, where neural networks trained by gradient descent approximate score functions for generative tasks. Leveraging NTK theory, analyses from 2024 derive explicit convergence rates for these approximations, highlighting how the kernel regime explains optimization and generalization in score-based generative modeling, with bounds scaling favorably in network width.[25]
Limitations
Key Assumptions
The Neural Tangent Kernel (NTK) theory fundamentally relies on the infinite-width limit, where the widths of all hidden layers approach infinity, ensuring that the network function converges to a Gaussian process at initialization and that the NTK remains constant and deterministic throughout training under gradient descent. This limit, combined with a sufficiently small learning rate to maintain the "lazy" training regime—where parameters undergo minimal deviations from their initial values—enables the equivalence between neural network training and kernel gradient descent, but these assumptions break down in feature-learning regimes where larger learning rates or adaptive scaling allow nonlinear adaptations beyond linearization.[1]Initialization in NTK analyses adopts a mean-field perspective, assuming weights are drawn independently and identically from a standard Gaussian distribution \mathcal{N}(0,1), scaled appropriately by layer depth and width to control variance; this i.i.d. setup facilitates the probabilistic convergence of the kernel but neglects dependencies introduced by mechanisms like batch normalization, which couple parameters across samples and layers.[1][26]Activation functions are presumed to be smooth, specifically Lipschitz continuous and twice differentiable with bounded second derivatives (e.g., error function or hyperbolic tangent), to preserve the positive-definiteness and stability of the NTK; while ReLU activations are frequently employed in practice and yield tractable kernels despite their piecewise linearity and non-differentiability at zero, more complex smooth functions like Swish can alter the kernel's form due to their multiplicative gating structure, potentially introducing non-stationarities in the feature map derivatives.[1][3][27]The input data distribution is assumed to be stationary, with samples drawn i.i.d. from a fixed measure supported on a compact domain such as the unit sphere, ensuring the empirical kernelmatrix is well-conditioned and invertible for convergence guarantees; this precludes adversarial inputs or distribution shifts that could render the kernel degenerate or violate the required separability of data points.[1]
Empirical and Theoretical Gaps
Despite its theoretical appeal, the neural tangent kernel (NTK) framework exhibits a significant mismatch with the behavior of finite-width neural networks, as it primarily describes the "lazy training" regime where network parameters change minimally during optimization, leading to linearization around initialization. In contrast, real-world finite-width networks often operate in the "rich" or feature-learning regime, where parameters evolve substantially to adapt representations, enabling better generalization but deviating from NTK predictions. This discrepancy arises because the NTK assumes an infinite-width limit, which imposes unrealistic scaling requirements for convergence, with finite-width corrections involving higher-order terms that are not fully captured in the standard theory. Empirically, experiments on models like three-layer linear networks and convolutional neural networks (CNNs) demonstrate stable training only within specific richness parameters, beyond which divergence occurs, highlighting the NTK's inability to model feature evolution in practical settings.The NTK's applicability is also architecture-specific, performing adequately for CNNs but poorly for transformer-based models, particularly in tasks involving high-frequency signal processing. For vision transformers (ViTs), NTK-based metrics, such as those used in neural architecture search, yield low correlation with actual performance (e.g., Kendall-Tau correlations below 0.15 in pure ViT spaces), due to the NTK's focus on low-frequency components that overlook the high-frequency features critical to multi-head self-attention mechanisms. In contrast, NTK metrics achieve higher correlations (over 0.5) in CNN search spaces, where low-frequency approximations align better with convolutional inductive biases.[28]A major practical barrier is the NTK's scalability, as computing the empirical NTK Gram matrix for n training samples and input dimension d incurs O(n² d⁴) time complexity in standard implementations, rendering it infeasible for large-scale datasets like ImageNet (n ≈ 10⁶), where full computation could take months on high-end GPUs. While approximations like Nyström methods reduce this to O(n √n log n), they introduce approximation errors that can undermine theoretical guarantees, limiting NTK's use in real-world deployment.Recent open questions in NTK research, particularly from 2024 onward, include its adaptation to continual learning, where the fixed kernel assumption fails in finite-width networks transitioning to feature learning, leading to catastrophic forgetting and dependent gradient updates that diverge from Bayesian ideals; however, recent advancements as of 2025, such as path-coordinated frameworks and parameter-efficient fine-tuning analyses, have begun addressing these challenges.[29][30] Similarly, robustness to noise remains underexplored, with extensions to aleatoric noise providing estimators for posterior uncertainty but lacking validation on complex datasets or non-Gaussian noise distributions, leaving gaps in handling real-world perturbations like label noise or adversarial inputs.