Neural network Gaussian process
A Neural Network Gaussian Process (NNGP) is a Gaussian process whose covariance kernel is defined recursively through the layered architecture of a deep neural network, emerging precisely in the limit as the network's hidden layers become infinitely wide. This theoretical equivalence establishes that the prior distribution over functions induced by a randomly initialized deep neural network converges to a multivariate Gaussian with a non-stationary kernel that captures the network's depth and activation functions, enabling exact Bayesian inference without requiring network training.[1]
The NNGP framework builds on the earlier observation that single-layer neural networks with infinite width and i.i.d. parameter priors behave as Gaussian processes, extending this result to arbitrarily deep architectures by deriving closed-form expressions for the kernel at each layer.[1] This derivation involves computing the expected inner products of pre-activations and activations across layers, accounting for nonlinearities like ReLU or erf, and reveals how signal propagation in random deep networks leads to increasingly complex covariance structures with depth.[1] Empirical evaluations demonstrate that NNGP predictions on benchmarks such as MNIST and CIFAR-10 often surpass those of finite-width neural networks, particularly in terms of uncertainty calibration, while finite-width networks approximate the NNGP as width scales up.[1]
Beyond its foundational role in understanding the Bayesian interpretation of deep learning, the NNGP has influenced extensions to specialized domains, including graph neural networks where infinite-width limits yield graph-specific kernels for tasks like node classification, and to dependent weight models that incorporate correlations in parameters for more flexible priors.[2] Recent theoretical advances have also explored NNGP kernels in high-dimensional settings, such as cardinality estimation in databases, where they provide uncertainty-aware approximations leveraging the universal function approximation of neural architectures.[3] Additionally, connections to the Neural Tangent Kernel highlight how NNGPs describe the untrained network prior, contrasting with dynamics during gradient descent training.[4] These developments underscore the NNGP's utility in bridging non-parametric Bayesian methods with scalable neural modeling, facilitating applications in uncertainty quantification and kernel-based learning.
Introduction
Definition and Motivation
A neural network Gaussian process (NNGP) refers to the probabilistic model obtained in the limit where the width of a neural network—defined as the number of neurons in each hidden layer—approaches infinity, under Bayesian priors on the network weights. In this regime, the function computed by the network, when evaluated at any finite set of inputs, converges in distribution to a multivariate Gaussian with mean zero and a covariance matrix specified by a kernel function that depends on the network's architecture and activation functions.[1] This equivalence establishes that infinitely wide Bayesian neural networks behave as Gaussian processes, providing a non-parametric prior over functions where the kernel encodes the inductive biases of the original network structure.[5]
The motivation for studying NNGPs stems from the desire to combine the expressive power of neural networks with the principled uncertainty quantification offered by Gaussian processes, particularly in scenarios where finite-width networks suffer from overfitting or lack reliable predictive variances. By taking the infinite-width limit, exact Bayesian inference becomes tractable, as the posterior over functions can be computed analytically using the GP framework, avoiding the intractability of integrating over high-dimensional weight spaces in standard Bayesian neural networks.[1] This correspondence addresses key limitations in deep learning, such as poor calibration of confidence estimates, by enabling scalable methods for tasks requiring probabilistic predictions, like active learning or safety-critical applications. Furthermore, it provides theoretical insights into why wide neural networks generalize well, bridging empirical observations in machine learning with probabilistic modeling.[5]
To illustrate, consider a simple fully connected single-layer neural network with random Gaussian-initialized weights and a nonlinear activation like ReLU applied to the hidden units before linear readout. As the number of hidden units increases to infinity, the pre-activation outputs at different inputs become jointly Gaussian due to the central limit theorem, resulting in the network's overall function following a GP prior with zero mean and a covariance kernel that reflects the expected similarity between inputs under the activation's geometry.[5] This GP prior induces smooth, data-dependent functions that adapt to the input distribution, demonstrating how the infinite-width limit transforms a parametric model into a flexible, non-parametric one capable of approximating a wide class of mappings.[1]
While the core NNGP framework initially focuses on fully connected architectures, it naturally extends to more complex structures like convolutional networks by defining analogous recursive kernels that incorporate spatial invariances.[1] The non-parametric nature of the resulting GP allows for highly flexible function approximation without fixed-capacity constraints, making NNGPs particularly valuable for modeling complex data modalities where traditional parametric neural networks may underperform in capturing uncertainty.[5]
Historical Development
The concept of neural network Gaussian processes (NNGPs) originated with Radford Neal's 1996 PhD thesis, which demonstrated that Bayesian priors over the weights of a single-hidden-layer neural network with infinitely many units converge to a Gaussian process in the infinite-width limit.[5] This equivalence provided a Bayesian interpretation of neural networks, linking them to nonparametric models and influencing early work in probabilistic machine learning.[6]
The extension to deep neural networks was established in 2018 by Jaehoon Lee and colleagues, who derived that fully connected networks with multiple layers also converge to Gaussian processes under the infinite-width limit, introducing a recursive computation for the resulting kernel that captures layer-wise transformations.[1] This work generalized Neal's result, enabling the analysis of deep architectures as hierarchical Gaussian processes and highlighting the role of activation functions in kernel design.[7]
Recent developments have explored NNGPs in specialized contexts, such as wide networks achieving perfect fitting of training data through their Gaussian process equivalence, as examined in a 2023 NeurIPS paper on deep equilibrium models.[8] In 2025, research in Nature Physics showed that certain quantum neural networks, based on Haar-random unitary or orthogonal layers, converge to Gaussian processes in the large Hilbert space limit, bridging classical and quantum paradigms.[9] A subsequent October 2025 study established quantitative convergence rates for trained quantum neural networks to Gaussian processes under infinite width.[10] In July 2025, researchers revisited the equivalence of Bayesian neural networks and Gaussian processes, emphasizing the role of learnable activations in the infinite-width limit.[11] Additionally, a 2023 study in Machine Learning: Science and Technology applied NNGPs to model potential energy surfaces for polyatomic molecules, demonstrating their efficiency in capturing complex molecular interactions with fewer parameters than traditional methods.[12]
The theoretical understanding of NNGPs has evolved from Neal's initial heuristic arguments based on Bayesian priors to rigorous measure-theoretic proofs in subsequent works, such as the 2018 derivation ensuring convergence for composed layers.[1] This progression has profoundly influenced Bayesian deep learning by providing tools for uncertainty quantification and kernel-based approximations in scalable neural models.[5]
Background Concepts
Gaussian Processes
A Gaussian process (GP) is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution.[13] It is completely specified by a mean function \mu(x) and a covariance (or kernel) function k(x, x'), and is denoted by f \sim \mathcal{GP}(\mu, k), where f represents the function values at inputs x.[13] This formulation treats the GP as a prior distribution over functions, enabling probabilistic modeling of unknown mappings from inputs to outputs.[14]
Gaussian processes possess several key properties that make them suitable for function approximation. They are non-parametric models, operating in an infinite-dimensional function space rather than fixed-dimensional parameter spaces, which allows flexibility in capturing complex patterns without assuming a specific functional form.[14] Additionally, GPs support exact posterior inference through simple conditioning operations on the joint Gaussian distribution, providing a fully Bayesian framework where beliefs about the function update based on observed data.[14] Unlike deterministic models such as neural networks, GPs inherently model stochasticity, yielding distributions over possible functions.[15]
The choice of covariance kernel is central to a GP, as it encodes assumptions about the function's properties, such as smoothness or periodicity. A widely used kernel is the squared exponential (or radial basis function) kernel, given by
k(x, x') = \sigma_f^2 \exp\left( -\frac{\|x - x'\|^2}{2\ell^2} \right),
where \sigma_f^2 controls the vertical scale and \ell the length scale; this kernel generates infinitely differentiable, highly smooth sample functions.[16] Another common choice is the Matérn kernel, parameterized by a smoothness factor \nu (often \nu = 5/2 or $3/2), which allows for more realistic roughness in functions; for \nu = 5/2, it takes the form
k(x, x') = \sigma_f^2 \left(1 + \sqrt{5} r + \frac{5}{3} r^2 \right) \exp\left( -\sqrt{5} r \right),
with r = \|x - x'\| / \ell.[16] These kernels facilitate GP regression tasks by enabling point predictions alongside uncertainty quantification, where the predictive variance reflects data density and model confidence.[14]
Inference in Gaussian processes typically involves regression, where noisy observations y = f(x) + \epsilon are modeled with \epsilon \sim \mathcal{N}(0, \sigma_n^2). The posterior predictive distribution at a test point x_* is Gaussian, with mean
\mu_* = K(X, x_*) [K(X, X) + \sigma_n^2 I]^{-1} y
and covariance
\sigma_*^2 = k(x_*, x_*) - K(X, x_*) [K(X, X) + \sigma_n^2 I]^{-1} K(X, x_*),
assuming a zero mean prior for simplicity; here, K denotes kernel matrices over training inputs X and outputs y.[13] This exact inference comes at a computational cost of O(n^3) time and O(n^2) space for n data points, dominated by the inversion of the n \times n covariance matrix.[13]
Neural Network Fundamentals
Feedforward neural networks consist of multiple layers of interconnected neurons, where each neuron in a given layer receives inputs from the previous layer, applies a linear transformation, and passes the result through a nonlinear activation function. The basic architecture includes an input layer, one or more hidden layers, and an output layer, with connections defined by weights and biases. For a network with L layers, the computation in layer l involves weights W^l \in \mathbb{R}^{n^l \times n^{l-1}} and biases b^l \in \mathbb{R}^{n^l}, where n^l denotes the width (number of neurons) in layer l. The pre-activation for layer l is given by z^l = W^l h^{l-1} + b^l, followed by the activation h^l = \phi(z^l), with h^0 = x the input and \phi the activation function, such as ReLU (\phi(z) = \max(0, z))[17] or the error function (\phi(z) = \erf(z / \sqrt{2})),[1] chosen for their properties in facilitating gradient flow and enabling nonlinear mappings.[18]
The forward pass computes the network output as a nested composition of these layer functions: f(x) = \phi_L(W^L \phi_{L-1}(W^{L-1} \cdots \phi_1(W^1 x + b^1) \cdots + b^{L-1}) + b^L), transforming inputs through successive nonlinearities to produce predictions. In a Bayesian framework, neural networks are treated probabilistically by placing priors on the weights and biases, typically independent Gaussian distributions such as W^l_{ij} \sim \mathcal{N}(0, \sigma_w^2) and b^l_i \sim \mathcal{N}(0, \sigma_b^2), which induce a prior distribution over the functions realized by the network. This perspective views the network as a probabilistic model where the prior captures uncertainty in the mapping from inputs to outputs, analogous to Gaussian processes in Bayesian nonparametrics but defined through finite parametric compositions.[19]
The width n^l and depth L define the network's capacity, with wider layers allowing more parallel computations and deeper layers enabling hierarchical feature extraction; however, to maintain stable signal propagation, weights are scaled during initialization, such as by a factor of 1 / \sqrt{n^{l-1}} to preserve variance across layers for activations like tanh or ReLU. This scaling prevents vanishing or exploding gradients in deep networks by ensuring that the variance of activations remains approximately constant from layer to layer. While training typically employs gradient descent to minimize a loss function by iteratively updating weights via backpropagation, the Bayesian view emphasizes the prior-induced function space prior to optimization, where the distribution over functions reflects the model's inductive biases before data adaptation.
Theoretical Framework
Infinite-Width Limit
In the infinite-width limit, the outputs of a neural network converge in distribution to those of a Gaussian process, providing a Bayesian nonparametric perspective on wide networks. This equivalence arises primarily from the central limit theorem (CLT), which implies that pre-activations z^l at layer l, defined as the linear transformation of the previous layer's activations followed by a nonlinearity, become Gaussian distributed due to the summation over a large number of independent terms proportional to the layer width n^{l-1}. Specifically, for a fully connected network, the pre-activation for the i-th unit in layer l is z^l_i(x) = b^l_i + \sum_{j=1}^{n^{l-1}} w^l_{ij} y^{l-1}_j(x), where weights w^l_{ij} and biases b^l_i are drawn i.i.d. from distributions with zero mean and appropriate variances; as n^{l-1} \to \infty, the CLT ensures z^l(x) is Gaussian for any input x.[1][20]
This Gaussianity propagates recursively through the network layers. For a single-layer network, the output is exactly a Gaussian process when the width is infinite, as the function values at finite sets of inputs are jointly Gaussian by construction. Extending to multi-layer networks with L layers, the process applies inductively: assuming the post-activations y^{l-1} from the previous layer form a Gaussian process conditional on the inputs, the pre-activations z^l at the next layer are Gaussian conditional on y^{l-1}, and thus the overall outputs y^L are jointly Gaussian for any finite set of inputs, yielding a Gaussian process prior over functions. This layer-by-layer recursion holds under mild conditions on the nonlinearity, such as the linear envelope property, ensuring the limit is non-degenerate.[5][20]
Crucial to achieving this convergence without degeneracy or explosion in variance is the scaling of the weight variances with the fan-in, typically set to \sigma_w^{2,l} = \hat{\sigma}_w^{2,l} / n^{l-1} for weights in layer l, alongside a constant bias variance \sigma_b^{2,l}. This scaling normalizes the contributions from each previous neuron, preventing the pre-activation variance from vanishing to zero or diverging to infinity as the width grows, thereby maintaining a well-defined Gaussian process limit with finite covariance. Without such scaling, the network outputs could collapse to constants or exhibit unstable behavior, underscoring the importance of this parameterization in theoretical analyses and practical implementations.[1][19]
Derivation of the NNGP Kernel
The Neural Network Gaussian Process (NNGP) kernel is defined as the covariance function K^l(x, x') = \Cov[z^l(x), z^l(x')], where z^l denotes the pre-activation outputs at layer l under the infinite-width prior on the network parameters.[1]
For the input layer (l = 0), the kernel takes the form of a linear kernel:
K^0(x, x') = \sigma_b^2 + \sigma_w^2 \frac{x \cdot x'}{d_{\text{in}}},
where \sigma_b^2 is the bias variance, \sigma_w^2 is the weight variance, and d_{\text{in}} is the input dimension; this follows from the Gaussian initialization of weights W^{ij} \sim \mathcal{N}(0, \sigma_w^2 / d_{\text{in}}) and biases b^j \sim \mathcal{N}(0, \sigma_b^2).[1]
Subsequent layers are computed recursively. Conditioning on the previous layer's kernel K^{l-1}, the pre-activations z^l(x) at layer l are Gaussian with covariance
K^l(x, x') = \sigma_b^2 + \sigma_w^2 \mathbb{E}[\phi(u) \phi(v)],
where \phi is the activation function, and the expectation is over the bivariate Gaussian (u, v) \sim \mathcal{N}(0, \Sigma) with
\Sigma = \begin{pmatrix}
K^{l-1}(x, x) & K^{l-1}(x, x') \\
K^{l-1}(x', x) & K^{l-1}(x', x')
\end{pmatrix}.
This expectation, often denoted \mathbb{E}[\phi(u) \phi(v)] = V_\phi(K^{l-1}(x, x'), K^{l-1}(x, x), K^{l-1}(x', x')), admits closed-form expressions for common activations and can be computed exactly, yielding a deterministic recursion for K^l given K^{l-1}.[1]
In the infinite-width limit (n^l \to \infty), the kernel K^l conditioned on K^{l-1} becomes deterministic due to the law of large numbers applied to the finite-sum structure of the pre-activations, enabling layer-wise exact computation without stochasticity.[1]
For the ReLU activation \phi(z) = \max(0, z), the expectation corresponds to the first-order arc-cosine kernel:
K^l(x, x') = \sigma_b^2 + \frac{\sigma_w^2}{2\pi} \sqrt{K^{l-1}(x, x) K^{l-1}(x', x')} \left[ \sin \theta^{l-1} + (\pi - \theta^{l-1}) \cos \theta^{l-1} \right],
where \theta^{l-1} = \arccos \left( \frac{K^{l-1}(x, x')}{\sqrt{K^{l-1}(x, x) K^{l-1}(x', x')}} \right).[1]
Iterating this recursion to the output layer L, the final pre-activations satisfy z^L \mid x \sim \GP(0, K^L), establishing the NNGP as a Gaussian process with kernel K^L; equivalently, z^l \mid z^{l-1} \sim \GP(0, K^l) holds layer-wise.[1]
Extensions and Variants
Convolutional and Other Architectures
The extension of the neural network Gaussian process (NNGP) framework to convolutional neural networks (CNNs) involves deriving kernels that respect the local connectivity and weight-sharing properties of convolutional layers. In this limit, the output of a Bayesian CNN with random weights converges to a Gaussian process whose kernel is translation-invariant, capturing spatial correlations induced by the convolution operation. This kernel is computed recursively across layers, where each convolutional layer transforms the covariance structure of the previous layer through a Monte Carlo approximation of expectations over the nonlinear activation, leading to structured covariances that model feature maps as multivariate Gaussian processes with dependencies along spatial dimensions.[21][22]
The derivation adjusts the standard infinite-width limit by considering infinite channels per layer rather than infinite hidden units, ensuring that the central limit theorem applies to the summed contributions from filter outputs while preserving the fixed spatial resolution until pooling. Pooling operations, such as max or average pooling, are treated as deterministic downsampling in the infinite-channel limit, which reduces the dimensionality of the feature maps without introducing additional stochasticity, thereby maintaining the Gaussian process structure through the network depth. For instance, in multi-stage process modeling, the CNN-GPR approach stacks convolutional layers to hierarchically extract features from sequential data, yielding a composite Gaussian process that outperforms traditional GPR on tasks like semiconductor manufacturing prediction by leveraging spatial hierarchies.[22][23]
Beyond CNNs, the NNGP framework extends to other architectures by adapting the kernel to their inductive biases. For recurrent neural networks (RNNs), the infinite-width limit produces a Gaussian process with a temporal kernel that models sequence dependencies through recursive covariance updates, enabling non-Markovian modeling of time-series data such as in dynamical systems identification.[24] In graph neural networks (GNNs), the kernel incorporates graph-structured covariances by propagating messages along edges in the infinite-width regime, resulting in permutation-equivariant Gaussian processes suitable for tasks like molecular property prediction where node features depend on neighborhood topology.[25] Transformer-like architectures yield kernels based on query-key dot-product similarities, where self-attention layers induce cross-covariance structures that capture long-range dependencies in sequences, as seen in correlated Gaussian process models for uncertainty-aware natural language processing.[26]
Recent advancements have further generalized this to quantum neural networks (QNNs), where Haar-random unitary layers in the infinite-Hilbert-space limit converge to Gaussian processes with kernels reflecting quantum entanglement patterns, offering a Bayesian perspective on quantum machine learning for tasks like quantum state tomography.[9]
The Neural Tangent Kernel (NTK) provides a kernel-based description of the tangent space of neural network functions during gradient flow training in the infinite-width limit, where the network's evolution under gradient descent converges to kernel regression dynamics.[27] This framework reveals that, for sufficiently wide networks, the training process linearizes around initialization, enabling closed-form analysis of convergence and generalization properties akin to those in kernel methods.
The Neural Network Gaussian Process (NNGP) kernel governs the prior distribution over functions at initialization, while the NTK describes the kernel for the conjugate gradient descent process during training; in certain cases, such as with the error function (erf) activation, the NTK approximates the NNGP kernel.[1] This correspondence arises because both kernels emerge from the same infinite-width scaling of random neural networks, with the NNGP capturing the Bayesian prior and the NTK reflecting the linearized dynamics post-initialization.[27] For activations like ReLU, the kernels differ but remain closely related through recursive definitions that propagate layer-wise covariances.
Key differences lie in their interpretive roles: the NNGP enables Bayesian sampling from the function prior for uncertainty quantification, whereas the NTK focuses on frequentist optimization trajectories under gradient flow, both facilitating tractable analysis without simulating finite-width networks.[1] These distinctions highlight complementary perspectives, with the NNGP emphasizing pre-training variability and the NTK post-training evolution.
The interplay between NNGP and NTK underscores that infinite-width neural networks behave as kernel machines, bridging deep learning with classical kernel regression and enabling theoretical insights into why wide networks generalize well.[27] This connection, established in foundational works, has influenced subsequent analyses of network scaling and has practical implications for designing efficient approximations in large-scale machine learning.[1]
Properties and Implications
Statistical Properties
Neural network Gaussian processes (NNGPs) exhibit non-stationary kernels due to their layered compositional structure, which causes the covariance to depend not only on the difference between inputs but also on their individual magnitudes, in contrast to stationary kernels like the radial basis function (RBF). This non-stationarity emerges from the recursive definition of the kernel across layers, where each layer's covariance incorporates expectations over activations that scale with input norms.[20]
NNGPs function as universal priors over smooth functions, leveraging the Gaussian process framework to span a rich reproducing kernel Hilbert space that encompasses a denser set of functions than those representable by finite-width neural networks. This universality stems from the infinite-width limit, where the central limit theorem ensures convergence to a GP with a kernel capable of approximating continuous functions under suitable conditions.[20] The mean function of an NNGP is typically zero when using centered priors for weights and biases, while the output variance scales recursively with network depth, often necessitating variance scaling factors to maintain stable signal propagation across layers.
Sampling from the NNGP prior employs standard Gaussian process sampling techniques, such as drawing functions from the multivariate normal distribution defined by the kernel; for finite-width approximations, Monte Carlo methods average predictions over ensembles of randomly initialized networks to approximate the infinite-width behavior. Inference under the NNGP is thus equivalent to exact GP inference, enabling Bayesian predictions via kernel ridge regression or similar methods.
Analysis of NNGP properties often involves recurrence relations for the kernel's eigenvalues, derived from the layer-wise updates, which provide insights into spectral decay and effective dimensionality. These relations highlight generalization behaviors, including double descent, where the test error initially rises with model complexity before declining, mirroring phenomena observed in overparameterized regimes.[28][20]
Applications in Machine Learning
Neural network Gaussian processes (NNGPs) have found significant application in Bayesian neural networks, where the infinite-width limit enables exact Gaussian process posteriors for uncertainty quantification. In this framework, the prior over functions induced by an infinitely wide neural network aligns with a GP, allowing Bayesian inference to capture epistemic uncertainty without variational approximations, which is particularly useful for safety-critical predictions in domains like healthcare and autonomous systems. This equivalence facilitates scalable posterior sampling via GP methods, improving calibration of uncertainty estimates compared to standard neural networks.[1][29]
In regression and classification tasks, NNGPs serve as scalable priors for modeling high-dimensional data, leveraging the neural network's representational power to define expressive kernels that handle complex nonlinearities. For instance, in molecular simulations, NNGP models have been employed to construct global potential energy surfaces (PES) for polyatomic molecules, achieving high accuracy in predicting energies across wide ranges with fewer training points than traditional GPs, due to the kernel's ability to extrapolate from low-energy data to high-energy regimes. This approach demonstrates NNGPs' efficacy in high-dimensional regression, where finite neural networks approximate the infinite-width GP prior to manage computational demands.[12][30]
Hybrid models integrating NNGPs with deep kernels enhance flexibility by combining neural feature extraction with GP inference, enabling better generalization in tasks like active learning. In active learning scenarios, the GP acquisition function derived from the NNGP kernel guides data selection to minimize uncertainty, outperforming standard neural network heuristics by focusing queries on informative regions of the input space. Such hybrids, often termed deep kernel learning, use the NNGP as a guiding prior to optimize kernel hyperparameters, improving sample efficiency in regression benchmarks.[31][29]
Recent applications highlight NNGPs' versatility in specialized domains. In quantum simulations, quantum neural networks in the infinite-width limit converge to Gaussian processes, enabling efficient modeling of quantum states and dynamics with built-in uncertainty quantification for tasks like error mitigation in quantum circuits.[9]
Despite these advances, NNGPs face limitations in scalability for deep networks, as the cubic complexity of GP inference hinders application to large datasets, necessitating sparse approximations or inducing points. For finite-width networks, deviations from the infinite-width GP behavior require perturbative corrections or Monte Carlo methods to approximate the kernel, which can introduce biases in uncertainty estimates for shallower or narrower architectures. These challenges underscore the need for ongoing research into efficient finite-width extensions to broaden practical deployment.
Implementations
Software Libraries
Several open-source software libraries facilitate the implementation and application of neural network Gaussian processes (NNGPs), enabling researchers to compute NNGP kernels, perform Bayesian inference, and integrate them into broader machine learning workflows. These tools often build on established Gaussian process frameworks while allowing extensions for neural-inspired kernels, such as those derived from infinite-width limits of neural networks.
GPyTorch, a PyTorch-based library for Gaussian processes, enables custom implementation of NNGP kernels through its modular kernel composition system. It allows users to define multi-layer NNGP kernels recursively, incorporating custom activation functions like ReLU or erf, and handles finite-width approximations for practical computations. Key features include automatic differentiation for kernel optimization and scalable inference via inducing points or variational methods, making it suitable for large-scale NNGP modeling. For instance, users can subclass Kernel to implement NNGP computations, though full multi-layer implementations require custom coding.
Google's Neural Tangents library, designed for analyzing neural networks in the infinite-width regime, includes dedicated modules for both NNGP and neural tangent kernels (NTK). It supports recursive computation of NNGP kernels across various architectures, with built-in handling for custom activations and depth, and integrates seamlessly with JAX for high-performance numerical operations like efficient sampling from the GP posterior. The library's ntk module allows quick setup of NNGP objects, enabling tasks such as uncertainty quantification in deep Bayesian neural networks. An example for a single-layer NNGP with JAX integration is:
python
import jax.numpy as jnp
import neural_tangents as nt
from neural_tangents import stax
# Define a simple single-layer network
init_fn, apply_fn, [kernel](/page/Kernel)_fn = stax.serial(
stax.Dense(1, W_std=1.0, b_std=0.1), stax.Erf(), stax.Dense(1)
)
# Compute NNGP kernel
x = jnp.ones((10, 1)) # Input [data](/page/Data)
nngp_fn = nt.nngp([kernel](/page/Kernel)_fn, x, x, 'ntk', dtype=jnp.float32)
K_nngp = nngp_fn(x, x, None, None, None)['nngp'] # Extract [kernel](/page/Kernel) [matrix](/page/Matrix)
import jax.numpy as jnp
import neural_tangents as nt
from neural_tangents import stax
# Define a simple single-layer network
init_fn, apply_fn, [kernel](/page/Kernel)_fn = stax.serial(
stax.Dense(1, W_std=1.0, b_std=0.1), stax.Erf(), stax.Dense(1)
)
# Compute NNGP kernel
x = jnp.ones((10, 1)) # Input [data](/page/Data)
nngp_fn = nt.nngp([kernel](/page/Kernel)_fn, x, x, 'ntk', dtype=jnp.float32)
K_nngp = nngp_fn(x, x, None, None, None)['nngp'] # Extract [kernel](/page/Kernel) [matrix](/page/Matrix)
This facilitates rapid prototyping and scales to deeper networks via batched computations.
GPflow, a TensorFlow/Keras-compatible library for Gaussian processes, allows custom kernel definitions suitable for implementing NNGPs through its kernel registry and custom kernel classes. Users can implement NNGP kernels by subclassing tfk.Kernel and defining recursive covariance computations, with support for finite-width corrections and integration into variational GP models. It emphasizes scalability for non-conjugate likelihoods and provides tools for hyperparameter optimization via MCMC or variational inference. While not specialized for infinite-width limits, GPflow's flexibility allows adaptation for NNGP-based priors in probabilistic modeling.
Additionally, specialized implementations like the open-source nngp repository provide code for constructing covariance kernels equivalent to infinitely wide, fully connected, deep neural networks.[32]
Practical and Computational Considerations
The computation of the neural network Gaussian process (NNGP) kernel itself is typically efficient, as it involves recursive evaluation of covariance functions based on input features, but the subsequent Gaussian process inference scales cubically with the number of data points n, requiring O(n^3) time for kernel matrix inversion and O(n^2) space for storage. This cubic complexity arises from solving linear systems in the GP posterior, making exact NNGP inference prohibitive for datasets beyond a few thousand points. To mitigate this, practitioners often employ sparse approximations, such as inducing points, which reduce complexity to O(m^3 + nm^2) where m \ll n is the number of inducing variables, or structured kernel approximations like those leveraging low-rank decompositions.
In practice, exact infinite-width NNGP assumptions are rarely feasible, so finite-width approximations are used to model realistic neural networks. These include empirical estimation of the NNGP kernel by averaging over multiple finite-width network initializations, which captures deviations from the infinite-width limit while introducing controllable bias. Finite-width effects lead to a bias-variance tradeoff: narrower networks exhibit higher variance in predictions due to amplified parameter uncertainty, but they can reduce bias by enabling feature learning absent in the infinite-width regime.[33] Such approximations are particularly useful for uncertainty quantification in Bayesian settings, though they require careful sampling to balance computational cost and accuracy.
Hyperparameter selection in NNGPs critically influences the resulting kernel's expressivity and stability. Key parameters include the weight variance \sigma_w^2 and bias variance \sigma_b^2, which control signal propagation through layers; for example, \sigma_w^2 = 2 for ReLU activations preserves variance across layers, while smaller values prevent exploding covariances in deeper networks. Activation function choices, such as erf for analytic tractability or ReLU for compatibility with modern architectures, further shape the kernel's non-stationarity. Depth and width interact via scaling laws: increasing width reduces effective dimensionality of the function space, but optimal performance often follows \sigma_w \propto 1/\sqrt{\text{width}} to maintain constant kernel variance, enabling stable deep NNGPs without gradient issues.[34]
NNGPs present optimization challenges due to their inherent non-stationarity, where the kernel depends explicitly on input magnitudes and architecture, complicating hyperparameter tuning via standard GP methods like marginal likelihood maximization. This non-stationarity can lead to ill-conditioned kernel matrices, exacerbating numerical instability in inversion. Integrating NNGPs with stochastic gradient descent (SGD) in hybrid training regimes—such as using NNGP priors to initialize finite-width networks—requires careful variance matching to avoid mode collapse, often addressed through variational approximations or ensemble methods.[35]
Best practices recommend deploying NNGPs primarily for small datasets (n < 10^3) where exact inference is viable, or as informative priors in Bayesian workflows to encode neural-like inductive biases. For large-scale applications, combining NNGPs with neural tangent kernels (NTKs) is effective, leveraging NNGP for prior specification and NTK for scalable gradient-based training dynamics.[36]