Fact-checked by Grok 2 weeks ago

Neural network Gaussian process

A Neural Network Gaussian Process (NNGP) is a Gaussian process whose covariance kernel is defined recursively through the layered architecture of a deep , emerging precisely in the limit as the network's hidden layers become infinitely wide. This theoretical equivalence establishes that the prior distribution over functions induced by a randomly initialized deep converges to a multivariate Gaussian with a non-stationary that captures the network's depth and activation functions, enabling exact without requiring network training. The NNGP framework builds on the earlier observation that single-layer neural networks with infinite width and i.i.d. parameter priors behave as Gaussian processes, extending this result to arbitrarily deep architectures by deriving closed-form expressions for the kernel at each layer. This derivation involves computing the expected inner products of pre-activations and activations across layers, accounting for nonlinearities like ReLU or erf, and reveals how signal propagation in random deep networks leads to increasingly complex covariance structures with depth. Empirical evaluations demonstrate that NNGP predictions on benchmarks such as MNIST and often surpass those of finite-width neural networks, particularly in terms of uncertainty calibration, while finite-width networks approximate the NNGP as width scales up. Beyond its foundational role in understanding the Bayesian interpretation of , the NNGP has influenced extensions to specialized domains, including graph neural networks where infinite-width limits yield graph-specific kernels for tasks like node classification, and to dependent weight models that incorporate correlations in parameters for more flexible priors. Recent theoretical advances have also explored NNGP kernels in high-dimensional settings, such as in , where they provide uncertainty-aware approximations leveraging the universal of neural architectures. Additionally, connections to the highlight how NNGPs describe the untrained network prior, contrasting with dynamics during training. These developments underscore the NNGP's utility in bridging non-parametric Bayesian methods with scalable neural modeling, facilitating applications in and kernel-based learning.

Introduction

Definition and Motivation

A neural network Gaussian process (NNGP) refers to the probabilistic model obtained in the limit where the width of a —defined as the number of neurons in each hidden layer—approaches infinity, under Bayesian priors on the network weights. In this regime, the function computed by the network, when evaluated at any finite set of inputs, converges in distribution to a multivariate Gaussian with mean zero and a specified by a function that depends on the network's and functions. This equivalence establishes that infinitely wide Bayesian s behave as Gaussian processes, providing a non-parametric over functions where the encodes the inductive biases of the original network structure. The motivation for studying NNGPs stems from the desire to combine the expressive power of neural networks with the principled offered by Gaussian processes, particularly in scenarios where finite-width networks suffer from or lack reliable predictive variances. By taking the infinite-width limit, exact becomes tractable, as the posterior over functions can be computed analytically using the GP framework, avoiding the intractability of integrating over high-dimensional weight spaces in standard Bayesian neural networks. This correspondence addresses key limitations in , such as poor calibration of confidence estimates, by enabling scalable methods for tasks requiring probabilistic predictions, like or safety-critical applications. Furthermore, it provides theoretical insights into why wide neural networks generalize well, bridging empirical observations in with probabilistic modeling. To illustrate, consider a simple fully connected single-layer neural network with random Gaussian-initialized weights and a nonlinear like ReLU applied to the hidden units before linear readout. As the number of hidden units increases to infinity, the pre-activation outputs at different inputs become jointly Gaussian due to the , resulting in the network's overall function following a GP prior with zero and a covariance kernel that reflects the expected similarity between inputs under the activation's geometry. This GP prior induces smooth, data-dependent functions that adapt to the input distribution, demonstrating how the infinite-width limit transforms a parametric model into a flexible, non-parametric one capable of approximating a wide class of mappings. While the core NNGP framework initially focuses on fully connected architectures, it naturally extends to more complex structures like convolutional networks by defining analogous recursive kernels that incorporate spatial invariances. The non-parametric nature of the resulting allows for highly flexible without fixed-capacity constraints, making NNGPs particularly valuable for modeling complex data modalities where traditional neural networks may underperform in capturing .

Historical Development

The concept of neural network Gaussian processes (NNGPs) originated with Radford Neal's 1996 , which demonstrated that Bayesian priors over the weights of a single-hidden-layer with infinitely many units converge to a in the infinite-width limit. This equivalence provided a Bayesian interpretation of , linking them to nonparametric models and influencing early work in probabilistic . The extension to deep neural networks was established in 2018 by Jaehoon Lee and colleagues, who derived that fully connected networks with multiple layers also converge to Gaussian processes under the infinite-width limit, introducing a recursive computation for the resulting that captures layer-wise transformations. This work generalized Neal's result, enabling the analysis of deep architectures as hierarchical Gaussian processes and highlighting the role of functions in kernel design. Recent developments have explored NNGPs in specialized contexts, such as wide networks achieving perfect fitting of training data through their Gaussian process equivalence, as examined in a 2023 NeurIPS paper on deep equilibrium models. In 2025, research in Nature Physics showed that certain quantum neural networks, based on Haar-random unitary or orthogonal layers, converge to Gaussian processes in the large Hilbert space limit, bridging classical and quantum paradigms. A subsequent October 2025 study established quantitative convergence rates for trained quantum neural networks to Gaussian processes under infinite width. In July 2025, researchers revisited the equivalence of Bayesian neural networks and Gaussian processes, emphasizing the role of learnable activations in the infinite-width limit. Additionally, a 2023 study in Machine Learning: Science and Technology applied NNGPs to model potential energy surfaces for polyatomic molecules, demonstrating their efficiency in capturing complex molecular interactions with fewer parameters than traditional methods. The theoretical understanding of NNGPs has evolved from Neal's initial arguments based on Bayesian priors to rigorous measure-theoretic proofs in subsequent works, such as the 2018 ensuring convergence for composed layers. This progression has profoundly influenced by providing tools for and kernel-based approximations in scalable neural models.

Background Concepts

Gaussian Processes

A Gaussian process (GP) is defined as a collection of random variables, any finite number of which have a joint Gaussian distribution. It is completely specified by a mean function \mu(x) and a covariance (or kernel) function k(x, x'), and is denoted by f \sim \mathcal{GP}(\mu, k), where f represents the function values at inputs x. This formulation treats the GP as a prior distribution over functions, enabling probabilistic modeling of unknown mappings from inputs to outputs. Gaussian processes possess several key properties that make them suitable for . They are non-parametric models, operating in an infinite-dimensional rather than fixed-dimensional spaces, which allows flexibility in capturing complex patterns without assuming a specific functional form. Additionally, GPs support exact posterior through simple operations on the joint Gaussian distribution, providing a fully Bayesian framework where beliefs about the function update based on observed data. Unlike deterministic models such as neural networks, GPs inherently model stochasticity, yielding distributions over possible functions. The choice of covariance is central to a , as it encodes assumptions about the function's properties, such as or periodicity. A widely used kernel is the squared exponential (or , given by k(x, x') = \sigma_f^2 \exp\left( -\frac{\|x - x'\|^2}{2\ell^2} \right), where \sigma_f^2 controls the vertical scale and \ell the length scale; this kernel generates infinitely differentiable, highly smooth sample functions. Another common choice is the Matérn kernel, parameterized by a smoothness factor \nu (often \nu = 5/2 or $3/2), which allows for more realistic roughness in functions; for \nu = 5/2, it takes the form k(x, x') = \sigma_f^2 \left(1 + \sqrt{5} r + \frac{5}{3} r^2 \right) \exp\left( -\sqrt{5} r \right), with r = \|x - x'\| / \ell. These kernels facilitate tasks by enabling point predictions alongside , where the predictive variance reflects data density and model confidence. Inference in Gaussian processes typically involves regression, where noisy observations y = f(x) + \epsilon are modeled with \epsilon \sim \mathcal{N}(0, \sigma_n^2). The posterior predictive distribution at a test point x_* is Gaussian, with mean \mu_* = K(X, x_*) [K(X, X) + \sigma_n^2 I]^{-1} y and covariance \sigma_*^2 = k(x_*, x_*) - K(X, x_*) [K(X, X) + \sigma_n^2 I]^{-1} K(X, x_*), assuming a zero mean prior for simplicity; here, K denotes kernel matrices over training inputs X and outputs y. This exact inference comes at a computational cost of O(n^3) time and O(n^2) space for n data points, dominated by the inversion of the n \times n covariance matrix.

Neural Network Fundamentals

Feedforward neural networks consist of multiple layers of interconnected neurons, where each neuron in a given layer receives inputs from the previous layer, applies a linear transformation, and passes the result through a nonlinear activation function. The basic architecture includes an input layer, one or more hidden layers, and an output layer, with connections defined by weights and biases. For a network with L layers, the computation in layer l involves weights W^l \in \mathbb{R}^{n^l \times n^{l-1}} and biases b^l \in \mathbb{R}^{n^l}, where n^l denotes the width (number of neurons) in layer l. The pre-activation for layer l is given by z^l = W^l h^{l-1} + b^l, followed by the activation h^l = \phi(z^l), with h^0 = x the input and \phi the activation function, such as ReLU (\phi(z) = \max(0, z)) or the error function (\phi(z) = \erf(z / \sqrt{2})), chosen for their properties in facilitating gradient flow and enabling nonlinear mappings. The forward pass computes the network output as a nested composition of these layer functions: f(x) = \phi_L(W^L \phi_{L-1}(W^{L-1} \cdots \phi_1(W^1 x + b^1) \cdots + b^{L-1}) + b^L), transforming inputs through successive nonlinearities to produce predictions. In a Bayesian framework, neural networks are treated probabilistically by placing priors on the weights and biases, typically independent Gaussian distributions such as W^l_{ij} \sim \mathcal{N}(0, \sigma_w^2) and b^l_i \sim \mathcal{N}(0, \sigma_b^2), which induce a prior distribution over the functions realized by the network. This perspective views the network as a probabilistic model where the prior captures uncertainty in the mapping from inputs to outputs, analogous to Gaussian processes in Bayesian nonparametrics but defined through finite parametric compositions. The width n^l and depth L define the network's capacity, with wider layers allowing more parallel computations and deeper layers enabling hierarchical feature extraction; however, to maintain stable signal propagation, weights are scaled during initialization, such as by a factor of 1 / \sqrt{n^{l-1}} to preserve variance across layers for activations like tanh or ReLU. This scaling prevents vanishing or exploding gradients in deep networks by ensuring that the variance of activations remains approximately constant from layer to layer. While training typically employs to minimize a by iteratively updating weights via , the Bayesian view emphasizes the prior-induced prior to optimization, where the distribution over functions reflects the model's inductive biases before data adaptation.

Theoretical Framework

Infinite-Width Limit

In the infinite-width limit, the outputs of a converge in distribution to those of a , providing a Bayesian nonparametric perspective on wide networks. This equivalence arises primarily from the (CLT), which implies that pre-activations z^l at layer l, defined as the linear transformation of the previous layer's activations followed by a nonlinearity, become Gaussian distributed due to the summation over a large number of independent terms proportional to the layer width n^{l-1}. Specifically, for a fully connected network, the pre-activation for the i-th unit in layer l is z^l_i(x) = b^l_i + \sum_{j=1}^{n^{l-1}} w^l_{ij} y^{l-1}_j(x), where weights w^l_{ij} and biases b^l_i are drawn i.i.d. from distributions with zero mean and appropriate variances; as n^{l-1} \to \infty, the CLT ensures z^l(x) is Gaussian for any input x. This Gaussianity propagates recursively through the network layers. For a single-layer network, the output is exactly a when the width is infinite, as the function values at finite sets of inputs are jointly Gaussian by construction. Extending to multi-layer networks with L layers, the process applies inductively: assuming the post-activations y^{l-1} from the previous layer form a conditional on the inputs, the pre-activations z^l at the next layer are Gaussian conditional on y^{l-1}, and thus the overall outputs y^L are jointly Gaussian for any finite set of inputs, yielding a prior over functions. This layer-by-layer recursion holds under mild conditions on the nonlinearity, such as the linear envelope property, ensuring the limit is non-degenerate. Crucial to achieving this convergence without degeneracy or explosion in variance is the scaling of the weight variances with the fan-in, typically set to \sigma_w^{2,l} = \hat{\sigma}_w^{2,l} / n^{l-1} for weights in layer l, alongside a constant bias variance \sigma_b^{2,l}. This scaling normalizes the contributions from each previous neuron, preventing the pre-activation variance from vanishing to zero or diverging to infinity as the width grows, thereby maintaining a well-defined Gaussian process limit with finite covariance. Without such scaling, the network outputs could collapse to constants or exhibit unstable behavior, underscoring the importance of this parameterization in theoretical analyses and practical implementations.

Derivation of the NNGP Kernel

The Neural Network Gaussian Process (NNGP) kernel is defined as the covariance function K^l(x, x') = \Cov[z^l(x), z^l(x')], where z^l denotes the pre-activation outputs at layer l under the infinite-width prior on the network parameters. For the input layer (l = 0), the kernel takes the form of a linear kernel: K^0(x, x') = \sigma_b^2 + \sigma_w^2 \frac{x \cdot x'}{d_{\text{in}}}, where \sigma_b^2 is the bias variance, \sigma_w^2 is the weight variance, and d_{\text{in}} is the input dimension; this follows from the Gaussian initialization of weights W^{ij} \sim \mathcal{N}(0, \sigma_w^2 / d_{\text{in}}) and biases b^j \sim \mathcal{N}(0, \sigma_b^2). Subsequent layers are computed recursively. Conditioning on the previous layer's kernel K^{l-1}, the pre-activations z^l(x) at layer l are Gaussian with covariance K^l(x, x') = \sigma_b^2 + \sigma_w^2 \mathbb{E}[\phi(u) \phi(v)], where \phi is the activation function, and the expectation is over the bivariate Gaussian (u, v) \sim \mathcal{N}(0, \Sigma) with \Sigma = \begin{pmatrix} K^{l-1}(x, x) & K^{l-1}(x, x') \\ K^{l-1}(x', x) & K^{l-1}(x', x') \end{pmatrix}. This expectation, often denoted \mathbb{E}[\phi(u) \phi(v)] = V_\phi(K^{l-1}(x, x'), K^{l-1}(x, x), K^{l-1}(x', x')), admits closed-form expressions for common activations and can be computed exactly, yielding a deterministic recursion for K^l given K^{l-1}. In the infinite-width limit (n^l \to \infty), the kernel K^l conditioned on K^{l-1} becomes deterministic due to the law of large numbers applied to the finite-sum structure of the pre-activations, enabling layer-wise exact computation without stochasticity. For the ReLU activation \phi(z) = \max(0, z), the expectation corresponds to the first-order arc-cosine kernel: K^l(x, x') = \sigma_b^2 + \frac{\sigma_w^2}{2\pi} \sqrt{K^{l-1}(x, x) K^{l-1}(x', x')} \left[ \sin \theta^{l-1} + (\pi - \theta^{l-1}) \cos \theta^{l-1} \right], where \theta^{l-1} = \arccos \left( \frac{K^{l-1}(x, x')}{\sqrt{K^{l-1}(x, x) K^{l-1}(x', x')}} \right). Iterating this recursion to the output layer L, the final pre-activations satisfy z^L \mid x \sim \GP(0, K^L), establishing the NNGP as a with kernel K^L; equivalently, z^l \mid z^{l-1} \sim \GP(0, K^l) holds layer-wise.

Extensions and Variants

Convolutional and Other Architectures

The extension of the neural network (NNGP) framework to convolutional neural networks () involves deriving kernels that respect the local connectivity and weight-sharing properties of convolutional layers. In this limit, the output of a Bayesian with random weights converges to a whose kernel is translation-invariant, capturing spatial correlations induced by the operation. This kernel is computed recursively across layers, where each convolutional layer transforms the covariance structure of the previous layer through a approximation of expectations over the nonlinear , leading to structured covariances that model maps as multivariate es with dependencies along spatial dimensions. The derivation adjusts the standard infinite-width limit by considering infinite channels per layer rather than infinite hidden units, ensuring that the central limit theorem applies to the summed contributions from filter outputs while preserving the fixed until pooling. Pooling operations, such as max or pooling, are treated as deterministic downsampling in the infinite-channel limit, which reduces the dimensionality of the feature maps without introducing additional stochasticity, thereby maintaining the structure through the network depth. For instance, in multi-stage , the CNN-GPR approach stacks convolutional layers to hierarchically extract features from sequential data, yielding a composite that outperforms traditional GPR on tasks like manufacturing prediction by leveraging spatial hierarchies. Beyond CNNs, the NNGP framework extends to other architectures by adapting the kernel to their inductive biases. For recurrent neural networks (RNNs), the infinite-width limit produces a with a temporal that models dependencies through recursive updates, enabling non-Markovian modeling of time-series data such as in dynamical systems identification. In graph neural networks (GNNs), the incorporates graph-structured by propagating messages along edges in the infinite-width regime, resulting in permutation-equivariant suitable for tasks like molecular property prediction where node features depend on neighborhood . Transformer-like architectures yield based on query-key dot-product similarities, where self-attention layers induce cross- structures that capture long-range dependencies in , as seen in correlated models for uncertainty-aware . Recent advancements have further generalized this to quantum neural networks (QNNs), where Haar-random unitary layers in the infinite-Hilbert-space limit converge to Gaussian processes with kernels reflecting patterns, offering a Bayesian perspective on for tasks like quantum state tomography.

Relation to

The (NTK) provides a kernel-based description of the tangent space of functions during gradient flow training in the infinite-width limit, where the network's evolution under converges to dynamics. This framework reveals that, for sufficiently wide networks, the training process linearizes around initialization, enabling closed-form analysis of convergence and generalization properties akin to those in methods. The Neural Network Gaussian Process (NNGP) kernel governs the prior distribution over functions at initialization, while the NTK describes the kernel for the conjugate gradient descent process during training; in certain cases, such as with the error function (erf) activation, the NTK approximates the NNGP kernel. This correspondence arises because both kernels emerge from the same infinite-width scaling of random neural networks, with the NNGP capturing the Bayesian prior and the NTK reflecting the linearized dynamics post-initialization. For activations like ReLU, the kernels differ but remain closely related through recursive definitions that propagate layer-wise covariances. Key differences lie in their interpretive roles: the NNGP enables Bayesian sampling from the function prior for , whereas the NTK focuses on frequentist optimization trajectories under gradient flow, both facilitating tractable analysis without simulating finite-width networks. These distinctions highlight complementary perspectives, with the NNGP emphasizing pre-training variability and the NTK post-training evolution. The interplay between NNGP and NTK underscores that infinite-width neural networks behave as kernel machines, bridging with classical and enabling theoretical insights into why wide networks generalize well. This connection, established in foundational works, has influenced subsequent analyses of network scaling and has practical implications for designing efficient approximations in large-scale .

Properties and Implications

Statistical Properties

Neural network Gaussian processes (NNGPs) exhibit non-stationary kernels due to their layered compositional structure, which causes the covariance to depend not only on the difference between inputs but also on their individual magnitudes, in contrast to stationary kernels like the (RBF). This non-stationarity emerges from the recursive definition of the kernel across layers, where each layer's incorporates expectations over activations that scale with input norms. NNGPs function as universal priors over smooth functions, leveraging the framework to span a rich that encompasses a denser set of functions than those representable by finite-width neural networks. This universality stems from the infinite-width limit, where the ensures convergence to a GP with a kernel capable of approximating continuous functions under suitable conditions. The mean function of an NNGP is typically zero when using centered priors for weights and biases, while the output variance scales recursively with network depth, often necessitating variance scaling factors to maintain stable signal propagation across layers. Sampling from the NNGP prior employs standard Gaussian process sampling techniques, such as drawing functions from the defined by the kernel; for finite-width approximations, methods average predictions over ensembles of randomly initialized networks to approximate the infinite-width . Inference under the NNGP is thus equivalent to exact inference, enabling Bayesian predictions via kernel or similar methods. Analysis of NNGP properties often involves recurrence relations for the kernel's eigenvalues, derived from the layer-wise updates, which provide insights into spectral decay and effective dimensionality. These relations highlight generalization , including , where the test error initially rises with model complexity before declining, mirroring phenomena observed in overparameterized regimes.

Applications in Machine Learning

Neural network Gaussian processes (NNGPs) have found significant application in Bayesian neural networks, where the infinite-width limit enables exact Gaussian process posteriors for uncertainty quantification. In this framework, the prior over functions induced by an infinitely wide neural network aligns with a GP, allowing Bayesian inference to capture epistemic uncertainty without variational approximations, which is particularly useful for safety-critical predictions in domains like healthcare and autonomous systems. This equivalence facilitates scalable posterior sampling via GP methods, improving calibration of uncertainty estimates compared to standard neural networks. In and tasks, NNGPs serve as scalable priors for modeling high-dimensional data, leveraging the neural network's representational power to define expressive kernels that handle complex nonlinearities. For instance, in molecular simulations, NNGP models have been employed to construct global surfaces (PES) for polyatomic molecules, achieving high accuracy in predicting energies across wide ranges with fewer training points than traditional GPs, due to the kernel's ability to extrapolate from low-energy data to high-energy regimes. This approach demonstrates NNGPs' efficacy in high-dimensional , where finite neural networks approximate the infinite-width GP prior to manage computational demands. Hybrid models integrating NNGPs with deep kernels enhance flexibility by combining neural feature extraction with GP inference, enabling better generalization in tasks like active learning. In active learning scenarios, the GP acquisition function derived from the NNGP kernel guides data selection to minimize uncertainty, outperforming standard heuristics by focusing queries on informative regions of the input space. Such hybrids, often termed deep kernel learning, use the NNGP as a guiding to optimize kernel hyperparameters, improving sample efficiency in benchmarks. Recent applications highlight NNGPs' versatility in specialized domains. In quantum simulations, quantum neural networks in the infinite-width limit converge to Gaussian processes, enabling efficient modeling of quantum states and dynamics with built-in for tasks like error mitigation in quantum circuits. Despite these advances, NNGPs face limitations in scalability for deep networks, as the cubic complexity of GP inference hinders application to large datasets, necessitating sparse approximations or inducing points. For finite-width networks, deviations from the infinite-width GP behavior require perturbative corrections or methods to approximate the kernel, which can introduce biases in estimates for shallower or narrower architectures. These challenges underscore the need for ongoing into efficient finite-width extensions to broaden practical deployment.

Implementations

Software Libraries

Several open-source software libraries facilitate the implementation and application of neural network Gaussian processes (NNGPs), enabling researchers to compute NNGP kernels, perform , and integrate them into broader workflows. These tools often build on established frameworks while allowing extensions for neural-inspired kernels, such as those derived from infinite-width limits of s. GPyTorch, a PyTorch-based library for es, enables custom implementation of NNGP kernels through its modular kernel composition system. It allows users to define multi-layer NNGP kernels recursively, incorporating custom activation functions like ReLU or erf, and handles finite-width approximations for practical computations. Key features include for kernel optimization and scalable inference via inducing points or variational methods, making it suitable for large-scale NNGP modeling. For instance, users can subclass to implement NNGP computations, though full multi-layer implementations require custom coding. Google's Neural Tangents library, designed for analyzing neural networks in the infinite-width regime, includes dedicated modules for both NNGP and neural tangent kernels (NTK). It supports recursive computation of NNGP kernels across various architectures, with built-in handling for custom activations and depth, and integrates seamlessly with for high-performance numerical operations like efficient sampling from the GP posterior. The library's ntk module allows quick setup of NNGP objects, enabling tasks such as in deep Bayesian neural networks. An example for a single-layer NNGP with integration is:
python
import jax.numpy as jnp
import neural_tangents as nt
from neural_tangents import stax

# Define a simple single-layer network
init_fn, apply_fn, [kernel](/page/Kernel)_fn = stax.serial(
    stax.Dense(1, W_std=1.0, b_std=0.1), stax.Erf(), stax.Dense(1)
)

# Compute NNGP kernel
x = jnp.ones((10, 1))  # Input [data](/page/Data)
nngp_fn = nt.nngp([kernel](/page/Kernel)_fn, x, x, 'ntk', dtype=jnp.float32)
K_nngp = nngp_fn(x, x, None, None, None)['nngp']  # Extract [kernel](/page/Kernel) [matrix](/page/Matrix)
This facilitates rapid prototyping and scales to deeper networks via batched computations. GPflow, a /Keras-compatible library for Gaussian processes, allows custom definitions suitable for implementing NNGPs through its registry and custom classes. Users can implement NNGP kernels by subclassing tfk.Kernel and defining recursive computations, with support for finite-width corrections and integration into variational GP models. It emphasizes scalability for non-conjugate likelihoods and provides tools for via MCMC or variational inference. While not specialized for infinite-width limits, GPflow's flexibility allows adaptation for NNGP-based priors in probabilistic modeling. Additionally, specialized implementations like the open-source nngp repository provide code for constructing covariance kernels equivalent to infinitely wide, fully connected, deep neural networks.

Practical and Computational Considerations

The computation of the neural network Gaussian process (NNGP) kernel itself is typically efficient, as it involves recursive evaluation of covariance functions based on input features, but the subsequent Gaussian process inference scales cubically with the number of data points n, requiring O(n^3) time for kernel matrix inversion and O(n^2) space for storage. This cubic complexity arises from solving linear systems in the GP posterior, making exact NNGP inference prohibitive for datasets beyond a few thousand points. To mitigate this, practitioners often employ sparse approximations, such as inducing points, which reduce complexity to O(m^3 + nm^2) where m \ll n is the number of inducing variables, or structured kernel approximations like those leveraging low-rank decompositions. In practice, exact infinite-width NNGP assumptions are rarely feasible, so finite-width approximations are used to model realistic . These include empirical estimation of the NNGP by averaging over multiple finite-width initializations, which captures deviations from the infinite-width while introducing controllable . Finite-width effects lead to a : narrower networks exhibit higher variance in predictions due to amplified , but they can reduce by enabling absent in the infinite-width regime. Such approximations are particularly useful for in Bayesian settings, though they require careful sampling to balance computational cost and accuracy. Hyperparameter selection in NNGPs critically influences the resulting kernel's expressivity and stability. Key parameters include the weight variance \sigma_w^2 and bias variance \sigma_b^2, which control signal propagation through layers; for example, \sigma_w^2 = 2 for ReLU activations preserves variance across layers, while smaller values prevent exploding covariances in deeper networks. Activation function choices, such as erf for analytic tractability or ReLU for compatibility with modern architectures, further shape the kernel's non-stationarity. Depth and width interact via scaling laws: increasing width reduces effective dimensionality of the function space, but optimal performance often follows \sigma_w \propto 1/\sqrt{\text{width}} to maintain constant kernel variance, enabling stable deep NNGPs without gradient issues. NNGPs present optimization challenges due to their inherent non-stationarity, where the kernel depends explicitly on input magnitudes and , complicating hyperparameter tuning via standard GP methods like marginal likelihood maximization. This non-stationarity can lead to ill-conditioned kernel matrices, exacerbating numerical instability in inversion. Integrating NNGPs with (SGD) in hybrid training regimes—such as using NNGP priors to initialize finite-width networks—requires careful variance matching to avoid mode collapse, often addressed through variational approximations or ensemble methods. Best practices recommend deploying NNGPs primarily for small datasets (n < 10^3) where exact is viable, or as informative priors in Bayesian workflows to encode neural-like inductive biases. For large-scale applications, combining NNGPs with neural tangent kernels (NTKs) is effective, leveraging NNGP for prior specification and NTK for scalable gradient-based training dynamics.

References

  1. [1]
  2. [2]
    Flexible Infinite-Width Graph Convolutional Neural Networks - arXiv
    Feb 9, 2024 · This is known as a neural network Gaussian process (NNGP). However, the NNGP kernel is fixed and tunable only through a small number of hyperparameters ...
  3. [3]
    [2107.08706] Uncertainty-aware Cardinality Estimation by Neural ...
    Jul 19, 2021 · This special class of BDL, known as Neural Network Gaussian Process (NNGP), inherits the advantages of Bayesian approach while keeping universal approximation ...
  4. [4]
    [2403.17467] A Unified Kernel for Neural Network Learning - arXiv
    Mar 26, 2024 · Two predominant approaches have emerged: the Neural Network Gaussian Process (NNGP) and the Neural Tangent Kernel (NTK). The former, rooted in Bayesian ...
  5. [5]
    Bayesian Learning for Neural Networks - University of Toronto
    Jul 23, 1996 · Some priors converge to Gaussian processes, in which functions computed by the network may be smooth, Brownian, or fractionally. Brownian ...
  6. [6]
  7. [7]
    [PDF] DEEP NEURAL NETWORKS AS GAUSSIAN PROCESSES
    Infinitely wide deep neural networks are equivalent to Gaussian processes (GPs) in the limit of infinite width, enabling Bayesian inference.
  8. [8]
    [PDF] Wide Neural Networks as Gaussian Processes: Lessons from Deep ...
    Neural networks with wide layers have attracted significant attention due to their equivalence to Gaussian processes, enabling perfect fitting of training ...
  9. [9]
    Quantum neural networks form Gaussian processes | Nature Physics
    May 21, 2025 · We show that the outputs of certain models based on Haar-random unitary or orthogonal quantum neural networks converge to Gaussian processes in the limit of ...
  10. [10]
    Neural network Gaussian processes as efficient models of potential ...
    We show that NNGP models of PES can be trained much more efficiently and yield better generalization accuracy without relying on any specific form of the ...
  11. [11]
    [PDF] Gaussian Processes in Machine Learning
    Definition 1. A Gaussian Process is a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions. A Gaussian ...
  12. [12]
    Book webpage - Gaussian Processes for Machine Learning
    Gaussian processes (GPs) provide a principled, practical, probabilistic approach to learning in kernel machines. GPs have received increased attention in ...Contents · Data · Errata · Order
  13. [13]
    [PDF] GAUSSIAN PROCESSES A Replacement for Supervised Neural ...
    GAUSSIAN PROCESSES. A Replacement for Supervised Neural Networks? DAVID J.C. MACKAY. Department of Physics, Cambridge University. Cavendish Laboratory, ...
  14. [14]
  15. [15]
    Learning representations by back-propagating errors - Nature
    Oct 9, 1986 · Learning representations by back-propagating errors. David E. Rumelhart,; Geoffrey E. Hinton &; Ronald J. Williams.Missing: feedforward architecture
  16. [16]
    Bayesian Learning for Neural Networks - SpringerLink
    Download chapter PDF · Introduction. Radford M. Neal. Pages 1-28. Priors for Infinite Networks. Radford M. Neal. Pages 29-53. Monte Carlo Implementation.
  17. [17]
    Gaussian Process Behaviour in Wide Deep Neural Networks - arXiv
    Apr 30, 2018 · In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes.
  18. [18]
    Deep Convolutional Networks as shallow Gaussian Processes - arXiv
    Aug 16, 2018 · We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP)
  19. [19]
    Bayesian Deep Convolutional Networks with Many Channels are ...
    Oct 11, 2018 · In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers.
  20. [20]
    Development of convolutional neural network based Gaussian ...
    In this paper, a novel modeling scheme for multi-stage process data is developed. It is a CNN based Gaussian process regression (GPR) model, called CNN-GPR ...
  21. [21]
    [2302.05828] Graph Neural Network-Inspired Kernels for Gaussian ...
    Feb 12, 2023 · We demonstrate that these graph-based kernels lead to competitive classification and regression performance, as well as advantages in computation time.
  22. [22]
    Revisiting Kernel Attention with Correlated Gaussian Process ... - arXiv
    Feb 27, 2025 · We propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between ...
  23. [23]
    [PDF] Neural Tangent Kernel: Convergence and Generalization in Neural ...
    In this paper, we investigate fully connected networks in this infinite-width limit, and describe the dynamics of the network function fθ during training: • ...
  24. [24]
    Double-descent curves in neural networks: a new perspective using ...
    Feb 14, 2021 · Abstract page for arXiv paper 2102.07238: Double-descent curves in neural networks: a new perspective using Gaussian processes. ... (NNGP) kernel, ...
  25. [25]
    [PDF] Deeper Connections between Neural Networks and Gaussian ...
    We propose to compute the approximation of trained. Bayesian neural network with Gaussian process, which allows using the GPs machinery to calculate uncertainty.
  26. [26]
    Neural network Gaussian processes as efficient models of potential ...
    Apr 11, 2023 · Neural network Gaussian processes as efficient models of potential energy surfaces for polyatomic molecules. Authors:Jun Dai, Roman V. Krems.
  27. [27]
    (PDF) Guided Deep Kernel Learning - ResearchGate
    Feb 19, 2023 · We propose to use the Neural Network Gaussian Process (NNGP) model as a guide to the DKL model in the optimization process. Our approach ...
  28. [28]
    Distributional Gaussian Processes Layers for Out-of ... - MELBA
    Jun 29, 2022 · ... Neural Network Gaussian Process (NNGP) kernel (Lee et al., 2017) , or finite-width ... which can be approximated via Monte Carlo integration.
  29. [29]
    Predicting the outputs of finite deep neural networks trained with ...
    Dec 2, 2021 · In previous works on the NNGP correspondence [12, 13] the NNGP kernel is determined by the distribution of the DNN weights at initialization, ...
  30. [30]
    [PDF] Critical Initialization of Wide and Deep Neural Networks using ...
    w/Nl) and N(0,σ2 b ), respectively. Hyperparameters σw and σb need to be tuned. ... When µ < 1, the fixed point value of NNGP kernel is scaled by (1 − µ2)−1.
  31. [31]
    [1905.05920] Hybrid Stochastic Gradient Descent Algorithms for ...
    May 15, 2019 · Abstract:We introduce a hybrid stochastic estimator to design stochastic gradient algorithms for solving stochastic optimization problems.Missing: NNGP non-
  32. [32]
    [PDF] Infinite attention: NNGP and NTK for deep attention networks
    NNGP and NTK are neural network Gaussian processes and Neural Tangent kernels, introduced in the Neural Tangents library.