Fact-checked by Grok 2 weeks ago

Gaussian process

A Gaussian process (GP) is a stochastic process in which every finite collection of random variables from the process has a multivariate normal distribution, fully specified by a mean function and a covariance function.^[1] This framework defines a probability distribution over functions, enabling flexible modeling of complex relationships in data without assuming a fixed parametric form.^[2] In machine learning, Gaussian processes serve as a powerful nonparametric Bayesian approach for supervised learning tasks, particularly regression and probabilistic classification, where they provide not only point predictions but also full posterior distributions to quantify uncertainty.^[3] The covariance function, often called a kernel, encodes assumptions about the smoothness and structure of the underlying function, with common choices including the squared exponential kernel for smooth functions and the Matérn kernel for more flexible roughness.^[1] GPs excel in scenarios with small datasets, offering interpretable results through their connection to kernel methods and reproducing kernel Hilbert spaces.^[4] Historically rooted in geostatistics as kriging—a method developed in the 1950s for spatial interpolation—Gaussian processes gained prominence in statistics and machine learning during the late 20th century, with foundational theoretical advancements in the 1990s and early 2000s.^[1] Their computational tractability for moderate data sizes, via exact inference using the multivariate Gaussian posterior, contrasts with scalable approximations like inducing points or variational methods needed for large-scale applications in fields such as robotics, optimization, and climate modeling.^[3]

Definition and Fundamentals

Formal Definition

A Gaussian process is a stochastic process \{f(\mathbf{x}) : \mathbf{x} \in \mathcal{X}\}, where \mathcal{X} is an arbitrary index set, such that for any finite collection of distinct points \mathbf{x}_1, \dots, \mathbf{x}_n \in \mathcal{X}, the random vector (f(\mathbf{x}_1), \dots, f(\mathbf{x}_n))^\top follows a multivariate normal distribution.^[5]^[6] This property ensures that the process is fully characterized by its finite-dimensional distributions, all of which are Gaussian.^[7] The finite-dimensional distributions of a Gaussian process are specified by a mean vector \boldsymbol{\mu} = (\mu(\mathbf{x}_1), \dots, \mu(\mathbf{x}_n))^\top, where \mu(\mathbf{x}_i) = \mathbb{E}[f(\mathbf{x}_i)], and a covariance matrix \mathbf{K}, with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) = \mathrm{Cov}[f(\mathbf{x}_i), f(\mathbf{x}_j)], such that \mathbf{K} is positive semi-definite.^[5]^[8] Thus, (f(\mathbf{x}_1), \dots, f(\mathbf{x}_n))^\top \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K}). This construction extends to any finite subset, guaranteeing consistency across all marginal distributions.^[7] A Gaussian process is commonly denoted as f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')), where m(\cdot) is the mean function and k(\cdot, \cdot) is the covariance kernel (or simply covariance function).^[7]^[9] The mean and covariance functions determine all finite-dimensional distributions and thus fully specify the process.^[8] While the terms "Gaussian process" and "Gaussian random field" are sometimes used interchangeably to describe such collections of random variables with joint Gaussian marginals, the former is often applied when the index set \mathcal{X} is one-dimensional (e.g., time), and the latter when \mathcal{X} is higher-dimensional (e.g., spatial coordinates).^[10] In both cases, the underlying mathematical structure remains identical.^[5]

Mean and Covariance Functions

A Gaussian process is fully specified by its mean function and covariance function, which together determine the expected value and the dependence structure of the process at any finite collection of points. The mean function, denoted \mu: \mathcal{X} \to \mathbb{R}, is defined as \mu(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})] for any \mathbf{x} \in \mathcal{X}, where f \sim \mathcal{GP}(\mu, k) and \mathcal{X} is the input space.^[4] This function captures the overall trend or systematic component of the process, and in many applications, it is assumed to be zero without loss of generality, as non-zero means can often be incorporated into the model through feature transformations or subtracted via centering techniques.^[4] The covariance function, or kernel, k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}, is given by k(\mathbf{x}, \mathbf{x}') = \mathrm{Cov}[f(\mathbf{x}), f(\mathbf{x}')], which encodes the similarity or correlation between function values at different inputs.^[4] For the covariance function to define a valid Gaussian process, it must be positive semi-definite, meaning that for any finite set of distinct points \mathbf{x}_1, \dots, \mathbf{x}_n \in \mathcal{X} and any coefficients c_1, \dots, c_n \in \mathbb{R}, the inequality \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0 holds; this ensures that the resulting covariance matrix is positive semi-definite and thus a valid covariance for a multivariate Gaussian distribution.^[11] Given these functions, the joint distribution of the process at a finite set of points \mathbf{x} = [\mathbf{x}_1, \dots, \mathbf{x}_n]^\top follows a multivariate Gaussian:

\mathbf{f}(\mathbf{x}) \sim \mathcal{N}(\boldsymbol{\mu}(\mathbf{x}), \mathbf{K}(\mathbf{x}, \mathbf{x})),

where \boldsymbol{\mu}(\mathbf{x}) = [\mu(\mathbf{x}_1), \dots, \mu(\mathbf{x}_n)]^\top and \mathbf{K}(\mathbf{x}, \mathbf{x}) is the n \times n covariance matrix with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j).^[4] This finite-dimensional distribution fully characterizes the process's behavior at those points. In practice, normalization techniques such as centering the data—subtracting the empirical mean from observations—allow assuming \mu(\mathbf{x}) = \mathbf{0} without loss of generality, simplifying computations and posterior inference in Bayesian settings.^[4] For instance, in regression tasks, a non-zero mean can be modeled separately using linear basis functions, leaving the kernel to capture deviations, which streamlines the update formulas for the posterior mean and variance.^[4] A fundamental result is that any two Gaussian processes sharing the same mean and covariance functions are equivalent in terms of their finite-dimensional distributions, rendering them indistinguishable for practical purposes in finite-sample analyses.^[11] This uniqueness theorem underscores that the pair (\mu, k) completely specifies the process up to versions that agree almost surely on finite sets.^[11]

Core Properties

Stationarity

In Gaussian processes, strict stationarity refers to the property where all finite-dimensional distributions remain unchanged under translations of the input domain. Specifically, for any finite set of points x_1, \dots, x_n and any shift h, the joint distribution of f(x_1 + h), \dots, f(x_n + h) equals that of f(x_1), \dots, f(x_n).^[11] Wide-sense stationarity, a weaker condition, requires the mean function to be constant, \mu(x) = \mu for all x, and the covariance function to depend solely on the lag \tau = x - x', such that k(x, x') = k(\tau).^[12] For Gaussian processes, wide-sense stationarity implies strict stationarity, as the multivariate Gaussian distributions are fully specified by their means and covariances.^[13] This stationarity enables a spectral representation of the covariance kernel through Bochner's theorem, which characterizes stationary positive definite functions as the Fourier transforms of positive finite measures.^[14] In particular, a continuous function k: \mathbb{R}^d \to \mathbb{R} is the covariance function of a stationary Gaussian process if and only if there exists a positive finite measure \mu such that

k(\tau) = \int_{\mathbb{R}^d} e^{i \omega^\top \tau} \, \mu(d\omega).

When \mu has a density S(\omega), the power spectral density, this becomes

k(\tau) = \int_{\mathbb{R}^d} S(\omega) e^{i \omega^\top \tau} \, d\omega,

with S(\omega) \geq 0 for all \omega, ensuring positive definiteness.^[14] While stationary Gaussian processes exhibit translation-invariant statistical properties, many real-world applications involve non-stationary processes where the mean or covariance varies with absolute input positions, necessitating alternative kernel constructions.^[14]

Marginal Variance

In a Gaussian process, the marginal distribution of the function value at any single input point x follows a univariate normal distribution, denoted as f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x)), where \mu(x) is the mean function and the marginal variance \sigma^2(x) is given by the covariance kernel evaluated at that point, \sigma^2(x) = k(x, x). This formulation arises because the Gaussian process defines a joint distribution over function values, and the marginal at a single point is simply the diagonal element of the covariance matrix. The marginal variance \sigma^2(x) quantifies the inherent uncertainty or spread of possible function values at x, reflecting the process's variability without conditioning on observations at other points. In the marginal sense, this variance is independent of function values at distinct locations, emphasizing the pointwise stochastic nature of the process. For stationary Gaussian processes, the marginal variance is constant across the domain, \sigma^2(x) = k(x, x) = \sigma^2 for all x. When observations are available, the predictive variance at a new point x_* incorporates data dependence, expressed as \mathrm{var}(f(x_*) \mid \mathbf{y}) = k(x_*, x_*) - \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*, where \mathbf{k}_* collects covariances between x_* and training inputs, and K is the covariance matrix of the training data; this reduces the prior marginal variance by accounting for information from nearby points. In regression tasks, the marginal variance plays a key role in the signal-to-noise ratio, where scaling the kernel by a hyperparameter (often \sigma_f^2) balances the amplitude of the signal against noise variance \sigma_n^2, ensuring the model captures meaningful patterns without overfitting. Appropriate scaling, typically learned via maximum likelihood, aligns the process variance with the data's dynamic range. Degenerate Gaussian processes occur when the marginal variance vanishes, k(x, x) = 0 for all x, implying a deterministic function with zero uncertainty everywhere, such as a constant mean process without stochastic components. This case reduces the Gaussian process to a Dirac delta distribution at the mean, useful for modeling noise-free scenarios but limiting flexibility.

Sample Path Properties

Sample paths of a Gaussian process, denoted as realizations f(\cdot), exhibit properties such as continuity and differentiability that depend on the covariance kernel and the domain. Almost sure continuity holds under conditions provided by Kolmogorov's continuity theorem, which ensures that the paths are continuous with probability one if the process satisfies a moment condition on increments.^[15] For stationary Gaussian processes on \mathbb{R}^d, the paths are Hölder continuous if there exist constants C > 0, \alpha > 0, and \beta > 0 such that

\mathbb{E}\left[ |f(x) - f(y)|^\alpha \right] \leq C |x - y|^{d + \beta}

for all x, y in a compact set, where the exponent d + \beta accounts for the dimension. This condition is nearly necessary and sufficient for Hölder continuity of order \gamma for any \gamma < \beta / \alpha.^[15] Mean-square differentiability of the paths occurs when the covariance kernel k(x, x') admits a continuous second mixed partial derivative \frac{\partial^2 k}{\partial x \partial x'}, ensuring that the derivative process is also Gaussian with a well-defined covariance. In this case, the paths are mean-square differentiable, and under additional regularity, sample path differentiable.^[14] The choice of kernel governs the roughness and smoothness of the sample paths; for instance, Matérn kernels provide tunable regularity through the smoothness parameter \nu, where the paths are k-times mean-square differentiable if and only if \nu > k, allowing control over the expected wiggliness of the function. Squared exponential kernels yield infinitely differentiable paths, while less smooth kernels like exponential produce rougher, once-differentiable paths.^[14] Gaussian processes can exhibit discontinuous sample paths in certain cases, particularly when defined on discrete domains where the index set lacks a natural topology for continuity, or when the covariance function fails the Kolmogorov condition, such as for kernels that are not continuous. For example, a centered Gaussian process with a covariance R(t,s) = 1 if t = s and 0 otherwise on a discrete set has discontinuous "paths" by construction, matching the finite-dimensional distributions of a continuous process but lacking path continuity.^[5]^[16]

Covariance Kernels

Kernel Properties

A covariance kernel k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} for a Gaussian process must satisfy the positive semi-definiteness condition to ensure that the resulting covariance matrices are valid for multivariate Gaussian distributions. Specifically, for any finite set of distinct points x_1, \dots, x_n \in \mathcal{X} and any coefficients c_1, \dots, c_n \in \mathbb{R}, the quadratic form \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(x_i, x_j) \geq 0, or equivalently, the n \times n Gram matrix K with entries K_{ij} = k(x_i, x_j) has non-negative eigenvalues. This property guarantees that the variance of any linear combination of process values is non-negative, preserving the non-negativity of variances in the finite-dimensional marginal distributions.^[17] Mercer's theorem provides a spectral decomposition for such kernels under mild conditions, such as continuity on a compact domain. It states that there exist positive eigenvalues \lambda_m \geq 0 (with only finitely many non-zero if the kernel is degenerate) and orthonormal functions \{\phi_m\} in the L^2 space such that

k(x, x') = \sum_{m=1}^\infty \lambda_m \phi_m(x) \phi_m(x'),

where the series converges absolutely and uniformly on compact sets. This expansion represents the kernel as an inner product in a feature space weighted by the eigenvalues, facilitating analysis of the process's spectral content and approximation methods.^[18] The class of positive semi-definite kernels exhibits closure under several operations, enabling the construction of complex kernels from simpler ones. The sum of two kernels k_1 + k_2 is positive semi-definite because the corresponding Gram matrix is the sum of two positive semi-definite matrices. Similarly, scaling by a positive constant c > 0 yields c k, as the eigenvalues are scaled by c. The pointwise product k_1 \cdot k_2 is also positive semi-definite, corresponding to the tensor product of the respective feature spaces, which preserves the inner product structure. These properties allow for flexible kernel engineering while maintaining validity.^[17] Positive semi-definite kernels induce a feature map \phi: \mathcal{X} \to \mathcal{H} into a Hilbert space \mathcal{H} (possibly infinite-dimensional), such that k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}. This representation underpins the kernel trick in computation, where inner products are computed directly via the kernel without explicit feature vectors, and connects to the reproducing kernel Hilbert space framework. For any x \in \mathcal{X}, the evaluation functional is continuous, with \langle \phi(x), f \rangle_{\mathcal{H}} = f(x) for f \in \mathcal{H}.^[9] Boundedness of the kernel, meaning \sup_{x,x' \in \mathcal{X}} |k(x, x')| < \infty, implies that the process has uniformly bounded variance \mathrm{Var}[f(x)] = k(x,x) \leq M for some M, limiting the magnitude of function values. Continuity of the kernel k with respect to the input topology ensures mean-square continuity of the Gaussian process sample paths, i.e., \mathbb{E}[(f(x) - f(x'))^2] \to 0 as x \to x', which under Kolmogorov's continuity theorem can imply almost sure path continuity and thus regularity properties like Hölder continuity depending on the modulus of continuity of k.^[17]

Standard Kernel Families

Standard covariance kernels, also known as kernel functions, define the similarity between input points in a Gaussian process and must be positive semi-definite to ensure valid covariance matrices. These parametric forms allow practitioners to model various assumptions about the underlying function's smoothness, periodicity, or structure. The squared exponential kernel, often referred to as the radial basis function (RBF) kernel, is one of the most widely used due to its flexibility in capturing smooth functions. It is defined as

k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2} \right),

where \sigma^2 is the variance parameter controlling the overall scale, and \ell is the length scale parameter governing the rate of correlation decay with distance. This kernel produces infinitely differentiable sample paths, making it suitable for modeling highly smooth processes.^[19] The Matérn family of kernels provides a more flexible alternative, allowing control over the smoothness of the process through a parameter \nu. The general form is

k(\tau) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \sqrt{2\nu} \frac{|\tau|}{\ell} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{|\tau|}{\ell} \right),

where \tau = |\mathbf{x} - \mathbf{x}'| is the distance, \Gamma is the gamma function, and K_\nu is the modified Bessel function of the second kind of order \nu. The parameter \nu determines the mean-square differentiability of the paths: for \nu = 1/2, it yields exponential decay (once differentiable); \nu = 3/2 allows one derivative; and \nu = 5/2 permits two derivatives, with higher \nu approaching the squared exponential kernel as \nu \to \infty. Specific cases like \nu = p + 1/2 for integer p have closed-form expressions without Bessel functions. This family balances smoothness and computational tractability while avoiding the infinite differentiability of the RBF kernel.^[19] The periodic kernel is designed for functions exhibiting repeating patterns and is given by

k(x, x') = \sigma^2 \exp\left( -\frac{2 \sin^2 \left( \pi |x - x'| / p \right)}{\ell^2} \right),

where p is the period parameter setting the repetition length, \sigma^2 scales the variance, and \ell controls the decay within each period. This kernel enforces exact periodicity by mapping distances via the sine function, producing infinitely differentiable paths that oscillate regularly. It is particularly effective when the data suggests cyclic behavior, though it assumes global periodicity across the input space.^[19] The linear kernel assumes a simpler, non-stationary structure suitable for modeling low-dimensional linear trends or projections:

k(\mathbf{x}, \mathbf{x}') = \sigma^2 \mathbf{x}^\top \mathbf{x}',

where \sigma^2 adjusts the variance. This kernel corresponds to Bayesian linear regression in the Gaussian process framework, generating sample paths that are straight lines through the origin in the feature space, with smoothness limited to linear functions. It is computationally efficient and serves as a building block for more complex models.^[19] Composite kernels are formed by combining simpler kernels through operations like addition, multiplication, or exponentiation, enabling the modeling of structured data with multiple characteristics. For example, the product of an RBF kernel and a linear kernel, k(\mathbf{x}, \mathbf{x}') = k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') \cdot k_{\text{linear}}(\mathbf{x}, \mathbf{x}'), captures both smooth non-linear variations and global linear trends. Such compositions inherit properties from their components—e.g., the product of two stationary kernels remains stationary—and allow for interpretable hyperparameters tailored to hierarchical or multiplicative structures in the data. The validity of composite kernels relies on ensuring the result remains positive semi-definite, which holds for products and sums of positive definite kernels.^[19]

Examples and Special Cases

Wiener Process

The Wiener process, also known as standard Brownian motion, is a canonical example of a Gaussian process that models random fluctuations in various physical and financial systems. It is defined as a continuous-time stochastic process \{W(t) : t \geq 0\} with W(0) = 0, independent increments, and normally distributed increments such that W(t) - W(s) \sim \mathcal{N}(0, t - s) for all t > s \geq 0.^[20] This construction ensures that the process has stationary increments, meaning the distribution of W(t + h) - W(t) depends only on h > 0 and not on t, but the process itself is non-stationary because its variance \mathrm{Var}(W(t)) = t grows linearly with time.^[20] As a Gaussian process, the Wiener process is fully specified by its mean function \mu(t) = 0 for all t \geq 0 and its covariance kernel k(s, t) = \min(s, t), which captures the dependence structure where earlier times influence later ones cumulatively.^[21] The Wiener process can be interpreted as the continuous-time integral of white Gaussian noise, providing a mathematical representation of idealized random perturbations.^[22] This perspective underscores its role in stochastic differential equations, where it serves as the driving noise term. The non-stationarity arises directly from the kernel form, as k(s, t) is not a function solely of |t - s|, contrasting with stationary Gaussian processes whose covariances depend only on time differences.^[11] Despite this, the independent and stationary increments property makes it a Lévy process, facilitating analytical tractability in applications like diffusion modeling.^[20] Sample paths of the Wiener process exhibit remarkable regularity properties: they are almost surely continuous functions of time, ensuring no jumps occur with probability one.^[23] However, these paths are nowhere differentiable almost surely, meaning no tangent exists at any point, which reflects the infinite variation accumulated over any interval.^[23] More precisely, the paths are Hölder continuous with any exponent \gamma < 1/2, but not with exponent $1/2, quantifying their roughness in terms of modulus of continuity.^[21] Historically, the Wiener process is named after Norbert Wiener, who in 1923 provided the first rigorous mathematical construction of Brownian motion as a continuous stochastic process with these properties, laying the foundation for modern stochastic analysis.^[24]

Ornstein-Uhlenbeck Process

The Ornstein-Uhlenbeck process is defined as the solution to the stochastic differential equation
df(t) = -\theta f(t)\, dt + \sigma\, dW(t),
where \theta > 0 represents the speed of mean reversion, \sigma > 0 is the diffusion coefficient, and W(t) denotes a standard Wiener process; this formulation captures a mean-reverting dynamic where fluctuations decay toward the origin over time.^[25]^[26] Viewed as a Gaussian process on \mathbb{R}, the Ornstein-Uhlenbeck process assumes a zero mean function \mu(t) = 0 and features the stationary covariance kernel
k(s,t) = \frac{\sigma^2}{2\theta} \exp\left(-\theta |t - s|\right),
which corresponds to the Matérn kernel family with smoothness parameter \nu = 1/2.^[14]^[27] This stationarity implies a constant marginal variance of \sigma^2 / (2\theta) for all t, with the covariance between any two points decaying exponentially at rate \theta as the separation |t - s| increases, ensuring temporal homogeneity and the Markov property.^[14]^[26] The sample paths of the Ornstein-Uhlenbeck process are continuous almost surely, reflecting the continuity of the driving Wiener process and the Lipschitz continuity of the drift term; however, they possess limited regularity, being mean-square continuous but not mean-square differentiable, akin to the roughness of Brownian motion paths.^[28]^[14] In physical modeling, the Ornstein-Uhlenbeck process classically describes the velocity component of a particle undergoing Brownian motion under viscous friction, providing a foundational example of stochastic damping in statistical mechanics.^[25]

Reproducing Kernel Hilbert Space

RKHS Basics

A reproducing kernel Hilbert space (RKHS) is a Hilbert space \mathcal{H} of functions f: \mathcal{X} \to \mathbb{R} on a set \mathcal{X}, equipped with an inner product \langle \cdot, \cdot \rangle_{\mathcal{H}} such that point evaluation at any x \in \mathcal{X} is a continuous linear functional, meaning there exists a reproducing kernel k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} satisfying f(x) = \langle f, k(x, \cdot) \rangle_{\mathcal{H}} for all f \in \mathcal{H}.^[29] This reproducing property ensures that the kernel function k(x, \cdot) acts as the representer of the evaluation functional in \mathcal{H}.^[29] The space \mathcal{H} is complete with respect to the norm \|f\|_{\mathcal{H}} = \sqrt{\langle f, f \rangle_{\mathcal{H}}}, and the reproducing property implies that evaluations are bounded by the kernel: |f(x)| \leq \|f\|_{\mathcal{H}} \sqrt{k(x,x)} for all f \in \mathcal{H} and x \in \mathcal{X}.^[29] The kernel k is positive semi-definite, symmetric, and uniquely determines the inner product via \langle k(x, \cdot), k(y, \cdot) \rangle_{\mathcal{H}} = k(x,y). The reproducing property induces a feature map \phi: \mathcal{X} \to \mathcal{H} defined by \phi(x) = k(x, \cdot), which is infinite-dimensional in general and embeds \mathcal{X} into \mathcal{H} such that the inner product in \mathcal{H} corresponds to kernel evaluations: \langle \phi(x), \phi(y) \rangle_{\mathcal{H}} = k(x,y). Functions in \mathcal{H} can thus be expressed as f(x) = \langle f, \phi(x) \rangle_{\mathcal{H}}, representing linear combinations in the feature space. A canonical example is the radial basis function (RBF) kernel k(x,y) = \exp\left( -\frac{\|x-y\|^2}{2\sigma^2} \right) on \mathbb{R}^d, where the associated RKHS \mathcal{H} consists of analytic functions with exponential decay in their Fourier transforms, ensuring smoothness and rapid decay away from the origin.^[30] In general, an RKHS \mathcal{H} is a proper subspace of the L^2 space over \mathcal{X} with respect to a probability measure, as the RKHS norm is stricter; however, under Mercer's theorem conditions (e.g., continuous kernel on a compact domain), \mathcal{H} embeds continuously into L^2, with \|f\|_{L^2} \leq C \|f\|_{\mathcal{H}} for some constant C.

Link to Gaussian Processes

The reproducing kernel Hilbert space \mathcal{H}_k associated with a positive definite kernel k serves as the Cameron–Martin space for the Gaussian measure induced by a zero-mean Gaussian process prior f \sim \mathcal{GP}(0, k), with the covariance operator of the measure given by the integral operator C_k g = \int k(\cdot, x) g(x) \mu(dx) for a base measure \mu. This operator ties the probabilistic structure of the GP to the geometry of \mathcal{H}_k, where the reproducing property allows point evaluations f(x) = \langle f, k(x, \cdot) \rangle_{\mathcal{H}_k}. Although sample paths from the GP lie in \mathcal{H}_k with probability zero for most kernels, the prior concentrates on functions whose RKHS norms are controlled by the eigenvalues of C_k.^[31] Upon observing noisy data y = f(X) + \epsilon with X = \{x_1, \dots, x_n\} and i.i.d. Gaussian noise \epsilon \sim \mathcal{N}(0, \sigma^2 I), the GP posterior mean represents the orthogonal projection of the prior onto the finite-dimensional subspace V_n = \operatorname{span}\{k(x_i, \cdot) : i=1,\dots,n\} of \mathcal{H}_k. This projection minimizes the RKHS norm subject to interpolating the data in the noiseless limit, while the noise reduces the posterior variance in the \mathcal{H}_k-norm by shrinking the effective reproducing kernel. The explicit form of the posterior mean is

m(x) = \sum_{i=1}^n \alpha_i k(x_i, x), \quad \alpha = (K + \sigma^2 I)^{-1} y,

where K_{ij} = k(x_i, x_j) is the Gram matrix. This formulation highlights how the posterior updates the prior by projecting onto the span of kernel functions centered at the observation points.^[32] Furthermore, GP regression exhibits a duality with kernel ridge regression in the RKHS, where the posterior mean solves the optimization problem

m = \arg\min_{f \in \mathcal{H}_k} \|f\|_{\mathcal{H}_k}^2 + \frac{1}{\sigma^2} \|y - f(X)\|^2.

This equivalence underscores the regularizing effect of the GP prior, equivalent to the RKHS norm penalty in kernel methods, bridging probabilistic and deterministic interpretations. In the infinite-data limit as n \to \infty, under suitable conditions on the kernel and true function, the GP posterior concentrates within a shrinking ball in \mathcal{H}_k, achieving minimax rates determined by the RKHS smoothness.^[32]

Constrained Processes

Linear Equality Constraints

Linear equality constraints on a Gaussian process f \sim \mathcal{GP}(\mu, k) are imposed to ensure that the process satisfies Af = b almost surely, where A is a linear operator mapping the function space to \mathbb{R}^m and b \in \mathbb{R}^m. This setup is particularly useful for exact interpolation problems, where the constraints enforce specific values or linear relations at certain points, modifying the prior distribution to a constrained Gaussian process \mathcal{GP}(\mu_c, k_c) that incorporates the restrictions directly into its mean and covariance functions.^[33] The constrained mean function is given by

\mu_c(x) = \mu(x) + k(x, Z) [k(Z, Z)]^{-1} (b - A \mu(Z)),

where Z denotes the set of constraint points or locations relevant to the operator A, and k(x, Z) represents the cross-covariance between x and Z. Similarly, the constrained covariance function is

k_c(x, x') = k(x, x') - k(x, Z) [k(Z, Z)]^{-1} k(Z, x').

These expressions arise from conditioning the original Gaussian process on the linear constraints, analogous to standard Gaussian conditioning but with zero noise to enforce exact satisfaction.^[33] The constrained process exhibits degeneracy at the constraint locations, where the variance vanishes: k_c(z, z) = 0 for z \in Z, reflecting the deterministic fixing of the function values or relations imposed by the constraints. This zero-variance property ensures perfect interpolation without additional probabilistic uncertainty at those points.^[33] For specific kernel choices, such as those corresponding to integrated Brownian motion or Matérn kernels with appropriate smoothness parameters, imposing linear equality constraints—such as zero derivatives at boundaries—results in the constrained process reproducing classical cubic spline interpolants. This connection highlights the interpretability of Gaussian processes as Bayesian analogs to spline methods in nonparametric regression.^[32]

Posterior under Constraints

In Gaussian processes, the posterior distribution under noisy linear constraints arises when both the standard observations and the constraints are subject to noise, generalizing the prior conditioning to a joint inference setting. Consider noisy observations y = f(X) + \eta, where \eta \sim \mathcal{N}(0, \sigma^2 I), and noisy linear constraints z = A f + \varepsilon, where f \sim \mathcal{GP}(0, k), A is a linear operator mapping the GP to the constraint space, and \varepsilon \sim \mathcal{N}(0, \Sigma). The joint distribution of [y; z] is Gaussian with mean zero and block covariance matrix incorporating the kernel evaluations at the observation points X and constraint points Z (determined by A), augmented by the respective noise terms. The posterior distribution is obtained by conditioning the GP on this joint vector [y; z], yielding a Gaussian process with mean function and covariance kernel derived from standard multivariate Gaussian conditioning. The posterior mean at a test point x is given by

\mu_\text{post}(x) = k(x, [X; Z]) K^{-1} \begin{bmatrix} y \\ z \end{bmatrix},

where k(x, [X; Z]) = [k(x, X), k(x, Z)] and K is the augmented covariance

K = \begin{pmatrix} K_{XX} + \sigma^2 I & K_{XZ} \\ K_{ZX} & K_{ZZ} + \Sigma \end{pmatrix},

with K_{XX} = [k(x_i, x_j)]_{i,j=1}^n, K_{XZ} = [k(x_i, z_l)]_{i,l}, and similarly for the other blocks. The posterior covariance kernel between points x and x' is

k_\text{post}(x, x') = k(x, x') - k(x, [X; Z]) K^{-1} k([X; Z], x').

This effective kernel k_\text{post} integrates the information from both data and constraints into a single covariance structure, ensuring the posterior respects the noisy prior knowledge encoded in the constraints.^[34] Sampling from this posterior can be performed by generating paths conditioned on the affine subspace implied by the linear constraints, with the noise \varepsilon broadening the conditioning to a probabilistic neighborhood around the subspace; this allows for Monte Carlo exploration of plausible trajectories consistent with both datasets. When \Sigma = 0, the formulation reduces to the exact linear constraints case as a special instance.^[34] In geostatistics, this framework finds application in kriging for spatial prediction under boundary conditions, such as estimating subsurface properties in environmental or resource modeling.^[35]

Applications

Kriging for Spatial Prediction

Kriging originated in geostatistics as a method for spatial interpolation and prediction of ore grades in mining, developed by South African engineer D. G. Krige in his 1951 master's thesis, where he applied statistical techniques to estimate gold reserves based on limited drill-hole data.^[36] The theoretical framework was formalized by French mathematician Georges Matheron in 1963, who coined the term "kriging" in honor of Krige and established it as a best linear unbiased estimator for random functions in spatial domains.^[37] This approach models spatial data as realizations of a Gaussian process, enabling predictions at unsampled locations while quantifying uncertainty through conditional distributions.^[38] In ordinary kriging, the prediction at an unobserved location x^* given observations y at locations X assumes a stationary Gaussian process with mean zero and covariance function k(\cdot, \cdot), plus independent noise \sigma^2. The posterior predictive distribution is f(x^*) \mid y \sim \mathcal{N}(m(x^*), v(x^*)), where the mean is m(x^*) = k(x^*, X) (K + \sigma^2 I)^{-1} y and K is the covariance matrix over X.^[38] This formulation minimizes the mean squared prediction error under the assumption of second-order stationarity, providing weights that balance proximity and spatial correlation.^[38] Universal kriging extends ordinary kriging to account for a non-stationary trend, modeling the process mean as \mu(x) = \sum_j \beta_j p_j(x), where \{p_j(x)\} are known basis functions (e.g., polynomials for linear or quadratic trends) and \beta_j are coefficients. The trend parameters are estimated via generalized least squares, incorporating the spatial covariance structure, before applying kriging to the residuals.^[38] This allows for predictions in regions with systematic drifts, such as elevation effects in environmental data. The variogram is central to kriging, quantifying spatial dependence through \gamma(h) = \frac{1}{2} \mathbb{E}[(f(x+h) - f(x))^2], which for a stationary Gaussian process equals k(0) - k(h). Empirical variograms are fitted to data to estimate the covariance function k, guiding the choice of model (e.g., exponential or Matérn) for prediction.^[39] Kriging weights satisfy the best linear unbiased estimator (BLUE) property, ensuring unbiased predictions with minimal variance among all linear combinations of observations, derived from the Lagrange multiplier conditions in the optimization.^[38] This optimality traces back to the Wiener-Kolmogorov prediction theory for stationary random fields, which provides the foundational linear filtering approach adapted by Matheron to spatial contexts.^[40] Stationary kernels, such as the squared exponential, are commonly used in spatial kriging to model isotropic dependence.^[38]

Gaussian Processes in Regression

Gaussian process regression models the output y at input \mathbf{x} as y = f(\mathbf{x}) + \epsilon, where f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) specifies a Gaussian process prior over latent functions f with mean function m (often taken as zero) and positive definite covariance kernel k, and \epsilon \sim \mathcal{N}(0, \sigma_n^2) denotes independent Gaussian noise.^[4] Given training data \mathbf{X} = \{\mathbf{x}_i\}_{i=1}^n and \mathbf{y} = \{y_i\}_{i=1}^n, the joint distribution over training outputs and a test output f_* at new input \mathbf{x}_* is multivariate Gaussian, yielding a closed-form Gaussian posterior predictive distribution f_* \mid \mathbf{X}, \mathbf{y}, \mathbf{x}_* \sim \mathcal{N}(\mu_*(\mathbf{x}_*),\ \sigma_*^2(\mathbf{x}_*)).^[4] The predictive mean and variance are given by

\begin{align} \mu_*(\mathbf{x}_*) &= \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}, \\ \sigma_*^2(\mathbf{x}_*) &= k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*, \end{align}

where \mathbf{K} is the n \times n covariance matrix with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j), \mathbf{k}_* is the vector of covariances k(\mathbf{x}_*, \mathbf{x}_i) for i=1,\dots,n, and \mathbf{I} is the identity matrix.^[4] These expressions enable exact inference for small to moderate datasets (n \lesssim 10^3), providing not only point predictions but also calibrated uncertainty estimates directly from the predictive variance.^[4] Kernel hyperparameters \boldsymbol{\theta} (e.g., length-scale parameters controlling smoothness) enter the model through the covariance function k(\cdot, \cdot; \boldsymbol{\theta}), along with the noise variance \sigma_n^2, and are typically optimized by maximizing the marginal log likelihood of the observed data:

\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^T (\mathbf{K}_\boldsymbol{\theta} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K}_\boldsymbol{\theta} + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi.

^[4] This evidence-based approach avoids overfitting by integrating over functions weighted by the prior, with optimization often performed via gradient-based methods like conjugate gradients or L-BFGS due to the non-convexity of the objective.^[4] A key advantage of Gaussian process regression lies in its non-parametric nature, equivalent to Bayesian linear regression using an infinite number of basis functions whose prior covariance is induced by the kernel, allowing flexible modeling without fixed model complexity. Automatic relevance determination (ARD) extends this by assigning separate length-scale hyperparameters to each input dimension in separable kernels (e.g., squared exponential), driving irrelevant features' length-scales to infinity during optimization and effectively performing input selection.^[4] For uncertainty quantification, the predictive variance \sigma_*^2(\mathbf{x}_*) yields probabilistic intervals, such as 95% credible intervals \mu_*(\mathbf{x}_*) \pm 1.96 \sqrt{\sigma_*^2(\mathbf{x}_*)}, which widen in data-sparse regions to reflect extrapolation risk.^[4] Gaussian process regression in machine learning draws from kriging techniques in geostatistics as a foundational precursor.^[4]

Neural Networks and Deep GPs

In the infinite-width limit, Bayesian neural networks exhibit behavior equivalent to Gaussian processes, providing a theoretical bridge between parametric neural architectures and non-parametric probabilistic models. Specifically, as the width of each layer approaches infinity, the prior distribution over functions induced by a Bayesian neural network converges to a Gaussian process with a kernel known as the Neural Network Gaussian Process (NNGP) kernel. This equivalence arises because the central limit theorem applies recursively across layers, leading to Gaussian marginals at each depth. The NNGP kernel for layer l is defined recursively as k^l(\mathbf{x}, \mathbf{x}') = \mathbb{E}[\phi(f^l(\mathbf{x})) \phi(f^l(\mathbf{x}'))], where \phi is the activation function, f^l denotes the pre-activation at layer l, and the expectation is taken over the random weights and previous layers.^[41] Furthermore, during training with gradient descent in the infinite-width regime, the posterior predictive distribution of the neural network also corresponds to a Gaussian process, but governed by the Neural Tangent Kernel (NTK). The NTK captures the evolution of the network's function during optimization, remaining constant in the infinite-width limit and enabling kernel regression-like behavior. This kernel is derived from the tangent space of the network parameters at initialization and is given by \Theta(\mathbf{x}, \mathbf{x}') = \mathbb{E}\left[ \left( \frac{\partial f(\mathbf{x})}{\partial \theta} \right)^\top \left( \frac{\partial f(\mathbf{x}')}{\partial \theta} \right) \right], where \theta are the network parameters. Thus, the Bayesian posterior over functions approaches a Gaussian process with the NTK, highlighting how wide neural networks generalize through kernel methods.^[42] Deep Gaussian processes extend this correspondence by composing multiple Gaussian processes in a hierarchical manner to model complex, multi-level data structures. A deep Gaussian process defines the overall function as f(\mathbf{x}) = f_L \circ g_{L-1} \circ \cdots \circ f_1(\mathbf{x}), where each f_i and g_j is drawn independently from a Gaussian process prior, allowing for layered transformations that capture non-stationarities and intricate dependencies not easily modeled by single-layer processes. This architecture draws inspiration from deep neural networks but retains the probabilistic interpretability of Gaussian processes. However, inference in deep Gaussian processes is generally intractable due to the nested integrals required for the likelihood, necessitating approximate methods such as variational inference.^[43] These connections, established in seminal works, underscore the shared theoretical foundations of neural networks and Gaussian processes, enabling insights into generalization and scalability in deep learning.^[41]^[42]

Computational Aspects

Exact Inference

Exact inference in Gaussian processes relies on the closed-form expressions for the posterior distribution, leveraging the conjugacy between the Gaussian prior and likelihood. Given observed data points (X, y) where X are the input locations and y the corresponding outputs, and new test inputs X_*, the posterior predictive distribution for the latent function values f_* at X_* is p(f_* | y, X, X_*) \propto \int p(y | f, X) p(f_* | f, X_*) p(f | X) \, df. Due to the Gaussian nature of both the prior p(f | X) = \mathcal{N}(f | 0, K) and the conditional p(f_* | f, X_*) = \mathcal{N}(f_* | K_{*X} K^{-1} f, K_{**} - K_{*X} K^{-1} K_{X*}), as well as the likelihood p(y | f, X) = \mathcal{N}(y | f, \sigma^2 I) for noise-corrupted observations, the integral evaluates to a Gaussian distribution p(f_* | y, X, X_*) = \mathcal{N}(f_* | \mu_*, \Sigma_*), where the mean \mu_* = K_{*X} (K + \sigma^2 I)^{-1} y and covariance \Sigma_* = K_{**} - K_{*X} (K + \sigma^2 I)^{-1} K_{X*}.^[4] Computing this posterior requires inverting or solving linear systems involving the n \times n covariance matrix K_y = K + \sigma^2 I, where n is the number of data points. This is efficiently achieved using Cholesky decomposition, factoring K_y = L L^T with L a lower triangular matrix. The inverse operations are then performed via forward and back-substitution: solve L \alpha = y for \alpha, then L^T \beta = \alpha to obtain \beta = K_y^{-1} y, enabling the predictive mean as \mu_* = K_{*X} \beta. The decomposition itself costs \mathcal{O}(n^3) time, with additional \mathcal{O}(n^2) for predictions at m test points, making it the dominant computational bottleneck.^[4] The marginal likelihood p(y | X, \theta) = \mathcal{N}(y | 0, K_y), where \theta denotes hyperparameters such as length scales and noise variance, plays a central role in model selection and hyperparameter optimization. It is computed as \log p(y) = -\frac{1}{2} y^T K_y^{-1} y - \frac{1}{2} \log |K_y| - \frac{n}{2} \log 2\pi, with the log-determinant \log |K_y| = 2 \sum_i \log L_{ii} from the Cholesky factors and the quadratic form via the solved \beta as above. This evidence-based approach allows direct maximization of \log p(y) over \theta without cross-validation.^[4] Gradients of the log marginal likelihood with respect to hyperparameters are available in closed form, facilitating efficient optimization. Specifically, \frac{\partial \log p(y)}{\partial \theta} = -\frac{1}{2} \text{tr}(K_y^{-1} \frac{\partial K_y}{\partial \theta}) + \frac{1}{2} y^T K_y^{-1} \frac{\partial K_y}{\partial \theta} K_y^{-1} y, where traces and quadratic forms reuse the Cholesky factorization for \mathcal{O}(n^3) per evaluation during gradient computation in optimization loops. These analytic derivatives enable conjugate gradient or L-BFGS methods for hyperparameter learning.^[4] Despite these advantages, exact inference is limited to datasets with n \lesssim 10^4 due to the \mathcal{O}(n^3) scaling, beyond which memory and time requirements become prohibitive on standard hardware.^[4]

Approximation Methods

Gaussian processes (GPs) provide a powerful framework for probabilistic modeling, but exact inference scales cubically with the number of data points due to the need to invert the full covariance matrix, limiting their applicability to large datasets. Approximation methods address this by reducing computational complexity while preserving much of the GP's expressive power, often achieving linear or sub-quadratic scaling. These techniques are essential for scaling GPs to thousands or millions of observations in applications like spatial statistics and machine learning.^[44] One prominent class of approximations uses inducing points (or pseudo-inputs), which introduce a low-rank structure to the covariance by parameterizing the GP through a smaller set of latent variables \mathbf{u} at inducing locations \mathbf{Z} \in \mathbb{R}^{m \times d}, where m \ll n and n is the number of data points. In sparse GP regression, the inducing variables are modeled as \mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{K}_{mm}), where \mathbf{K}_{mm} is the kernel matrix over the inducing points. The approximate posterior is then obtained by maximizing an evidence lower bound (ELBO) on the marginal log-likelihood. Specifically, the variational approximation frames the log marginal likelihood as

\log p(\mathbf{y}) \approx \log \int p(\mathbf{y} | \mathbf{u}) q(\mathbf{u}) \, d\mathbf{u},

where q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S}) is a Gaussian variational distribution optimized jointly with the inducing locations and kernel hyperparameters. This approach, introduced in the variational sparse GP framework, yields a tight bound and enables efficient inference via stochastic optimization.^[45] Within the inducing points paradigm, specific bounds like the Fully Independent Training Conditional (FITC) and Variational Free Energy (VFE) provide distinct approximations to the GP posterior. FITC assumes a factorized conditional distribution over the function values at data points given the inducing variables, leading to a diagonal approximation of the posterior covariance that simplifies computations to O(m^2 n + m^3) time. This method, which treats the inducing points as fixed or optimized separately, offers a computationally efficient trade-off but can underestimate uncertainty in some regimes. In contrast, VFE maximizes a variational lower bound on the marginal likelihood by optimizing the inducing variables as parameters, resulting in a non-factorized approximation that better captures dependencies and provides a tighter bound than FITC, though at slightly higher cost. Both methods stem from unifying views of sparse approximations and have been widely adopted for their balance of accuracy and scalability.^[44]^[45] Another key approximation is based on random Fourier features (RFFs), which exploit Bochner's theorem to represent stationary kernels as expectations over random projections. For a shift-invariant kernel k(\mathbf{x}, \mathbf{x}') \approx \phi(\mathbf{x})^\top \phi(\mathbf{x}'), where \phi(\mathbf{x}) is a finite-dimensional feature map drawn from a Fourier transform of the kernel's spectral density. This reduces GP regression to Bayesian linear regression in the feature space, with complexity O(D n) for D features, enabling scalability to massive datasets. RFFs are particularly effective for kernels like the squared exponential and have been shown to approximate GP predictions with error bounds decreasing as O(1/\sqrt{D}).^[46] For non-Gaussian likelihoods, such as in classification or Poisson regression, the Laplace approximation provides a point estimate of the latent function values by approximating the posterior with a Gaussian centered at the mode of the log-posterior. This involves solving for the Hessian of the negative log-posterior at the mode, yielding an approximate covariance as its inverse, which can then be combined with sparse inducing point methods for scalability. The approach is computationally efficient for moderate-sized problems and forms a baseline for more advanced variational methods in non-conjugate settings.^[4]

Scalability Challenges

Gaussian processes (GPs) face significant scalability challenges primarily due to the computational demands of exact inference, which requires inverting an n × n kernel matrix at a cost of O(n³) time complexity via methods like Cholesky decomposition, alongside O(n²) storage requirements for the matrix itself. This cubic scaling severely limits the application of standard GPs to datasets with more than a few thousand data points, as training times grow prohibitively large even on modern hardware. For instance, processing n = 10,000 observations typically takes minutes to tens of minutes on standard hardware, becoming increasingly impractical for datasets exceeding this size in big data scenarios in fields like spatial statistics and machine learning.^[47] Non-stationary kernels, designed to capture varying smoothness or correlations across the input space, exacerbate these issues by increasing the per-evaluation cost of the kernel function, often from O(1) for simple stationary kernels like the squared exponential to O(d) or higher in input dimension d due to more intricate parameterizations. This added expense propagates through the repeated kernel computations needed for matrix construction, further straining resources in non-i.i.d. or heterogeneous data settings common in real-world applications such as environmental modeling.^[48] In high-dimensional input spaces, GPs encounter the curse of dimensionality, where the volume of the space explodes, leading to sparse data coverage and challenges in kernel specification that result in ill-conditioned or near-singular kernel matrices. Effective lengthscales become difficult to estimate reliably, as irrelevant dimensions dilute signal, often requiring dimensionality reduction or specialized kernels to maintain predictive performance without exponential growth in computational overhead.^[49] For non-Gaussian likelihoods, such as those arising in classification or count data, the posterior distribution over latent functions loses conjugacy, necessitating Markov chain Monte Carlo (MCMC) methods for inference, which introduce additional challenges like slow mixing and high variance due to the infinite-dimensional nature of the GP prior. These sampling-based approaches can require thousands of iterations per chain, amplifying the overall computational burden beyond even the O(n³) baseline for Gaussian cases.^[50] As of 2025, research directions to address GP scalability emphasize structured kernels that exploit low-rank or Kronecker factorizations to approximate full kernel matrices at reduced cost, alongside distributed computing paradigms that parallelize matrix operations across clusters for massive datasets. These advancements aim to enable GPs on scales exceeding millions of points while preserving probabilistic guarantees.^[51]^[52]

References

[1]
[PDF] Gaussian Processes in Machine Learning
Definition 1. A Gaussian Process is a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions. A Gaussian ...
[2]
An Intuitive Tutorial to Gaussian Process Regression - arXiv
Jan 28, 2024 · Gaussian Process is a key model in probabilistic supervised machine learning, widely applied in regression and classification tasks.Mathematical Basics · Gaussian Distribution · Gaussian Processes
[3]
1.7. Gaussian Processes - Scikit-learn
Gaussian Processes (GP) are a nonparametric supervised learning method used to solve regression and probabilistic classification problems.
[4]
[PDF] Gaussian Processes for Machine Learning
Gaussian Processes for Machine Learning presents one of the most important. Bayesian machine learning ... definition of Gaussian processes, in particular for the.
[5]
[PDF] INTRODUCTION TO GAUSSIAN PROCESSES Definition 1.1. A ...
DEFINITIONS AND EXAMPLES. Definition 1.1. A Gaussian process {Xt }t ∈T indexed by a set T is a family of (real-valued) random variables Xt , all defined on ...
[6]
[PDF] Lecture 24: Gaussian Processes 1 Preliminaries - Nikolai Matni
Definition 5. A Gaussian Process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Same as ...
[7]
http://www.gaussianprocess.org/gpml/chapters/RW.pdf
[8]
[PDF] An Introduction to Gaussian Process Models - arXiv
Feb 10, 2021 · Gaussian process is a powerful tool for nonlinear function regression without the need of much. prior knowledge. In contrast to most of the ...
[9]
[PDF] Gaussian processes
Dec 1, 2007 · In order to get an intuition for how Gaussian processes work, consider a simple zero-mean. Gaussian process, h(·) ∼ GP(0,k(·,·)). defined for ...
[10]
Random Fields and Geometry - Book - SpringerLink
The Geometry of Random Fields. Front Matter. Pages 259-262. Download chapter ... Adler. Department of Statistics, Stanford University, Stanford, USA.
[11]
[PDF] Lecture 5: Gaussian processes & Stationary processes
To construct a Gaussian process we must provide two things: its mean m(t), and its covariance function, B(s,t). These two objects completely determine all the ...
[12]
[PDF] Lecture Notes 7 Stationary Random Processes • Strict-Sense and ...
Stationarity refers to time invariance of some, or all, of the statistics of a random process, such as mean, autocorrelation, n-th-order distribution.
[13]
Complex Gaussian Processes - jstor
In general, a wide-sense stationary Gaussian process is not strictly stationary. However, if condition (3.1) holds, then wide-sense stationarity implies strict.
[14]
[PDF] Covariance Functions - Gaussian Processes for Machine Learning
A random field is continuous in mean square at x∗ if and only if its covariance function k(x, x0) is continuous at the point x = x0 = x∗. For stationary ...
[15]
[PDF] arXiv:1403.2215v1 [math.PR] 10 Mar 2014
Mar 10, 2014 · It turns out that for Gaussian processes the Kolmogorov–Centsov condition is very close to being necessary for Hölder continuity: Theorem 1. ...
[16]
[PDF] Gaussian processes - IISc Math
Mar 28, 2021 · Construct a process with discontinuous sample paths that has the same mean and covariance. There is no contradiction here. On the cylinder ...
[17]
http://gaussianprocess.org/gpml/
[18]
[PDF] 3 Properties of kernels - People @EECS
Proposition 3.9 A matrix A is positive (semi-)definite if and only if all of its principal minors are positive (semi-)definite.
[19]
https://gaussianprocess.org/gpml/chapters/RW4.pdf
[20]
[PDF] BROWNIAN MOTION 1.1. Wiener Process
The Wiener process is the intersection of the class of Gaussian processes with the Lévy processes. It should not be obvious that properties (1)–(4) in the ...
[21]
[PDF] Lecture 6: Wiener Process
Wt is a Gaussian process with mean and covariance. EWt = 0,. EWtWs = min(t, s). (d) Continuity. With probability 1, Wt viewed as a function of t is continuous.
[22]
[PDF] Lecture 4: Gaussian white noise and Wiener process - eis.mdx.ac.uk
As α → ∞, σ2 = S[x, 0]α/2 → ∞ and τcor → 0 (i.e. the process becomes white noise). 2. Page 3. 3 Linear transformation of white noise. Linear transformation of ...
[23]
[PDF] Lecture 6: Brownian motion
A Brownian motion or Wiener process is a stochastic process W = (Wt)t≥0 with the fol- lowing properties: (i) W0 = 0;. (ii) It is a Gaussian process;. (iii) It ...
[24]
Differential‐Space - Wiener - 1923 - Wiley Online Library
Volume 2, Issue 1-4 pp. 131-174 Journal of Mathematics and Physics Article Full Access Differential-Space Norbert Wiener, Norbert Wiener
[25]
On the Theory of the Brownian Motion | Phys. Rev.
On the Theory of the Brownian Motion. G. E. Uhlenbeck and L. S. Ornstein ... 36, 823 – Published 1 September, 1930. DOI: https://doi.org/10.1103/PhysRev ...
[26]
[PDF] Ornstein-Uhlenbeck process
10 to solve for the autocorrelation function Cx(t) when the OU process has ... Stationarity implies Cx(t) = Cx(-t), hence. Cx(t) = D α e−α|t| = D α e ...
[27]
[PDF] STOCHASTIC PROCESSES AND APPLICATIONS
Nov 11, 2015 · • The Ornstein-Uhlenbeck process is a Gaussian process with m(t) = 0, C(t, s) = λe−α|t−s| with. α, λ > 0. 1.2 Stationary Processes.
[28]
https://www.journals.uchicago.edu/doi/full/10.1086/706339
[29]
[PDF] Theory of Reproducing Kernels - N. Aronszajn
Aug 26, 2002 · Résumé of basic properties of reproducing kernels. In the following, F denotes a class of functions f(x) defined in E, forming a Hilbert space ...
[30]
Multivariate approximation for analytic functions with Gaussian kernels
In this paper we study multivariate approximation of functions from the same space. We measure the error of approximation in the L 2 sense with the standard ...
[31]
[0805.3252] Reproducing kernel Hilbert spaces of Gaussian priors
May 21, 2008 · We review definitions and properties of reproducing kernel Hilbert spaces attached to Gaussian variables and processes, with a view to applications in ...Missing: seminal | Show results with:seminal
[32]
[PDF] Relationships between GPs and Other Models - Gaussian Process
However, note that although sample functions of this Gaussian process are not in H, the posterior mean after observing some data will lie in the RKHS, due to ...
[33]
[1703.00787] Linearly constrained Gaussian processes - arXiv
Mar 2, 2017 · Access Paper: View a PDF of the paper titled Linearly constrained Gaussian processes, by Carl Jidling and 3 other authors. View PDF · TeX ...
[34]
[PDF] Gaussian Processes with Linear Operator Inequality Constraints
Aug 28, 2018 · These algorithms are based on derivation of the exact posterior of the constrained Gaussian process using a general linear operator, and ...
[35]
The Many Forms of Co-kriging: A Diversity of Multivariate Spatial ...
Oct 12, 2023 · Boundary co-kriging, in which the primary variable is estimated using primary variable data together with secondary variable boundary conditions ...
[36]
A statistical approach to some basic mine valuation problems on the ...
A statistical approach to some basic mine valuation problems on the Witwatersrand, by D.G. Krige, published in the Journal, December 1951 : introduction by the ...
[37]
Principles of geostatistics | Economic Geology - GeoScienceWorld
Mar 2, 2017 · Other| December 01, 1963. Principles of geostatistics Available. Georges Matheron ... Application of Kriging Techniques in Groundwater Hydrology.
[38]
Explaining and Connecting Kriging with Gaussian Process Regression
Aug 5, 2024 · Kriging and Gaussian Process Regression are statistical methods that allow predicting the outcome of a random process or a random field by using a sample of ...
[39]
[PDF] 1 Gaussian Processes - Stat@Duke
Definition 1.2 A Gaussian process {xi}, is said to have a variogram if Var(xi − xj) is a function of distance dij between sites i and j. We use 2γ(dij) to ...
[40]
(PDF) Kriging in the Light of the Theory of Random Field Prediction.
The method of kriging is critically examined from the viewpoint of the classical Wiener-Kolmogorov prediction theory for random fields, as well as from the ...
[41]
Gaussian Process Behaviour in Wide Deep Neural Networks - arXiv
Apr 30, 2018 · In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes.
[42]
Neural Tangent Kernel: Convergence and Generalization in ... - arXiv
Jun 20, 2018 · View a PDF of the paper titled Neural Tangent Kernel: Convergence and Generalization in Neural Networks, by Arthur Jacot and 2 other authors.
[43]
[1211.0358] Deep Gaussian Processes - arXiv
Nov 2, 2012 · Damianou, Neil D. Lawrence. View a PDF of the paper titled Deep Gaussian Processes, by Andreas C. Damianou and 1 other authors. View PDF.
[44]
[PDF] A Unifying View of Sparse Approximate Gaussian Process Regression
This paper provides a unifying view of sparse approximations for Gaussian process regression, which are used to overcome computational limitations of Gaussian ...
[45]
[PDF] Variational Learning of Inducing Variables in Sparse Gaussian ...
Titsias, M. K. (2009). Variational Model Selection for. Sparse Gaussian Process Regression. Technical report,. School of Computer Science, University of ...
[46]
[PDF] Random Features for Large-Scale Kernel Machines - People @EECS
Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user.
[47]
[PDF] When Gaussian Process Meets Big Data: A Review of Scalable GPs
The most prominent weakness of standard GP is that it suffers from a cubic time complexity O(n3) because of the inversion and determinant of the n × n kernel ...
[48]
[PDF] Nonstationary Covariance Functions for Gaussian Process Regression
We choose to fully model the functions as Gaussian processes themselves, but recognize the computational cost and suggest that simpler representations are worth.
[49]
[PDF] A survey on high-dimensional Gaussian process modeling ... - arXiv
The major advantage of this approach is that it allows for solving d many one dimensional ... “On. ANOVA decompositions of kernels and Gaussian random field paths ...
[50]
[PDF] Efficient Sampling for Gaussian Process Inference using Control ...
When the likelihood p(y|f) is non-Gaussian, computations become intractable and we need to carry out approximate inference. The MCMC algorithm we consider ...<|separator|>
[51]
[PDF] arXiv:2502.10540v1 [cs.LG] 14 Feb 2025
Feb 14, 2025 · Kernel inter- polation for scalable structured Gaussian processes. (KISS-GP). In International conference on machine learning, pages 1775 ...
[52]
Scalable Gaussian Processes with Latent Kronecker Structure - arXiv
Jun 7, 2025 · Gaussian processes (GPs) are probabilistic models prized for their flexibility, data efficiency, and well-calibrated uncertainty estimates.Missing: directions | Show results with:directions