Gaussian process
A Gaussian process (GP) is a stochastic process in which every finite collection of random variables from the process has a multivariate normal distribution, fully specified by a mean function and a covariance function.[1] This framework defines a probability distribution over functions, enabling flexible modeling of complex relationships in data without assuming a fixed parametric form.[2] In machine learning, Gaussian processes serve as a powerful nonparametric Bayesian approach for supervised learning tasks, particularly regression and probabilistic classification, where they provide not only point predictions but also full posterior distributions to quantify uncertainty.[3] The covariance function, often called a kernel, encodes assumptions about the smoothness and structure of the underlying function, with common choices including the squared exponential kernel for smooth functions and the Matérn kernel for more flexible roughness.[1] GPs excel in scenarios with small datasets, offering interpretable results through their connection to kernel methods and reproducing kernel Hilbert spaces.[4] Historically rooted in geostatistics as kriging—a method developed in the 1950s for spatial interpolation—Gaussian processes gained prominence in statistics and machine learning during the late 20th century, with foundational theoretical advancements in the 1990s and early 2000s.[1] Their computational tractability for moderate data sizes, via exact inference using the multivariate Gaussian posterior, contrasts with scalable approximations like inducing points or variational methods needed for large-scale applications in fields such as robotics, optimization, and climate modeling.[3]Definition and Fundamentals
Formal Definition
A Gaussian process is a stochastic process \{f(\mathbf{x}) : \mathbf{x} \in \mathcal{X}\}, where \mathcal{X} is an arbitrary index set, such that for any finite collection of distinct points \mathbf{x}_1, \dots, \mathbf{x}_n \in \mathcal{X}, the random vector (f(\mathbf{x}_1), \dots, f(\mathbf{x}_n))^\top follows a multivariate normal distribution.[5][6] This property ensures that the process is fully characterized by its finite-dimensional distributions, all of which are Gaussian.[7] The finite-dimensional distributions of a Gaussian process are specified by a mean vector \boldsymbol{\mu} = (\mu(\mathbf{x}_1), \dots, \mu(\mathbf{x}_n))^\top, where \mu(\mathbf{x}_i) = \mathbb{E}[f(\mathbf{x}_i)], and a covariance matrix \mathbf{K}, with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) = \mathrm{Cov}[f(\mathbf{x}_i), f(\mathbf{x}_j)], such that \mathbf{K} is positive semi-definite.[5][8] Thus, (f(\mathbf{x}_1), \dots, f(\mathbf{x}_n))^\top \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K}). This construction extends to any finite subset, guaranteeing consistency across all marginal distributions.[7] A Gaussian process is commonly denoted as f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')), where m(\cdot) is the mean function and k(\cdot, \cdot) is the covariance kernel (or simply covariance function).[7][9] The mean and covariance functions determine all finite-dimensional distributions and thus fully specify the process.[8] While the terms "Gaussian process" and "Gaussian random field" are sometimes used interchangeably to describe such collections of random variables with joint Gaussian marginals, the former is often applied when the index set \mathcal{X} is one-dimensional (e.g., time), and the latter when \mathcal{X} is higher-dimensional (e.g., spatial coordinates).[10] In both cases, the underlying mathematical structure remains identical.[5]Mean and Covariance Functions
A Gaussian process is fully specified by its mean function and covariance function, which together determine the expected value and the dependence structure of the process at any finite collection of points. The mean function, denoted \mu: \mathcal{X} \to \mathbb{R}, is defined as \mu(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})] for any \mathbf{x} \in \mathcal{X}, where f \sim \mathcal{GP}(\mu, k) and \mathcal{X} is the input space.[4] This function captures the overall trend or systematic component of the process, and in many applications, it is assumed to be zero without loss of generality, as non-zero means can often be incorporated into the model through feature transformations or subtracted via centering techniques.[4] The covariance function, or kernel, k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}, is given by k(\mathbf{x}, \mathbf{x}') = \mathrm{Cov}[f(\mathbf{x}), f(\mathbf{x}')], which encodes the similarity or correlation between function values at different inputs.[4] For the covariance function to define a valid Gaussian process, it must be positive semi-definite, meaning that for any finite set of distinct points \mathbf{x}_1, \dots, \mathbf{x}_n \in \mathcal{X} and any coefficients c_1, \dots, c_n \in \mathbb{R}, the inequality \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0 holds; this ensures that the resulting covariance matrix is positive semi-definite and thus a valid covariance for a multivariate Gaussian distribution.[11] Given these functions, the joint distribution of the process at a finite set of points \mathbf{x} = [\mathbf{x}_1, \dots, \mathbf{x}_n]^\top follows a multivariate Gaussian: \mathbf{f}(\mathbf{x}) \sim \mathcal{N}(\boldsymbol{\mu}(\mathbf{x}), \mathbf{K}(\mathbf{x}, \mathbf{x})), where \boldsymbol{\mu}(\mathbf{x}) = [\mu(\mathbf{x}_1), \dots, \mu(\mathbf{x}_n)]^\top and \mathbf{K}(\mathbf{x}, \mathbf{x}) is the n \times n covariance matrix with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j).[4] This finite-dimensional distribution fully characterizes the process's behavior at those points. In practice, normalization techniques such as centering the data—subtracting the empirical mean from observations—allow assuming \mu(\mathbf{x}) = \mathbf{0} without loss of generality, simplifying computations and posterior inference in Bayesian settings.[4] For instance, in regression tasks, a non-zero mean can be modeled separately using linear basis functions, leaving the kernel to capture deviations, which streamlines the update formulas for the posterior mean and variance.[4] A fundamental result is that any two Gaussian processes sharing the same mean and covariance functions are equivalent in terms of their finite-dimensional distributions, rendering them indistinguishable for practical purposes in finite-sample analyses.[11] This uniqueness theorem underscores that the pair (\mu, k) completely specifies the process up to versions that agree almost surely on finite sets.[11]Core Properties
Stationarity
In Gaussian processes, strict stationarity refers to the property where all finite-dimensional distributions remain unchanged under translations of the input domain. Specifically, for any finite set of points x_1, \dots, x_n and any shift h, the joint distribution of f(x_1 + h), \dots, f(x_n + h) equals that of f(x_1), \dots, f(x_n).[11] Wide-sense stationarity, a weaker condition, requires the mean function to be constant, \mu(x) = \mu for all x, and the covariance function to depend solely on the lag \tau = x - x', such that k(x, x') = k(\tau).[12] For Gaussian processes, wide-sense stationarity implies strict stationarity, as the multivariate Gaussian distributions are fully specified by their means and covariances.[13] This stationarity enables a spectral representation of the covariance kernel through Bochner's theorem, which characterizes stationary positive definite functions as the Fourier transforms of positive finite measures.[14] In particular, a continuous function k: \mathbb{R}^d \to \mathbb{R} is the covariance function of a stationary Gaussian process if and only if there exists a positive finite measure \mu such that k(\tau) = \int_{\mathbb{R}^d} e^{i \omega^\top \tau} \, \mu(d\omega). When \mu has a density S(\omega), the power spectral density, this becomes k(\tau) = \int_{\mathbb{R}^d} S(\omega) e^{i \omega^\top \tau} \, d\omega, with S(\omega) \geq 0 for all \omega, ensuring positive definiteness.[14] While stationary Gaussian processes exhibit translation-invariant statistical properties, many real-world applications involve non-stationary processes where the mean or covariance varies with absolute input positions, necessitating alternative kernel constructions.[14]Marginal Variance
In a Gaussian process, the marginal distribution of the function value at any single input point x follows a univariate normal distribution, denoted as f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x)), where \mu(x) is the mean function and the marginal variance \sigma^2(x) is given by the covariance kernel evaluated at that point, \sigma^2(x) = k(x, x). This formulation arises because the Gaussian process defines a joint distribution over function values, and the marginal at a single point is simply the diagonal element of the covariance matrix. The marginal variance \sigma^2(x) quantifies the inherent uncertainty or spread of possible function values at x, reflecting the process's variability without conditioning on observations at other points. In the marginal sense, this variance is independent of function values at distinct locations, emphasizing the pointwise stochastic nature of the process. For stationary Gaussian processes, the marginal variance is constant across the domain, \sigma^2(x) = k(x, x) = \sigma^2 for all x. When observations are available, the predictive variance at a new point x_* incorporates data dependence, expressed as \mathrm{var}(f(x_*) \mid \mathbf{y}) = k(x_*, x_*) - \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*, where \mathbf{k}_* collects covariances between x_* and training inputs, and K is the covariance matrix of the training data; this reduces the prior marginal variance by accounting for information from nearby points. In regression tasks, the marginal variance plays a key role in the signal-to-noise ratio, where scaling the kernel by a hyperparameter (often \sigma_f^2) balances the amplitude of the signal against noise variance \sigma_n^2, ensuring the model captures meaningful patterns without overfitting. Appropriate scaling, typically learned via maximum likelihood, aligns the process variance with the data's dynamic range. Degenerate Gaussian processes occur when the marginal variance vanishes, k(x, x) = 0 for all x, implying a deterministic function with zero uncertainty everywhere, such as a constant mean process without stochastic components. This case reduces the Gaussian process to a Dirac delta distribution at the mean, useful for modeling noise-free scenarios but limiting flexibility.Sample Path Properties
Sample paths of a Gaussian process, denoted as realizations f(\cdot), exhibit properties such as continuity and differentiability that depend on the covariance kernel and the domain. Almost sure continuity holds under conditions provided by Kolmogorov's continuity theorem, which ensures that the paths are continuous with probability one if the process satisfies a moment condition on increments.[15] For stationary Gaussian processes on \mathbb{R}^d, the paths are Hölder continuous if there exist constants C > 0, \alpha > 0, and \beta > 0 such that \mathbb{E}\left[ |f(x) - f(y)|^\alpha \right] \leq C |x - y|^{d + \beta} for all x, y in a compact set, where the exponent d + \beta accounts for the dimension. This condition is nearly necessary and sufficient for Hölder continuity of order \gamma for any \gamma < \beta / \alpha.[15] Mean-square differentiability of the paths occurs when the covariance kernel k(x, x') admits a continuous second mixed partial derivative \frac{\partial^2 k}{\partial x \partial x'}, ensuring that the derivative process is also Gaussian with a well-defined covariance. In this case, the paths are mean-square differentiable, and under additional regularity, sample path differentiable.[14] The choice of kernel governs the roughness and smoothness of the sample paths; for instance, Matérn kernels provide tunable regularity through the smoothness parameter \nu, where the paths are k-times mean-square differentiable if and only if \nu > k, allowing control over the expected wiggliness of the function. Squared exponential kernels yield infinitely differentiable paths, while less smooth kernels like exponential produce rougher, once-differentiable paths.[14] Gaussian processes can exhibit discontinuous sample paths in certain cases, particularly when defined on discrete domains where the index set lacks a natural topology for continuity, or when the covariance function fails the Kolmogorov condition, such as for kernels that are not continuous. For example, a centered Gaussian process with a covariance R(t,s) = 1 if t = s and 0 otherwise on a discrete set has discontinuous "paths" by construction, matching the finite-dimensional distributions of a continuous process but lacking path continuity.[5][16]Covariance Kernels
Kernel Properties
A covariance kernel k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} for a Gaussian process must satisfy the positive semi-definiteness condition to ensure that the resulting covariance matrices are valid for multivariate Gaussian distributions. Specifically, for any finite set of distinct points x_1, \dots, x_n \in \mathcal{X} and any coefficients c_1, \dots, c_n \in \mathbb{R}, the quadratic form \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(x_i, x_j) \geq 0, or equivalently, the n \times n Gram matrix K with entries K_{ij} = k(x_i, x_j) has non-negative eigenvalues. This property guarantees that the variance of any linear combination of process values is non-negative, preserving the non-negativity of variances in the finite-dimensional marginal distributions.[17] Mercer's theorem provides a spectral decomposition for such kernels under mild conditions, such as continuity on a compact domain. It states that there exist positive eigenvalues \lambda_m \geq 0 (with only finitely many non-zero if the kernel is degenerate) and orthonormal functions \{\phi_m\} in the L^2 space such that k(x, x') = \sum_{m=1}^\infty \lambda_m \phi_m(x) \phi_m(x'), where the series converges absolutely and uniformly on compact sets. This expansion represents the kernel as an inner product in a feature space weighted by the eigenvalues, facilitating analysis of the process's spectral content and approximation methods.[18] The class of positive semi-definite kernels exhibits closure under several operations, enabling the construction of complex kernels from simpler ones. The sum of two kernels k_1 + k_2 is positive semi-definite because the corresponding Gram matrix is the sum of two positive semi-definite matrices. Similarly, scaling by a positive constant c > 0 yields c k, as the eigenvalues are scaled by c. The pointwise product k_1 \cdot k_2 is also positive semi-definite, corresponding to the tensor product of the respective feature spaces, which preserves the inner product structure. These properties allow for flexible kernel engineering while maintaining validity.[17] Positive semi-definite kernels induce a feature map \phi: \mathcal{X} \to \mathcal{H} into a Hilbert space \mathcal{H} (possibly infinite-dimensional), such that k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}. This representation underpins the kernel trick in computation, where inner products are computed directly via the kernel without explicit feature vectors, and connects to the reproducing kernel Hilbert space framework. For any x \in \mathcal{X}, the evaluation functional is continuous, with \langle \phi(x), f \rangle_{\mathcal{H}} = f(x) for f \in \mathcal{H}.[9] Boundedness of the kernel, meaning \sup_{x,x' \in \mathcal{X}} |k(x, x')| < \infty, implies that the process has uniformly bounded variance \mathrm{Var}[f(x)] = k(x,x) \leq M for some M, limiting the magnitude of function values. Continuity of the kernel k with respect to the input topology ensures mean-square continuity of the Gaussian process sample paths, i.e., \mathbb{E}[(f(x) - f(x'))^2] \to 0 as x \to x', which under Kolmogorov's continuity theorem can imply almost sure path continuity and thus regularity properties like Hölder continuity depending on the modulus of continuity of k.[17]Standard Kernel Families
Standard covariance kernels, also known as kernel functions, define the similarity between input points in a Gaussian process and must be positive semi-definite to ensure valid covariance matrices. These parametric forms allow practitioners to model various assumptions about the underlying function's smoothness, periodicity, or structure. The squared exponential kernel, often referred to as the radial basis function (RBF) kernel, is one of the most widely used due to its flexibility in capturing smooth functions. It is defined as k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2} \right), where \sigma^2 is the variance parameter controlling the overall scale, and \ell is the length scale parameter governing the rate of correlation decay with distance. This kernel produces infinitely differentiable sample paths, making it suitable for modeling highly smooth processes.[19] The Matérn family of kernels provides a more flexible alternative, allowing control over the smoothness of the process through a parameter \nu. The general form is k(\tau) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \sqrt{2\nu} \frac{|\tau|}{\ell} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{|\tau|}{\ell} \right), where \tau = |\mathbf{x} - \mathbf{x}'| is the distance, \Gamma is the gamma function, and K_\nu is the modified Bessel function of the second kind of order \nu. The parameter \nu determines the mean-square differentiability of the paths: for \nu = 1/2, it yields exponential decay (once differentiable); \nu = 3/2 allows one derivative; and \nu = 5/2 permits two derivatives, with higher \nu approaching the squared exponential kernel as \nu \to \infty. Specific cases like \nu = p + 1/2 for integer p have closed-form expressions without Bessel functions. This family balances smoothness and computational tractability while avoiding the infinite differentiability of the RBF kernel.[19] The periodic kernel is designed for functions exhibiting repeating patterns and is given by k(x, x') = \sigma^2 \exp\left( -\frac{2 \sin^2 \left( \pi |x - x'| / p \right)}{\ell^2} \right), where p is the period parameter setting the repetition length, \sigma^2 scales the variance, and \ell controls the decay within each period. This kernel enforces exact periodicity by mapping distances via the sine function, producing infinitely differentiable paths that oscillate regularly. It is particularly effective when the data suggests cyclic behavior, though it assumes global periodicity across the input space.[19] The linear kernel assumes a simpler, non-stationary structure suitable for modeling low-dimensional linear trends or projections: k(\mathbf{x}, \mathbf{x}') = \sigma^2 \mathbf{x}^\top \mathbf{x}', where \sigma^2 adjusts the variance. This kernel corresponds to Bayesian linear regression in the Gaussian process framework, generating sample paths that are straight lines through the origin in the feature space, with smoothness limited to linear functions. It is computationally efficient and serves as a building block for more complex models.[19] Composite kernels are formed by combining simpler kernels through operations like addition, multiplication, or exponentiation, enabling the modeling of structured data with multiple characteristics. For example, the product of an RBF kernel and a linear kernel, k(\mathbf{x}, \mathbf{x}') = k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') \cdot k_{\text{linear}}(\mathbf{x}, \mathbf{x}'), captures both smooth non-linear variations and global linear trends. Such compositions inherit properties from their components—e.g., the product of two stationary kernels remains stationary—and allow for interpretable hyperparameters tailored to hierarchical or multiplicative structures in the data. The validity of composite kernels relies on ensuring the result remains positive semi-definite, which holds for products and sums of positive definite kernels.[19]Examples and Special Cases
Wiener Process
The Wiener process, also known as standard Brownian motion, is a canonical example of a Gaussian process that models random fluctuations in various physical and financial systems. It is defined as a continuous-time stochastic process \{W(t) : t \geq 0\} with W(0) = 0, independent increments, and normally distributed increments such that W(t) - W(s) \sim \mathcal{N}(0, t - s) for all t > s \geq 0.[20] This construction ensures that the process has stationary increments, meaning the distribution of W(t + h) - W(t) depends only on h > 0 and not on t, but the process itself is non-stationary because its variance \mathrm{Var}(W(t)) = t grows linearly with time.[20] As a Gaussian process, the Wiener process is fully specified by its mean function \mu(t) = 0 for all t \geq 0 and its covariance kernel k(s, t) = \min(s, t), which captures the dependence structure where earlier times influence later ones cumulatively.[21] The Wiener process can be interpreted as the continuous-time integral of white Gaussian noise, providing a mathematical representation of idealized random perturbations.[22] This perspective underscores its role in stochastic differential equations, where it serves as the driving noise term. The non-stationarity arises directly from the kernel form, as k(s, t) is not a function solely of |t - s|, contrasting with stationary Gaussian processes whose covariances depend only on time differences.[11] Despite this, the independent and stationary increments property makes it a Lévy process, facilitating analytical tractability in applications like diffusion modeling.[20] Sample paths of the Wiener process exhibit remarkable regularity properties: they are almost surely continuous functions of time, ensuring no jumps occur with probability one.[23] However, these paths are nowhere differentiable almost surely, meaning no tangent exists at any point, which reflects the infinite variation accumulated over any interval.[23] More precisely, the paths are Hölder continuous with any exponent \gamma < 1/2, but not with exponent $1/2, quantifying their roughness in terms of modulus of continuity.[21] Historically, the Wiener process is named after Norbert Wiener, who in 1923 provided the first rigorous mathematical construction of Brownian motion as a continuous stochastic process with these properties, laying the foundation for modern stochastic analysis.[24]Ornstein-Uhlenbeck Process
The Ornstein-Uhlenbeck process is defined as the solution to the stochastic differential equationdf(t) = -\theta f(t)\, dt + \sigma\, dW(t),
where \theta > 0 represents the speed of mean reversion, \sigma > 0 is the diffusion coefficient, and W(t) denotes a standard Wiener process; this formulation captures a mean-reverting dynamic where fluctuations decay toward the origin over time.[25][26] Viewed as a Gaussian process on \mathbb{R}, the Ornstein-Uhlenbeck process assumes a zero mean function \mu(t) = 0 and features the stationary covariance kernel
k(s,t) = \frac{\sigma^2}{2\theta} \exp\left(-\theta |t - s|\right),
which corresponds to the Matérn kernel family with smoothness parameter \nu = 1/2.[14][27] This stationarity implies a constant marginal variance of \sigma^2 / (2\theta) for all t, with the covariance between any two points decaying exponentially at rate \theta as the separation |t - s| increases, ensuring temporal homogeneity and the Markov property.[14][26] The sample paths of the Ornstein-Uhlenbeck process are continuous almost surely, reflecting the continuity of the driving Wiener process and the Lipschitz continuity of the drift term; however, they possess limited regularity, being mean-square continuous but not mean-square differentiable, akin to the roughness of Brownian motion paths.[28][14] In physical modeling, the Ornstein-Uhlenbeck process classically describes the velocity component of a particle undergoing Brownian motion under viscous friction, providing a foundational example of stochastic damping in statistical mechanics.[25]