Fact-checked by Grok 2 weeks ago

Gaussian process

A Gaussian process (GP) is a stochastic process in which every finite collection of random variables from the process has a multivariate normal distribution, fully specified by a mean function and a covariance function. This framework defines a probability distribution over functions, enabling flexible modeling of complex relationships in data without assuming a fixed parametric form. In , Gaussian processes serve as a powerful nonparametric Bayesian approach for tasks, particularly and , where they provide not only point predictions but also full posterior distributions to quantify . The , often called a , encodes assumptions about the smoothness and structure of the underlying , with common choices including the squared exponential for smooth functions and the Matérn for more flexible roughness. GPs excel in scenarios with small datasets, offering interpretable results through their connection to kernel methods and reproducing kernel Hilbert spaces. Historically rooted in as —a method developed in the for spatial —Gaussian processes gained prominence in statistics and during the late , with foundational theoretical advancements in the and early . Their computational tractability for moderate data sizes, via exact inference using the multivariate Gaussian posterior, contrasts with scalable approximations like inducing points or variational methods needed for large-scale applications in fields such as , optimization, and climate modeling.

Definition and Fundamentals

Formal Definition

A Gaussian process is a stochastic process \{f(\mathbf{x}) : \mathbf{x} \in \mathcal{X}\}, where \mathcal{X} is an arbitrary index set, such that for any finite collection of distinct points \mathbf{x}_1, \dots, \mathbf{x}_n \in \mathcal{X}, the random vector (f(\mathbf{x}_1), \dots, f(\mathbf{x}_n))^\top follows a multivariate normal distribution. This property ensures that the process is fully characterized by its finite-dimensional distributions, all of which are Gaussian. The finite-dimensional distributions of a Gaussian process are specified by a mean \boldsymbol{\mu} = (\mu(\mathbf{x}_1), \dots, \mu(\mathbf{x}_n))^\top, where \mu(\mathbf{x}_i) = \mathbb{E}[f(\mathbf{x}_i)], and a \mathbf{K}, with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) = \mathrm{Cov}[f(\mathbf{x}_i), f(\mathbf{x}_j)], such that \mathbf{K} is positive semi-definite. Thus, (f(\mathbf{x}_1), \dots, f(\mathbf{x}_n))^\top \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K}). This construction extends to any finite subset, guaranteeing consistency across all marginal distributions. A Gaussian process is commonly denoted as f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')), where m(\cdot) is the function and k(\cdot, \cdot) is the kernel (or simply function). The and functions determine all finite-dimensional distributions and thus fully specify the process. While the terms "Gaussian process" and "" are sometimes used interchangeably to describe such collections of random variables with joint Gaussian marginals, the former is often applied when the index set \mathcal{X} is one-dimensional (e.g., time), and the latter when \mathcal{X} is higher-dimensional (e.g., spatial coordinates). In both cases, the underlying mathematical structure remains identical.

Mean and Covariance Functions

A Gaussian process is fully specified by its function and covariance function, which together determine the and the dependence of the process at any finite collection of points. The function, denoted \mu: \mathcal{X} \to \mathbb{R}, is defined as \mu(\mathbf{x}) = \mathbb{E}[f(\mathbf{x})] for any \mathbf{x} \in \mathcal{X}, where f \sim \mathcal{GP}(\mu, k) and \mathcal{X} is the input space. This function captures the overall trend or systematic component of the process, and in many applications, it is assumed to be zero , as non-zero means can often be incorporated into the model through feature transformations or subtracted via centering techniques. The covariance function, or kernel, k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}, is given by k(\mathbf{x}, \mathbf{x}') = \mathrm{Cov}[f(\mathbf{x}), f(\mathbf{x}')], which encodes the similarity or between function values at different inputs. For the covariance function to define a valid Gaussian process, it must be positive semi-definite, meaning that for any finite set of distinct points \mathbf{x}_1, \dots, \mathbf{x}_n \in \mathcal{X} and any coefficients c_1, \dots, c_n \in \mathbb{R}, the inequality \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(\mathbf{x}_i, \mathbf{x}_j) \geq 0 holds; this ensures that the resulting is positive semi-definite and thus a valid covariance for a multivariate Gaussian . Given these functions, the joint distribution of the process at a finite set of points \mathbf{x} = [\mathbf{x}_1, \dots, \mathbf{x}_n]^\top follows a multivariate Gaussian: \mathbf{f}(\mathbf{x}) \sim \mathcal{N}(\boldsymbol{\mu}(\mathbf{x}), \mathbf{K}(\mathbf{x}, \mathbf{x})), where \boldsymbol{\mu}(\mathbf{x}) = [\mu(\mathbf{x}_1), \dots, \mu(\mathbf{x}_n)]^\top and \mathbf{K}(\mathbf{x}, \mathbf{x}) is the n \times n covariance matrix with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j). This finite-dimensional distribution fully characterizes the process's behavior at those points. In practice, normalization techniques such as centering the data—subtracting the empirical from observations—allow assuming \mu(\mathbf{x}) = \mathbf{0} , simplifying computations and posterior in Bayesian settings. For instance, in tasks, a non-zero can be modeled separately using linear basis functions, leaving the kernel to capture deviations, which streamlines the update formulas for the posterior and variance. A fundamental result is that any two Gaussian processes sharing the same mean and covariance functions are equivalent in terms of their finite-dimensional distributions, rendering them indistinguishable for practical purposes in finite-sample analyses. This uniqueness theorem underscores that the pair (\mu, k) completely specifies the process up to versions that agree almost surely on finite sets.

Core Properties

Stationarity

In Gaussian processes, strict stationarity refers to the property where all finite-dimensional distributions remain unchanged under translations of the input domain. Specifically, for any finite set of points x_1, \dots, x_n and any shift h, the joint distribution of f(x_1 + h), \dots, f(x_n + h) equals that of f(x_1), \dots, f(x_n). Wide-sense stationarity, a weaker condition, requires the mean function to be constant, \mu(x) = \mu for all x, and the covariance function to depend solely on the lag \tau = x - x', such that k(x, x') = k(\tau). For Gaussian processes, wide-sense stationarity implies strict stationarity, as the multivariate Gaussian distributions are fully specified by their means and covariances. This stationarity enables a spectral representation of the covariance kernel through Bochner's theorem, which characterizes stationary positive definite functions as the Fourier transforms of positive finite measures. In particular, a continuous function k: \mathbb{R}^d \to \mathbb{R} is the covariance function of a Gaussian process if and only if there exists a positive finite measure \mu such that k(\tau) = \int_{\mathbb{R}^d} e^{i \omega^\top \tau} \, \mu(d\omega). When \mu has a density S(\omega), the power spectral density, this becomes k(\tau) = \int_{\mathbb{R}^d} S(\omega) e^{i \omega^\top \tau} \, d\omega, with S(\omega) \geq 0 for all \omega, ensuring positive definiteness. While Gaussian processes exhibit translation-invariant statistical properties, many real-world applications involve non-stationary processes where the or varies with absolute input positions, necessitating alternative constructions.

Marginal Variance

In a Gaussian process, the marginal distribution of the function value at any single input point x follows a univariate , denoted as f(x) \sim \mathcal{N}(\mu(x), \sigma^2(x)), where \mu(x) is the and the marginal variance \sigma^2(x) is given by the evaluated at that point, \sigma^2(x) = k(x, x). This formulation arises because the Gaussian process defines a joint distribution over function values, and the marginal at a single point is simply the diagonal element of the . The marginal variance \sigma^2(x) quantifies the inherent uncertainty or spread of possible function values at x, reflecting the process's variability without conditioning on observations at other points. In the marginal sense, this variance is independent of function values at distinct locations, emphasizing the pointwise stochastic nature of the process. For stationary Gaussian processes, the marginal variance is constant across the domain, \sigma^2(x) = k(x, x) = \sigma^2 for all x. When observations are available, the predictive variance at a new point x_* incorporates data dependence, expressed as \mathrm{var}(f(x_*) \mid \mathbf{y}) = k(x_*, x_*) - \mathbf{k}_*^T (K + \sigma_n^2 I)^{-1} \mathbf{k}_*, where \mathbf{k}_* collects covariances between x_* and training inputs, and K is the of the training data; this reduces the prior marginal variance by accounting for information from nearby points. In regression tasks, the marginal variance plays a key role in the , where scaling the kernel by a hyperparameter (often \sigma_f^2) balances the of the signal against variance \sigma_n^2, ensuring the model captures meaningful patterns without . Appropriate scaling, typically learned via maximum likelihood, aligns the process variance with the data's . Degenerate Gaussian processes occur when the marginal variance vanishes, k(x, x) = 0 for all x, implying a deterministic with zero everywhere, such as a constant mean process without components. This case reduces the Gaussian process to a Dirac delta distribution at the mean, useful for modeling noise-free scenarios but limiting flexibility.

Sample Path Properties

Sample paths of a Gaussian process, denoted as realizations f(\cdot), exhibit properties such as and differentiability that depend on the covariance kernel and the domain. Almost sure holds under conditions provided by Kolmogorov's continuity theorem, which ensures that the paths are continuous with probability one if the process satisfies a moment condition on increments. For Gaussian processes on \mathbb{R}^d, the paths are Hölder continuous if there exist constants C > 0, \alpha > 0, and \beta > 0 such that \mathbb{E}\left[ |f(x) - f(y)|^\alpha \right] \leq C |x - y|^{d + \beta} for all x, y in a compact set, where the exponent d + \beta accounts for the . This condition is nearly necessary and sufficient for Hölder continuity of order \gamma for any \gamma < \beta / \alpha. Mean-square differentiability of the paths occurs when the covariance kernel k(x, x') admits a continuous second mixed partial derivative \frac{\partial^2 k}{\partial x \partial x'}, ensuring that the derivative process is also Gaussian with a well-defined covariance. In this case, the paths are mean-square differentiable, and under additional regularity, sample path differentiable. The choice of kernel governs the roughness and smoothness of the sample paths; for instance, Matérn kernels provide tunable regularity through the smoothness parameter \nu, where the paths are k-times mean-square differentiable if and only if \nu > k, allowing control over the expected wiggliness of the function. Squared exponential kernels yield infinitely differentiable paths, while less smooth kernels like produce rougher, once-differentiable paths. Gaussian processes can exhibit discontinuous sample paths in certain cases, particularly when defined on discrete domains where the index set lacks a natural topology for continuity, or when the covariance function fails the Kolmogorov condition, such as for kernels that are not continuous. For example, a centered Gaussian process with a covariance R(t,s) = 1 if t = s and 0 otherwise on a set has discontinuous "paths" by construction, matching the finite-dimensional distributions of a continuous process but lacking path .

Covariance Kernels

Kernel Properties

A covariance kernel k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} for a Gaussian process must satisfy the positive semi-definiteness condition to ensure that the resulting matrices are valid for multivariate Gaussian distributions. Specifically, for any finite set of distinct points x_1, \dots, x_n \in \mathcal{X} and any coefficients c_1, \dots, c_n \in \mathbb{R}, the \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(x_i, x_j) \geq 0, or equivalently, the n \times n Gram matrix K with entries K_{ij} = k(x_i, x_j) has non-negative eigenvalues. This property guarantees that the variance of any of process values is non-negative, preserving the non-negativity of variances in the finite-dimensional marginal distributions. Mercer's theorem provides a for such kernels under mild conditions, such as on a compact . It states that there exist positive eigenvalues \lambda_m \geq 0 (with only finitely many non-zero if the kernel is degenerate) and orthonormal functions \{\phi_m\} in the L^2 space such that k(x, x') = \sum_{m=1}^\infty \lambda_m \phi_m(x) \phi_m(x'), where the series converges absolutely and uniformly on compact sets. This expansion represents the kernel as an inner product in a feature space weighted by the eigenvalues, facilitating analysis of the process's content and methods. The class of positive semi-definite kernels exhibits closure under several operations, enabling the construction of complex kernels from simpler ones. The sum of two kernels k_1 + k_2 is positive semi-definite because the corresponding is the sum of two positive semi-definite matrices. Similarly, scaling by a positive constant c > 0 yields c k, as the eigenvalues are scaled by c. The k_1 \cdot k_2 is also positive semi-definite, corresponding to the of the respective feature spaces, which preserves the inner product structure. These properties allow for flexible kernel while maintaining validity. Positive semi-definite kernels induce a feature map \phi: \mathcal{X} \to \mathcal{H} into a \mathcal{H} (possibly infinite-dimensional), such that k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}. This representation underpins the kernel trick in computation, where inner products are computed directly via the without explicit feature vectors, and connects to the framework. For any x \in \mathcal{X}, the evaluation functional is continuous, with \langle \phi(x), f \rangle_{\mathcal{H}} = f(x) for f \in \mathcal{H}. Boundedness of the kernel, meaning \sup_{x,x' \in \mathcal{X}} |k(x, x')| < \infty, implies that the process has uniformly bounded variance \mathrm{Var}[f(x)] = k(x,x) \leq M for some M, limiting the magnitude of function values. Continuity of the kernel k with respect to the input topology ensures mean-square continuity of the Gaussian process sample paths, i.e., \mathbb{E}[(f(x) - f(x'))^2] \to 0 as x \to x', which under Kolmogorov's continuity theorem can imply almost sure path continuity and thus regularity properties like Hölder continuity depending on the modulus of continuity of k.

Standard Kernel Families

Standard covariance kernels, also known as kernel functions, define the similarity between input points in a Gaussian process and must be positive semi-definite to ensure valid covariance matrices. These parametric forms allow practitioners to model various assumptions about the underlying function's smoothness, periodicity, or structure. The squared exponential kernel, often referred to as the radial basis function (RBF) kernel, is one of the most widely used due to its flexibility in capturing smooth functions. It is defined as k(\mathbf{x}, \mathbf{x}') = \sigma^2 \exp\left( -\frac{|\mathbf{x} - \mathbf{x}'|^2}{2\ell^2} \right), where \sigma^2 is the variance parameter controlling the overall scale, and \ell is the length scale parameter governing the rate of correlation decay with distance. This kernel produces infinitely differentiable sample paths, making it suitable for modeling highly smooth processes. The Matérn family of kernels provides a more flexible alternative, allowing control over the smoothness of the process through a parameter \nu. The general form is k(\tau) = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \sqrt{2\nu} \frac{|\tau|}{\ell} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{|\tau|}{\ell} \right), where \tau = |\mathbf{x} - \mathbf{x}'| is the distance, \Gamma is the , and K_\nu is the of order \nu. The parameter \nu determines the mean-square differentiability of the paths: for \nu = 1/2, it yields exponential decay (once differentiable); \nu = 3/2 allows one derivative; and \nu = 5/2 permits two derivatives, with higher \nu approaching the as \nu \to \infty. Specific cases like \nu = p + 1/2 for integer p have closed-form expressions without . This family balances smoothness and computational tractability while avoiding the infinite differentiability of the . The periodic kernel is designed for functions exhibiting repeating patterns and is given by k(x, x') = \sigma^2 \exp\left( -\frac{2 \sin^2 \left( \pi |x - x'| / p \right)}{\ell^2} \right), where p is the period parameter setting the repetition length, \sigma^2 scales the variance, and \ell controls the decay within each period. This kernel enforces exact periodicity by mapping distances via the sine function, producing infinitely differentiable paths that oscillate regularly. It is particularly effective when the data suggests cyclic behavior, though it assumes global periodicity across the input space. The linear kernel assumes a simpler, non-stationary structure suitable for modeling low-dimensional linear trends or projections: k(\mathbf{x}, \mathbf{x}') = \sigma^2 \mathbf{x}^\top \mathbf{x}', where \sigma^2 adjusts the variance. This kernel corresponds to in the Gaussian process framework, generating sample paths that are straight lines through the origin in the feature space, with smoothness limited to linear functions. It is computationally efficient and serves as a building block for more complex models. Composite kernels are formed by combining simpler kernels through operations like addition, multiplication, or exponentiation, enabling the modeling of structured data with multiple characteristics. For example, the product of an RBF kernel and a linear kernel, k(\mathbf{x}, \mathbf{x}') = k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') \cdot k_{\text{linear}}(\mathbf{x}, \mathbf{x}'), captures both smooth non-linear variations and global linear trends. Such compositions inherit properties from their components—e.g., the product of two stationary kernels remains stationary—and allow for interpretable hyperparameters tailored to hierarchical or multiplicative structures in the data. The validity of composite kernels relies on ensuring the result remains positive semi-definite, which holds for products and sums of positive definite kernels.

Examples and Special Cases

Wiener Process

The , also known as standard Brownian motion, is a canonical example of a Gaussian process that models random fluctuations in various physical and financial systems. It is defined as a continuous-time stochastic process \{W(t) : t \geq 0\} with W(0) = 0, independent increments, and normally distributed increments such that W(t) - W(s) \sim \mathcal{N}(0, t - s) for all t > s \geq 0. This construction ensures that the process has stationary increments, meaning the distribution of W(t + h) - W(t) depends only on h > 0 and not on t, but the process itself is non-stationary because its variance \mathrm{Var}(W(t)) = t grows linearly with time. As a Gaussian process, the Wiener process is fully specified by its mean function \mu(t) = 0 for all t \geq 0 and its kernel k(s, t) = \min(s, t), which captures the dependence structure where earlier times influence later ones cumulatively. The can be interpreted as the continuous-time integral of white , providing a mathematical representation of idealized random perturbations. This perspective underscores its role in stochastic differential equations, where it serves as the driving term. The non-stationarity arises directly from the kernel form, as k(s, t) is not a function solely of |t - s|, contrasting with stationary Gaussian processes whose covariances depend only on time differences. Despite this, the independent and stationary increments property makes it a , facilitating analytical tractability in applications like modeling. Sample paths of the Wiener process exhibit remarkable regularity properties: they are almost surely continuous functions of time, ensuring no jumps occur with probability one. However, these paths are nowhere differentiable almost surely, meaning no exists at any point, which reflects the infinite variation accumulated over any interval. More precisely, the paths are Hölder continuous with any exponent \gamma < 1/2, but not with exponent $1/2, quantifying their roughness in terms of modulus of continuity. Historically, the is named after Norbert Wiener, who in 1923 provided the first rigorous mathematical construction of Brownian motion as a continuous stochastic process with these properties, laying the foundation for modern stochastic analysis.

Ornstein-Uhlenbeck Process

The Ornstein-Uhlenbeck process is defined as the solution to the stochastic differential equation
df(t) = -\theta f(t)\, dt + \sigma\, dW(t),
where \theta > 0 represents the speed of mean reversion, \sigma > 0 is the diffusion coefficient, and W(t) denotes a standard ; this formulation captures a mean-reverting dynamic where fluctuations decay toward the origin over time.
Viewed as a Gaussian process on \mathbb{R}, the Ornstein-Uhlenbeck process assumes a zero mean function \mu(t) = 0 and features the stationary covariance kernel
k(s,t) = \frac{\sigma^2}{2\theta} \exp\left(-\theta |t - s|\right),
which corresponds to the Matérn kernel family with smoothness parameter \nu = 1/2.
This stationarity implies a constant marginal variance of \sigma^2 / (2\theta) for all t, with the covariance between any two points decaying exponentially at rate \theta as the separation |t - s| increases, ensuring temporal homogeneity and the Markov property. The sample paths of the Ornstein-Uhlenbeck process are continuous almost surely, reflecting the continuity of the driving and the of the drift term; however, they possess limited regularity, being mean-square continuous but not mean-square differentiable, akin to the roughness of paths. In physical modeling, the Ornstein-Uhlenbeck process classically describes the velocity component of a particle undergoing under viscous friction, providing a foundational example of damping in .

Reproducing Kernel Hilbert Space

RKHS Basics

A (RKHS) is a \mathcal{H} of functions f: \mathcal{X} \to \mathbb{R} on a set \mathcal{X}, equipped with an inner product \langle \cdot, \cdot \rangle_{\mathcal{H}} such that point evaluation at any x \in \mathcal{X} is a continuous linear functional, meaning there exists a reproducing kernel k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} satisfying f(x) = \langle f, k(x, \cdot) \rangle_{\mathcal{H}} for all f \in \mathcal{H}. This reproducing property ensures that the kernel function k(x, \cdot) acts as the representer of the evaluation functional in \mathcal{H}. The space \mathcal{H} is complete with respect to the norm \|f\|_{\mathcal{H}} = \sqrt{\langle f, f \rangle_{\mathcal{H}}}, and the reproducing property implies that evaluations are bounded by the kernel: |f(x)| \leq \|f\|_{\mathcal{H}} \sqrt{k(x,x)} for all f \in \mathcal{H} and x \in \mathcal{X}. The kernel k is positive semi-definite, symmetric, and uniquely determines the inner product via \langle k(x, \cdot), k(y, \cdot) \rangle_{\mathcal{H}} = k(x,y). The reproducing property induces a feature map \phi: \mathcal{X} \to \mathcal{H} defined by \phi(x) = k(x, \cdot), which is infinite-dimensional in general and embeds \mathcal{X} into \mathcal{H} such that the inner product in \mathcal{H} corresponds to kernel evaluations: \langle \phi(x), \phi(y) \rangle_{\mathcal{H}} = k(x,y). Functions in \mathcal{H} can thus be expressed as f(x) = \langle f, \phi(x) \rangle_{\mathcal{H}}, representing linear combinations in the feature space. A canonical example is the radial basis function (RBF) kernel k(x,y) = \exp\left( -\frac{\|x-y\|^2}{2\sigma^2} \right) on \mathbb{R}^d, where the associated RKHS \mathcal{H} consists of analytic functions with exponential decay in their Fourier transforms, ensuring smoothness and rapid decay away from the origin. In general, an RKHS \mathcal{H} is a proper subspace of the L^2 space over \mathcal{X} with respect to a probability measure, as the RKHS norm is stricter; however, under Mercer's theorem conditions (e.g., continuous kernel on a compact domain), \mathcal{H} embeds continuously into L^2, with \|f\|_{L^2} \leq C \|f\|_{\mathcal{H}} for some constant C. The \mathcal{H}_k associated with a k serves as the Cameron–Martin space for the induced by a zero-mean f \sim \mathcal{GP}(0, k), with the of the measure given by the C_k g = \int k(\cdot, x) g(x) \mu(dx) for a base measure \mu. This ties the probabilistic structure of the GP to the geometry of \mathcal{H}_k, where the reproducing property allows point evaluations f(x) = \langle f, k(x, \cdot) \rangle_{\mathcal{H}_k}. Although sample paths from the GP lie in \mathcal{H}_k with probability zero for most kernels, the concentrates on functions whose RKHS norms are controlled by the eigenvalues of C_k. Upon observing noisy data y = f(X) + \epsilon with X = \{x_1, \dots, x_n\} and i.i.d. Gaussian noise \epsilon \sim \mathcal{N}(0, \sigma^2 I), the GP posterior mean represents the orthogonal projection of the prior onto the finite-dimensional subspace V_n = \operatorname{span}\{k(x_i, \cdot) : i=1,\dots,n\} of \mathcal{H}_k. This projection minimizes the RKHS norm subject to interpolating the data in the noiseless limit, while the noise reduces the posterior variance in the \mathcal{H}_k-norm by shrinking the effective reproducing kernel. The explicit form of the posterior mean is m(x) = \sum_{i=1}^n \alpha_i k(x_i, x), \quad \alpha = (K + \sigma^2 I)^{-1} y, where K_{ij} = k(x_i, x_j) is the Gram matrix. This formulation highlights how the posterior updates the prior by projecting onto the span of kernel functions centered at the observation points. Furthermore, GP regression exhibits a duality with kernel ridge regression in the RKHS, where the posterior mean solves the optimization problem m = \arg\min_{f \in \mathcal{H}_k} \|f\|_{\mathcal{H}_k}^2 + \frac{1}{\sigma^2} \|y - f(X)\|^2. This equivalence underscores the regularizing effect of the GP prior, equivalent to the RKHS norm penalty in kernel methods, bridging probabilistic and deterministic interpretations. In the infinite-data limit as n \to \infty, under suitable conditions on the kernel and true function, the GP posterior concentrates within a shrinking ball in \mathcal{H}_k, achieving minimax rates determined by the RKHS smoothness.

Constrained Processes

Linear Equality Constraints

Linear equality constraints on a Gaussian process f \sim \mathcal{GP}(\mu, k) are imposed to ensure that the process satisfies Af = b almost surely, where A is a linear operator mapping the function space to \mathbb{R}^m and b \in \mathbb{R}^m. This setup is particularly useful for exact interpolation problems, where the constraints enforce specific values or linear relations at certain points, modifying the prior distribution to a constrained Gaussian process \mathcal{GP}(\mu_c, k_c) that incorporates the restrictions directly into its mean and covariance functions. The constrained mean function is given by \mu_c(x) = \mu(x) + k(x, Z) [k(Z, Z)]^{-1} (b - A \mu(Z)), where Z denotes the set of constraint points or locations relevant to the A, and k(x, Z) represents the between x and Z. Similarly, the constrained covariance function is k_c(x, x') = k(x, x') - k(x, Z) [k(Z, Z)]^{-1} k(Z, x'). These expressions arise from the original Gaussian process on the linear constraints, analogous to standard Gaussian conditioning but with zero noise to enforce exact satisfaction. The constrained process exhibits degeneracy at the constraint locations, where the variance vanishes: k_c(z, z) = 0 for z \in Z, reflecting the deterministic fixing of the function values or relations imposed by the constraints. This zero-variance property ensures perfect without additional probabilistic uncertainty at those points. For specific kernel choices, such as those corresponding to integrated or Matérn kernels with appropriate smoothness parameters, imposing linear equality constraints—such as zero derivatives at boundaries—results in the constrained process reproducing classical cubic spline interpolants. This connection highlights the interpretability of Gaussian processes as Bayesian analogs to spline methods in .

Posterior under Constraints

In Gaussian processes, the posterior distribution under noisy linear constraints arises when both the standard observations and the constraints are subject to , generalizing the prior conditioning to a setting. Consider noisy observations y = f(X) + \eta, where \eta \sim \mathcal{N}(0, \sigma^2 I), and noisy linear constraints z = A f + \varepsilon, where f \sim \mathcal{GP}(0, k), A is a linear mapping the GP to the constraint , and \varepsilon \sim \mathcal{N}(0, \Sigma). The distribution of [y; z] is Gaussian with mean zero and block incorporating the evaluations at the points X and constraint points Z (determined by A), augmented by the respective terms. The posterior distribution is obtained by conditioning the GP on this joint vector [y; z], yielding a Gaussian process with mean function and covariance kernel derived from standard multivariate Gaussian conditioning. The posterior mean at a test point x is given by \mu_\text{post}(x) = k(x, [X; Z]) K^{-1} \begin{bmatrix} y \\ z \end{bmatrix}, where k(x, [X; Z]) = [k(x, X), k(x, Z)] and K is the augmented covariance K = \begin{pmatrix} K_{XX} + \sigma^2 I & K_{XZ} \\ K_{ZX} & K_{ZZ} + \Sigma \end{pmatrix}, with K_{XX} = [k(x_i, x_j)]_{i,j=1}^n, K_{XZ} = [k(x_i, z_l)]_{i,l}, and similarly for the other blocks. The posterior covariance kernel between points x and x' is k_\text{post}(x, x') = k(x, x') - k(x, [X; Z]) K^{-1} k([X; Z], x'). This effective kernel k_\text{post} integrates the information from both data and constraints into a single covariance structure, ensuring the posterior respects the noisy prior knowledge encoded in the constraints. Sampling from this posterior can be performed by generating paths conditioned on the affine implied by the linear constraints, with the \varepsilon broadening the conditioning to a probabilistic neighborhood around the subspace; this allows for exploration of plausible trajectories consistent with both datasets. When \Sigma = 0, the formulation reduces to the exact linear constraints case as a special instance. In , this framework finds application in for spatial prediction under boundary conditions, such as estimating subsurface properties in environmental or resource modeling.

Applications

Kriging for Spatial Prediction

Kriging originated in as a method for spatial interpolation and prediction of ore grades in , developed by South African engineer D. G. Krige in his 1951 master's thesis, where he applied statistical techniques to estimate gold reserves based on limited drill-hole data. The theoretical framework was formalized by Georges Matheron in 1963, who coined the term "kriging" in honor of Krige and established it as a best linear unbiased for random functions in spatial domains. This approach models spatial data as realizations of a Gaussian process, enabling predictions at unsampled locations while quantifying uncertainty through conditional distributions. In ordinary kriging, the prediction at an unobserved location x^* given observations y at locations X assumes a stationary Gaussian process with mean zero and covariance function k(\cdot, \cdot), plus independent noise \sigma^2. The posterior predictive distribution is f(x^*) \mid y \sim \mathcal{N}(m(x^*), v(x^*)), where the mean is m(x^*) = k(x^*, X) (K + \sigma^2 I)^{-1} y and K is the covariance matrix over X. This formulation minimizes the mean squared prediction error under the assumption of second-order stationarity, providing weights that balance proximity and spatial correlation. Universal kriging extends ordinary to account for a non-stationary trend, modeling the process mean as \mu(x) = \sum_j \beta_j p_j(x), where \{p_j(x)\} are known basis functions (e.g., polynomials for linear or quadratic trends) and \beta_j are coefficients. The trend parameters are estimated via , incorporating the spatial covariance structure, before applying to the residuals. This allows for predictions in regions with systematic drifts, such as elevation effects in environmental data. The is central to , quantifying spatial dependence through \gamma(h) = \frac{1}{2} \mathbb{E}[(f(x+h) - f(x))^2], which for a equals k(0) - k(h). Empirical variograms are fitted to data to estimate the k, guiding the choice of model (e.g., or Matérn) for . weights satisfy the best linear unbiased estimator (BLUE) property, ensuring unbiased predictions with minimal variance among all linear combinations of observations, derived from the conditions in the optimization. This optimality traces back to the Wiener-Kolmogorov for random fields, which provides the foundational linear filtering approach adapted by Matheron to spatial contexts. kernels, such as the squared , are commonly used in spatial to model isotropic dependence.

Gaussian Processes in Regression

Gaussian process regression models the output y at input \mathbf{x} as y = f(\mathbf{x}) + \epsilon, where f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')) specifies a Gaussian process prior over latent functions f with mean function m (often taken as zero) and positive definite covariance kernel k, and \epsilon \sim \mathcal{N}(0, \sigma_n^2) denotes independent Gaussian noise. Given training data \mathbf{X} = \{\mathbf{x}_i\}_{i=1}^n and \mathbf{y} = \{y_i\}_{i=1}^n, the joint distribution over training outputs and a test output f_* at new input \mathbf{x}_* is multivariate Gaussian, yielding a closed-form Gaussian posterior predictive distribution f_* \mid \mathbf{X}, \mathbf{y}, \mathbf{x}_* \sim \mathcal{N}(\mu_*(\mathbf{x}_*),\ \sigma_*^2(\mathbf{x}_*)). The predictive mean and variance are given by \begin{align} \mu_*(\mathbf{x}_*) &= \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}, \\ \sigma_*^2(\mathbf{x}_*) &= k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*, \end{align} where \mathbf{K} is the n \times n with entries K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j), \mathbf{k}_* is the vector of covariances k(\mathbf{x}_*, \mathbf{x}_i) for i=1,\dots,n, and \mathbf{I} is the . These expressions enable exact inference for small to moderate datasets (n \lesssim 10^3), providing not only point predictions but also calibrated uncertainty estimates directly from the predictive variance. Kernel hyperparameters \boldsymbol{\theta} (e.g., length-scale parameters controlling smoothness) enter the model through the covariance function k(\cdot, \cdot; \boldsymbol{\theta}), along with the noise variance \sigma_n^2, and are typically optimized by maximizing the marginal log likelihood of the observed data: \log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^T (\mathbf{K}_\boldsymbol{\theta} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K}_\boldsymbol{\theta} + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi. This evidence-based approach avoids overfitting by integrating over functions weighted by the prior, with optimization often performed via gradient-based methods like conjugate gradients or L-BFGS due to the non-convexity of the objective. A key advantage of Gaussian process regression lies in its non-parametric nature, equivalent to using an infinite number of basis functions whose prior is induced by the , allowing flexible modeling without fixed model complexity. Automatic relevance determination (ARD) extends this by assigning separate length-scale hyperparameters to each input dimension in separable kernels (e.g., squared exponential), driving irrelevant features' length-scales to infinity during optimization and effectively performing input selection. For , the predictive variance \sigma_*^2(\mathbf{x}_*) yields probabilistic intervals, such as 95% credible intervals \mu_*(\mathbf{x}_*) \pm 1.96 \sqrt{\sigma_*^2(\mathbf{x}_*)}, which widen in data-sparse regions to reflect risk. Gaussian process regression in draws from techniques in as a foundational precursor.

Neural Networks and Deep GPs

In the infinite-width limit, Bayesian neural networks exhibit behavior equivalent to Gaussian processes, providing a theoretical bridge between parametric neural architectures and non-parametric probabilistic models. Specifically, as the width of each layer approaches infinity, the prior distribution over functions induced by a Bayesian neural network converges to a Gaussian process with a kernel known as the Neural Network Gaussian Process (NNGP) kernel. This equivalence arises because the central limit theorem applies recursively across layers, leading to Gaussian marginals at each depth. The NNGP kernel for layer l is defined recursively as k^l(\mathbf{x}, \mathbf{x}') = \mathbb{E}[\phi(f^l(\mathbf{x})) \phi(f^l(\mathbf{x}'))], where \phi is the activation function, f^l denotes the pre-activation at layer l, and the expectation is taken over the random weights and previous layers. Furthermore, during training with in the infinite-width regime, the of the also corresponds to a Gaussian process, but governed by the (NTK). The NTK captures the evolution of the network's function during optimization, remaining constant in the infinite-width limit and enabling kernel regression-like behavior. This kernel is derived from the of the network parameters at initialization and is given by \Theta(\mathbf{x}, \mathbf{x}') = \mathbb{E}\left[ \left( \frac{\partial f(\mathbf{x})}{\partial \theta} \right)^\top \left( \frac{\partial f(\mathbf{x}')}{\partial \theta} \right) \right], where \theta are the network parameters. Thus, the Bayesian posterior over functions approaches a Gaussian process with the NTK, highlighting how wide neural networks generalize through methods. Deep Gaussian processes extend this correspondence by composing multiple Gaussian processes in a hierarchical manner to model complex, multi-level data structures. A deep Gaussian process defines the overall function as f(\mathbf{x}) = f_L \circ g_{L-1} \circ \cdots \circ f_1(\mathbf{x}), where each f_i and g_j is drawn independently from a Gaussian process prior, allowing for layered transformations that capture non-stationarities and intricate dependencies not easily modeled by single-layer processes. This architecture draws inspiration from deep neural networks but retains the probabilistic interpretability of Gaussian processes. However, inference in deep Gaussian processes is generally intractable due to the nested integrals required for the likelihood, necessitating approximate methods such as variational inference. These connections, established in seminal works, underscore the shared theoretical foundations of neural networks and Gaussian processes, enabling insights into generalization and scalability in deep learning.

Computational Aspects

Exact Inference

Exact inference in Gaussian processes relies on the closed-form expressions for the posterior distribution, leveraging the conjugacy between the Gaussian prior and likelihood. Given observed data points (X, y) where X are the input locations and y the corresponding outputs, and new test inputs X_*, the posterior predictive distribution for the latent function values f_* at X_* is p(f_* | y, X, X_*) \propto \int p(y | f, X) p(f_* | f, X_*) p(f | X) \, df. Due to the Gaussian nature of both the prior p(f | X) = \mathcal{N}(f | 0, K) and the conditional p(f_* | f, X_*) = \mathcal{N}(f_* | K_{*X} K^{-1} f, K_{**} - K_{*X} K^{-1} K_{X*}), as well as the likelihood p(y | f, X) = \mathcal{N}(y | f, \sigma^2 I) for noise-corrupted observations, the integral evaluates to a Gaussian distribution p(f_* | y, X, X_*) = \mathcal{N}(f_* | \mu_*, \Sigma_*), where the mean \mu_* = K_{*X} (K + \sigma^2 I)^{-1} y and covariance \Sigma_* = K_{**} - K_{*X} (K + \sigma^2 I)^{-1} K_{X*}. Computing this posterior requires inverting or solving linear systems involving the n \times n covariance matrix K_y = K + \sigma^2 I, where n is the number of data points. This is efficiently achieved using Cholesky decomposition, factoring K_y = L L^T with L a lower triangular matrix. The inverse operations are then performed via forward and back-substitution: solve L \alpha = y for \alpha, then L^T \beta = \alpha to obtain \beta = K_y^{-1} y, enabling the predictive mean as \mu_* = K_{*X} \beta. The decomposition itself costs \mathcal{O}(n^3) time, with additional \mathcal{O}(n^2) for predictions at m test points, making it the dominant computational bottleneck. The p(y | X, \theta) = \mathcal{N}(y | 0, K_y), where \theta denotes hyperparameters such as length scales and noise variance, plays a central role in and . It is computed as \log p(y) = -\frac{1}{2} y^T K_y^{-1} y - \frac{1}{2} \log |K_y| - \frac{n}{2} \log 2\pi, with the log-determinant \log |K_y| = 2 \sum_i \log L_{ii} from the Cholesky factors and the quadratic form via the solved \beta as above. This evidence-based approach allows direct maximization of \log p(y) over \theta without cross-validation. Gradients of the log with respect to hyperparameters are available in closed form, facilitating efficient optimization. Specifically, \frac{\partial \log p(y)}{\partial \theta} = -\frac{1}{2} \text{tr}(K_y^{-1} \frac{\partial K_y}{\partial \theta}) + \frac{1}{2} y^T K_y^{-1} \frac{\partial K_y}{\partial \theta} K_y^{-1} y, where traces and quadratic forms reuse the Cholesky factorization for \mathcal{O}(n^3) per evaluation during gradient computation in optimization loops. These analytic derivatives enable conjugate gradient or L-BFGS methods for hyperparameter learning. Despite these advantages, exact inference is limited to datasets with n \lesssim 10^4 due to the \mathcal{O}(n^3) scaling, beyond which memory and time requirements become prohibitive on standard hardware.

Approximation Methods

Gaussian processes (GPs) provide a powerful framework for probabilistic modeling, but exact inference scales cubically with the number of data points due to the need to invert the full covariance matrix, limiting their applicability to large datasets. Approximation methods address this by reducing computational complexity while preserving much of the GP's expressive power, often achieving linear or sub-quadratic scaling. These techniques are essential for scaling GPs to thousands or millions of observations in applications like spatial statistics and machine learning. One prominent class of approximations uses inducing points (or pseudo-inputs), which introduce a low-rank structure to the by parameterizing the through a smaller set of latent variables \mathbf{u} at inducing locations \mathbf{Z} \in \mathbb{R}^{m \times d}, where m \ll n and n is the number of data points. In sparse , the inducing variables are modeled as \mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{K}_{mm}), where \mathbf{K}_{mm} is the over the inducing points. The approximate posterior is then obtained by maximizing an (ELBO) on the marginal log-likelihood. Specifically, the variational approximation frames the log marginal likelihood as \log p(\mathbf{y}) \approx \log \int p(\mathbf{y} | \mathbf{u}) q(\mathbf{u}) \, d\mathbf{u}, where q(\mathbf{u}) = \mathcal{N}(\mathbf{m}, \mathbf{S}) is a Gaussian variational distribution optimized jointly with the inducing locations and hyperparameters. This approach, introduced in the variational sparse framework, yields a tight bound and enables efficient inference via stochastic optimization. Within the inducing points paradigm, specific bounds like the Fully Independent Training Conditional (FITC) and Variational Free Energy (VFE) provide distinct s to the GP posterior. FITC assumes a factorized conditional distribution over the function values at data points given the inducing variables, leading to a diagonal of the posterior that simplifies computations to O(m^2 n + m^3) time. This method, which treats the inducing points as fixed or optimized separately, offers a computationally efficient but can underestimate in some regimes. In , VFE maximizes a variational lower bound on the by optimizing the inducing variables as parameters, resulting in a non-factorized that better captures dependencies and provides a tighter bound than FITC, though at slightly higher cost. Both methods stem from unifying views of sparse approximations and have been widely adopted for their balance of accuracy and scalability. Another key approximation is based on random Fourier features (RFFs), which exploit Bochner's theorem to represent stationary kernels as expectations over random projections. For a shift-invariant kernel k(\mathbf{x}, \mathbf{x}') \approx \phi(\mathbf{x})^\top \phi(\mathbf{x}'), where \phi(\mathbf{x}) is a finite-dimensional feature map drawn from a of the kernel's . This reduces GP to in the feature space, with complexity O(D n) for D features, enabling scalability to massive datasets. RFFs are particularly effective for kernels like the squared exponential and have been shown to approximate GP predictions with error bounds decreasing as O(1/\sqrt{D}). For non-Gaussian likelihoods, such as in or , the Laplace approximation provides a point estimate of the latent function values by approximating the posterior with a Gaussian centered at the of the log-posterior. This involves solving for the of the negative log-posterior at the , yielding an approximate as its inverse, which can then be combined with sparse inducing point methods for scalability. The approach is computationally efficient for moderate-sized problems and forms a for more advanced variational methods in non-conjugate settings.

Scalability Challenges

Gaussian processes (GPs) face significant scalability challenges primarily due to the computational demands of exact inference, which requires inverting an n × n kernel matrix at a cost of O(n³) via methods like , alongside O(n²) storage requirements for the matrix itself. This cubic scaling severely limits the application of standard GPs to datasets with more than a few thousand data points, as training times grow prohibitively large even on modern . For instance, processing n = observations typically takes minutes to tens of minutes on standard , becoming increasingly impractical for datasets exceeding this size in scenarios in fields like spatial statistics and . Non- kernels, designed to capture varying or correlations across the input , exacerbate these issues by increasing the per-evaluation cost of the kernel function, often from O(1) for simple kernels like the squared exponential to O(d) or higher in input dimension d due to more intricate parameterizations. This added expense propagates through the repeated kernel computations needed for construction, further straining resources in non-i.i.d. or heterogeneous data settings common in real-world applications such as environmental modeling. In high-dimensional input spaces, GPs encounter the curse of dimensionality, where the volume of the space explodes, leading to sparse data coverage and challenges in kernel specification that result in ill-conditioned or near-singular kernel matrices. Effective lengthscales become difficult to estimate reliably, as irrelevant dimensions dilute signal, often requiring or specialized kernels to maintain predictive performance without exponential growth in computational overhead. For non-Gaussian likelihoods, such as those arising in or count data, the posterior distribution over latent functions loses conjugacy, necessitating (MCMC) methods for , which introduce additional challenges like slow mixing and high variance due to the infinite-dimensional nature of the prior. These sampling-based approaches can require thousands of iterations per , amplifying the overall computational burden beyond even the O(n³) baseline for Gaussian cases. As of 2025, research directions to address GP scalability emphasize structured that exploit low-rank or Kronecker factorizations to approximate full kernel matrices at reduced cost, alongside paradigms that parallelize matrix operations across clusters for massive datasets. These advancements aim to enable GPs on scales exceeding millions of points while preserving probabilistic guarantees.

References

  1. [1]
    [PDF] Gaussian Processes in Machine Learning
    Definition 1. A Gaussian Process is a collection of random variables, any finite number of which have (consistent) joint Gaussian distributions. A Gaussian ...
  2. [2]
    An Intuitive Tutorial to Gaussian Process Regression - arXiv
    Jan 28, 2024 · Gaussian Process is a key model in probabilistic supervised machine learning, widely applied in regression and classification tasks.Mathematical Basics · Gaussian Distribution · Gaussian Processes
  3. [3]
    1.7. Gaussian Processes - Scikit-learn
    Gaussian Processes (GP) are a nonparametric supervised learning method used to solve regression and probabilistic classification problems.
  4. [4]
    [PDF] Gaussian Processes for Machine Learning
    Gaussian Processes for Machine Learning presents one of the most important. Bayesian machine learning ... definition of Gaussian processes, in particular for the.
  5. [5]
    [PDF] INTRODUCTION TO GAUSSIAN PROCESSES Definition 1.1. A ...
    DEFINITIONS AND EXAMPLES. Definition 1.1. A Gaussian process {Xt }t ∈T indexed by a set T is a family of (real-valued) random variables Xt , all defined on ...
  6. [6]
    [PDF] Lecture 24: Gaussian Processes 1 Preliminaries - Nikolai Matni
    Definition 5. A Gaussian Process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution. Same as ...
  7. [7]
  8. [8]
    [PDF] An Introduction to Gaussian Process Models - arXiv
    Feb 10, 2021 · Gaussian process is a powerful tool for nonlinear function regression without the need of much. prior knowledge. In contrast to most of the ...
  9. [9]
    [PDF] Gaussian processes
    Dec 1, 2007 · In order to get an intuition for how Gaussian processes work, consider a simple zero-mean. Gaussian process, h(·) ∼ GP(0,k(·,·)). defined for ...
  10. [10]
    Random Fields and Geometry - Book - SpringerLink
    The Geometry of Random Fields. Front Matter. Pages 259-262. Download chapter ... Adler. Department of Statistics, Stanford University, Stanford, USA.
  11. [11]
    [PDF] Lecture 5: Gaussian processes & Stationary processes
    To construct a Gaussian process we must provide two things: its mean m(t), and its covariance function, B(s,t). These two objects completely determine all the ...
  12. [12]
    [PDF] Lecture Notes 7 Stationary Random Processes • Strict-Sense and ...
    Stationarity refers to time invariance of some, or all, of the statistics of a random process, such as mean, autocorrelation, n-th-order distribution.
  13. [13]
    Complex Gaussian Processes - jstor
    In general, a wide-sense stationary Gaussian process is not strictly stationary. However, if condition (3.1) holds, then wide-sense stationarity implies strict.
  14. [14]
    [PDF] Covariance Functions - Gaussian Processes for Machine Learning
    A random field is continuous in mean square at x∗ if and only if its covariance function k(x, x0) is continuous at the point x = x0 = x∗. For stationary ...
  15. [15]
    [PDF] arXiv:1403.2215v1 [math.PR] 10 Mar 2014
    Mar 10, 2014 · It turns out that for Gaussian processes the Kolmogorov–Centsov condition is very close to being necessary for Hölder continuity: Theorem 1. ...
  16. [16]
    [PDF] Gaussian processes - IISc Math
    Mar 28, 2021 · Construct a process with discontinuous sample paths that has the same mean and covariance. There is no contradiction here. On the cylinder ...
  17. [17]
  18. [18]
    [PDF] 3 Properties of kernels - People @EECS
    Proposition 3.9 A matrix A is positive (semi-)definite if and only if all of its principal minors are positive (semi-)definite.
  19. [19]
  20. [20]
    [PDF] BROWNIAN MOTION 1.1. Wiener Process
    The Wiener process is the intersection of the class of Gaussian processes with the Lévy processes. It should not be obvious that properties (1)–(4) in the ...
  21. [21]
    [PDF] Lecture 6: Wiener Process
    Wt is a Gaussian process with mean and covariance. EWt = 0,. EWtWs = min(t, s). (d) Continuity. With probability 1, Wt viewed as a function of t is continuous.
  22. [22]
    [PDF] Lecture 4: Gaussian white noise and Wiener process - eis.mdx.ac.uk
    As α → ∞, σ2 = S[x, 0]α/2 → ∞ and τcor → 0 (i.e. the process becomes white noise). 2. Page 3. 3 Linear transformation of white noise. Linear transformation of ...
  23. [23]
    [PDF] Lecture 6: Brownian motion
    A Brownian motion or Wiener process is a stochastic process W = (Wt)t≥0 with the fol- lowing properties: (i) W0 = 0;. (ii) It is a Gaussian process;. (iii) It ...
  24. [24]
    Differential‐Space - Wiener - 1923 - Wiley Online Library
    Volume 2, Issue 1-4 pp. 131-174 Journal of Mathematics and Physics Article Full Access Differential-Space Norbert Wiener, Norbert Wiener
  25. [25]
    On the Theory of the Brownian Motion | Phys. Rev.
    On the Theory of the Brownian Motion. G. E. Uhlenbeck and L. S. Ornstein ... 36, 823 – Published 1 September, 1930. DOI: https://doi.org/10.1103/PhysRev ...
  26. [26]
    [PDF] Ornstein-Uhlenbeck process
    10 to solve for the autocorrelation function Cx(t) when the OU process has ... Stationarity implies Cx(t) = Cx(-t), hence. Cx(t) = D α e−α|t| = D α e ...
  27. [27]
    [PDF] STOCHASTIC PROCESSES AND APPLICATIONS
    Nov 11, 2015 · • The Ornstein-Uhlenbeck process is a Gaussian process with m(t) = 0, C(t, s) = λe−α|t−s| with. α, λ > 0. 1.2 Stationary Processes.
  28. [28]
  29. [29]
    [PDF] Theory of Reproducing Kernels - N. Aronszajn
    Aug 26, 2002 · Résumé of basic properties of reproducing kernels. In the following, F denotes a class of functions f(x) defined in E, forming a Hilbert space ...
  30. [30]
    Multivariate approximation for analytic functions with Gaussian kernels
    In this paper we study multivariate approximation of functions from the same space. We measure the error of approximation in the L 2 sense with the standard ...
  31. [31]
    [0805.3252] Reproducing kernel Hilbert spaces of Gaussian priors
    May 21, 2008 · We review definitions and properties of reproducing kernel Hilbert spaces attached to Gaussian variables and processes, with a view to applications in ...Missing: seminal | Show results with:seminal
  32. [32]
    [PDF] Relationships between GPs and Other Models - Gaussian Process
    However, note that although sample functions of this Gaussian process are not in H, the posterior mean after observing some data will lie in the RKHS, due to ...
  33. [33]
    [1703.00787] Linearly constrained Gaussian processes - arXiv
    Mar 2, 2017 · Access Paper: View a PDF of the paper titled Linearly constrained Gaussian processes, by Carl Jidling and 3 other authors. View PDF · TeX ...
  34. [34]
    [PDF] Gaussian Processes with Linear Operator Inequality Constraints
    Aug 28, 2018 · These algorithms are based on derivation of the exact posterior of the constrained Gaussian process using a general linear operator, and ...
  35. [35]
    The Many Forms of Co-kriging: A Diversity of Multivariate Spatial ...
    Oct 12, 2023 · Boundary co-kriging, in which the primary variable is estimated using primary variable data together with secondary variable boundary conditions ...
  36. [36]
    A statistical approach to some basic mine valuation problems on the ...
    A statistical approach to some basic mine valuation problems on the Witwatersrand, by D.G. Krige, published in the Journal, December 1951 : introduction by the ...
  37. [37]
    Principles of geostatistics | Economic Geology - GeoScienceWorld
    Mar 2, 2017 · Other| December 01, 1963. Principles of geostatistics Available. Georges Matheron ... Application of Kriging Techniques in Groundwater Hydrology.
  38. [38]
    Explaining and Connecting Kriging with Gaussian Process Regression
    Aug 5, 2024 · Kriging and Gaussian Process Regression are statistical methods that allow predicting the outcome of a random process or a random field by using a sample of ...
  39. [39]
    [PDF] 1 Gaussian Processes - Stat@Duke
    Definition 1.2 A Gaussian process {xi}, is said to have a variogram if Var(xi − xj) is a function of distance dij between sites i and j. We use 2γ(dij) to ...
  40. [40]
    (PDF) Kriging in the Light of the Theory of Random Field Prediction.
    The method of kriging is critically examined from the viewpoint of the classical Wiener-Kolmogorov prediction theory for random fields, as well as from the ...
  41. [41]
    Gaussian Process Behaviour in Wide Deep Neural Networks - arXiv
    Apr 30, 2018 · In this paper, we study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes.
  42. [42]
    Neural Tangent Kernel: Convergence and Generalization in ... - arXiv
    Jun 20, 2018 · View a PDF of the paper titled Neural Tangent Kernel: Convergence and Generalization in Neural Networks, by Arthur Jacot and 2 other authors.
  43. [43]
    [1211.0358] Deep Gaussian Processes - arXiv
    Nov 2, 2012 · Damianou, Neil D. Lawrence. View a PDF of the paper titled Deep Gaussian Processes, by Andreas C. Damianou and 1 other authors. View PDF.
  44. [44]
    [PDF] A Unifying View of Sparse Approximate Gaussian Process Regression
    This paper provides a unifying view of sparse approximations for Gaussian process regression, which are used to overcome computational limitations of Gaussian ...
  45. [45]
    [PDF] Variational Learning of Inducing Variables in Sparse Gaussian ...
    Titsias, M. K. (2009). Variational Model Selection for. Sparse Gaussian Process Regression. Technical report,. School of Computer Science, University of ...
  46. [46]
    [PDF] Random Features for Large-Scale Kernel Machines - People @EECS
    Our randomized features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user.
  47. [47]
    [PDF] When Gaussian Process Meets Big Data: A Review of Scalable GPs
    The most prominent weakness of standard GP is that it suffers from a cubic time complexity O(n3) because of the inversion and determinant of the n × n kernel ...
  48. [48]
    [PDF] Nonstationary Covariance Functions for Gaussian Process Regression
    We choose to fully model the functions as Gaussian processes themselves, but recognize the computational cost and suggest that simpler representations are worth.
  49. [49]
    [PDF] A survey on high-dimensional Gaussian process modeling ... - arXiv
    The major advantage of this approach is that it allows for solving d many one dimensional ... “On. ANOVA decompositions of kernels and Gaussian random field paths ...
  50. [50]
    [PDF] Efficient Sampling for Gaussian Process Inference using Control ...
    When the likelihood p(y|f) is non-Gaussian, computations become intractable and we need to carry out approximate inference. The MCMC algorithm we consider ...<|separator|>
  51. [51]
    [PDF] arXiv:2502.10540v1 [cs.LG] 14 Feb 2025
    Feb 14, 2025 · Kernel inter- polation for scalable structured Gaussian processes. (KISS-GP). In International conference on machine learning, pages 1775 ...
  52. [52]
    Scalable Gaussian Processes with Latent Kronecker Structure - arXiv
    Jun 7, 2025 · Gaussian processes (GPs) are probabilistic models prized for their flexibility, data efficiency, and well-calibrated uncertainty estimates.Missing: directions | Show results with:directions