Reproducing kernel Hilbert space

A reproducing kernel Hilbert space (RKHS) is a Hilbert space \mathcal{H} of real- or complex-valued functions defined on a nonempty set X in which pointwise evaluation at any x \in X defines a continuous linear functional, and there exists a reproducing kernel K: X \times X \to \mathbb{C} such that for every f \in \mathcal{H}, f(x) = \langle f, K(\cdot, x) \rangle_{\mathcal{H}}, where \langle \cdot, \cdot \rangle_{\mathcal{H}} denotes the inner product in \mathcal{H}.^[1] This reproducing property ensures that the kernel functions K(\cdot, x) serve as representers for the evaluation functionals, making the space particularly suitable for problems involving function approximation and interpolation.^[2] The concept of RKHS originated in the early 20th century through work on integral equations and positive definite functions, with foundational contributions from David Hilbert around 1904–1910 on integral equations leading to Hilbert spaces and Erhard Schmidt in 1908 on integral operators. Early examples of reproducing kernels appeared in the 1907 work of Stanisław Zaremba on boundary value problems for harmonic and biharmonic functions, while James Mercer in 1909 introduced his theorem on the expansion of positive definite kernels via eigenfunctions. E. H. Moore developed related ideas on positive Hermitian matrices and reproducing properties in the 1930s, and Nachman Aronszajn formalized the general theory of RKHS in 1950, establishing its core properties.^[1] A central result is the Moore–Aronszajn theorem, which asserts a one-to-one correspondence between symmetric positive definite kernels on X and RKHS of functions on X: for any such kernel K, there exists a unique RKHS \mathcal{H}_K whose reproducing kernel is K, and conversely, every RKHS has a unique reproducing kernel.^[1] Key properties include the positive definiteness of the kernel, ensuring that the Gram matrix (K(x_i, x_j)) is positive semi-definite for any finite set \{x_i\} \subset X, and the density of the span of \{K(\cdot, x) \mid x \in X\} in \mathcal{H}, which implies that functions in the space can be approximated by finite linear combinations of kernel functions.^[2] RKHS have profound applications across functional analysis, approximation theory, and statistics, where they provide a framework for regularization and smoothing via kernel-based penalties.^[3] In machine learning, introduced to the field by Aizerman et al. in 1964 and popularized through support vector machines by Vapnik in the 1990s, RKHS enable implicit mappings to high-dimensional feature spaces via the kernel trick, facilitating nonlinear classification, regression, and dimensionality reduction without explicit computation of the features.^[2] Common examples include the Sobolev spaces with Matérn kernels for smoothing and the space of Gaussian processes, where the kernel defines the covariance structure.^[2]

Fundamentals

Definition

A reproducing kernel Hilbert space (RKHS) is a special type of Hilbert space consisting of functions defined on a nonempty set X. To establish the context, recall that a Hilbert space \mathcal{H} is a complete inner product space, meaning it is a vector space equipped with an inner product \langle \cdot, \cdot \rangle_{\mathcal{H}} that induces a norm \|f\|_{\mathcal{H}} = \sqrt{\langle f, f \rangle_{\mathcal{H}}}, and every Cauchy sequence in \mathcal{H} converges to an element in \mathcal{H}.^[4] In the case of an RKHS, denoted \mathcal{H}, the elements are functions f: X \to \mathbb{C}, and the vector space operations of addition and scalar multiplication are defined pointwise: (f + g)(x) = f(x) + g(x) and (\alpha f)(x) = \alpha f(x) for all x \in X, \alpha \in \mathbb{C}.^[5] A key requirement for \mathcal{H} to qualify as an RKHS is that the point evaluation functionals are continuous. Specifically, for each x \in X, the map \mathrm{ev}_x: \mathcal{H} \to \mathbb{C} defined by \mathrm{ev}_x(f) = f(x) must be a bounded linear functional, meaning there exists a constant c_x > 0 such that |f(x)| \leq c_x \|f\|_{\mathcal{H}} for all f \in \mathcal{H}.^[5] This continuity ensures that the functions in \mathcal{H} are sufficiently regular to allow evaluation at points without leaving the space. Formally, \mathcal{H} is an RKHS if it is a Hilbert space of functions on X such that there exists a function K: X \times X \to \mathbb{C}, called the reproducing kernel, satisfying the reproducing property: f(x) = \langle f, K(\cdot, x) \rangle_{\mathcal{H}} for all f \in \mathcal{H} and all x \in X. This inner product representation of point evaluations is the defining characteristic of an RKHS, first systematically developed by Aronszajn. The reproducing kernel K is unique for a given RKHS \mathcal{H}, and for each fixed x \in X, the function K(x, \cdot): X \to \mathbb{C} belongs to \mathcal{H}. This membership ensures that the kernel functions themselves are elements of the space, reinforcing the structural coherence of \mathcal{H}.^[5]

Reproducing Property and Kernel Function

The reproducing property is the defining characteristic of a reproducing kernel Hilbert space (RKHS), enabling the pointwise evaluation of functions in the space through inner products with specific kernel sections. In an RKHS H over a set X with reproducing kernel K: X \times X \to \mathbb{C}, for every x \in X, there exists a unique element k_x \in H, called the kernel section at x, such that for all f \in H,

f(x) = \langle f, k_x \rangle_H,

where \langle \cdot, \cdot \rangle_H denotes the inner product in H. The kernel section is given explicitly by k_x(y) = K(x, y) for all y \in X, making K the function that "reproduces" the value of any function in the space at any point via this inner product mechanism. This property ensures that point evaluation is a continuous linear functional on H, as required for the space to be an RKHS.^[1]^[6] The symmetry of the kernel follows directly from the sesquilinearity of the inner product in complex Hilbert spaces. To derive this, apply the reproducing property to the kernel section k_y: k_y(x) = \langle k_y, k_x \rangle_H, so K(y, x) = \langle k_y, k_x \rangle_H. By the conjugate symmetry of the inner product, \langle k_x, k_y \rangle_H = \overline{\langle k_y, k_x \rangle_H} = \overline{K(y, x)}. On the other hand, K(x, y) = k_x(y) = \langle k_x, k_y \rangle_H, yielding K(x, y) = \overline{K(y, x)}. In the real-valued case, this simplifies to K(x, y) = K(y, x). This Hermitian symmetry is essential for the kernel to generate a valid inner product structure in the space.^[1]^[6]^[7] A direct consequence is the inner product between kernel sections: \langle k_x, k_y \rangle_H = K(x, y). This follows immediately from the reproducing property applied to k_x at y: k_x(y) = \langle k_x, k_y \rangle_H, and since k_x(y) = K(x, y), the equality holds. This equation underscores the kernel's role in computing inner products solely through its values, without explicit reference to the underlying functions.^[1]^[6] In probabilistic interpretations, the kernel K functions analogously to a covariance function, as it defines the inner product structure much like a covariance operator does in spaces of random functions, such as Gaussian processes. Specifically, samples from the RKHS can be viewed as realizations where K(x, y) captures the covariance between points x and y.^[6] The kernel sections \{k_x \mid x \in X\} span a dense subspace of H. To see this, suppose there exists a nonzero g \in H orthogonal to all k_x, so \langle g, k_x \rangle_H = 0 for all x. By the reproducing property, this implies g(x) = 0 for all x, hence g = 0 in H. Thus, the linear span of \{k_x\} has trivial orthogonal complement and is dense in H; the RKHS is the completion of this span under the inner product induced by K. This density ensures that any function in H can be approximated arbitrarily well by finite linear combinations of kernel sections.^[1]^[6]

Key Theorems

Moore–Aronszajn Theorem

The Moore–Aronszajn theorem asserts that for any positive definite kernel K defined on a set X \times X, there exists a unique reproducing kernel Hilbert space H_K consisting of functions on X such that K serves as its reproducing kernel.^[1] A kernel K: X \times X \to \mathbb{C} is positive definite if it is Hermitian, meaning K(x, y) = \overline{K(y, x)} for all x, y \in X, and for every finite collection of points x_1, \dots, x_n \in X and complex coefficients c_1, \dots, c_n \in \mathbb{C},

\sum_{i=1}^n \sum_{j=1}^n c_i \overline{c_j} K(x_i, x_j) \geq 0,

with equality holding if and only if the coefficients c_i satisfy a linear dependence relation with respect to the kernel functions k_{x_i}(\cdot) = K(\cdot, x_i), or all c_i = 0.^[1] This condition ensures that the associated Gram matrices are positive semi-definite, forming the foundation for the Hilbert space structure.^[8] The space H_K is constructed explicitly as the completion of the pre-Hilbert space H_0, which is the linear span of the kernel sections \{k_x \mid x \in X\}, under the inner product defined for finite linear combinations f = \sum_{i=1}^n c_i k_{x_i} and g = \sum_{j=1}^m d_j k_{y_j} by

\langle f, g \rangle_{H_0} = \sum_{i=1}^n \sum_{j=1}^m c_i \overline{d_j} K(x_i, y_j).

This inner product induces a semi-norm on H_0, and H_K is obtained by quotienting out the null space and completing with respect to Cauchy sequences that converge pointwise on X, ensuring the reproducing property holds continuously.^[8]^[9] Uniqueness of H_K is established by showing that any two Hilbert spaces sharing the same reproducing kernel K must coincide as sets, with identical inner products. Specifically, for any such space H, the kernel sections k_x satisfy \langle k_x, k_y \rangle_H = K(x, y), and since the span of \{k_x\} is dense in H, the inner product and completion uniquely determine the space.^[1]^[9] The theorem bears the names of E. H. Moore, who first outlined the correspondence between positive definite forms and associated function spaces in his 1939 work on general analysis, and N. Aronszajn, who formalized the full theory of reproducing kernels in 1950.^[1]

Mercer's Theorem

Mercer's theorem provides a spectral decomposition for certain reproducing kernels, linking them to the eigenstructure of associated integral operators on L^2 spaces. Specifically, under suitable conditions, a symmetric positive definite kernel admits an expansion in terms of orthonormal eigenfunctions of a compact integral operator.^[10] This theorem, originally established by James Mercer in 1909, plays a crucial role in constructing and understanding reproducing kernel Hilbert spaces (RKHS) explicitly. Consider a compact metric space X equipped with a positive Borel measure \mu of finite total mass, and let K: X \times X \to \mathbb{C} be a continuous kernel that is symmetric (K(x,y) = \overline{K(y,x)}) and positive definite (meaning \sum_{i,j} c_i \overline{c_j} K(x_i, x_j) \geq 0 for all finite sets \{x_i\} \subset X and coefficients \{c_i\} \subset \mathbb{C}). The associated integral operator T: L^2(X, \mu) \to L^2(X, \mu) is defined by

(Tf)(x) = \int_X K(x, z) f(z) \, \mu(dz)

for f \in L^2(X, \mu).^[10] By Mercer's theorem, T is a compact, self-adjoint, positive operator on L^2(X, \mu), admitting a countable orthonormal basis of eigenfunctions \{\phi_n\}_{n=1}^\infty \subset L^2(X, \mu) with corresponding positive eigenvalues \{\lambda_n\}_{n=1}^\infty satisfying \lambda_n \searrow 0 and \sum_n \lambda_n < \infty. The kernel then expands as

K(x, y) = \sum_{n=1}^\infty \lambda_n \phi_n(x) \overline{\phi_n(y)},

where the series converges absolutely and uniformly on X \times X. The RKHS H_K associated with K can be explicitly described using this decomposition: it consists of all functions of the form f = \sum_{n=1}^\infty a_n \sqrt{\lambda_n} \phi_n, where \{a_n\} \in \ell^2(\mathbb{N}), equipped with the inner product \langle f, g \rangle_{H_K} = \sum_{n=1}^\infty a_n \overline{b_n} for g = \sum_{n=1}^\infty b_n \sqrt{\lambda_n} \phi_n, so that \|f\|_{H_K}^2 = \sum_{n=1}^\infty |a_n|^2.^[10] This representation ensures the reproducing property f(x) = \langle f, K(\cdot, x) \rangle_{H_K} holds, with K(\cdot, x) = \sum_{n=1}^\infty \sqrt{\lambda_n} \phi_n(x) \sqrt{\lambda_n} \overline{\phi_n(\cdot)}. A proof outline relies on the spectral theorem for compact self-adjoint operators on Hilbert spaces.^[10] Continuity of K on the compact set X \times X implies T is compact and Hilbert-Schmidt (since \|T\|_{HS}^2 = \iint |K(x,y)|^2 \, \mu(dx) \mu(dy) < \infty), hence self-adjoint with discrete spectrum \lambda_n > 0 and orthonormal eigenfunctions \phi_n. Positive definiteness ensures all eigenvalues are non-negative. The expansion follows from the spectral decomposition T = \sum_n \lambda_n \langle \cdot, \phi_n \rangle \phi_n, yielding K(x,y) = \langle T \delta_y, \delta_x \rangle in a distributional sense, with uniform convergence via the dominated convergence theorem and \sum_n \lambda_n \sup_{x,y} |\phi_n(x) \overline{\phi_n(y)}| < \infty. Mercer's theorem also facilitates a continuous embedding of the RKHS H_K into L^2(X, \mu), defined by i: H_K \to L^2(X, \mu) with i(f) = f.^[10] For f = \sum_n a_n \sqrt{\lambda_n} \phi_n, the L^2 norm satisfies \|f\|_{L^2}^2 = \sum_n |a_n|^2 \lambda_n \leq \left( \sup_n \lambda_n \right) \|f\|_{H_K}^2, but since \lambda_n \to 0, the embedding is compact, reflecting the smoothness of functions in H_K relative to L^2. This embedding highlights how positive definiteness (as guaranteed by the Moore–Aronszajn theorem) enables the operator-theoretic construction of H_K.^[10]

Representations and Structures

Feature Maps

In reproducing kernel Hilbert spaces, a feature map provides a geometric realization of the kernel function by embedding the input space into a Hilbert space. Specifically, given a positive definite kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} on a set \mathcal{X}, a feature map \Phi: \mathcal{X} \to \mathcal{H} is a mapping to a Hilbert space \mathcal{H} (possibly infinite-dimensional) such that K(x, y) = \langle \Phi(x), \Phi(y) \rangle_{\mathcal{H}} for all x, y \in \mathcal{X}. This construction interprets the kernel as an inner product in the feature space \mathcal{H}, allowing kernel methods to operate implicitly in high- or infinite-dimensional spaces without explicit computation of \Phi.^[6] The reproducing kernel Hilbert space \mathcal{H}_K associated with K is isomorphic to the closure of the linear span of \{\Phi(x) \mid x \in \mathcal{X}\} in \mathcal{H}, equipped with the inner product pulled back from \mathcal{H}. An explicit canonical construction defines \Phi(x) = k_x, where k_x(\cdot) = K(\cdot, x) is the kernel function viewed as an element of \mathcal{H}_K. This canonical feature map satisfies the reproducing property, as \langle f, k_x \rangle_{\mathcal{H}_K} = f(x) for any f \in \mathcal{H}_K, and ensures that \mathcal{H}_K is the completion of the span of such maps under the semi-inner product induced by K. The mapping \Phi is an isometry from the pre-Hilbert space (\mathcal{X}, K) (completed with respect to the semi-norm \|x\|_K = \sqrt{K(x,x)}) onto its image in \mathcal{H}, preserving distances and inner products where defined.^[6] Explicit feature maps can be constructed for certain kernels, but their dimensionality depends on the kernel's form. For polynomial kernels, such as K(x, y) = (x^\top y + c)^d with c \geq 0 and integer d \geq 1, \Phi maps to a finite-dimensional space of monomials of degree at most d; for example, in one dimension with d=2, \Phi(x) = (1, \sqrt{2}x, x^2) realizes K(x, y) = (xy + 1)^2. In contrast, universal kernels like the Gaussian radial basis function K(x, y) = \exp(-\|x - y\|^2 / (2\sigma^2)) yield infinite-dimensional feature maps with no closed-form explicit expression, as the image spans a dense subspace of L^2 functions via Mercer's expansion, though approximations are possible in finite dimensions. This distinction highlights the practicality of implicit computations via the kernel trick for infinite-dimensional cases.^[6]

Integral Operators

In the context of a reproducing kernel Hilbert space (RKHS) associated with a positive definite kernel K: X \times X \to \mathbb{R} on a measure space (X, \mu), the integral operator T_K is defined on L^2(X, \mu) by

(T_K f)(x) = \int_X K(x, y) f(y) \, d\mu(y)

for all f \in L^2(X, \mu) and x \in X.^[11] Assuming X is compact and K is continuous and symmetric, T_K maps L^2(X, \mu) to the continuous functions on X and is a compact operator.^[11] Moreover, T_K is self-adjoint because of the symmetry of K, and positive semi-definite due to the positive definiteness of K, admitting a sequence of eigenvalues \lambda_n \geq 0 with \lambda_1 \geq \lambda_2 \geq \cdots \to 0.^[11]^[12] The RKHS H_K can be realized as the range of the square root operator T_K^{1/2}, specifically H_K = \{ T_K^{1/2} g \mid g \in L^2(X, \mu) \}, where the inner product on H_K is given by \langle T_K^{1/2} g_1, T_K^{1/2} g_2 \rangle_{H_K} = \langle g_1, T_K g_2 \rangle_{L^2(X, \mu)}.^[13] This construction embeds H_K isometrically into L^2(X, \mu), with the reproducing property arising from the action of T_K.^[13] The eigenvalues \lambda_n from the spectral decomposition of T_K (as per Mercer's theorem) determine the structure of H_K, with eigenfunctions serving as an orthonormal basis.^[11] The boundedness of T_K is characterized by its operator norm \|T_K\| = \sup_{x \in X} \|K(\cdot, x)\|_{H_K}^2 = \sup_n \lambda_n, which equals the supremum of K(x, x) over X.^[11] This norm provides a measure of the kernel's capacity and ensures the well-posedness of T_K on L^2(X, \mu).^[6] In regularization theory for inverse problems, the pseudo-inverse T_K^{-1/2} (defined on the range of T_K^{1/2}) plays a key role in constructing solutions to interpolation tasks within the RKHS, such as minimizing the RKHS norm subject to data-fitting constraints.^[14] This operator facilitates stable approximations by leveraging the spectral regularization inherent to T_K.^[6]

Properties

Basic Properties

A reproducing kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} on a set \mathcal{X} is positive definite if, for any finite set of distinct points x_1, \dots, x_n \in \mathcal{X} and coefficients c_1, \dots, c_n \in \mathbb{R}, the inequality \sum_{i=1}^n \sum_{j=1}^n c_i c_j K(x_i, x_j) \geq 0 holds, with equality only if all c_i = 0 when the kernel is strictly positive definite.^[6] This property ensures that the Gram matrix G_{ij} = K(x_i, x_j) is positive semi-definite, which is equivalent to the existence of an associated reproducing kernel Hilbert space (RKHS) \mathcal{H}_K. Positive definiteness guarantees a valid inner product structure in the feature space induced by the kernel, supporting applications in optimization and covariance representations.^[6] Certain kernels, known as universal kernels, possess the property that the RKHS \mathcal{H}_K is dense in the space C(\mathcal{X}) of continuous functions on a compact metric space \mathcal{X}, equipped with the supremum norm. This density implies that functions in \mathcal{H}_K can approximate any continuous function arbitrarily well, making universal kernels powerful for universal approximation tasks.^[6] For example, the Gaussian kernel K(x, y) = \exp(-\|x - y\|^2 / 2\sigma^2) is universal on compact subsets of \mathbb{R}^d. If the kernel K is continuous on \mathcal{X} \times \mathcal{X}, then every function f \in \mathcal{H}_K is continuous on \mathcal{X}.^[6] This follows from the reproducing property, where f(x) = \langle f, K(\cdot, x) \rangle_{\mathcal{H}_K}, and the continuity of the map x \mapsto K(\cdot, x) in the RKHS norm ensures pointwise continuity of f.^[6] The RKHS \mathcal{H}_K is minimal in the sense that it is the smallest Hilbert space of functions on \mathcal{X} that reproduces the kernel K, meaning any other Hilbert space reproducing K contains \mathcal{H}_K as a closed subspace. This minimality arises from constructing \mathcal{H}_K as the completion of the span of \{K(\cdot, x) \mid x \in \mathcal{X}\} under the inner product defined by the kernel.^[6] For a bounded kernel K with \sup_{x \in \mathcal{X}} K(x, x) < \infty, the RKHS norm satisfies

\|f\|_{\mathcal{H}_K}^2 \geq \sup_{x \in \mathcal{X}} \frac{|f(x)|^2}{K(x, x)}

for all f \in \mathcal{H}_K.^[6] This inequality provides a lower bound on the smoothness or complexity of functions in \mathcal{H}_K relative to their pointwise values, linking the abstract norm to observable evaluations.^[6]

Evaluation and Norms

In a reproducing kernel Hilbert space H with kernel K, the evaluation functional \mathrm{ev}_x: H \to \mathbb{R} defined by \mathrm{ev}_x(f) = f(x) is a bounded linear functional for each x in the domain, with operator norm \|\mathrm{ev}_x\| = \sqrt{K(x,x)}.^[7] This follows from the Riesz representation theorem, where \mathrm{ev}_x corresponds to the kernel function k_x(\cdot) = K(\cdot, x), and \|k_x\|_H^2 = \langle k_x, k_x \rangle_H = K(x,x). Consequently, by the Cauchy-Schwarz inequality, pointwise function values satisfy |f(x)| \leq \|f\|_H \sqrt{K(x,x)} for all f \in H.^[7] The quantity P(x) = \sqrt{K(x,x)}, known as the power function, provides a pointwise bound on function values relative to the RKHS norm and plays a key role in uncertainty quantification. In the Gaussian process perspective, where the kernel K serves as the prior covariance function, P(x) equals the prior standard deviation at x, as \mathrm{Var}(f(x)) = K(x,x).^[15] For interpolation problems, the function f^* \in H that minimizes \|f\|_H subject to the constraints f(x_i) = y_i for distinct points x_1, \dots, x_n and observations y \in \mathbb{R}^n takes the form f^*(\cdot) = \sum_{i=1}^n \alpha_i k_{x_i}(\cdot), where the coefficients satisfy \alpha = K^{-1} y and K is the n \times n Gram matrix with entries K_{ij} = K(x_i, x_j).^[16] With regularization to address ill-posedness or noise, the minimizer of \sum_{i=1}^n (f(x_i) - y_i)^2 + \lambda \|f\|_H^2 yields \alpha = (K + \lambda I)^{-1} y for \lambda > 0, reducing the infinite-dimensional optimization to a finite-dimensional linear system.^[16] A lower bound on the RKHS norm in terms of point evaluations at distinct points x_1, \dots, x_n is given by \|f\|_H^2 \geq \mathbf{y}^T K^{-1} \mathbf{y}, where \mathbf{y}_i = f(x_i) and K_{ij} = K(x_i, x_j). This bound is the norm of the minimum-norm interpolant satisfying the point constraints and arises from the projection of f onto the span of \{K(\cdot, x_i)\}, providing a quantitative measure of how function values constrain the overall smoothness.^[9] The RKHS norm also governs higher-order regularity, such as control over derivatives, through Sobolev embeddings when the RKHS embeds into smoother function spaces. For instance, if the kernel induces a Sobolev space of order s > d/2 (where d is the domain dimension), the embedding H \hookrightarrow C^j for j < s - d/2 ensures that \|f\|_{C^j} \lesssim \|f\|_H, bounding derivatives up to order j.^[17] This property links the RKHS norm to fractional Sobolev norms via interpolation theory, enabling rates for derivative estimation in learning settings.^[17]

Common Examples

Bilinear and Polynomial Kernels

The bilinear kernel is defined as K(\mathbf{x}, \mathbf{y}) = \langle \mathbf{x}, \mathbf{y} \rangle for vectors \mathbf{x}, \mathbf{y} \in \mathbb{R}^d, where \langle \cdot, \cdot \rangle denotes the standard Euclidean inner product. This kernel is positive semi-definite, and its associated reproducing kernel Hilbert space (RKHS) is simply \mathbb{R}^d equipped with the standard inner product, where functions in the RKHS are linear evaluations on the input space. The reproducing property holds directly via the inner product: for any f \in \mathbb{R}^d, f(\mathbf{x}) = \langle f, \mathbf{x} \rangle. Homogeneous polynomial kernels extend this to higher degrees, defined as K(\mathbf{x}, \mathbf{y}) = \langle \mathbf{x}, \mathbf{y} \rangle^p for integer degree p \geq 1 and \mathbf{x}, \mathbf{y} \in \mathbb{R}^d. These kernels are positive definite and correspond to an explicit feature map \phi: \mathbb{R}^d \to \mathcal{H} that sends inputs to all monomials of exact degree p, such as \phi(\mathbf{x}) = (x_1^p, x_2^p, \dots, x_d^p, \sqrt{2} x_1^{p-1} x_2, \dots ) for normalized versions to ensure \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle = K(\mathbf{x}, \mathbf{y}). The dimension of this feature space, and thus the RKHS, is finite and given by the number of monomials of degree p in d variables: \binom{d + p - 1}{p}. Inhomogeneous polynomial kernels generalize further with K(\mathbf{x}, \mathbf{y}) = (\langle \mathbf{x}, \mathbf{y} \rangle + c)^p for constant c > 0, incorporating interactions across degrees up to p. The feature map now includes all monomials from degree 0 to p, such as constants, linear terms, and higher-order products, yielding a finite-dimensional RKHS of dimension \sum_{k=0}^p \binom{d + k - 1}{k}. This structure allows the kernel to capture both linear and nonlinear dependencies without explicit computation in high dimensions. The explicit RKHS for these polynomial kernels consists of all polynomials in d variables of degree at most p (or exactly p for the homogeneous case), with the inner product defined via the feature map to reproduce the kernel: for functions f(\mathbf{x}) = \sum_{\alpha} a_{\alpha} \mathbf{x}^{\alpha} and g(\mathbf{x}) = \sum_{\alpha} b_{\alpha} \mathbf{x}^{\alpha} in the monomial basis \{ \mathbf{x}^{\alpha} \} (where |\alpha| \leq p), the inner product is \langle f, g \rangle_{\mathcal{H}} = \sum_{\alpha} a_{\alpha} b_{\alpha} \langle \phi(\mathbf{e}_{\alpha}), \phi(\mathbf{e}_{\alpha}) \rangle, leveraging the orthogonality of the normalized monomial basis under the induced measure. This finite-dimensional setup ensures that evaluation and norms are computationally tractable, as

\langle f, K(\mathbf{x}, \cdot) \rangle_{\mathcal{H}} = f(\mathbf{x})

holds for all f \in \mathcal{H}, directly from the reproducing property.

Radial Basis Function Kernels

Radial basis function (RBF) kernels are a class of positive definite kernels that are translation-invariant, meaning they depend solely on the Euclidean distance r = \|x - y\| between inputs x, y \in \mathbb{R}^d. These kernels generate reproducing kernel Hilbert spaces (RKHSs) particularly suited for approximation tasks in machine learning and statistics, as their associated function spaces emphasize smoothness controlled by the kernel's decay properties. The Gaussian RBF kernel is defined as K(x, y) = \exp\left( -\frac{\|x - y\|^2}{2\sigma^2} \right), where \sigma > 0 is a length-scale parameter. The corresponding RKHS consists of infinitely differentiable functions that decay at infinity faster than any exponential, ensuring strong regularity. This kernel is universal, meaning its RKHS is dense in the space of continuous functions C(X) on any compact subset X \subset \mathbb{R}^d, enabling approximation of arbitrary smooth functions. The Matérn kernel provides finer control over function smoothness and is given by K(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \sqrt{2\nu} \frac{r}{\ell} \right)^\nu K_\nu \left( \sqrt{2\nu} \frac{r}{\ell} \right), where \nu > 0 is a smoothness parameter, \ell > 0 is the length scale, \Gamma is the gamma function, and K_\nu is the modified Bessel function of the second kind. Functions in the associated RKHS are mean-square differentiable up to order \lfloor \nu \rfloor, with the case \nu = 1/2 recovering the exponential kernel and \nu \to \infty approaching the Gaussian. This tunability makes it widely used in Gaussian process regression for modeling data with varying regularity.^[18] The Laplace kernel, K(x, y) = \exp\left( -\frac{\|x - y\|}{\sigma} \right), yields an RKHS of functions that are continuous but not mean-square differentiable (corresponding to the Matérn kernel with ν=1/2), offering less smoothness than the Gaussian. Like the Gaussian, it is universal on compact domains, supporting dense approximations in C(X).^[19] For translation-invariant RBF kernels on \mathbb{R}^d, the RKHS norm of a function f can be expressed via its Fourier transform \hat{f} as \|f\|_{\mathcal{H}}^2 = \frac{1}{(2\pi)^d} \int_{\mathbb{R}^d} \frac{|\hat{f}(\omega)|^2}{\hat{K}(\omega)} \, d\omega, where \hat{K} is the Fourier transform of the kernel K, which serves as a spectral density. This formulation weights higher frequencies inversely to \hat{K}, penalizing rapid oscillations and aligning the norm with Sobolev-like spaces for Matérn kernels or exponential decay for Gaussian. The universality of these RBF kernels extends to density in L^2(\mathbb{R}^d) under suitable conditions, such as integrability of \hat{K}, allowing RBF-based methods to approximate square-integrable functions arbitrarily well.

Bergman Kernels

In complex analysis, the Bergman kernel serves as the reproducing kernel for the Bergman space, a canonical reproducing kernel Hilbert space consisting of square-integrable holomorphic functions on a domain in complex Euclidean space. For a bounded domain \Omega \subset \mathbb{C}^n equipped with the Euclidean Lebesgue measure dV, the Bergman space A^2(\Omega) is defined as the closed subspace of L^2(\Omega, dV) comprising all holomorphic functions f: \Omega \to \mathbb{C} satisfying \|f\|^2 = \int_\Omega |f(z)|^2 \, dV(z) < \infty. The associated inner product is the standard L^2 pairing \langle f, g \rangle = \int_\Omega f(z) \overline{g(z)} \, dV(z). Point evaluation at any z \in \Omega is a bounded linear functional on A^2(\Omega) due to the subharmonic nature of |f|^2 for holomorphic f, ensuring the space is an RKHS with reproducing kernel K^\Omega(z, w).^[20] Explicitly, if \{\phi_k\}_{k=1}^\infty is any orthonormal basis for A^2(\Omega) consisting of holomorphic functions, the Bergman kernel admits the series expansion

K^\Omega(z, w) = \sum_{k=1}^\infty \phi_k(z) \overline{\phi_k(w)},

which converges absolutely and uniformly on compact subsets of \Omega \times \Omega. This kernel is holomorphic in the first argument and anti-holomorphic in the second, and it satisfies the reproducing property f(z) = \langle f, K^\Omega(\cdot, z) \rangle for all f \in A^2(\Omega) and z \in \Omega. A key geometric property is that the diagonal K^\Omega(z, z) quantifies the supremal evaluation functional: K^\Omega(z, z) = \sup \{ |f(z)|^2 / \|f\|^2 : f \in A^2(\Omega), f \not\equiv 0 \}, or equivalently, when normalized by the condition f(0) = 1 (assuming $0 \in \Omega), it captures the extremal growth of unit-normalized functions at z. This supremum reflects the space's capacity to approximate delta-like behavior at points while respecting holomorphy and integrability. Under biholomorphic transformations, the Bergman kernel transforms in a manner that preserves its reproducing character while accounting for the change in volume measure. Specifically, for a biholomorphism \phi: \Omega \to \Omega' between domains, the kernels satisfy

K^{\Omega'}(\phi(z), \phi(w)) = \frac{K^\Omega(z, w)}{J_\phi(z) \overline{J_\phi(w)}},

where J_\phi denotes the complex Jacobian determinant \det D\phi. In one complex variable (n=1), this simplifies to K^{\Omega'}(\phi(z), \phi(w)) = K^\Omega(z, w) / (\phi'(z) \overline{\phi'(w)}), highlighting the kernel's role as a complete biholomorphic invariant up to these factors. This law arises from the pullback of the L^2 inner product under \phi, where the volume scales by |\det D\phi|^2, ensuring the reproducing property holds in the transformed space.^[21] A canonical example occurs for the unit disk \mathbb{D} = \{ z \in \mathbb{C} : |z| < 1 \}, where an orthonormal basis is given by \phi_k(z) = \sqrt{k+1} z^k for k = 0, 1, 2, \dots. The resulting Bergman kernel is

K^\mathbb{D}(z, w) = \frac{1}{\pi (1 - z \overline{w})^2},

which can be derived by summing the series or via the explicit Bergman projection onto holomorphics. This formula underscores the kernel's singularity on the boundary |z| = |w| = 1, reflecting the space's boundary behavior, and it plays a central role in studying automorphisms of \mathbb{D}, such as Möbius transformations.^[22]

Extensions

Vector-Valued Functions

In the context of reproducing kernel Hilbert spaces (RKHS), the framework can be extended to functions taking values in a Hilbert space Y, rather than the scalars. Let H be a Hilbert space of functions f: X → Y, where X is the input domain. The space H is an RKHS if, for every x ∈ X and y ∈ Y, the evaluation map f ↦ ⟨f(x), y⟩_Y is a continuous linear functional on H. The reproducing kernel for such an H is an operator-valued kernel K: X × X → L(Y), where L(Y) denotes the space of bounded linear operators from Y to Y. This kernel satisfies the reproducing property: for all f ∈ H, x ∈ X, and y ∈ Y,

\langle f(x), y \rangle_Y = \langle f, K(\cdot, x) y \rangle_H,

where the inner product on the right is in H. A kernel K is positive definite if, for every finite n ∈ ℕ, points x_1, \dots, x_n ∈ X, and elements c_1, \dots, c_n ∈ Y,

\sum_{i,j=1}^n \langle c_i, K(x_i, x_j) c_j \rangle_Y \geq 0.

This condition ensures the existence of an associated RKHS. Moreover, by a generalization of the Moore-Aronszajn theorem to the operator-valued setting, every positive definite operator-valued kernel K determines a unique RKHS H_K (up to isometry) consisting of Y-valued functions on X, with K as its reproducing kernel. Examples of such kernels include matrix-valued kernels when Y = ℝ^d is finite-dimensional, which arise in multi-output regression tasks. A simple case is the separable kernel K(x, y) = k(x, y) I_d, where k is a positive definite scalar kernel on X and I_d is the d × d identity matrix; this corresponds to independent scalar RKHS for each output component. Vector-valued RKHS find applications in spaces like vector-valued Sobolev spaces, which consist of functions f: Ω → Y with finite Sobolev norm and admit an operator-valued reproducing kernel, enabling kernel-based methods for problems involving vector outputs such as in geostatistics or image processing.^[23]

Connections to ReLU and Neural Networks

In the context of deep learning, reproducing kernel Hilbert spaces (RKHS) provide a theoretical framework for understanding the behavior of overparameterized neural networks, particularly those using ReLU activations, in the limit of infinite width. As the width of a neural network increases indefinitely, the function space induced by the network's random initialization converges to an RKHS governed by a specific kernel derived from the activation function. For ReLU networks, this kernel corresponds to the arc-cosine kernel of degree one, which captures the homogeneity and angular dependence of the ReLU operation. Specifically, the arc-cosine kernel for inputs x, y \in \mathbb{R}^d is given by

K(x, y) = \frac{\|x\| \|y\|}{\pi} \left( \sqrt{1 - \rho^2} + \rho (\pi - \arccos(\rho)) \right),

where \rho = \frac{\langle x, y \rangle}{\|x\| \|y\|} encodes the angle between x and y. This kernel arises from the expected inner product of ReLU-activated random features and ensures that the network's prior distribution over functions aligns with a Gaussian process in the infinite-width limit. A key insight is that wide ReLU networks, when trained via gradient descent, exhibit dynamics equivalent to kernel regression in the RKHS defined by the neural tangent kernel (NTK). The NTK, which parameterizes the evolution of the network's output during training, for a two-layer ReLU network takes the form

\Theta(x, y) = \langle x, y \rangle \mathbb{E}[\sigma'(z)] + \mathbb{E}[\sigma(z_1) \sigma(z_2)],

where \sigma(z) = \max(0, z) is the ReLU function, \sigma'(z) is its subgradient (equal to 1 for z > 0 and 0 otherwise), and the expectations are taken over auxiliary variables z, z_1, z_2 drawn from a Gaussian distribution conditioned on x and y. In this regime, the overparameterized network achieves global convergence to a minimum-risk solution akin to kernel ridge regression, bridging classical kernel methods with modern deep learning architectures. This equivalence holds under suitable initialization and learning rate schedules, explaining the strong generalization observed in wide networks despite their massive parameter count.^[24] Post-2018 developments have further elucidated these connections, emphasizing links to Gaussian processes and the benefits of overparameterization. In the infinite-width limit, Bayesian ReLU networks induce Gaussian process posteriors with recursive arc-cosine kernels for multi-layer architectures, enabling exact Bayesian inference via kernel methods while preserving the network's hierarchical structure. Additionally, analyses of the NTK's spectral properties reveal its inductive biases, such as preference for smooth functions in the RKHS, which align with the frequency biases observed in ReLU network training and contribute to their sample efficiency in high-dimensional settings. These insights have informed practical approximations, such as random feature expansions of the NTK, to scale kernel methods to large datasets while retaining neural network-like performance.^[25] More recent work as of 2025 has extended these ideas to deep architectures beyond infinite width. For instance, deep neural networks can be viewed as compositions forming reproducing kernel chains or hierarchies of RKHS, where each layer corresponds to a kernel operation, including ReLU activations as special cases. These frameworks provide sparse solutions for empirical risk minimization and better characterize the function spaces of finite-width deep networks, enhancing understanding of their generalization and efficiency.^[26]^[27]