Fact-checked by Grok 2 weeks ago

Mercer's theorem

Mercer's theorem is a fundamental result in functional analysis that asserts a continuous, symmetric, and positive semi-definite kernel K(t, s) defined on a compact square [0, T] \times [0, T] admits a spectral decomposition K(t, s) = \sum_{m=1}^\infty \lambda_m \phi_m(t) \phi_m(s), where \{\phi_m\} forms an orthonormal basis of L^2([0, T]) and \{\lambda_m\} is a sequence of non-negative eigenvalues of the associated integral operator, with the series converging absolutely and uniformly on the domain.^[1]^[2] Named after British mathematician James Mercer, who introduced it in 1909 while studying integral equations, the theorem builds on earlier work by David Hilbert on Fredholm operators and provides an explicit eigenfunction expansion for Hilbert-Schmidt kernels, linking them to the theory of positive definite functions.^[1]^[2] It requires the kernel to satisfy \int_0^T \int_0^T K(t, s) f(t) f(s) \, dt \, ds \geq 0 for all square-integrable functions f, ensuring the operator is compact and self-adjoint on the Hilbert space L^2([0, T]).^[2]^[3] The theorem's significance extends to reproducing kernel Hilbert spaces (RKHS), where it characterizes positive definite kernels as inner products in a feature space, enabling the construction of function spaces with reproducing properties: for a kernel k, the RKHS consists of functions f(x) = \sum_i c_i k(x_i, x) with norm \|f\|^2 = \sum_{i,j} c_i c_j k(x_i, x_j).^[3] In modern applications, particularly in machine learning, Mercer's theorem underpins kernel methods such as support vector machines, Gaussian processes, and kernel principal component analysis by justifying the use of infinite-dimensional feature maps without explicit computation, as the kernel trick allows evaluations solely through k(x, y).^[3] Generalizations have broadened its scope to non-compact domains, matrix-valued kernels, and discontinuous cases, while maintaining the core eigenexpansion principle.^[2]

Overview

Statement

Mercer's theorem concerns the spectral expansion of a symmetric positive semi-definite kernel defined on a compact domain. Specifically, consider a kernel K: [a, b] \times [a, b] \to \mathbb{R} that is continuous and symmetric, meaning K(x, y) = K(y, x) for all x, y \in [a, b], and positive semi-definite in the sense that

\int_a^b \int_a^b K(x, y) f(x) f(y) \, dx \, dy \geq 0

for every f \in L^2[a, b].^[2] The theorem asserts that such a kernel admits a series expansion of the form

K(x, y) = \sum_{n=1}^\infty \lambda_n \phi_n(x) \phi_n(y),

where \{\lambda_n\}_{n=1}^\infty is a sequence of positive eigenvalues arranged in decreasing order (\lambda_1 \geq \lambda_2 \geq \cdots > 0), and \{\phi_n\}_{n=1}^\infty is a corresponding sequence of orthonormal eigenfunctions in L^2[a, b] satisfying the eigenvalue equation

\int_a^b K(x, y) \phi_n(y) \, dy = \lambda_n \phi_n(x)

for each n. These eigenfunctions form a complete orthonormal basis for L^2[a, b].^[2]^[4] Under the given conditions of continuity and positive semi-definiteness, the series converges absolutely and uniformly on [a, b] \times [a, b]. Pointwise convergence holds for each fixed (x, y), and the uniform convergence ensures that the partial sums approximate K arbitrarily closely in the supremum norm.^[2] This expansion provides a constructive link to feature maps in Hilbert spaces, where the kernel corresponds to an inner product \langle \phi(x), \phi(y) \rangle = K(x, y) via \phi(x) = (\sqrt{\lambda_n} \phi_n(x))_{n=1}^\infty.^[4]

Historical Context

Mercer's theorem emerged from early 20th-century advancements in the theory of integral equations. In 1909, British mathematician James Mercer (1883–1932) published a seminal paper in which he demonstrated that a continuous symmetric positive definite kernel on a compact interval could be represented as a uniformly convergent series involving the eigenfunctions of the associated integral operator.^[1] This result, now known as Mercer's theorem, provided a key tool for expanding such kernels and connected directly to the spectral properties of integral operators.^[5] Mercer's work built on foundational contributions to integral equations by earlier mathematicians. Henri Poincaré had introduced integral equation methods in the late 19th century to address perturbation problems in celestial mechanics and differential equations, laying early groundwork for treating boundary value problems via integral representations.^[6] David Hilbert advanced this field significantly between 1904 and 1906 through a series of papers that developed systematic solutions for linear integral equations of the first and second kinds, emphasizing their role in expanding functions and resolving Fredholm-type problems.^[7] Mercer's 1909 expansion theorem extended these ideas by focusing on the uniform convergence of eigenfunction series for positive definite kernels.^[1] The theorem gained broader theoretical context in the 1930s with the formalization of Hilbert space theory. John von Neumann's development of the spectral theorem for self-adjoint operators on Hilbert spaces, culminating in his 1932 monograph, provided a rigorous abstract framework that encompassed Mercer's result as a special case for compact integral operators. By 1953, Mercer's theorem was established as a standard tool in applied analysis, prominently featured in Richard Courant and David Hilbert's comprehensive treatment of mathematical physics methods. Mercer's broader contributions to analysis included studies on summability methods and function theory, though his early death limited further developments.^[5]

Mathematical Foundations

Positive Definite Kernels

A kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}, where \mathcal{X} is a nonempty set, is called positive definite if it is symmetric, i.e., K(x,y) = K(y,x) for all x,y \in \mathcal{X}, and if for every finite collection of points x_1, \dots, x_n \in \mathcal{X} and real coefficients c_1, \dots, c_n \in \mathbb{R}, the quadratic form satisfies

\sum_{i=1}^n \sum_{j=1}^n c_i c_j K(x_i, x_j) \geq 0. \tag{1}

This condition ensures that the associated Gram matrix is positive semi-definite, allowing the kernel to represent inner products in some feature space. An equivalent continuous formulation, often referred to as Mercer's condition, applies when K is a measurable function on a measure space (\mathcal{X}, \mu) with \mu(\mathcal{X}) < \infty: K is positive definite if

\iint_{\mathcal{X} \times \mathcal{X}} g(x) K(x,y) g(y) \, d\mu(x) \, d\mu(y) \geq 0

for all square-integrable functions g \in L^2(\mathcal{X}, \mu). This integral form establishes the positive semi-definiteness of the quadratic form induced by K over the function space, serving as a prerequisite for the spectral analysis in Mercer's theorem. Common examples of positive definite kernels include the Gaussian kernel, defined as

K(x,y) = \exp\left( -\frac{\|x - y\|^2}{2\sigma^2} \right),

for x,y \in \mathbb{R}^d and \sigma > 0, which is universally approximating and radially symmetric. Another simple case is the constant kernel K(x,y) = c for some constant c > 0, which corresponds to a one-dimensional feature space and yields constant similarity measures. These examples satisfy the positive definiteness condition due to their symmetry and the non-negativity of the resulting Gram matrices. Positive definite kernels exhibit several key properties beyond symmetry, including the non-negativity of diagonal elements K(x,x) \geq 0 and adherence to the Cauchy-Schwarz inequality |K(x,y)|^2 \leq K(x,x) K(y,y). The set of positive definite kernels forms a convex cone closed under positive linear combinations and pointwise products. In stochastic processes, positive definite kernels often serve as covariance functions, where K(x,y) = \operatorname{Cov}(f(x), f(y)) for a random field f, ensuring valid probabilistic interpretations via Bochner's theorem for stationary cases.

Integral Operators and Compactness

The integral operator T_K associated with a continuous symmetric kernel K on the compact interval [a, b] acts on the Hilbert space L^2[a, b] and is defined by

(T_K f)(x) = \int_a^b K(x, y) f(y) \, dy

for all f \in L^2[a, b].^[8] This operator arises naturally in the study of integral equations connected to Mercer's theorem.^[2] Due to the symmetry of the kernel, K(x, y) = K(y, x) for all x, y \in [a, b], the operator T_K is self-adjoint on L^2[a, b].^[8] That is, \langle T_K f, g \rangle = \langle f, T_K g \rangle for all f, g \in L^2[a, b], where \langle \cdot, \cdot \rangle denotes the inner product on L^2[a, b].^[2] When K is continuous on the compact set [a, b] \times [a, b] , the operator T_K is Hilbert–Schmidt and hence compact on the infinite-dimensional Hilbert space L^2[a, b]. Compactness implies that the eigenvalues of T_K (which are real due to self-adjointness) form a sequence that accumulates only at 0.^[2] The Hilbert–Schmidt norm of T_K is given by

\| T_K \|_{HS}^2 = \int_a^b \int_a^b |K(x, y)|^2 \, dx \, dy,

which is finite because K is continuous on a compact domain and thus bounded.^[2] For positive definite kernels, where T_K is positive (non-negative eigenvalues), this structure further implies that T_K is trace class, as the trace equals \int_a^b K(x, x) \, dx < \infty.^[9]

Proof and Derivation

Spectral Decomposition

In the context of Mercer's theorem, the integral operator T_K associated with a positive semi-definite kernel K on a compact domain is compact and self-adjoint on the Hilbert space L^2(X, \mu), where \mu is a suitable measure. By the spectral theorem for compact self-adjoint operators, T_K admits a countable orthonormal basis \{\phi_n\}_{n=1}^\infty of eigenfunctions in L^2(X, \mu) with corresponding eigenvalues \{\lambda_n\}_{n=1}^\infty, satisfying T_K \phi_n = \lambda_n \phi_n for each n.^[10]^[11] The eigenvalues are non-negative, \lambda_n \geq 0, due to the positive semi-definiteness of K, which ensures that \langle T_K f, f \rangle \geq 0 for all f \in L^2(X, \mu).^[10] The eigenvalues can be arranged in non-increasing order, \lambda_1 \geq \lambda_2 \geq \cdots \geq 0, and they accumulate only at zero, \lambda_n \to 0 as n \to \infty, reflecting the compactness of T_K. Each positive eigenvalue \lambda_n > 0 has finite multiplicity, meaning the eigenspace for \lambda_n is finite-dimensional.^[11] This spectral structure diagonalizes T_K in the basis \{\phi_n\}, allowing the operator to be expressed as

T_K f = \sum_{n=1}^\infty \lambda_n \langle f, \phi_n \rangle_{L^2} \phi_n

for any f \in L^2(X, \mu), where the series converges in the L^2-norm.^[10]^[11] The resolution of the identity follows from the completeness of the orthonormal basis: every f \in L^2(X, \mu) admits the expansion

f = \sum_{n=1}^\infty \langle f, \phi_n \rangle_{L^2} \phi_n,

with convergence in L^2-norm, completing the spectral decomposition of the space.^[11] A proof of this decomposition relies on Hilbert space theory for compact self-adjoint operators. One constructs the eigenfunctions iteratively by maximizing |\langle T_K u, u \rangle| over the unit sphere, using compactness to ensure a maximizer exists, and self-adjointness to show it is an eigenvector; repeating on orthogonal complements yields the basis, with eigenvalues decreasing to zero by the compact perturbation argument.^[11]^[10] The eigenfunctions are unique up to phase, meaning that if \{\tilde{\phi}_n\} is another such basis, then \tilde{\phi}_n = e^{i\theta_n} \phi_n for some real \theta_n, preserving orthonormality and the eigenspaces.^[11]

Mercer Expansion Construction

The Mercer expansion of a positive semi-definite kernel K(x, y) on a compact domain is constructed by applying the spectral theorem to the associated compact, self-adjoint, Hilbert-Schmidt integral operator T_K: L^2(\Omega, \mu) \to L^2(\Omega, \mu) defined by

(T_K f)(x) = \int_\Omega K(x, y) f(y) \, d\mu(y),

where \Omega is a compact subset of \mathbb{R}^d and \mu is a finite Borel measure. The spectral theorem guarantees the existence of a countable orthonormal eigenbasis \{\phi_n\}_{n=1}^\infty \subset L^2(\Omega, \mu) and non-negative eigenvalues \{\lambda_n\}_{n=1}^\infty (with \lambda_1 \geq \lambda_2 \geq \cdots \geq 0 and \sum \lambda_n < \infty) satisfying the eigen-equation

\int_\Omega K(x, y) \phi_n(y) \, d\mu(y) = \lambda_n \phi_n(x)

for each n. Substituting this into the integral representation and leveraging the orthonormality of the eigenfunctions yields the series expansion

K(x, y) = \sum_{n=1}^\infty \lambda_n \phi_n(x) \phi_n(y).[](https://arxiv.org/pdf/2106.08443.pdf)[](https://www.math.pku.edu.cn/teachers/yaoy/publications/MinNiyYao06.pdf) For a continuous kernel $K$, the series converges pointwise and absolutely to $K(x, y)$ for all $x, y \in \Omega$, with uniform convergence on compact subsets; in the general $L^2$ setting, convergence holds almost everywhere with respect to $\mu \times \mu$. This follows from the compactness of $T_K$, which ensures the eigenvalues decay sufficiently fast ($\sum \lambda_n^2 < \infty$), and bounds derived via the Cauchy-Schwarz inequality applied to the partial sums.[](https://arxiv.org/pdf/2106.08443.pdf)[](https://www.math.pku.edu.cn/teachers/yaoy/publications/MinNiyYao06.pdf) The expansion induces an explicit feature map $\psi: \Omega \to \ell^2$ given by \[ \psi(x) = \left( \sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x), \dots \right),

such that K(x, y) = \langle \psi(x), \psi(y) \rangle_{\ell^2}, embedding the kernel into an infinite-dimensional Hilbert space and facilitating computations in the feature space.^[12]^[4] For practical approximations, the finite-rank truncation K_m(x, y) = \sum_{n=1}^m \lambda_n \phi_n(x) \phi_n(y) serves as an error bound, with the remainder r_m(x, y) = \sum_{n=m+1}^\infty \lambda_n \phi_n(x) \phi_n(y) satisfying |r_m(x, y)| \leq \sqrt{ \left( \sum_{n=m+1}^\infty \lambda_n \phi_n^2(x) \right) \left( \sum_{n=m+1}^\infty \lambda_n \phi_n^2(y) \right) } \leq \sqrt{ K(x,x) K(y,y) } by Cauchy-Schwarz, ensuring the approximation error decreases monotonically as m increases.^[12]^[13] The expansion preserves positive semi-definiteness, as the quadratic form \sum_{i,j=1}^N c_i c_j K(x_i, x_j) = \sum_{n=1}^\infty \lambda_n \left( \sum_{i=1}^N c_i \phi_n(x_i) \right)^2 \geq 0 for any finite N, coefficients \{c_i\}, and points \{x_i\}, directly verifying the kernel's positive semi-definiteness through the non-negativity of the eigenvalues.^[12]^[4]

Properties

Trace Formula

One of the important consequences of Mercer's theorem is the trace formula, which establishes an identity between the integral of the kernel along its diagonal and the sum of its eigenvalues. For a continuous symmetric positive definite kernel K on the compact interval [a, b], the associated integral operator T_K f(t) = \int_a^b K(t, s) f(s) \, ds is trace class, and the trace satisfies

\int_a^b K(t, t) \, dt = \sum_{n=1}^\infty \lambda_n,

where \lambda_n > 0 are the eigenvalues appearing in the Mercer expansion K(t, s) = \sum_{n=1}^\infty \lambda_n \phi_n(t) \phi_n(s), with \{\phi_n\} an orthonormal basis of eigenfunctions. This formula underscores the finite summability of the eigenvalues for such kernels. The proof exploits the properties of trace-class operators and the spectral decomposition guaranteed by Mercer's theorem. Since T_K is compact, self-adjoint, and positive, the spectral theorem yields T_K \phi_n = \lambda_n \phi_n. The trace is then

\operatorname{Tr}(T_K) = \sum_{n=1}^\infty \langle T_K \phi_n, \phi_n \rangle = \sum_{n=1}^\infty \lambda_n,

by cyclicity of the trace over the orthonormal eigenbasis. Independently, for an integral operator with continuous kernel on a compact domain, the trace equals the diagonal integral

\operatorname{Tr}(T_K) = \int_a^b K(t, t) \, dt.

Equating these representations gives the desired identity. The continuity of K ensures the integral is finite, thereby confirming that T_K is indeed trace class. This trace formula has significant implications, including a bound relating the eigenvalues to the Hilbert-Schmidt norm of the operator. Specifically,

\|T_K\|_{\mathrm{HS}} = \left( \sum_{n=1}^\infty \lambda_n^2 \right)^{1/2} = \left( \iint_{[a,b]^2} |K(s, t)|^2 \, ds \, dt \right)^{1/2},

which provides an L^2 control on the eigenvalue decay beyond the mere finiteness of their sum. In numerical contexts, the formula is particularly useful in Karhunen–Loève expansions, where the total variance of a centered stochastic process with covariance kernel K equals \sum \lambda_n = \int_a^b K(t, t) \, dt, offering a direct estimate of the process's overall variability from the kernel alone.

Eigenfunction Continuity

Under the continuity assumption on the kernel K, the eigenfunctions corresponding to positive eigenvalues in Mercer's theorem exhibit smoothness properties that enhance the theorem's applicability. Specifically, if K is continuous and symmetric on the compact domain [a, b] \times [a, b], then the eigenfunctions \phi_n for eigenvalues \lambda_n > 0 of the associated integral operator T_K f(x) = \int_a^b K(x, y) f(y) \, dy are themselves continuous on [a, b].^[2]^[3] The proof relies on the mapping properties of T_K. Since K is continuous on the compact set [a, b] \times [a, b] , the operator T_K maps the space L^2[a, b] into the continuous functions C[a, b], as the integral defines a continuous function in x for any f \in L^2[a, b]. For \lambda_n > 0, the eigen-equation T_K \phi_n = \lambda_n \phi_n rearranges to \phi_n = \frac{1}{\lambda_n} T_K \phi_n, placing \phi_n in the image of T_K, which is a subset of C[a, b]. This bootstrap argument, grounded in the compactness and self-adjointness of T_K, confirms the continuity without requiring additional regularity like differentiability.^[2]^[3] A key consequence is the uniform convergence of the Mercer series on compact subsets. The continuity of \phi_n ensures that the expansion K(x, y) = \sum_{n=1}^\infty \lambda_n \phi_n(x) \phi_n(y) converges absolutely and uniformly on [a, b] \times [a, b] , leveraging Dini's test for series of continuous functions.^[2] However, continuity does not hold universally. For \lambda_n = 0, eigenfunctions reside in the kernel of T_K, which may include discontinuous L^2 functions orthogonal to the range of T_K. Similarly, if K is only square-integrable but discontinuous, T_K no longer maps into C[a, b], allowing discontinuous eigenfunctions even for positive eigenvalues.^[3] In approximation theory, the continuity of \phi_n for \lambda_n > 0 supports robust finite-dimensional reductions, such as truncating the Mercer expansion to the first N terms, yielding smooth basis functions that improve convergence rates in numerical schemes like the Nyström method.^[3]

Generalizations

Discrete Versions

In the finite-dimensional setting, Mercer's theorem finds its analog in the spectral decomposition of symmetric positive semidefinite matrices, which arise naturally as Gram matrices for finite sets of data points under a positive definite kernel. Consider an n \times n matrix K = (K_{ij}) where K_{ij} = k(x_i, x_j) for distinct points x_1, \dots, x_n in a domain and a symmetric positive definite kernel k.^[14] Such a matrix K is symmetric by construction and positive semidefinite, meaning that for any vector v \in \mathbb{R}^n, v^\top K v \geq 0. The symmetry of K ensures that all eigenvalues are real, and the positive semidefiniteness guarantees that these eigenvalues \lambda_1, \dots, \lambda_n are non-negative. By the spectral theorem for symmetric matrices, K admits an eigenvalue decomposition K = V \Lambda V^\top, where \Lambda = \operatorname{diag}(\lambda_1, \dots, \lambda_n) is the diagonal matrix of eigenvalues and V is an orthogonal matrix whose columns v_1, \dots, v_n \in \mathbb{R}^n are the corresponding orthonormal eigenvectors.^[14] This decomposition provides the exact Mercer-like expansion for the matrix entries:

K_{ij} = \sum_{k=1}^n \lambda_k v_k(i) v_k(j),

where v_k(i) denotes the i-th component of the eigenvector v_k.^[14] Unlike the continuous case, this sum is finite and exact, with no convergence issues, reflecting the compactness inherent in finite dimensions. A practical example occurs in support vector machines, where the kernel matrix K encodes pairwise similarities between training points, and its spectral decomposition facilitates efficient computations such as principal component analysis in the feature space.^[15] Furthermore, the expansion connects to feature representations: the scaled eigenvectors \sqrt{\lambda_k} v_k can serve as explicit finite-dimensional features approximating the implicit kernel mapping, and this is linked to the Cholesky decomposition K = LL^\top, where the columns of L provide an alternative low-triangular basis for the features when K is positive definite.^[16] As the dimension n grows large, the discrete expansion approximates the continuous Mercer series for the underlying kernel operator, bridging finite data approximations to the infinite-dimensional integral representation on compact domains.

Measure-Theoretic Extensions

Mercer's theorem extends to more abstract settings beyond the classical interval case, considering a kernel K defined on a compact Hausdorff space X equipped with a \sigma-finite measure \mu. In this framework, the associated integral operator T_K acts on the Hilbert space L^2(X, \mu) by (T_K f)(x) = \int_X K(x, y) f(y) \, d\mu(y). For T_K to satisfy the conditions of the theorem, it must be compact, positive, and self-adjoint, which typically requires K to be measurable, symmetric, and positive semi-definite in the sense that \int_X \int_X K(x, y) \overline{f(x)} f(y) \, d\mu(x) \, d\mu(y) \geq 0 for all f \in L^2(X, \mu).^[2]^[17] Under these assumptions, the spectral theorem for compact self-adjoint operators on Hilbert spaces guarantees the existence of an orthonormal basis \{\phi_n\}_{n=1}^\infty \subset L^2(X, \mu) of eigenfunctions with corresponding positive eigenvalues \{\lambda_n\}_{n=1}^\infty satisfying \lambda_n \to 0 as n \to \infty and T_K \phi_n = \lambda_n \phi_n. The kernel admits the expansion

K(x, y) = \sum_{n=1}^\infty \lambda_n \phi_n(x) \phi_n(y)

in the L^2(\mu \times \mu) sense, meaning the partial sums converge to K in the norm of L^2(X \times X, \mu \times \mu). This convergence holds \mu \times \mu-almost everywhere, reflecting the operator's compactness without requiring stronger topological assumptions on X.^[2]^[18] Convergence properties vary with additional conditions on K and the space. If K is continuous on the compact space X, the series converges pointwise to K(x, y) for all x, y \in X, and uniformly on compact subsets, leveraging the uniform continuity of K. For Hilbert-Schmidt operators, where \|T_K\|_{HS}^2 = \int_X \int_X |K(x, y)|^2 \, d\mu(x) \, d\mu(y) < \infty, the L^2(\mu \times \mu) convergence is absolute and ensures the eigenfunctions form a complete basis in L^2(X, \mu). These variants highlight how the theorem adapts to weaker regularity, prioritizing L^2 norms over pointwise behavior.^[17]^[19] Further extensions broaden the applicability to non-compact domains or spaces without full compactness. On first-countable locally compact Hausdorff spaces that are \sigma-compact with a Radon measure \mu, the theorem holds by approximating via increasing compact subsets, provided the eigenvalues are summable (\sum \lambda_n < \infty) to ensure trace-class properties. For non-compact X, such as separable metric spaces with locally finite measures, the expansion converges in L^2(\mu \times \mu) if T_K remains compact, often requiring boundedness of K and integrability conditions like \int_X \|K(x, x)\| \, d\mu(x) < \infty. However, uniform convergence across the entire space generally fails without additional topological restrictions, such as second countability or uniform boundedness of eigenfunctions, limiting pointwise guarantees to local compacts.^[9]^[20]^[2]

Matrix-Valued Kernels

Mercer's theorem generalizes to matrix-valued kernels, which map to matrices in \mathbb{R}^{N \times N} (or \mathbb{C}^{N \times N}) and are useful for vector-valued reproducing kernel Hilbert spaces. Consider a kernel K: X \times X \to \mathbb{R}^{N \times N} that is bounded, continuous, and symmetric, meaning K(x, y) = K(y, x)^\top, defined on a separable metric space X with a locally finite measure \mu. The associated operator acts on vector-valued functions in L^2(X, \mu; \mathbb{R}^N).^[20] Under positive definiteness—equivalently, integral positivity \int_X \int_X f(x)^\top K(x, y) f(y) \, d\mu(x) \, d\mu(y) \geq 0 for suitable f—the kernel admits a spectral decomposition entry-wise:

K_{\ell m}(x, y) = \sum_{k} \sigma_k \phi_\ell^k(x) \phi_m^k(y),

where \{\sigma_k\} are non-negative eigenvalues and \{\phi^k\} are continuous vector-valued eigenfunctions forming an orthonormal basis in the vector Hilbert space, with the series converging uniformly on X \times X. This extension preserves the core eigenexpansion while accommodating finite-dimensional vector outputs, enabling applications in multi-task learning and operator-valued kernels.^[20]^[21]

Applications

Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space (RKHS) for a positive definite kernel K: X \times X \to \mathbb{R} on a set X is a Hilbert space \mathcal{H} of real-valued functions on X such that point evaluation at any x \in X is a bounded linear functional on \mathcal{H}. This property implies the existence of a unique reproducing kernel K_x(y) = K(x, y) for each x \in X, satisfying the reproduction formula f(x) = \langle f, K_x \rangle_{\mathcal{H}} for all f \in \mathcal{H}. The kernel K fully determines the space \mathcal{H}, and conversely, every such Hilbert space admits a unique reproducing kernel given by K(x, y) = \langle K_x, K_y \rangle_{\mathcal{H}}. When X is compact and K is continuous and symmetric positive semi-definite, Mercer's theorem provides an explicit construction of the associated RKHS. The kernel admits the expansion K(x, y) = \sum_{n=1}^\infty \lambda_n \phi_n(x) \phi_n(y), where \{\lambda_n\}_{n=1}^\infty is a sequence of non-negative eigenvalues in decreasing order and \{\phi_n\}_{n=1}^\infty is an orthonormal basis of eigenfunctions in L^2(X, \mu) for some measure \mu on X. The RKHS \mathcal{H} is then the completion (closure) of the linear span of \{\sqrt{\lambda_n} \phi_n \mid n \in \mathbb{N}, \lambda_n > 0\} under the inner product defined by

\left\langle \sum_{n} a_n \sqrt{\lambda_n} \phi_n, \sum_{n} b_n \sqrt{\lambda_n} \phi_n \right\rangle_{\mathcal{H}} = \sum_{n} a_n b_n,

for finite linear combinations. Functions in \mathcal{H} are thus representable as f = \sum_{n} c_n \sqrt{\lambda_n} \phi_n with \sum_{n} c_n^2 < \infty, and the norm is \|f\|_{\mathcal{H}}^2 = \sum_{n} c_n^2. This construction ensures that the reproducing property holds, as \langle f, K_x \rangle_{\mathcal{H}} = f(x). The Mercer construction induces an isometric embedding of the RKHS into \ell^2 via the feature map \psi: X \to \ell^2, defined by \psi(x) = (\sqrt{\lambda_n} \phi_n(x))_{n=1}^\infty. This map satisfies \langle \psi(x), \psi(y) \rangle_{\ell^2} = K(x, y), preserving kernel-induced inner products, and extends to an isometry from \mathcal{H} to a subspace of \ell^2 by mapping f = \sum c_n \sqrt{\lambda_n} \phi_n to (c_n)_{n=1}^\infty. Consequently, computations in \mathcal{H} can be performed implicitly through kernel evaluations without explicit feature expansion. Certain kernels, termed universal, yield RKHSs whose spans are dense in the space of continuous functions C(X) on compact X with the uniform norm. For such kernels, the dense span of \{K_x \mid x \in X\} in C(X) implies universal approximation properties: functions in \mathcal{H} can approximate any continuous function arbitrarily well in the sup-norm. Examples include the Gaussian kernel K(x, y) = \exp(-\|x - y\|^2 / 2\sigma^2), whose Mercer eigenfunctions form a dense set in C(X). The Mercer expansion also connects directly to the Karhunen-Loève theorem in the context of stochastic processes. For a mean-zero Gaussian process on X with covariance kernel K, the eigenfunctions \phi_n and eigenvalues \lambda_n from Mercer's theorem provide the principal components, yielding the Karhunen-Loève expansion Z(x) = \sum_{n=1}^\infty \sqrt{\lambda_n} Z_n \phi_n(x), where Z_n are uncorrelated standard Gaussian random variables. This representation minimizes the mean-squared error for finite-rank approximations and underpins dimensionality reduction in random fields.

Machine Learning and Statistics

Mercer's theorem underpins kernel methods in machine learning by establishing that a continuous, symmetric, positive semi-definite kernel function corresponds to an inner product in a reproducing kernel Hilbert space (RKHS), enabling algorithms to implicitly operate in potentially infinite-dimensional feature spaces without computing the features explicitly. This justification is crucial for methods like support vector machines (SVMs), where the kernel trick allows classification in high-dimensional spaces via kernel evaluations alone, avoiding the computational cost of explicit mappings. Similarly, Gaussian processes leverage kernels as covariance functions, with Mercer's expansion providing a spectral representation that facilitates probabilistic predictions without direct feature computation.^[22]^[3] In kernel principal component analysis (kernel PCA), Mercer's theorem ensures that the eigen-decomposition of the Gram kernel matrix yields an approximation to the continuous Mercer expansion, allowing nonlinear dimensionality reduction by projecting data onto the leading eigenfunctions of the kernel operator. This approach captures nonlinear structures in data that linear PCA misses, with the eigenvalues indicating the variance explained in the feature space. The method's efficacy relies on the theorem's guarantee of orthogonal eigenfunctions, enabling stable and interpretable low-dimensional representations.^[23]^[3] For Gaussian processes, the kernel defines the prior covariance, and Mercer's theorem decomposes it into a series of eigenfunctions and eigenvalues, which serve as basis functions for expressing the posterior mean and variance. This spectral view aids in understanding the process's smoothness and extrapolation properties, with the decay of eigenvalues quantifying the effective dimensionality of the function space. In practice, this representation supports efficient inference in regression and classification tasks by approximating the infinite sum with finite terms.^[24]^[25] Recent advancements since 2018 have integrated Mercer's theorem into deep kernel learning, where neural networks parameterize kernels to approximate Mercer expansions, boosting expressivity for complex data while maintaining probabilistic guarantees. For example, hierarchical deep kernels compose base kernels in layers, drawing on the theorem to ensure positive definiteness and enable scalable Gaussian process models with neural-like flexibility. Neural tangent kernels (NTKs), which characterize the infinite-width limit of deep networks as kernel methods, rely on Mercer's spectral decomposition to analyze training dynamics and generalization, bridging neural networks and classical kernel regression. To address scalability, the Nyström approximation truncates the Mercer series by subsampling the kernel matrix, reducing computational complexity from cubic to linear in dataset size while preserving approximation quality for large-scale applications.^[26]^[27]^[28]^[3]^[29] In statistical estimation, empirical Mercer expansions derived from the eigendecomposition of the sample kernel matrix provide consistent estimates of the population eigenfunctions and eigenvalues as the sample size grows, under assumptions of i.i.d. data and kernel continuity. This consistency ensures that data-driven approximations converge to the true spectral decomposition, supporting reliable inference in kernel-based estimators like kernel density or regression. Theoretical bounds on the convergence rate depend on the eigenvalue decay and effective dimension of the kernel, guiding practical truncation choices.^[3]

References

[1]
XVI. Functions of positive and negative type, and their connection ...
Functions of positive and negative type, and their connection the theory of integral equations. James Mercer.
[2]
[PDF] Mercer's Theorem and Related Topics1 - USC Dornsife
The Original Result: Mercer's theorem2 Let K = K(t, s) be a function defined on [0,T] ×. [0,T]. Assume that the function K has the following properties: (1) ...Missing: paper | Show results with:paper
[3]
[PDF] Reproducing Kernel Hilbert Space, Mercer's Theorem ... - arXiv
Jun 15, 2021 · This is a tutorial and survey paper on kernels, kernel methods, and related fields. We start with reviewing the history of kernels in functional.
[4]
[PDF] Mercer's Theorem, Feature Maps, and Smoothing
In particular, we will show that the polynomial and Gaussian kernels define. Hilbert spaces of functions whose norms may be interpreted as smoothness.
[5]
James Mercer - Biography - MacTutor - University of St Andrews
Mercer's theorem about the uniform convergence of eigenfunction expansions for kernels of operators appears in his 1909 paper Functions of positive and ...
[6]
Math Origins: Eigenvectors and Eigenvalues
Poincaré began with the differential equation ΔP+ξP+fD=0 ... Hilbert called these integral equations of the first kind and second kind, respectively.
[7]
[PDF] On the origin and early history of functional analysis - DiVA portal
During the years 1904–1906, Hilbert published six papers on integral equations which ... The development of function spaces with particular reference to their ...
[8]
Functions of positive and negative type, and their connection with ...
Cite this article. Mercer James. 1909Functions of positive and negative type, and their connection with the theory of integral equationsProc. R. Soc. Lond. A ...
[9]
[PDF] A Generalized Form of Mercer's Theorem - HELDA
Mercer's theorem was first represented by James Mercer in 1909, and it is an important result in the theory of integral equations. A simple form of the ...<|control11|><|separator|>
[10]
[PDF] MERCER'S THEOREM inner products Let V be a finite dimensional ...
Mercer's theorem states that a symmetric function k is a kernel if and only if it is positive definite. For finite dimensional spaces, inner products are ...
[11]
[PDF] 18.102 S2021 Lecture 22. The Spectral Theorem for a Compact Self ...
May 11, 2021 · Theorem 232 (Spectral theorem). Let A = A∗ be a self-adjoint compact operator on a separable Hilbert space H. If |λ1| ≥ |λ2| ≥ ··· are the.
[12]
https://arxiv.org/pdf/2106.08443.pdf
[13]
[PDF] MATH 590: Meshfree Methods - Chapter 2 — Part 1
Mercer's theorem provides an infinite series representation (or an eigenfunction expansion) of a positive definite kernel. A transparent proof for a continuous ...
[14]
[PDF] CS281B/Stat241B. Statistical Learning Theory. Lecture 20.
defined by Kij = k(xi,xj)—is positive semidefinite. Definition: k : X. 2. → R ... Mercer's theorem gives another representation of k as an inner product,.<|control11|><|separator|>
[15]
[PDF] Support Vector Machines - Statistics & Data Science
Nov 20, 2009 · Theorem 1 (Mercer's Theorem) If Kφ(x, z) is the kernel for a feature map- ping φ, then for any finite set of vectors x1,...xm, the m × m ...
[16]
[PDF] Linear & Ridge Regression and Kernels - People @EECS
Using Cholesky decomposition gives us xT Ay = xT LLT y, so we can define Φ(x) = LT x. where Pj ij = r. This gives us the intuition to come up with a function Φ ...
[17]
[PDF] Theory of Positive Definite Kernel and Reproducing Kernel Hilbert ...
Positive and negative definite kernels. Bochner's theorem. Mercer's theorem. Mercer's theorem. Mercer's theorem. K(x, y): continuous positive definite kernel on ...
[18]
[PDF] Vector valued reproducing kernel Hilbert spaces of integrable ...
As a consequence of the above result we have the following version of Mercer theorem. Let ν be a positive σ-finite measure defined on the Borel σ-algebra Σ(σΓ ).
[19]
[PDF] An extension of Mercer theorem to vector-valued measurable kernels
Oct 18, 2011 · Abstract. We extend the classical Mercer theorem to reproducing kernel. Hilbert spaces whose elements are functions from a measurable space.
[20]
[PDF] The Mercer-Young Theorem for Matrix-Valued Kernels on ... - arXiv
Apr 15, 2025 · To achieve this, we extend. Mercer's original proof (see Part II of [31]) from compact intervals to separable metric spaces. This extension is ...
[21]
[PDF] Kernel methods in machine learning - arXiv
We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel ...
[22]
[PDF] Nonlinear Component Analysis as a Kernel Eigenvalue Problem
We describe a new method for performing a nonlinear form of Principal Component Anal- ysis. By the use of integral operator kernel functions, ...
[23]
[PDF] Covariance Functions - Gaussian Processes for Machine Learning
We first define eigenvalues and eigenfunctions and discuss Mercer's theorem which allows us to express the kernel (under certain conditions) in terms of these.
[24]
[PDF] Gaussian Processes and Kernel Methods - arXiv
Jul 6, 2018 · For Gaussian processes, positive definite kernels serve as covariance functions of random function values, so they are also called covariance ...
[25]
[PDF] Hierarchical Kernels in Deep Kernel Learning
By Mercer's theorem (Mercer,. 1909), any continuous kernel on a compact metric space is a Hilbert-Schmidt kernel. For this sake, we shall first present the ...Missing: post- | Show results with:post-
[26]
Scalable Gaussian Processes with Low-Rank Deep Kernel ... - arXiv
May 24, 2025 · Drawing on Mercer's theorem, we introduce a fully data-driven, scalable deep kernel representation where a neural network directly ...
[27]
Understanding Layer-wise Contributions in Deep Neural Networks ...
Nov 6, 2021 · Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, ...
[28]
[PDF] Sampling-based Nyström Approximation and Kernel Quadrature
Abstract. We analyze the Nyström approximation of a posi- tive definite kernel associated with a probability measure. We first prove an improved error bound.