Fact-checked by Grok 2 weeks ago

Representer theorem

The representer theorem is a result in the of reproducing kernel Hilbert spaces (RKHS), asserting that the minimizer of a regularized empirical functional—comprising a term over training data and a strictly monotonic penalty on the norm of the solution—can always be represented as a finite of kernel evaluations centered at the training points, regardless of the potentially infinite dimensionality of the underlying feature space. Originally developed in the context of spline and , the theorem traces its roots to the work of Wahba and colleagues in the early , who established it for problems involving Sobolev spaces and squared-error loss with quadratic regularization, enabling solutions in the form of reproducing kernel expansions for statistical tasks such as splines. This classical version, often termed Wahba's representer theorem, laid the groundwork for broader applications in approximation theory and . In the late and early , the theorem was generalized by researchers in to encompass a wider array of loss functions (e.g., for ) and regularizers (beyond quadratics, including strictly increasing functions of the RKHS ), culminating in a unified that applies to positive definite kernels and arbitrary cost structures, provided the regularizer is strictly monotonic. These extensions, such as those incorporating vector-valued outputs or multitask settings, further characterize the conditions under which the solution lies in a data-spanned , using concepts like orthomonotone functions and maps to unify variants across regularization paradigms. The theorem's practical significance stems from enabling the "kernel trick," which allows computationally efficient algorithms by reducing optimization to finite-dimensional problems in the coefficients of the kernel expansion, even in high- or infinite-dimensional spaces; this underpins key kernel-based methods like support vector machines for classification, kernel ridge regression for prediction, and Gaussian processes in both statistics and . Its influence extends to modern applications in , metric learning, and , where it ensures tractable solutions for ill-posed inverse problems by constraining the hypothesis space to the span of observed data. Recent extensions have further linked the theorem to neural networks and reproducing kernel Banach spaces, enhancing its relevance in as of 2025.

Background and Motivation

Origins and Historical Development

The foundations of the Representer theorem trace back to the theory of reproducing kernel Hilbert spaces (RKHS), formally developed by Nachman Aronszajn in his seminal 1950 paper, which established the mathematical framework for spaces where point evaluations are continuous linear functionals, enabling kernel-based representations of functions. Early applications of kernel ideas emerged in the context of through the work of Mark A. Aizerman, Emmanuel M. Braverman, and Lev I. Rozonoer, who in 1964 introduced the potential function method as a means to classify patterns using sums of kernel-like functions centered at data points, laying groundwork for non-linear decision boundaries without explicit feature mapping. The theorem's direct precursors appeared in statistical smoothing problems through the work of George Kimeldorf and Grace Wahba in the early 1970s, such as their 1970 paper establishing the representer theorem for spline in a Bayesian context. Wahba's 1990 book on spline models for observational data further articulated and applied these results, showing that optimal smoothing splines could be expressed as finite linear combinations of evaluations at observed points, building on RKHS theory to solve regularization problems in function . The modern formal statement of the Representer theorem in was provided by Bernhard Schölkopf, Alexander J. Smola, and Robert C. Williamson (along with Peter L. Bartlett) in their 2000 paper on new support vector algorithms, generalizing earlier results to a broader class of problems in RKHS, which has since influenced kernel-based learning methods. This timeline—from Aronszajn's abstract theory in 1950, through Aizerman et al.'s 1964 practical kernels, Kimeldorf and Wahba's 1970 representer theorem for splines, Wahba's 1990 book summarizing these advances, to the 2000 generalization—highlights the theorem's evolution from to computational optimization.

Role in Kernel Methods

Kernel methods constitute a fundamental framework in for addressing nonlinear problems by implicitly mapping input data into high-dimensional feature spaces through kernel functions. These kernels, defined as continuous, symmetric, and positive semi-definite functions k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}, satisfy Mercer's condition, which guarantees their representation as inner products \langle \phi(x), \phi(x') \rangle in some feature space \mathcal{H} via a feature map \phi: \mathcal{X} \to \mathcal{H}. This implicit mapping, often of infinite dimensionality, enables algorithms to capture complex patterns without the need to compute \phi(x) explicitly, relying instead on kernel evaluations for all computations. At the heart of kernel methods lies the (RKHS), a complete \mathcal{H}_k of functions on \mathcal{X} associated with the kernel k, where the kernel acts as the reproducing kernel. The defining reproducing property of an RKHS asserts that point evaluations are continuous linear functionals, expressed as f(x) = \langle f, k(x, \cdot) \rangle_{\mathcal{H}_k} for all f \in \mathcal{H}_k and x \in \mathcal{X}, ensuring that function values can be obtained via inner products in the feature space. This property underpins the efficiency of kernel methods, as it allows optimization and inference to proceed through kernel matrices formed from data points, transforming high-dimensional problems into manageable matrix operations. The Representer Theorem establishes the theorem's pivotal role in kernel methods by demonstrating that solutions to regularized problems in an RKHS reside in a finite-dimensional spanned by evaluations at the . Specifically, for a \mathcal{L} and regularization term involving the RKHS , the minimizer takes the form f^*(\cdot) = \sum_{i=1}^n c_i k(\cdot, x_i), where n is the number of points \{x_i\} and coefficients c_i are determined by optimization. This finite representation drastically reduces , as it confines the search space from the potentially infinite-dimensional RKHS to an n-dimensional , making kernel methods scalable for practical . Intuitively, the theorem ensures that the optimal function aligns with the data through the reproducing property, confining the solution to the generated by the kernel-centered functions at observed points while enforcing regularization to prevent . This mechanism is essential for the success of kernel-based learners, such as support vector machines, where the theorem guarantees that decision boundaries or functions depend solely on support vectors within this span.

Core Theorem

Formal Statement

The Representer Theorem provides a characterization of the solution to regularized empirical risk minimization problems in a reproducing kernel Hilbert space (RKHS) \mathcal{H} associated with a positive semi-definite kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}. Given a training set of n data points (x_i, y_i)_{i=1}^n with x_i \in \mathcal{X} and y_i \in \mathbb{R}, the theorem considers the optimization problem of finding f \in \mathcal{H} that minimizes the functional \sum_{i=1}^n L(y_i, f(x_i)) + \lambda \|f\|_{\mathcal{H}}^2, where L: \mathbb{R} \times \mathbb{R} \to \mathbb{R} \cup \{\infty\} is a loss function and \lambda > 0 is a regularization parameter. The asserts that any minimizer f^* \in \mathcal{H} of this functional admits an expansion in the span of the kernel functions centered at the inputs: f^*(\cdot) = \sum_{i=1}^n \alpha_i K(x_i, \cdot), where the coefficients \alpha_1, \dots, \alpha_n \in \mathbb{R} are scalars obtained by solving an equivalent finite-dimensional dual optimization problem. In boundary cases, the yields trivial solutions: when n=0 (no data), the minimizer is the zero function f^* = 0; when \lambda = 0, the problem reduces to unregularized , where solutions may be non-unique or lie outside the representer form if the loss admits perfect fit without constraint.

Assumptions and Prerequisites

The Representer Theorem operates within the framework of reproducing kernel Hilbert spaces (RKHS), requiring familiarity with basic concepts from and . Specifically, prerequisites include understanding functions and their minimization, as the theorem relies on the existence of solutions to problems, and knowledge of norms, particularly the squared norm \|f\|_H^2 that measures the complexity or smoothness of functions in the RKHS H. Central to the theorem's applicability are the assumptions on the it addresses: minimizing an empirical risk plus a regularization term over functions in an RKHS. Although a ensures that the overall is and admits a unique minimizer, the representer theorem holds for more general losses (possibly non-) as long as a minimizer exists; common examples include squared loss for or for classification. The regularization term takes the form \lambda \|f\|_H^2 with \lambda > 0, where \lambda controls the trade-off between fitting the data and controlling function complexity, and the squared RKHS norm promotes solutions with desirable smoothness properties. The kernel K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} must be positive semi-definite, which guarantees the existence of an associated RKHS H equipped with the reproducing property: for any f \in H and x \in \mathcal{X}, f(x) = \langle f, K(\cdot, x) \rangle_H. This property allows functions in H to be evaluated via inner products with kernel-centered basis elements. The data consist of a finite training set \{(x_i, y_i)\}_{i=1}^n, where x_i \in \mathcal{X} (the input space) and y_i \in \mathbb{R} are scalar targets, enabling the empirical risk to be computed solely on these points. These assumptions delineate the theorem's scope, but violations lead to failures. If the loss L is such that no minimizer exists or the problem lacks suitable properties, the representation may not hold. Similarly, omitting the regularization term (\lambda = 0) results in infinitely many interpolating solutions without a unique finite-dimensional representation, as seen in unregularized problems where the minimum RKHS norm solution is not enforced.

Proof and Analysis

Derivation Outline

The derivation of the Representer Theorem proceeds by decomposing the minimizer in the (RKHS) and leveraging properties of the loss and regularization terms in the . Consider the standard setup where the goal is to minimize an objective of the form J(f) = L(f(x_1), \dots, f(x_n)) + \lambda \|f\|_{\mathcal{H}}^2, with f \in \mathcal{H}, a L depending only on evaluations at points x_i, regularization parameter \lambda > 0, and \mathcal{H} an RKHS with K. The first step involves representing f as the sum of a component in the span of the kernel sections centered at the training points and an orthogonal remainder: f = f_{\text{span}} + f_{\perp}, where f_{\text{span}} \in \operatorname{span}\{K(x_i, \cdot) \mid i=1,\dots,n\} and f_{\perp} is orthogonal to this finite-dimensional subspace in \mathcal{H}, meaning \langle f_{\perp}, K(x_i, \cdot) \rangle_{\mathcal{H}} = 0 for all i. This decomposition follows from the projection theorem in Hilbert spaces, which guarantees a unique orthogonal projection onto the closed subspace spanned by the K(x_i, \cdot). Next, observe that the loss term L(f(x_1), \dots, f(x_n)) depends solely on f_{\text{[span](/page/Span)}}. By the reproducing property of the RKHS, the evaluation f(x_i) = \langle f, K(x_i, \cdot) \rangle_{\mathcal{H}} = \langle f_{\text{[span](/page/Span)}}, K(x_i, \cdot) \rangle_{\mathcal{H}} + \langle f_{\perp}, K(x_i, \cdot) \rangle_{\mathcal{H}} = \langle f_{\text{[span](/page/Span)}}, K(x_i, \cdot) \rangle_{\mathcal{H}}, since the condition sets the second inner product to zero. Thus, replacing f with f_{\text{[span](/page/Span)}} leaves the loss unchanged while potentially reducing the regularization term. The norm then decomposes additively due to : \|f\|_{\mathcal{H}}^2 = \|f_{\text{span}}\|_{\mathcal{H}}^2 + \|f_{\perp}\|_{\mathcal{H}}^2. For \lambda > 0, minimizing J(f) requires setting f_{\perp} = 0, as any nonzero f_{\perp} strictly increases the norm without affecting , thereby increasing . This yields f = \sum_{i=1}^n \alpha_i K(x_i, \cdot) for some coefficients \alpha_i, reducing the infinite-dimensional problem to a finite-dimensional optimization over the \alpha_i. A fuller derivation sketch frames this within optimization theory, where the minimizer satisfies the representer identity derived from setting the of J to zero, confirming that the subgradient condition aligns with the span representation under convexity assumptions on L and the quadratic regularizer. Alternatively, the proof invokes the representer identity from variational analysis in , ensuring the solution lies in the relevant subspace. The analytical core relies on projections, with the resulting dual problem solvable via an O(n^2)-sized for the kernel evaluations, though full inversion may scale cubically in practice.

Geometric Interpretation

The (RKHS) associated with a k can be conceptualized as an infinite-dimensional of functions, equipped with an inner product \langle f, g \rangle_{\mathcal{H}} = \sum_{i,j} \alpha_i \beta_j k(x_i, x_j) for functions in the of kernel evaluations, allowing operations analogous to those in finite-dimensional spaces. In this setting, the training data points \{x_i\}_{i=1}^n define "anchors" through the feature maps k(\cdot, x_i), which serve as reproducing basis elements satisfying the reproducing property f(x_i) = \langle f, k(\cdot, x_i) \rangle_{\mathcal{H}}. These anchors embed the data's geometric structure into the , transforming the optimization problem from infinite dimensions to interactions among these points. The optimal function f^* minimizing a regularized empirical risk functional resides exclusively in the finite-dimensional subspace S = \operatorname{span}\{k(\cdot, x_i) : i=1,\dots,n\} of the RKHS. Geometrically, this confinement arises from the Hilbert space projection theorem, which decomposes any function f \in \mathcal{H} as f = f_S + f_\perp where f_S \in S and f_\perp \in S^\perp is orthogonal to S; the orthogonal component f_\perp contributes nothing to the empirical loss (as it vanishes at the data points) but increases the regularization term \|f\|_{\mathcal{H}}^2, making f^* = f_S the minimizer. Thus, f^* is the orthogonal projection of potential solutions onto the data-induced basis, ensuring the function interpolates the observations while minimizing complexity in the RKHS norm. This projection-based view parallels , where data are projected onto a low-dimensional capturing maximal variance; however, in the representer , the is kernel-defined by the anchors, and regularization biases the solution toward minimal norm within S, akin to a ridge-penalized fit that avoids by constraining deviation from the origin in . For , consider a of the RKHS where the anchors k(\cdot, x_i) align along coordinate axes; f^* then emerges as a weighted \sum_i \alpha_i k(\cdot, x_i), with coefficients \alpha_i balancing data fit and norm minimization, resulting in a vector hugging close to the data points without excessive length. A key insight from this geometry is that the theorem guarantees finite support for solutions even under universal kernels like the (RBF) kernel, which spans a dense of continuous functions on compact domains and could in principle approximate arbitrary functions. Despite this universality, the onto the finite S confines f^* to depend solely on the n anchors, yielding a sparse representation that is computationally efficient and robust to outside the data points. The in the proof underscores this, as any extension beyond S would inflate the norm without improving the fit.

Extensions and Variants

Generalizations to

The representer theorem, originally formulated for scalar-valued functions in reproducing kernel Hilbert spaces (RKHS) with quadratic regularization and least-squares , has been extended to broader classes of problems. The classical version, due to Wahba, applied the theorem to semi-norm regularizers in Sobolev spaces for estimation. In this setting, the optimization involves minimizing an empirical plus a semi-norm penalty J(f) = \int_a^b (L f(t))^2 dt, where L is a , ensuring the solution remains a finite of kernel evaluations at the data points. This framework supports and in non-RKHS Hilbert spaces while preserving the finite-dimensional representability of the minimizer. A significant broadening came with the generalization by Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola, which accommodates arbitrary loss functions and regularization terms, provided the loss depends only on point evaluations of the function. Specifically, for a c over training data (x_i, y_i) and a monotonically increasing regularization g(\|f\|_{\mathcal{H}_k}) in an RKHS \mathcal{H}_k with kernel k, the minimizer f^* satisfies f^*(\cdot) = \sum_{i=1}^m \alpha_i k(x_i, \cdot) for some coefficients \alpha_i \in \mathbb{R}. This extension decouples the loss from the strict , enabling applications to robust losses like or logistic while maintaining computational tractability through kernel methods. The convexity ensures the problem is amenable to efficient solvers, such as subgradient descent. Further generalizations address vector-valued outputs, where functions map from input space \mathcal{X} to \mathbb{R}^m, using matrix-valued or operator-valued kernels. In vector-valued RKHS, the kernel K: \mathcal{X} \times \mathcal{X} \to \mathcal{L}(\mathbb{R}^m, \mathbb{R}^m) is a positive definite operator, and the representer theorem yields a solution of the form f^*(\cdot) = \sum_{i=1}^m \alpha_i \otimes K(x_i, \cdot), where \alpha_i \in \mathbb{R}^m are coefficient vectors and \otimes denotes the tensor product action. This structure supports multi-task learning and operator regression, with the finite span arising from evaluations at the finite training points x_i. Seminal developments in this area, including proofs of existence and uniqueness under mild boundedness conditions on the loss, were provided by Carlo A. Micchelli and Massimiliano Pontil. Despite these advances, the representer theorem and its generalizations inherently rely on a finite number of data points for the finite-dimensional ; with infinite training data, the becomes infinite-dimensional, breaking the to a tractable optimization over coefficients. This limitation underscores the theorem's applicability to in finite-sample settings, though approximations like Nyström methods can mitigate it in large-data regimes. Recent extensions (as of ) generalize the theorem to reproducing Banach spaces (RKBS), enabling sparse solutions for and ridge splines beyond constraints. These Banach variants provide representer theorems for non-Hilbert norms, supporting efficient learning in high-dimensional or structured data settings.

Adaptations for Non-Standard Settings

In scenarios, the representer theorem is adapted to handle multiple outputs by employing vector-valued or matrix-valued , which encode correlations between tasks. These , often structured as block matrices, allow the solution function f: \mathcal{X} \to Y—where Y is a of output vectors—to be expressed as a finite of kernel evaluations at the training points, specifically f(x) = \sum_{j=1}^m K(x_j, x) c_j, with c_j \in Y capturing task-specific coefficients. This extension preserves the finite-dimensional representer property while facilitating shared representations across tasks, as demonstrated in foundational work on reproducing kernel Hilbert spaces for vector-valued functions. For semi-supervised learning, adaptations incorporate unlabeled data by expanding the span of the representer to include kernel functions centered at both labeled and unlabeled points, enabling the exploitation of manifold structure or local invariances. The optimal function minimizing a regularized over labeled losses and constraints on unlabeled points (e.g., via bounded linear functionals like local averages) takes the form f(\cdot) = \sum_{i=1}^l \alpha_i k(x_i, \cdot) + \sum_{i=l+1}^n \beta_i z_i(\cdot), where the first sum spans labeled kernels and the second enforces properties on unlabeled data through representers of functionals. This maintains the theorem's efficiency for in finite dimensions, particularly for Gaussian or kernels. In non-convex settings, such as those arising in neural networks, approximate versions of the representer theorem address the limitations of exact representations by leveraging neural tangent kernels (NTKs). For finite-width networks, the empirical NTK is approximated via low-rank structures like the "sum of logits" pseudo-NTK, which reduces while converging to predictions at a rate of O(n^{-1/2}) for width n, even under non-convex loss landscapes during training. This approximation facilitates understanding representation learning in deep architectures by linking them to methods without assuming infinite width. Constrained variants extend the theorem to incorporate inequality constraints, such as measures or sparsity, by formulating the problem as a saddle-point optimization in the RKHS and applying projected gradient methods. The solution remains in the span of functions at points, \hat{f}^*(\cdot) = \sum_{t=1}^T w_t \kappa(x_t, \cdot), with coefficients obtained via projected primal-dual updates that enforce constraints through low-dimensional projections (e.g., orthogonal ). These methods achieve sublinear rates, like O(\sqrt{T}) for sub-optimality, while preserving the representer's finite . Recent developments in the have adapted quantum kernels for variational quantum algorithms (VQAs), using them as surrogate models to optimize parameterized quantum circuits more efficiently on noisy intermediate-scale devices. By employing quantum kernels in Gaussian processes, these adaptations reduce the number of circuit evaluations by up to 10 times compared to gradient-based methods, leveraging kernel-induced feature maps to approximate the non-convex VQA landscape without explicit reliance on classical representers, though aligning with principles.

Applications in Machine Learning

Support Vector Machines

The representer theorem plays a central role in the formulation and solution of support vector machines (SVMs), enabling the expression of the optimal decision function in terms of the training data via kernel functions. In SVMs, the goal is to find a that separates classes with maximum margin while minimizing errors, formulated as a constrained quadratic optimization problem. The form of this problem, derived using Lagrange multipliers, is to maximize \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) subject to $0 \leq \alpha_i \leq C for all i and \sum_{i=1}^n \alpha_i y_i = 0, where \alpha_i are the Lagrange multipliers, y_i \in \{-1, 1\} are the labels, C is the regularization parameter, and K is a function. This optimization avoids explicit computation in high-dimensional feature spaces, leveraging the kernel trick to handle non-linear separability implicitly. Applying the representer theorem to the SVM optimization—minimizing a regularized over a —guarantees that the solution lies in the of the evaluations at the points. Specifically, the decision function adopts the form f(\mathbf{x}) = \sum_{i=1}^n \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b, where b is the term determined by the Karush-Kuhn-Tucker conditions, and only the \alpha_i > 0 contribute meaningfully as vectors. This finite-dimensional representation ensures that SVMs remain computationally tractable even for non-linear , as the hypothesis class is restricted to expansions over the n examples without needing to materialize the potentially infinite-dimensional feature map. The efficiency of SVM training and prediction stems from the sparsity enforced by the representer theorem: only a subset of \alpha_i are non-zero, corresponding to the support vectors that lie on or beyond the margin boundaries. While solving the dual involves inverting or factorizing the n \times n matrix at O(n^3) cost, the resulting model uses only s \ll n support vectors for evaluation, reducing prediction time to O(s) per instance and enabling decomposition methods like for scalable training. For instance, in the linear SVM case, the kernel simplifies to K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\top \mathbf{x}_j, and the representer form collapses to the primal f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b with \mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i, allowing direct optimization in the input space. In contrast, non-linear extensions employ kernels like the K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2), mapping to higher dimensions implicitly while preserving the sparse representer structure for effective of complex boundaries.

Gaussian Processes and Regression

Gaussian processes (GPs) provide a Bayesian nonparametric for , where a GP prior is placed over functions in a (RKHS) defined by a k(\cdot, \cdot). The posterior mean of the GP, which serves as the predictive function, takes the form of a guaranteed by the representer theorem for squared loss minimization with regularization. Specifically, given training data \{(x_i, y_i)\}_{i=1}^n with observation noise, the posterior mean \mu_*(x) at a test point x is expressed as a finite of functions centered at the training inputs: \mu_*(x) = \sum_{i=1}^n \beta_i k(x_i, x), where the coefficients \beta = [K(X,X) + \sigma^2 I]^{-1} y, and K(x,X) denotes the kernel row vector [k(x, x_1), \dots, k(x, x_n)]. This representation ensures that the infinite-dimensional function space is reduced to an n-dimensional subspace spanned by the data points, aligning with the representer theorem's finite-dimensional solution property. The regularization parameter \lambda in the representer theorem formulation corresponds directly to the noise variance \sigma^2 in the GP model, incorporating Gaussian observation noise y_i = f(x_i) + \epsilon_i where \epsilon_i \sim \mathcal{N}(0, \sigma^2). This equivalence highlights how the GP's probabilistic noise handling regularizes the RKHS norm to prevent , with the posterior mean minimizing the regularized empirical risk. Exact GP inference requires inverting the n \times n K(X,X) + \sigma^2 I, leading to O(n^3) , which limits scalability for large datasets. To address this, sparse GP approximations introduce m \ll n inducing points as pseudo-data points, reducing complexity to O(m^3 + nm^2) while approximating the posterior via variational methods; these inducing points effectively act as a low-rank basis in the representer expansion. Hyperparameters of the GP kernel, such as length scales or signal variance, are typically learned by maximizing the of the observed data, which integrates out the latent function values and provides a principled way to balance fit and model complexity in the representer-based solution.

References

  1. [1]
    [PDF] A Generalized Representer Theorem - Alex Smola
    Abstract. Wahba's classical representer theorem states that the solu- tions of certain risk minimization problems involving an empirical risk.
  2. [2]
    [PDF] Representer Theorem - Department of Statistics
    This article reviews the representer theorem for various learning problems ... & Schölkopf, B. (2012). The representer theorem for Hilbert spaces: a.<|control11|><|separator|>
  3. [3]
    [PDF] A Unifying View of Representer Theorems
    In machine learning, the representer theorem is the main factor that enables application of the so-called “kernel trick” and underpins all of the widely used ...
  4. [4]
    Spline Models for Observational Data - SIAM Publications Library
    This book serves well as an introduction into the more theoretical aspects of the use of spline models. It develops a theory and practice for the estimation ...
  5. [5]
    [PDF] Reproducing Kernel Hilbert Space, Mercer's Theorem ... - arXiv
    Jun 15, 2021 · Abstract. This is a tutorial and survey paper on kernels, kernel methods, and related fields. We start with reviewing the history of kernels ...
  6. [6]
    A Generalized Representer Theorem - SpringerLink
    A Generalized Representer Theorem. Conference paper; First Online: 01 January ... Schölkopf, C.J.C. Burges, and A.J. Smola, editors, Advances in Kernel ...Missing: original | Show results with:original
  7. [7]
    [PDF] CS229 Supplemental Lecture notes
    More precisely, we have the following theorem, known as the representer theorem. Theorem 2.1. Suppose in the definition of the regularized risk (2) that λ ≥. 0.
  8. [8]
    [PDF] The Representer Theorem 1 Addendum on the Gaussian kernel 2 ...
    then the Representer Theorem ( Kimeldorf & Wahba, 1971 ) states that every minimizer of P admits a representation of the form f(·) = m. X i=1. αiK(·,xi). I.e. ...
  9. [9]
  10. [10]
    Asymptotic normality of support vector machine variants and other ...
    ... representer theorem [29, Theorem 5.9] implies f L , P , λ 0 ( x ) = 0 almost ... Fréchet-derivative in f ∈ H is given by H → R , h ↦ 〈 ∫ L f ′ Φ d ...
  11. [11]
    [PDF] Learning with Kernels
    Bernhard Schölkopf and Alexander Smola have written a comprehensive, yet ... the representer theorem (Theorem 4.2), proposed and explored in the context of.
  12. [12]
    [PDF] Characterizing the Representer Theorem
    The representer theorem assures that ker- nel methods retain optimality under penal- ized empirical risk minimization. While a sufficient condition on the ...
  13. [13]
    [PDF] Kernel Methods and the Representer Theorem 1 Introduction 2 Loss ...
    Now let's use the representer theorem in the context of regression with the squared error loss, so that X = Rd and Y = R. The kernel method solves bf= arg ...Missing: applications | Show results with:applications
  14. [14]
    [PDF] Universal Kernels - Journal of Machine Learning Research
    This useful fact is known as the Representer Theorem and has wide applicability (Schölkopf et al., 1999; Schölkopf and Smola,. 2002; Shawe-Taylor and ...
  15. [15]
    None
    ### Summary of "A Unifying View of Representer Theorems"
  16. [16]
    [PDF] Kernels for Multi--task Learning
    We present two methods to enhance standard RKHS to vector–valued functions. 2.1 Matrix–valued kernels based on Aronszajn. The first approach extends the scalar ...
  17. [17]
    [PDF] Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces ...
    We give a representer theorem, showing that we can represent an optimal function that minimizes the cost using a finite number of basis functions when the ...
  18. [18]
    [PDF] A Fast, Well-Founded Approximation to the Empirical Neural ... - arXiv
    Jun 7, 2023 · Empirical neural tangent kernels (eNTKs) provide a good understanding of a network's representation, but are computationally expensive. The ' ...
  19. [19]
    [PDF] Projected Stochastic Primal-Dual Method for Constrained Online ...
    In Section II, we formulate the constrained optimization problem in RKHS and extend the Representer Theorem to a class of saddle-point prob- lems. The projected ...
  20. [20]
    Faster variational quantum algorithms with quantum kernel-based ...
    Nov 2, 2022 · We present a new optimization method for small-to-intermediate scale variational algorithms on noisy near-term quantum processors which uses a Gaussian process ...Missing: 2020s | Show results with:2020s
  21. [21]
  22. [22]
  23. [23]
    [PDF] Regression - Gaussian Processes for Machine Learning
    We give some theoretical analysis of Gaussian process regression in section 2.6, and discuss how to incorporate explicit basis functions into the models in ...
  24. [24]
    [PDF] Variational Learning of Inducing Variables in Sparse Gaussian ...
    Sparse Gaussian process methods that use in- ducing variables require the selection of the inducing inputs and the kernel hyperparam-.
  25. [25]
    [PDF] Model Selection and Adaptation of Hyperparameters
    To set the hyperparameters by maximizing the marginal likelihood, we seek marginal likelihood gradient the partial derivatives of the marginal likelihood ...