Fact-checked by Grok 2 weeks ago

Kernel method

Kernel methods are a of analysis algorithms in that implicitly map data from an input to a high-dimensional using functions, enabling linear models to address nonlinear problems efficiently via the kernel trick, which computes inner products directly without explicit transformations. These methods rely on positive semi-definite functions that satisfy Mercer's condition, ensuring they represent valid inner products in a (RKHS), a mathematical that guarantees the existence of such a and supports learning algorithms operating solely on evaluations. The kernel trick, central to their efficiency, avoids the computational burden of high-dimensional mappings, making kernel methods scalable for complex datasets where explicit computation would be infeasible. Key to kernel methods is the choice of kernel function, which encodes domain-specific similarities; common examples include the linear kernel K(\mathbf{x}, \mathbf{y}) = \mathbf{x}^\top \mathbf{y} for basic linear separations, the K(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^\top \mathbf{y} + c)^d for capturing polynomial interactions, and the (RBF) or Gaussian kernel K(\mathbf{x}, \mathbf{y}) = \exp(-\|\mathbf{x} - \mathbf{y}\|^2 / 2\sigma^2) for modeling local similarities in continuous spaces. This flexibility allows kernel methods to generalize across tasks, with foundational applications in support vector machines (SVMs) for and , where they maximize margins in the feature space to achieve strong generalization performance. Beyond SVMs, kernel methods underpin techniques like (PCA) for , Gaussian processes for probabilistic , and kernel ridge for regularized learning, all unified by their operation in RKHS to enforce smoothness and control complexity through regularization. Historically, kernel methods gained prominence in the through the development of SVMs, building on earlier statistical ideas like regularization and from the works of Vapnik and Chervonenkis, with comprehensive theoretical foundations established in subsequent reviews and texts. Despite their power, kernel methods can suffer from high computational costs for large datasets due to the quadratic scaling of kernel matrices, prompting ongoing research into approximations and scalable variants.

Introduction

Definition and Overview

Kernel methods constitute a class of algorithms employed in for pattern analysis tasks, such as , , and clustering, where kernel functions facilitate the handling of nonlinear by implicitly mapping inputs to high-dimensional feature spaces without requiring explicit coordinate computations. This approach leverages the structure of the through pairwise similarities, allowing algorithms originally designed for linear problems to address nonlinear relationships effectively. At their core, kernel methods enable linear algorithms to operate in a nonlinear manner by replacing explicit feature mappings with kernel functions that compute inner products in the transformed , thereby avoiding the computational expense of high-dimensional representations. For example, a exhibiting intertwined classes that are not linearly separable in the input can often be separated by a after implicit into a richer defined by the . Kernel methods bear a strong resemblance to paradigms, such as k-nearest neighbors, in that they emphasize local similarities between points quantified via kernel-induced metrics rather than global model parameters. This focus on similarity measures positions kernel methods as versatile tools for nonparametric modeling, particularly in scenarios where the underlying is complex or unknown.

Historical Development

The origins of kernel methods trace back to the early with foundational mathematical contributions, notably , which established conditions for representing symmetric positive-definite as inner products in a , though its application to emerged much later. A more direct precursor appeared in the , when Mark A. Aizerman, Evgeniy M. Braverman, and Lev I. Rozonoer introduced the in 1964 as part of the potential function method for nonlinear . This approach generalized the linear by mapping into a higher-dimensional via functions, enabling the handling of nonlinear decision boundaries without explicit computation, and laid early groundwork for implicit high-dimensional representations in tasks. Kernel methods experienced limited attention during the 1970s and 1980s amid the dominance of neural networks and statistical approaches, but they saw a significant revival in the 1990s through integration with and support vector machines (SVMs). and colleagues formulated the SVM framework in 1992, incorporating kernels to extend linear maximum-margin classifiers to nonlinear problems via the kernel trick. This was further refined in the seminal 1995 paper by Corinna Cortes and , which demonstrated the practical efficacy of kernel SVMs on real-world datasets, propelling widespread adoption in . The revival capitalized on to ensure valid kernel choices, transforming kernel methods from theoretical curiosities into robust tools for and . Post-2000, kernel methods evolved from their roots in —pioneered by Vapnik and Alexey Chervonenkis—into broader paradigms, influencing techniques like , Gaussian processes, and . This expansion was driven by computational advances and empirical successes, establishing kernels as a cornerstone for handling complex, high-dimensional data in fields ranging from bioinformatics to . Recent developments, particularly from 2023 to 2025, have focused on multi-class extensions, such as quantum kernel methods for enhanced efficiency, and applications in , including for predicting molecular properties from chemical descriptors.

Theoretical Foundations

Reproducing Kernel Hilbert Spaces

A reproducing kernel Hilbert space (RKHS), denoted \mathcal{H}, is a complete inner product space of real-valued functions defined on a nonempty set X such that the point evaluation functional is continuous for every x \in X. This continuity implies the existence of a unique reproducing kernel k: X \times X \to \mathbb{R}, which is symmetric and positive semi-definite, satisfying k(x, y) = \langle \phi(x), \phi(y) \rangle_{\mathcal{H}} for some feature map \phi: X \to \mathcal{H}. The kernel serves as the inner product in this function space, enabling the representation of functions without explicit coordinate systems. The defining feature of an RKHS is the reproducing property, which states that for every f \in \mathcal{H} and x \in X, f(x) = \langle f, k(\cdot, x) \rangle_{\mathcal{H}}. This property ensures that the kernel function k(\cdot, x) acts as a representer for the at x, making point evaluations bounded linear operations on the . As a , \mathcal{H} is complete with respect to the norm induced by the inner product \langle \cdot, \cdot \rangle_{\mathcal{H}}, and the reproducing is unique for the given . Functions in \mathcal{H} are thus elements whose evaluations can be recovered via inner products with kernel sections, providing a structured way to handle infinite-dimensional spaces in and learning. The Moore-Aronszajn theorem establishes a bijective correspondence between positive definite kernels and RKHSs: for any continuous positive definite kernel k on X, there exists a unique RKHS \mathcal{H}_k of functions on X whose reproducing kernel is k. This theorem, building on earlier work by , guarantees that every such kernel induces a well-defined , with the space constructed as the completion of the span of \{k(\cdot, x) \mid x \in X\} under the inner product defined by the kernel. In kernel methods, the feature map \phi implicitly embeds the input space into the RKHS \mathcal{H}, allowing computations to proceed solely through kernel evaluations without constructing \phi explicitly. This mapping transforms nonlinear problems in the original space into linear ones in \mathcal{H}, where inner products correspond directly to kernel values, facilitating efficient algorithms in high- or infinite-dimensional settings.

The Kernel Trick

The kernel trick is a fundamental computational technique in kernel methods that enables algorithms to operate implicitly in a high-dimensional feature space without explicitly computing the feature map. It involves substituting the inner product between mapped feature vectors, ⟨φ(x), φ(y)⟩, with a kernel function evaluation k(x, y), where φ: X → maps inputs from the original space X to a H. This substitution is possible because many algorithms, including those for and , can be expressed solely in terms of such inner products. For algorithms relying on inner products, such as methods, the kernel trick allows implicit computation in potentially infinite-dimensional spaces by replacing every occurrence of ⟨φ(x_i), φ(x_j)⟩ with k(x_i, x_j). This avoids the prohibitive cost of explicit mapping, as φ may not even be computable directly. For instance, the squared in the feature space, which appears in many distance-based computations, can be expressed as \|\phi(x) - \phi(y)\|^2 = k(x, x) + k(y, y) - 2k(x, y), enabling efficient evaluation without materializing the features. The technique was first introduced in the context of by Aizerman, Braverman, and Rozonoer in their work on potential functions. Mercer's theorem establishes the theoretical foundation for the kernel trick by specifying conditions under which a symmetric function k serves as a valid kernel representing an inner product. If k is continuous and positive semi-definite on a compact domain, it admits an expansion k(x, y) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(y), where λ_i ≥ 0 are eigenvalues and {φ_i} forms an of the feature space, ensuring the expansion corresponds to a legitimate inner product. This theorem guarantees that kernel evaluations implicitly perform the mapping and dot product in the associated . In practice, for a dataset {x_1, ..., x_n}, the Gram matrix (or kernel matrix) K ∈ ℝ^{n×n} with entries K_{ij} = k(x_i, x_j) captures all pairwise inner products in the feature space. This matrix is positive semi-definite due to the kernel's properties and serves as the core data structure for optimization in kernel-based algorithms, allowing solutions to be derived without ever constructing φ(x_i).

Kernel Functions

Properties of Kernel Functions

A kernel function k: \mathcal{X} \times \mathcal{X} \to \mathbb{R} must satisfy two fundamental properties to be valid for kernel methods in machine learning: symmetry and positive semi-definiteness. Symmetry requires that k(x, y) = k(y, x) for all x, y \in \mathcal{X}, ensuring the resulting Gram matrix is symmetric and facilitating the interpretation as an inner product in some feature space. Positive semi-definiteness (PSD) is the core requirement, stating that for any finite set of points \{x_1, \dots, x_n\} \subset \mathcal{X} and any coefficients c_1, \dots, c_n \in \mathbb{R}, \sum_{i=1}^n \sum_{j=1}^n c_i c_j k(x_i, x_j) \geq 0, with equality holding if and only if c_1 = \dots = c_n = 0 (for strictly positive definite kernels; semi-definiteness allows equality for nontrivial c in degenerate cases). This condition guarantees that the Gram matrix K with entries K_{ij} = k(x_i, x_j) is positive semi-definite, which is essential for the existence of a corresponding reproducing kernel Hilbert space (RKHS) where kernel evaluations correspond to inner products. Mercer's theorem provides a spectral characterization for continuous kernels, linking PSD to explicit feature expansions. Specifically, for a continuous, symmetric PSD kernel k defined on a compact subset of \mathbb{R}^d \times \mathbb{R}^d, there exist positive eigenvalues \lambda_m \searrow 0 and orthonormal functions \phi_m in L^2 such that k(x, y) = \sum_{m=1}^\infty \lambda_m \phi_m(x) \phi_m(y), with the series converging absolutely and uniformly on compact sets. This expansion justifies the kernel trick by representing the kernel as an infinite dot product in a high-dimensional feature space, and the condition ensures the integral operator induced by k is positive, which is crucial for theoretical analyses in functional analysis and learning theory. Continuity and boundedness further enhance the utility of kernel functions in practical algorithms. A continuous kernel on a compact domain is necessarily bounded, satisfying |k(x, y)| \leq M for some M > 0 and all x, y \in \mathcal{X}, which bounds the operator norm of the associated and promotes uniform convergence of expansions. These properties imply improved and in kernel-based estimators; for instance, bounded kernels yield finite-variance estimators in tasks, leading to generalization bounds via algorithmic , where perturbations in the training set result in controlled changes in the learned function. Valid kernels can be constructed by combining existing ones, preserving PSD. If k_1 and k_2 are PSD kernels on the same input space, then for any a, b > 0, a k_1 + b k_2 is PSD, as the corresponding Gram matrices add positively; similarly, the pointwise product k_1(x, y) k_2(x, y) is PSD, corresponding to the tensor product of feature spaces. These operations enable flexible design of kernels tailored to data structure while maintaining theoretical guarantees.

Common Kernel Functions

Common kernel functions map input data into higher-dimensional spaces to capture nonlinear relationships while satisfying Mercer's condition of . These functions are selected based on the data's structure and the problem's complexity, enabling efficient computation via the kernel trick. The linear kernel, defined as k(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y}, computes the standard and is suitable for linearly separable data where no nonlinear mapping is required. It serves as a baseline for high-dimensional or sparse datasets, avoiding the computational overhead of more complex kernels. The extends the to capture interactions of a specified , given by k(\mathbf{x}, \mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} + c)^d, where d is the polynomial and c is a term controlling the influence of higher-order terms. This kernel is applied when the data exhibits polynomial-like relationships, such as in image recognition tasks with geometric features. The Gaussian (RBF) measures local similarities through the formula k(\mathbf{x}, \mathbf{y}) = \exp\left( -\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2\sigma^2} \right), with \sigma as the that determines the kernel's sensitivity to . It is versatile for datasets with unknown or complex structures, effectively handling nonlinear boundaries by emphasizing nearby points. The sigmoid kernel, inspired by neural network activation functions, is expressed as k(\mathbf{x}, \mathbf{y}) = \tanh(\alpha \mathbf{x} \cdot \mathbf{y} + c), where \alpha scales the input and c is a constant shift. It was popular in early applications due to its similarity to multilayer perceptrons, though it requires careful parameter tuning to ensure . String kernels address sequential data, such as text or biological sequences, by comparing substrings rather than explicit alignments. The spectrum kernel, for instance, counts the occurrences of all substrings of length k (k-mers) and computes their , effectively measuring subsequence similarities. This approach is particularly useful in bioinformatics for protein , where it captures motif-based patterns without relying on alignments. Guidelines for selecting kernels depend on data characteristics: the linear kernel suits linearly separable or high-dimensional data; polynomial kernels are chosen when interactions follow a known ; the RBF kernel is default for unknown nonlinearities due to its flexibility; sigmoid kernels apply to neural network-like problems; and string kernels are essential for non-vectorial sequence data. Cross-validation is recommended to tune parameters and validate choices empirically.

Algorithms and Applications

Support Vector Machines

Support vector machines (SVMs) are algorithms primarily used for tasks, where the goal is to find a that separates data points of different classes while maximizing the margin—the distance between the and the nearest data points from each class, known as support vectors. This maximization promotes better by reducing sensitivity to noise and outliers in the training data. In the formulation, the is expressed as minimizing \frac{1}{2} \| \mathbf{w} \|^2 subject to the constraints y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 for all training examples i = 1, \dots, n, where \mathbf{w} is the weight vector normal to the , b is the term, \mathbf{x}_i are the input features, and y_i \in \{-1, 1\} are the class labels. To handle nonlinearly separable data, kernel methods are incorporated via the dual formulation, which is derived using the and solved through . The dual problem maximizes \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) subject to \sum_{i=1}^n \alpha_i y_i = 0 and $0 \leq \alpha_i \leq C for all i, where \alpha_i are the Lagrange multipliers, and K(\cdot, \cdot) is a that computes products in a higher-dimensional feature space without explicitly the data. The kernel trick enables this by replacing inner products with kernel evaluations, allowing SVMs to implicitly operate in high-dimensional spaces for nonlinear decision boundaries. The decision function for a new point \mathbf{x} is then \operatorname{sign} \left( \sum_{i: \alpha_i > 0} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \right), relying only on support vectors where \alpha_i > 0, which ensures sparsity and computational efficiency. For real-world datasets with noise or overlapping classes, hard-margin SVMs are impractical, so soft-margin variants introduce slack variables \xi_i \geq 0 to allow some misclassifications. The primal optimization becomes \min \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^n \xi_i subject to y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, where C > 0 is a regularization controlling the between margin maximization and error. In the kernelized dual, the upper bound C on \alpha_i incorporates this softness, enabling robust performance on non-separable data. Kernel SVMs offer key advantages, including the ability to create complex, nonlinear decision boundaries through appropriate kernel choices, such as or kernels, without increasing beyond O(n^2) or O(n^3) for . The sparsity property means the model depends only on a of points (support vectors), typically a small fraction of the data, which reduces storage and prediction time while maintaining predictive power. A classic illustration is the XOR problem, where linearly inseparable points in 2D—such as classes at (-1,-1), (1,1) versus (1,-1), (-1,1)—can be separated using a kernel of degree 2, K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i \cdot \mathbf{x}_j + 1)^2, which maps to a quadratic surface in higher dimensions.

Kernel Methods in Other Algorithms

Kernel methods extend beyond classification tasks in support vector machines, enabling nonlinear extensions to a variety of , , and probabilistic modeling algorithms through the kernel trick. This versatility allows these algorithms to operate implicitly in high-dimensional feature spaces without explicit computation of feature mappings, facilitating the capture of complex data structures. Key adaptations include for , for predictive modeling, Gaussian processes for , and kernelized clustering techniques. Kernel principal component analysis (Kernel PCA) provides a nonlinear generalization of classical principal component analysis by performing eigen-decomposition in the feature space induced by a kernel function. Introduced by Schölkopf, Smola, and Müller in 1998, it computes the principal components as projections onto the eigenvectors of the kernel matrix, which approximates the covariance operator in the reproducing kernel Hilbert space. Specifically, for a centered kernel matrix K, the eigenvalue problem \lambda_i \alpha_i = K \alpha_i yields the coefficients \alpha_i for the i-th eigenvector, where the eigenvalues \lambda_i correspond to the variance explained in the feature space. This approach enables nonlinear dimensionality reduction, such as separating intertwined manifolds in datasets like concentric circles or Swiss rolls, outperforming linear PCA in capturing underlying nonlinear geometries. Kernel ridge regression adapts regularized least squares regression to nonlinear settings by solving the problem in the dual form using the kernel Gram matrix. Developed by Saunders, Gammerman, and Vovk in 1998, it minimizes the regularized loss \min_w \| y - \Phi w \|^2 + \lambda \| w \|^2, where \Phi is the feature map, leading to the solution \alpha = (K + \lambda I)^{-1} y and predictions f(x) = k(x)^T (K + \lambda I)^{-1} y. This formulation inverts the kernel matrix to obtain coefficients, allowing the model to fit nonlinear relationships while controlling overfitting through the regularization parameter \lambda. In practice, it has been applied to tasks like financial forecasting, where it handles high-dimensional inputs with improved generalization compared to linear ridge regression. Gaussian processes offer a probabilistic for and where the function directly defines the between function values at different inputs. As detailed in the seminal work by and Williams in 2006, a models the target function as a over functions, with the \text{Cov}(f(x), f(x')) = k(x, x'), enabling exact inference for and uncertainty estimates via the posterior. This -based specification allows flexible modeling of smooth, periodic, or other structured functions without parametric assumptions, making it suitable for applications like where quantifying prediction intervals is crucial. For instance, using a squared exponential , the process can capture smooth variations while providing calibrated confidence bounds. In clustering, kernel k-means extends the standard k-means algorithm to partition data in the nonlinear space by using kernel-induced distances. A key formulation was proposed by , Guan, and Kulis in 2004, it minimizes the within-cluster of squared distances in the space, computed via the kernel as d(\phi(x_i), \phi(x_j)) = k(x_i, x_i) + k(x_j, x_j) - 2 k(x_i, x_j), without explicit mapping. The algorithm iteratively assigns points to clusters based on distances to cluster centers represented in the span of mapped points, enabling the discovery of non-convex clusters that linear methods cannot separate. This has proven effective in text clustering tasks with sparse, high-dimensional data, where kernels reveal semantic groupings. These kernel adaptations demonstrate the broad applicability of methods in handling nonlinearity across , , and probabilistic tasks, often leading to superior performance on complex datasets compared to their linear counterparts. By leveraging positive definite kernels like the RBF, they enable algorithms to implicitly operate in infinite-dimensional spaces, enhancing expressiveness while maintaining computational tractability through operations.

Modern Developments

Integration with Neural Networks

Kernel methods have found significant integration with neural networks through the (NTK), which reveals that infinitely wide s trained via behave analogously to methods. In this regime, the network's evolution during training can be approximated by a fixed that captures the of gradients with respect to the parameters, enabling analytical insights into and . The NTK is defined as \theta(x, y) = \mathbb{E}[\nabla f(x) \cdot \nabla f(y)], where f denotes the output and the expectation is over random initializations. This equivalence highlights how wide networks implicitly perform kernel-like computations, bridging the gap between parametric neural architectures and non-parametric kernel approaches. Deep kernels extend this integration by composing s with functions to enable hierarchical within Gaussian processes or other kernel-based models. Here, a transforms input data into a feature space where a base (e.g., RBF) is then applied, allowing the model to capture complex, multi-level representations that traditional fixed kernels cannot. This hybrid structure leverages the expressive power of deep architectures for feature extraction while retaining the probabilistic benefits of kernels for . Such compositions have been shown to outperform standalone s or shallow kernels on tasks requiring nuanced . To address scalability, random features approximations linearize kernel computations by mapping inputs to a finite-dimensional space via Monte Carlo sampling of Fourier features, making kernel methods compatible with large-scale neural network training pipelines. This technique approximates the kernel matrix explicitly, reducing the quadratic or cubic complexity of exact kernel evaluations to linear time, thus enabling efficient integration with neural architectures for high-dimensional data. For instance, random Fourier features provide unbiased estimates of shift-invariant kernels like the RBF, facilitating faster optimization in hybrid settings. Despite these advances, integrating kernel methods with neural networks faces computational challenges, particularly the O(n^2) or O(n^3) storage and time requirements for kernel matrices on large datasets, which hinder compared to the linear-time forward passes of neural networks. Approximations like random features mitigate this by trading off some accuracy for , allowing hybrid models to handle millions of samples without full matrix inversion. Recent developments from 2023 to 2025 have focused on hybrid models that enhance in vision tasks, such as using NTK-guided for vision transformers, achieving improved accuracy on image classification benchmarks like by stabilizing training dynamics in overparameterized regimes. These hybrids demonstrate up to 2-5% gains in over pure neural baselines on datasets like and subsets.

Recent Applications and Advances

In , kernel regression has become a key technique for predicting material properties from chemical descriptors, particularly in data-scarce environments. A 2025 review details how kernel-based methods, such as , enable accurate forecasting of molecular and material properties like electronic band gaps and elastic moduli by mapping descriptors into high-dimensional reproducing Hilbert spaces, often outperforming linear models on benchmark datasets. In bioinformatics, kernels facilitate the analysis of molecular structures by quantifying similarities in representations of compounds. For example, 3D kernels, introduced in 2025, capture geometries for tasks like on datasets from the MoleculeNet . Complementing this, multi-class support vector machines with resampling strategies address class imbalance in bioinformatics applications, such as protein subcellular localization; a 2023 approach using synthetic via improves F1-scores on imbalanced multilabel datasets without altering functions. Scalability remains a challenge for kernel methods on large datasets, but approximate techniques like the Nyström method provide efficient low-rank approximations of kernel matrices. A 2025 development applies Nyström approximation to kernel logistic regression, reducing computational complexity from O(n^3) to O(n^2) for datasets exceeding 1 million samples, while maintaining over 95% of full-kernel accuracy in binary classification tasks. This approach, tested with leverage-score sampling for landmark selection, has been extended to clustering and regression, enabling kernel methods' deployment in big data scenarios. Kernel methods have found novel use in modeling for spatiotemporal , where non-parametric approaches handle irregular grids and temporal dependencies. A 2024 study employs on a decade of and across seven levels, revealing spatiotemporal patterns in extremes. Ethical considerations in kernel methods increasingly focus on mitigation within fairness-aware models, particularly for support vector machines. Post-training techniques, such as distribution-based adjustments to kernel outputs, have been shown in 2025 analyses to reduce demographic parity violations by up to 25% in tasks while preserving overall accuracy above 90%, addressing disparities in sensitive attribute predictions. Future directions emphasize quantum kernels to boost expressivity beyond classical limits. In 2025 experiments, entanglement-enhanced quantum kernels in photonic systems demonstrated superior performance in support vector classification of respiratory datasets, achieving 5-10% higher accuracies than classical kernels due to richer feature mappings in Hilbert spaces. These advances suggest quantum kernels could handle exponentially complex data structures, paving the way for quantum-classical pipelines.

References

  1. [1]
    [PDF] Kernel Methods - CS229
    Oct 5, 2019 · Kernel methods use feature maps to map original inputs to new features. The kernel trick improves computation by not storing θ explicitly.
  2. [2]
    [PDF] Kernel methods in machine learning - arXiv
    We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel ...
  3. [3]
    [PDF] 2 Kernel methods: an overview - People @EECS
    Any kernel methods solution comprises two parts: a module that performs the mapping into the embedding or feature space and a learning algorithm designed to ...
  4. [4]
    Learning with Kernels - MIT Press
    Learning with Kernels provides an introduction to SVMs and related kernel methods. Although the book begins with the basics, it also includes the latest ...
  5. [5]
    XVI. Functions of positive and negative type, and their connection ...
    Functions of positive and negative type, and their connection the theory of integral equations. James Mercer.
  6. [6]
    [PDF] theoretical foundations of the potential function method in pattern ...
    Sep 30, 2017 · M. A. Aizerman, E. M. Braverman, and L. I. Rozonoer. (Moscow) ... perceptron can be considered to be a realization of the potential function method ...
  7. [7]
    Large Margin Classification Using the Perceptron Algorithm
    Aizerman, M.A., Braverman, E.M., & Rozonoer, L.I. (1964). Theoretical foundations of the potential function method in pattern recognition learning ...
  8. [8]
    Support-vector networks | Machine Learning
    Cortes, C., Vapnik, V. Support-vector networks. Mach Learn 20, 273–297 (1995). https://doi.org/10.1007/BF00994018. Download citation. Received: 15 May 1993.Missing: history | Show results with:history
  9. [9]
    Kernel regression methods for prediction of materials properties
    Feb 13, 2025 · We review recent applications of kernel-based methods for the prediction of properties of molecules and materials from descriptors of chemical composition and ...
  10. [10]
    [PDF] Theory of Reproducing Kernels - N. Aronszajn
    Aug 26, 2002 · Moore: to every positive matrix K(x, y) there corresponds one and only one class of func- tions with a uniquely determined quadratic form in it, ...
  11. [11]
    [PDF] Learning with Kernels
    Learning with Kernels: Support Vector Machines, Regularization, Optimization, and. Beyond, Bernhard Schölkopf and Alexander J. Smola. Page 3. Learning with ...
  12. [12]
    [PDF] The Kernel Trick 1 Support Vectors - People @EECS
    All we use is the Gram Matrix K of the data, in the sense that once we ... Mercer's Theorem gives us just that. Page 4. 4. The Kernel Trick. 2.2 Mercer's ...
  13. [13]
    [PDF] Foundations of Machine Learning - NYU Computer Science
    Definition: a kernel is positive definite symmetric (PDS) if for any , the matrix is symmetric positive semi-definite (SPSD). SPSD if symmetric and one of the ...
  14. [14]
    [PDF] Stability and Generalization - Journal of Machine Learning Research
    The last one is a consequence of the definition of σ-admissibility. Example 1 (Stability of bounded SVM regression) Assume k is a bounded kernel, that is k ...
  15. [15]
    1.4. Support Vector Machines - Scikit-learn
    If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.SVM: Maximum margin · SVM: Separating hyperplane... · RBF SVM parameters
  16. [16]
  17. [17]
    [PDF] A Study on Sigmoid Kernels for SVM and the Training of non-PSD ...
    Results in Section 6.1 depend on properties of the sigmoid kernel. Here we propose an. SMO-type method which is able to handle all kernel matrices no matter ...
  18. [18]
    [PDF] The Spectrum Kernel: A String Kernel for SVM Protein Classification
    The spectrum kernel is a new, simple, and efficient sequence-similarity kernel for SVMs in protein classification, designed to be efficient and not depend on ...Missing: seminal | Show results with:seminal
  19. [19]
    Nonlinear forecasting with many predictors using kernel ridge ...
    Schölkopf, Smola, and Müller (1998) document the outstanding performance of kernel methods for this classification task. Kernel ridge regression has also been ...
  20. [20]
    Neural Tangent Kernel: Convergence and Generalization in ... - arXiv
    Jun 20, 2018 · Neural Tangent Kernel: Convergence and Generalization in Neural Networks. Authors:Arthur Jacot, Franck Gabriel, Clément Hongler.
  21. [21]
    [PDF] Deep Kernel Learning
    For our deep kernel learning model, we first train a deep neural network using SGD with the squared loss objective, and rectified linear activation functions.
  22. [22]
    Random Features for Large-Scale Kernel Machines - NIPS papers
    Authors. Ali Rahimi, Benjamin Recht. Abstract. To accelerate the training of kernel machines, we propose to map the input data to a randomized ...
  23. [23]
    [PDF] Random Features for Large-Scale Kernel Machines - People @EECS
    The kernel trick is a simple way to generate features for algorithms that depend only on the inner product between pairs of input points. It relies on the ...Missing: seminal | Show results with:seminal
  24. [24]
  25. [25]
    Efficient 3D kernels for molecular property prediction | Bioinformatics
    Jul 15, 2025 · Graph kernels are commonly used to determine structural similarity between ligands. However, while 2D graph kernels are well-studied ...
  26. [26]
    Imbalanced classification for protein subcellular localization with ...
    In this study, we propose an oversampling method to handle data imbalance in multilabel settings, by creating synthetic samples using multiple data augmentation ...
  27. [27]
    Scalable kernel logistic regression with Nyström approximation
    Feb 7, 2025 · This paper addresses these problems of scalability by introducing the Nyström approximation for Kernel Logistic Regression (KLR) on large datasets.
  28. [28]
    [PDF] arXiv:2505.08146v2 [cs.DS] 18 May 2025
    May 18, 2025 · Approximation of non-linear kernels using random feature maps has become a pow- erful technique for scaling kernel methods to large datasets.
  29. [29]
    The Kernel Density Estimation Technique for Spatio-Temporal ...
    Nov 11, 2024 · This study utilized ten years of atmospheric temperature and geopotential height data at seven pressure levels (1000, 850, 700, 500, 300, 200, ...
  30. [30]
    Explainable post-training bias mitigation with distribution-based ...
    Oct 29, 2025 · In this work, we investigate post-training bias mitigation methods that address distribution-based fairness constraints while preserving model ...
  31. [31]
    Entanglement-enabled quantum kernels for enhanced feature ...
    Feb 10, 2025 · In this study, we used the entanglement enhanced quantum kernel in a quantum support vector machine to train complex respiratory datasets.
  32. [32]
    Experimental quantum-enhanced kernel-based machine learning ...
    Jun 2, 2025 · Here we demonstrate a kernel method on a photonic integrated processor to perform a binary classification task.