Fact-checked by Grok 2 weeks ago

Sequential minimal optimization

Sequential minimal optimization (SMO) is an algorithm designed for efficiently training support vector machines (SVMs) by solving the associated (QP) problem through decomposition into the smallest possible sub-problems, specifically optimizing pairs of Lagrange multipliers analytically at each step. Developed by John C. Platt at and published in 1998, SMO was created to overcome the computational limitations of earlier SVM training methods, such as chunking algorithms, by avoiding the need for numerical QP solvers and ensuring linear equality constraints are maintained throughout the process. The algorithm operates iteratively: it selects two Lagrange multipliers to optimize while heuristically choosing working sets to accelerate , resulting in memory usage that is linear in the size of the set and times that typically scale between linear and with the number of examples (approximately N^{1.6} to N^{2.2}), often outperforming alternatives by factors of up to 1200 on sparse real-world datasets. Due to its simplicity, speed, and effectiveness—evidenced by over 4,000 citations of the original paper—SMO has become a foundational technique in SVM implementation, notably forming the basis of the decomposition method in the widely used LIBSVM library, which adapts SMO principles for various SVM formulations including classification, regression, and one-class problems.

Background

Support Vector Machines

Support vector machines (SVMs) are algorithms primarily used for tasks, where the goal is to find an optimal that separates points of two classes while maximizing the margin between them. This maximum-margin is defined by a weight vector \mathbf{w} and b such that the distance from the to the nearest point (the margin) is maximized, leading to better on unseen . The formulation assumes linearly separable in the hard-margin case, where all training examples \{(\mathbf{x}_i, y_i)\}_{i=1}^n with labels y_i \in \{-1, +1\} satisfy y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1. In practice, data is often not perfectly separable, so the soft-margin SVM introduces slack variables \xi_i \geq 0 to allow some misclassifications or points within the margin, controlled by a regularization C > 0. The hard-margin approach corresponds to C = \infty, enforcing strict separation, while finite C trades off margin maximization against classification errors, enabling handling of noisy or overlapping datasets. This is formulated as the primal : \begin{align*} \min_{\mathbf{w}, b, \xi} \quad & \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \xi_i \\ \text{subject to} \quad & y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \forall i \\ & \xi_i \geq 0, \quad \forall i. \end{align*} To solve this , the is converted to its form using Lagrange multipliers \alpha_i \geq 0 and \beta_i \geq 0, forming the and applying the Karush-Kuhn-Tucker (KKT) conditions. After eliminating the primal variables \mathbf{w} and b, the problem emerges as a that maximizes \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j) subject to $0 \leq \alpha_i \leq C and \sum_{i=1}^n \alpha_i y_i = 0, highlighting the structure central to SVM training. In the dual formulation, support vectors are the training points where \alpha_i > 0, as these are the only ones contributing to the ; points with \alpha_i = 0 lie outside the margin and do not influence \mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i. The classification decision function is then f(\mathbf{x}) = \sum_{i: \alpha_i > 0} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b, where K(\cdot, \cdot) is a kernel function (e.g., linear or radial basis) that implicitly maps to higher dimensions for non-linear separation via the kernel trick. The bias b is typically computed using any support vector as b = y_k - \sum_{i: \alpha_i > 0} \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_k).

Quadratic Programming Formulation

The soft-margin support vector machine (SVM) formulation seeks to minimize the primal objective function \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^n \xi_i, subject to the constraints y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i and \xi_i \geq 0 for i = 1, \dots, n, where \mathbf{w} is the weight vector, b is the , C > 0 is the regularization parameter, \mathbf{x}_i are the input features, y_i \in \{-1, 1\} are the labels, and \xi_i are the slack variables allowing for misclassifications. To derive the dual problem, introduce the Lagrange multipliers \alpha_i \geq 0 for the inequality constraints y_i (\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1 - \xi_i and \mu_i \geq 0 for \xi_i \geq 0. The is then L(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\mu}) = \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^n \xi_i - \sum_{i=1}^n \alpha_i \left[ y_i (\mathbf{w} \cdot \mathbf{x}_i + b) - 1 + \xi_i \right] - \sum_{i=1}^n \mu_i \xi_i. Setting the partial derivatives with respect to the primal variables to zero yields the stationarity conditions: \mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i, \sum_{i=1}^n \alpha_i y_i = 0, and C - \alpha_i - \mu_i = 0 for each i. Substituting these back into the eliminates \mathbf{w}, b, \boldsymbol{\xi}, and \boldsymbol{\mu}, resulting in the optimization problem of maximizing \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j (\mathbf{x}_i \cdot \mathbf{x}_j) subject to \sum_{i=1}^n \alpha_i y_i = 0 and $0 \leq \alpha_i \leq C for all i. This dual is a program () with a objective in the variables \alpha_i. The of the objective is \mathbf{Q} where Q_{ij} = y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) and K(\cdot, \cdot) is the (initially the linear kernel K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j). The QP is because \mathbf{Q} is positive semi-definite, as kernels are defined to ensure this property, guaranteeing a unique global maximum. The constraints consist of box bounds $0 \leq \alpha_i \leq C and a single linear equality \sum \alpha_i y_i = 0. The kernel trick extends this formulation to non-linear SVMs by replacing the dot product with a kernel K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j), where \phi maps inputs to a high-dimensional space, without explicitly computing \phi or storing the feature map, thus enabling efficient computation of the decision \operatorname{sign}\left( \sum \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}) + b \right). Solving this QP directly poses significant computational challenges for large n, the number of training examples. General QP solvers, such as interior-point or active-set methods, typically require O(n^2) to O(n^3) time due to the need to form, store, and decompose or invert the dense n \times n \mathbf{Q}, leading to prohibitive memory usage (up to O(n^2) space) and runtime for datasets with n > 10^4. For instance, active-set methods iteratively solve reduced subproblems but struggle with the dense Hessian in high dimensions, often failing to beyond moderate-sized problems due to caching inefficiencies and numerical in operations.

Algorithm Description

Core Principles

Sequential minimal optimization (SMO) addresses the scalability challenges of training support vector machines (SVMs) by decomposing the underlying () problem into a series of smaller, analytically solvable subproblems involving only two Lagrange multipliers at a time. This approach avoids the need for general-purpose QP solvers, which are computationally intensive for large datasets, by iteratively updating pairs of multipliers \alpha_i and \alpha_j while exploiting the closed-form solution for two-variable QPs. The algorithm was introduced by John C. Platt in 1998 to enable efficient SVM training on datasets with thousands of examples, overcoming limitations in prior methods like chunking that still required significant computational resources. Central to SMO's foundation are the Karush-Kuhn-Tucker (KKT) conditions, which define optimality in the dual formulation of the SVM . These conditions ensure that the solution satisfies stationarity (the gradient of the is zero), primal and dual feasibility (constraints like $0 \leq \alpha_i \leq C and \sum \alpha_i y_i = 0 hold), and complementarity. Specifically, for each training example i, the complementarity slackness conditions are: \alpha_i = 0 \iff y_i u_i \geq 1, $0 < \alpha_i < C \iff y_i u_i = 1, and \alpha_i = C \iff y_i u_i \leq 1, where u_i is the SVM output for example i, y_i is the label, and C is the regularization parameter. SMO selects pairs of multipliers that violate these KKT conditions to drive the solution toward optimality. SMO guarantees progress toward the global optimum due to the convexity of the QP objective function; each update of a two-variable subproblem maintains feasibility and strictly decreases the objective value, as ensured by , which states that optimizing any subset containing at least one KKT violator reduces the objective. Compared to full QP solvers, SMO achieves linear time complexity per iteration in the number of training examples n, requiring no additional storage beyond the kernel matrix, and demonstrates empirical speedups of 10 to 100 times or more on real-world datasets.

Step-by-Step Procedure

The Sequential Minimal Optimization (SMO) algorithm begins by initializing all Lagrange multipliers \alpha_i = 0 for i = 1, \dots, n, where n is the number of training examples, and the bias b = 0. The initial prediction errors are then computed as E_k = -y_k for all k. This setup ensures the starting point satisfies the non-negativity constraints \alpha_i \geq 0 and prepares for iterative optimization of the dual quadratic programming problem subject to \sum_{i=1}^n \alpha_i y_i = 0. The core iterative process forms the main loop of SMO, which continues until the Karush-Kuhn-Tucker (KKT) conditions are satisfied within a tolerance \epsilon (typically \epsilon = 0.001) for all training examples. In each iteration, two indices i and j are selected to form a two-variable subproblem, which is solved analytically to update \alpha_i and \alpha_j. Following the update, the bias b is recomputed, and the algorithm checks for convergence by evaluating prediction errors E_k = f(\mathbf{x}_k) - y_k for all k, where f(\mathbf{x}_k) = \sum_{i=1}^n \alpha_i y_i K(\mathbf{x}_i, \mathbf{x}_k) + b. The process typically converges in fewer than O(n^2) iterations for many datasets, though worst-case complexity remains quadratic. Solving the two-variable subproblem analytically involves deriving new values for \alpha_i and \alpha_j while respecting the box constraints $0 \leq \alpha_i \leq C (where C is the regularization parameter) and the equality constraint \sum \alpha_i y_i = 0. First, compute the second derivative term \eta = K(\mathbf{x}_i, \mathbf{x}_i) + K(\mathbf{x}_j, \mathbf{x}_j) - 2 K(\mathbf{x}_i, \mathbf{x}_j), and if \eta \neq 0, the unconstrained update for \alpha_j is \alpha_j^{\text{new}} = \alpha_j + \frac{y_j (E_i - E_j)}{\eta}, with \alpha_i^{\text{new}} = \alpha_i + y_i y_j (\alpha_j - \alpha_j^{\text{new}}). These values are then clipped to their feasible bounds [L, H], where the bounds (assuming j is primarily updated) are: if y_i \neq y_j, L = \max(0, \alpha_j - \alpha_i), H = \min(C, C + \alpha_j - \alpha_i); if y_i = y_j, L = \max(0, \alpha_i + \alpha_j - C), H = \min(C, \alpha_i + \alpha_j), ensuring feasibility after the update. Clip \alpha_j^{\text{new}} to [L, H], then compute \alpha_i^{\text{new}}. After updating \alpha_i and \alpha_j, the bias b is adjusted to maintain consistency with the KKT conditions. Compute b_1 = b - E_i - y_i (\alpha_i^{\text{new}} - \alpha_i) K(\mathbf{x}_i, \mathbf{x}_i) - y_j (\alpha_j^{\text{new}} - \alpha_j) K(\mathbf{x}_i, \mathbf{x}_j) and b_2 = b - E_j - y_i (\alpha_i^{\text{new}} - \alpha_i) K(\mathbf{x}_j, \mathbf{x}_i) - y_j (\alpha_j^{\text{new}} - \alpha_j) K(\mathbf{x}_j, \mathbf{x}_j). If $0 < \alpha_i^{\text{new}} < C, set b = b_1; if $0 < \alpha_j^{\text{new}} < C, set b = b_2; if both are interior, set b = (b_1 + b_2)/2; if neither (both on bounds), retain the previous b or use an appropriate bound-based adjustment. Convergence is assessed by verifying that for every example k, the KKT conditions hold approximately: \alpha_k ( \alpha_k - C ) = 0, y_k f(\mathbf{x}_k) \geq 1 - \epsilon if \alpha_k = 0, y_k f(\mathbf{x}_k) \leq 1 + \epsilon if \alpha_k = C, and |y_k f(\mathbf{x}_k) - 1| < \epsilon if $0 < \alpha_k < C. The algorithm terminates when no E_k violates these within \epsilon, ensuring the solution is \epsilon-optimal for the dual problem. The high-level pseudocode for SMO can be outlined as follows:
Initialize α = [0, ..., 0], b = 0
Compute initial errors E_k = -y_k for all k
While there exists a k with |E_k| > ε or KKT violated:
    Select indices i and j
    Compute L and H based on y_i, y_j, α_i, α_j, C:
        if y_i ≠ y_j:
            L = max(0, α_j - α_i)
            H = min(C, C + α_j - α_i)
        else:
            L = max(0, α_i + α_j - C)
            H = min(C, α_i + α_j)
    Compute η = K(x_i, x_i) + K(x_j, x_j) - 2 K(x_i, x_j)
    If η ≠ 0:
        α_j_new = α_j + y_j (E_i - E_j) / η
        Clip α_j_new to [L, H]
        α_i_new = α_i + y_i y_j (α_j - α_j_new)
    Else:
        # Handle degenerate case, e.g., no update
    Update α_i and α_j
    Update errors E_i and E_j (and others if needed)
    Compute b_1 = b - E_i - y_i (α_i_new - α_i) K(x_i, x_i) - y_j (α_j_new - α_j) K(x_i, x_j)
    Compute b_2 = b - E_j - y_i (α_i_new - α_i) K(x_j, x_i) - y_j (α_j_new - α_j) K(x_j, x_j)
    If 0 < α_i_new < C:
        b = b_1
    Elif 0 < α_j_new < C:
        b = b_2
    Else:
        b = (b_1 + b_2) / 2  # or retain if both bound
Return α, b
This structure captures the essential mechanics without delving into selection heuristics.

Implementation Aspects

Heuristic Selection of Variables

In Sequential Minimal Optimization (SMO), the selection of the two Lagrange multipliers to update in each iteration is guided by heuristics designed to maximize progress toward satisfying the Karush-Kuhn-Tucker (KKT) conditions while minimizing computational overhead. The first variable, indexed by i, is chosen as the training example with the maximum KKT violation, quantified by |E_i|, where E_i = f(\mathbf{x}_i) - y_i represents the difference between the current decision function value and the target label for example i. This selection prioritizes non-bound examples, where the multipliers \alpha_i satisfy $0 < \alpha_i < C (with C as the regularization parameter), as these are most likely to yield significant improvements; bounded examples (\alpha_i = 0 or \alpha_i = C) are scanned less frequently once non-bound violations are resolved within a small tolerance \epsilon, such as $10^{-3}. The second variable, indexed by j, is selected to ensure a substantial step size in the optimization. Among non-bound examples, j is picked to maximize |E_i - E_j|, which promotes a positive kernel-derived parameter \eta = 2K(\mathbf{x}_i, \mathbf{x}_j) - K(\mathbf{x}_i, \mathbf{x}_i) - K(\mathbf{x}_j, \mathbf{x}_j) and thus a large update magnitude. If no suitable non-bound j is found (e.g., all errors are sufficiently small), the algorithm falls back to selecting the j with the maximum |E_j|. For cases where non-bound options are exhausted, bounded examples are considered using a criterion that minimizes |E_i - E_j| to target potential boundary adjustments. These choices distinguish between bounded and non-bound cases to balance exploration of the feasible region with efficient violation correction. To enable rapid selection without recomputing errors for all examples, SMO maintains a cache of E values specifically for non-bound examples, updating only those affected by changes in support vectors (where \alpha_k > 0) after each iteration; this avoids the O(n) cost of full recomputation, where n is the number of training examples. Empirical tuning further enhances efficiency: if the cache misses or no progress is made in selecting j, the algorithm scans all non-bound examples first, then the full set, starting from a random index to mitigate from sequential ordering. These heuristics promote fast by greedily reducing the largest KKT violations at each step, ensuring that the objective function decreases monotonically as per Osuna's on pairwise updates in . In practice, this approach yields solutions in time comparable to simpler methods, with empirical tests on datasets like the UCI set demonstrating orders-of-magnitude speedups over general solvers.

Handling Constraints and Updates

In the Sequential Minimal Optimization (SMO) algorithm, constraints are enforced during the solution of the two-variable (QP) subproblem to ensure that the updated Lagrange multipliers \alpha_i and \alpha_j remain within the bounds $0 \leq \alpha \leq C and satisfy the equality constraint \sum \alpha_k y_k = 0. The bounds L and H for the update of \alpha_j (while fixing \alpha_i) are computed based on the labels y_i and y_j: if y_i = y_j, then L = \max(0, \alpha_i + \alpha_j - C) and H = \min(C, \alpha_i + \alpha_j); if y_i \neq y_j, then L = \max(0, \alpha_j - \alpha_i) and H = \min(C, C + \alpha_j - \alpha_i). These bounds derive from the requirement that both updated alphas must stay within [0, C] after the paired adjustment, preventing infeasible solutions without explicit projection onto the . The unconstrained solution for \alpha_j is first computed analytically as \alpha_{j,\text{new}} = \alpha_j + y_j (E_i - E_j) / \eta, where E_i and E_j are the errors on the predictions for examples i and j, and \eta = 2 K(x_i, x_j) - K(x_i, x_i) - K(x_j, x_j) is the curvature (with K denoting the kernel function). This value is then clipped to the interval [L, H] to enforce the box constraints: \alpha_{j,\text{new,clipped}} = \max(L, \min(H, \alpha_{j,\text{new}})). Subsequently, \alpha_i is updated exactly as \alpha_{i,\text{new}} = \alpha_i + y_i y_j (\alpha_j - \alpha_{j,\text{new,clipped}}) to preserve the equality constraint, which is automatically satisfied by this paired adjustment without additional computation. For numerical stability, if |\eta| < \epsilon (typically \epsilon = 10^{-12}), the update is skipped to avoid division by near-zero values that could amplify rounding errors. After the alpha updates, the model parameters are revised efficiently. In the linear SVM case, the weight vector is updated incrementally as \mathbf{w}_{\text{new}} = \mathbf{w} + y_i (\alpha_{i,\text{new}} - \alpha_i) \mathbf{x}_i + y_j (\alpha_{j,\text{new,clipped}} - \alpha_j) \mathbf{x}_j, avoiding full recomputation. For kernel-based SVMs, the decision function is expressed as an expansion over support vectors only (\alpha_k > 0), so changes propagate solely through these terms, with the b adjusted based on the errors of non-bound examples to maintain . To enhance efficiency in large-scale problems, SMO incorporates a shrinking that temporarily fixes alphas at their bounds (0 or C) if they are unlikely to change, reducing the active set size during inner loops. Specifically, during iterations over the non-bound subset, examples at bounds with small KKT violations are shrunk (removed from consideration); restoration occurs by rescanning the full dataset when the non-bound examples satisfy the KKT conditions within tolerance, checking bound examples for violations. This approach, while preserving optimality upon , significantly accelerates by focusing computations on potentially active variables, with empirical between linear and in the number of examples.

Extensions and Applications

Variants and Improvements

Since its introduction, the Sequential Minimal Optimization (SMO) algorithm has been extended in several ways to address limitations such as slow convergence on large datasets and restricted applicability to specific problem formulations. The original SMO, which optimizes two Lagrange multipliers at a time, uses pairwise updates. Early improvements focused on enhancing variable selection heuristics; for instance, Keerthi et al. proposed modifications that incorporate second-order information to predict the step length more accurately, reducing the number of iterations by up to 50% on benchmark datasets compared to the original heuristic. Decomposition methods allowing larger working sets have also emerged as key variants, enabling optimization over more than two variables when it accelerates convergence without violating constraints. SVMlight, developed by Joachims, employs a flexible working set selection based on steepest feasible descent, which dynamically chooses subsets of up to 100 multipliers for update, achieving scalability for text classification tasks with millions of features. This approach maintains the analytical solvability of subproblems while outperforming fixed-size decompositions on sparse, high-dimensional data. Adaptations for specialized problems include extensions to support vector regression (), where SMO is modified to handle epsilon-insensitive loss constraints by adjusting the bounded region for multipliers and incorporating on dual variables. Shevade et al. improved this by introducing a more efficient and for selecting non-bound pairs, leading to faster training for regression tasks like . For probabilistic outputs, integrates SMO-trained SVMs with a posterior fit, transforming decision values into class probabilities via optimized on cross-validated scores, which enhances interpretability in applications like . To tackle large-scale and distributed settings, parallel SMO variants distribute subproblem solving across processors, often approximating block-coordinate descent to minimize communication overhead. For example, the Communication-Avoiding SVM (CA-SVM) partitions data across nodes using random averaging and performs local SMO updates without global synchronization, achieving speedups of 3-16 times (average 7x) on clusters for datasets over 1 million samples. These variants, including second-order working set selection, further reduce iterations through better initial guesses and shrinking techniques. Recent developments as of 2025 include parallel SMO implementations for embedded hardware like FPGAs, enabling batch-wise updates for efficient on-device learning, and extensions to multi-task learning by leveraging SMO for related SVM tasks to improve generalization. Finite Newton methods have been developed as alternatives to SMO for linear SVMs on sparse data, solving primal problems with exact line search and achieving 2-3 times acceleration for text categorization tasks.

Usage in Machine Learning Libraries

LIBSVM, a widely adopted open-source library for support vector machines developed by Chih-Chung Chang and Chih-Jen Lin, employs the sequential minimal optimization (SMO) algorithm as its default solver for training SVM models. It incorporates the shrinking heuristic to accelerate convergence by eliminating inactive variables during optimization, and supports common kernels such as radial basis function (RBF) and linear. Additionally, LIBSVM provides built-in tools for automatic parameter selection through grid search, enabling users to optimize hyperparameters like the regularization parameter C and kernel parameter gamma without extensive manual tuning. In , a popular library, the Support Vector Classifier () implementation leverages the LIBSVM backend to utilize SMO for kernel-based SVM . Users can control SMO behavior via exposed parameters, including cache_size for managing kernel matrix storage in and tol for , which directly influence efficiency and precision. The fit time for scales at least quadratically with the number of samples; for linear kernels, the LinearSVC class using offers better near-linear scaling on moderate datasets, benefiting from heuristics that reduce computational overhead compared to naive approaches. SMO-based SVMs, as implemented in these libraries, facilitate applications in domains requiring handling of large-scale data, such as text classification for spam detection where models train on millions of examples to achieve high accuracy on sparse, high-dimensional spaces. In bioinformatics, SMO enables prediction by classifying structural motifs from sequence data, supporting datasets with thousands to millions of instances while maintaining robust generalization. As of 2025, SMO-pretrained SVMs are increasingly integrated as interpretable classifier heads in (CNN) pipelines, enhancing explainability in models for applications like from physiological signals. For practical deployment with large datasets, LIBSVM and scikit-learn recommend caching the kernel matrix to disk when memory limits are exceeded, preventing out-of-memory errors during SMO iterations. tuning for C (controlling the between margin and misclassification) and gamma (defining the kernel's influence range) is best performed using cross-validation to bias and variance.

References

  1. [1]
    [PDF] A Fast Algorithm for Training Support Vector Machines - Microsoft
    Apr 21, 1998 · ABSTRACT. This paper proposes a new algorithm for training support vector machines: Sequential. Minimal Optimization, or SMO.
  2. [2]
    ‪John C. Platt‬ - ‪Google Scholar‬
    Sequential minimal optimization: A fast algorithm for training support vector machines. J Platt. 4679, 1998 ; Best practices for convolutional neural networks ...
  3. [3]
    [PDF] LIBSVM: A Library for Support Vector Machines
    Aug 23, 2022 · LIBSVM considers an SMO-type decomposition method proposed in Fan et al. (2005). Algorithm 1 (An SMO-type decomposition method in Fan et al ...
  4. [4]
    Support-vector networks | Machine Learning
    About this article. Cite this article. Cortes, C., Vapnik, V. Support-vector networks. Mach Learn 20, 273–297 (1995). https://doi.org/10.1007/BF00994018.
  5. [5]
    [PDF] A Tutorial on Support Vector Machines for Pattern Recognition
    The purpose of this paper is to provide an introductory yet extensive tutorial on the basic ideas behind Support Vector Machines (SVMs). The books (Vapnik, 1995 ...<|control11|><|separator|>
  6. [6]
    Support-vector networks
    Abstract. The support-vector network is a new leaming machine for two-group classification problems. The machine conceptually implements the following idea: ...
  7. [7]
    [PDF] A Fast Algorithm for Training Support Vector Machines
    Apr 21, 1998 · Sequential Minimal Optimization (SMO) is a simple algorithm that can quickly solve the SVM. QP problem without any extra matrix storage and ...
  8. [8]
    [PDF] Improvements to Platt's SMO Algorithm for SVM Classifier Design
    This article points out an important source of inefficiency in Platt's se- quential minimal optimization (SMO) algorithm that is caused by the.
  9. [9]
    SVM-Light: Support Vector Machine - Cornell: Computer Science
    May 29, 2017 · The optimization algorithms used in SVMlight are described in [Joachims, 2002a ]. [Joachims, 1999a]. The algorithm has scalable memory ...
  10. [10]
    [PDF] Improvements to the SMO algorithm for SVM regression
    Abstract—This paper points out an important source of ineffi- ciency in Smola and Schölkopf's sequential minimal optimization. (SMO) algorithm for support ...
  11. [11]
    [PDF] Probabilistic Outputs for Support Vector Machines and Comparisons ...
    Mar 26, 1999 · This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized.Missing: SMO | Show results with:SMO
  12. [12]
    [PDF] CA-SVM: Communication-Avoiding Parallel Support Vector ...
    Feb 27, 2015 · As far as we know, it is the best distributed SMO implementation so far. The basic idea is to partition the data among nodes and launch a big.
  13. [13]
    [PDF] A Modified Finite Newton Method for Fast Solution of Large Scale ...
    Abstract. This paper develops a fast method for solving linear SVMs with L2 loss function that is suited for large scale data mining tasks such as text ...Missing: hybrid | Show results with:hybrid
  14. [14]
    SVC — scikit-learn 1.7.2 documentation
    C-Support Vector Classification. The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples.Plot classification boundaries... · Scaling the regularization...
  15. [15]
    1.4. Support Vector Machines - Scikit-learn
    In the binary case, the probabilities are calibrated using Platt scaling [9]: logistic regression on the SVM's scores, fit by an additional cross-validation on ...Missing: SMO | Show results with:SMO
  16. [16]
    Time complexity analysis of support vector machines (SVM) in LibSVM
    Oct 16, 2015 · The results of our research has proved that the complexity of SVM (LibSVM) is O(n3) and the time complexity shown that C++ faster than Java, both in training ...
  17. [17]
    Support Vector Machine Implementations for Classification ...
    Sep 26, 2006 · We describe Support Vector Machine (SVM) applications to classification and clustering of channel current data.
  18. [18]
    A pipeline and comparative study of 12 machine learning models for ...
    Sep 1, 2022 · Experiments showed that SVM was the best performing algorithm out of the 5 algorithms tested, which only included Multinomial Naïve Bayes and ...
  19. [19]
    (PDF) An Interpretable Hybrid CNN-SVM Model for Predicting Stress ...
    Mar 14, 2025 · In this study, a novel hybrid CNN-SVM model is proposed to predict the mental stress among the university students. The rationale here is that ...
  20. [20]
    LIBSVM -- A Library for Support Vector Machines
    LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM) ...LIBSVM Data Sets. · Libsvm faq · LIBSVM Tools · Other documents