Fact-checked by Grok 2 weeks ago

Multiclass classification

Multiclass classification is a supervised learning task in machine learning that involves assigning an input instance to one of three or more mutually exclusive categories based on its feature vector, extending beyond the binary case of two classes.^[1] This approach is essential in applications such as image recognition, where models distinguish among numerous object types, or natural language processing, for tasks like sentiment analysis across multiple levels (positive, neutral, negative).^[2] Unlike binary classification, which uses a single decision boundary, multiclass problems demand strategies to handle multiple boundaries, often leading to increased computational complexity and the need for specialized evaluation metrics like macro-averaged F1-score to account for class imbalance.^[3]

Introduction

Definition

Multiclass classification is a fundamental task in supervised machine learning, where the objective is to learn a function that maps input instances, described by a set of features, to one of three or more predefined, mutually exclusive discrete classes. Unlike binary classification, which restricts the output to exactly two classes, multiclass problems involve a label space with cardinality greater than two, ensuring that each instance is assigned to precisely one class from this set. This distinguishes it from multilabel classification, in which instances may receive multiple non-exclusive labels simultaneously.^[4]^[5] The task operates within the framework of supervised learning, where a training dataset comprises paired examples of feature vectors and corresponding class labels, enabling the model to approximate the underlying conditional distribution P(Y \mid X) over the discrete label space Y. Binary classification serves as a special case when the number of classes reduces to two. The classes are typically exhaustive and categorical, covering all possible outcomes for any given instance without overlap. Multiclass classification traces its roots to early 20th-century statistical methods, notably Ronald Fisher's 1936 introduction of linear discriminant analysis applied to the iris dataset—a multivariate collection of measurements from 150 flowers across three iris species (setosa, versicolor, and virginica)—which provided a seminal example for distinguishing multiple categories based on continuous features.^[6]^[7] In machine learning, the problem gained formal structure in the late 20th century, with algorithms like classification and regression trees (CART) enabling native handling of multiclass outputs through recursive partitioning of feature spaces.^[8] A straightforward illustration is classifying fruits into one of three categories—apple, banana, or orange—using input features such as color and size, where each fruit instance receives exactly one label based on these attributes.

Relation to Binary Classification

Multiclass classification extends the binary classification framework by addressing problems with more than two classes, where the output must represent a probability distribution over K > 2 categories. In binary classification, the logistic sigmoid function maps linear combinations of features to probabilities between 0 and 1 for two classes, often paired with binary cross-entropy loss. In contrast, multiclass settings employ the softmax function to generalize this, transforming a vector of raw scores (logits) into probabilities that sum to 1 across all K classes, ensuring a valid categorical distribution. This shift is essential because binary methods cannot directly handle multiple mutually exclusive outcomes without modification. Unique challenges arise in multiclass problems due to the expanded decision space. With more classes, the potential for prediction errors increases, as misclassifications can occur between any pair of categories, leading to higher overall error rates influenced by inter-class correlations and data geometry.^[9] Computational costs also escalate, as training and inference involve optimizing over larger parameter spaces or multiple subproblems, complicating analysis of model correlations.^[9] Additionally, multiclass methods often rely on assumptions of class separability, such as distinct feature distributions, which are harder to satisfy than in binary cases where separability is simpler to model.^[10] Many multiclass solutions build on binary classifiers by decomposing the problem into binary subproblems, such as comparing one class against others, though direct multiclass approaches avoid this by optimizing jointly over all classes.^[10] A common objective in direct methods is the multiclass cross-entropy loss, which measures divergence between the true one-hot encoded label and predicted probabilities:

L = -\sum_{k=1}^K y_k \log(p_k)

where y_k is 1 for the true class and 0 otherwise, and p_k is the softmax probability for class k. This loss generalizes binary cross-entropy and promotes confident, well-calibrated predictions across multiple classes.

Model Evaluation

Chance and Better-than-Chance Performance

In multiclass classification, random baseline models provide essential benchmarks for assessing whether a classifier performs better than trivial prediction strategies. The uniform random classifier assigns each instance to one of the K classes with equal probability $1/K, yielding an expected accuracy of $1/K on balanced datasets. This baseline represents pure chance under the assumption of no class imbalance or prior knowledge. In contrast, the majority class baseline, often implemented as the ZeroR classifier, predicts the most frequent class for every instance, achieving an accuracy equal to the proportion of the majority class in the dataset.^[11] These baselines are particularly useful in imbalanced settings, where the majority class proportion can exceed $1/K significantly. Intuitively, in binary classification (K=2), the uniform random baseline corresponds to 50% accuracy, serving as a simple threshold for meaningful performance.^[12] This extends to multiclass problems, where chance accuracy is either $1/K for uniform random or the maximum class frequency for the majority baseline; thus, better-than-chance performance requires exceeding these levels to demonstrate learning of discriminative patterns rather than mere frequency matching or overfitting to noise. For instance, consider a dataset with three classes having frequencies 0.4, 0.3, and 0.3; the uniform random baseline yields approximately 0.333 accuracy, while the majority baseline achieves 0.4. A model attaining 0.5 accuracy surpasses both, indicating genuine improvement over chance. Formally, a multiclass classifier exhibits better-than-chance performance if its expected accuracy exceeds the relevant baseline, often quantified through adaptations of binary diagnostic measures like likelihood ratios or odds ratios.^[13] One such extension is the multiclass likelihood ratio for class k, defined as the ratio of the probability of the data given class k to the probability given not class k:

LR_k = \frac{P(\text{data} \mid \text{class } k)}{P(\text{data} \mid \text{not } k)}

This measure can be computed per class and aggregated (e.g., via geometric mean or pairwise comparisons) to evaluate overall model utility, where LR_k > 1 for all k signals outperformance relative to chance, analogous to binary settings.^[14] In the classification context, pairwise likelihood ratios LR_{i,j} = \frac{P(\hat{y}=j \mid y=j)}{P(\hat{y}=j \mid y=i)} for i \neq j further characterize this, requiring LR_{i,j} \geq 1 with strict inequality for at least one pair to confirm the model as a maximum likelihood estimator superior to random assignment.^[13] These formalizations ensure rigorous assessment beyond raw accuracy comparisons.

Key Metrics and Measures

In multiclass classification, accuracy serves as a fundamental metric, defined as the ratio of correctly predicted instances to the total number of instances, providing a straightforward measure of overall performance.^[5] However, accuracy is often critiqued for its sensitivity to class imbalance, where it may yield misleadingly high values by favoring majority classes, thus underrepresenting errors in minority classes.^[5] The error rate, simply one minus the accuracy, complements this by quantifying the proportion of misclassifications.^[5] To address limitations in imbalanced settings, balanced accuracy offers a more equitable evaluation by averaging the recall across all classes, ensuring each class contributes equally regardless of prevalence.^[5] Formally, for K classes, it is computed as:

\text{Balanced Accuracy} = \frac{1}{K} \sum_{k=1}^{K} \frac{\text{TP}_k}{\text{TP}_k + \text{FN}_k}

where \text{TP}_k and \text{FN}_k denote true positives and false negatives for class k, respectively.^[5] This metric, originally formalized in probabilistic terms for posterior distributions, enhances reliability in scenarios with skewed class distributions.^[15] Probabilistic metrics extend evaluation to models outputting probability distributions over classes, penalizing confident but incorrect predictions. Log-loss, also known as cross-entropy loss, quantifies the divergence between predicted probabilities and true labels as:

\text{Log-Loss} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} y_{i,k} \log p_{i,k}

where N is the number of instances, y_{i,k} is the true binary indicator for class k of instance i, and p_{i,k} is the predicted probability.^[5] Similarly, the Brier score measures the mean squared difference between predicted probabilities and actual outcomes, applicable to multiclass via its original formulation for multiple probabilistic events:

\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} (p_{i,k} - y_{i,k})^2

Lower values indicate better calibration and accuracy in probability estimates.^[16] These scores, rooted in meteorological forecasting, promote models that output well-calibrated probabilities beyond hard classifications. Receiver operating characteristic (ROC) analysis, prominent in binary classification via the area under the curve (AUC), extends to multiclass through one-vs-rest decompositions, generating K binary ROC curves by treating each class against all others and plotting true positive rate against false positive rate for varying thresholds. For a holistic measure, the volume under the multidimensional ROC surface (VUS) integrates performance across all classes, generalizing AUC to higher dimensions and providing a single scalar comparable to binary AUC.^[17] These metrics find critical application in imbalanced domains such as medical diagnosis, where rare disease classes demand balanced accuracy or probabilistic scores to avoid overlooking critical errors, outperforming simple accuracy in detecting minority class performance.^[5] In such contexts, multiclass ROC variants enable threshold selection that balances sensitivity across classes, akin to binary AUC but adapted for multi-outcome separability.^[17]

Algorithmic Strategies

One-vs-Rest and One-vs-One Transformations

In multiclass classification problems with K classes, strategies such as one-vs-rest (OvR) and one-vs-one (OvO) reduce the task to a series of binary classification problems, allowing the use of binary learners like support vector machines or logistic regression. These decomposition methods enable the application of well-established binary algorithms without requiring native multiclass extensions, though they introduce trade-offs in computational complexity and class balance.^[18]^[19] The one-vs-rest (OvR) approach, also known as one-vs-all, trains K binary classifiers, where each classifier treats samples from one specific class as positive and all samples from the remaining K-1 classes as negative. During prediction, the class corresponding to the classifier with the highest output score is selected, often after normalizing the scores to approximate probabilities. This normalization is achieved by dividing each score by the sum of all scores, yielding pseudo-probabilities that sum to one:

\hat{p}(y = k \mid x) = \frac{f_k(x)}{\sum_{j=1}^K f_j(x)},

where f_k(x) is the decision function output for the k-th classifier, and the predicted class is \arg\max_k \hat{p}(y = k \mid x). For algorithms like support vector machines that output uncalibrated scores, Platt scaling—a logistic regression fit on the scores using cross-validation—can further refine these into calibrated probabilities. This method is computationally efficient, requiring only linear scaling in K, but it often results in highly imbalanced training sets, as the positive class is typically much smaller than the negative, potentially leading to biased classifiers unless addressed through techniques like class weighting.^[19]^[18]^[16] In contrast, the one-vs-one (OvO) strategy trains a separate binary classifier for every unique pair of classes, resulting in \binom{K}{2} = K(K-1)/2 classifiers. For prediction, each classifier votes for one of the two classes it was trained on, and the class receiving the most votes across all pairwise decisions is chosen as the final prediction; ties can be resolved by secondary criteria such as confidence scores. This pairwise decomposition, originally proposed through probabilistic coupling of binary estimates, avoids severe class imbalance since each classifier is trained on roughly equal numbers of samples from just two classes. However, the quadratic growth in the number of models makes OvO less scalable for large K, increasing both training and testing time significantly. Empirical comparisons show that OvO can slightly outperform OvR on datasets with weak binary learners or structured class relationships, but the differences are often negligible when using strong, well-tuned classifiers like SVMs.^[20]^[18]^[19] The primary trade-off between OvR and OvO lies in simplicity versus balance: OvR requires fewer models (O(K)) and is easier to implement, making it a default choice for moderate K, but its imbalance can degrade performance without mitigation. OvO mitigates imbalance at the cost of O(K^2) models, which becomes prohibitive for K > 10, though it may yield marginally higher accuracy in scenarios with overlapping classes. Studies across UCI datasets, such as letter recognition and satellite imagery, indicate that OvR achieves error rates comparable to OvO (e.g., 8.2% vs. 7.8% on satimage) when binary classifiers are properly tuned, with no consistent superiority of one over the other.^[18]^[19] A representative example is the Iris dataset, which contains 150 samples across three classes (setosa, versicolor, virginica) based on four features. Applying OvR with SVMs trains three binary classifiers: one for setosa vs. others, one for versicolor vs. others, and one for virginica vs. others. In OvO, three pairwise SVMs are trained instead: setosa-vs-versicolor, setosa-vs-virginica, and versicolor-vs-virginica, with the final class determined by majority vote. Both approaches yield high accuracy (>95%) on this balanced, low-dimensional data, illustrating their efficacy for small K.^[19]

Extensions of Binary Algorithms

Many binary classification algorithms can be extended directly to handle multiclass problems by adapting their core mechanisms to accommodate multiple classes without decomposing the problem into binary subproblems. These direct extensions often leverage algorithm-specific formulations for probability estimation, decision boundaries, or voting procedures, enabling efficient handling of K > 2 classes.^[21] In neural networks, the output layer is modified to use a softmax activation function over K classes, which normalizes the logits into a probability distribution summing to 1. Backpropagation then optimizes the multiclass cross-entropy loss, defined as the negative log-likelihood of the true class, to train the network end-to-end. This approach generalizes binary logistic regression seamlessly and is widely used in deep learning for tasks like image recognition.^[22] The k-nearest neighbors (KNN) algorithm extends to multiclass settings by assigning the class label of a new instance based on the majority vote among its k nearest neighbors in the feature space, where distances are typically computed using Euclidean or other metrics. Ties can be resolved by distance-weighted voting, with closer neighbors exerting greater influence. This non-parametric method requires no explicit model training beyond storing the training data.^[23] Naive Bayes classifiers are extended to multiclass by estimating the likelihood P(features|class k) for each of the K classes under assumptions like multinomial for discrete features (e.g., word counts in text) or Gaussian for continuous features, then computing the posterior P(class k|features) ∝ P(features|k) P(k) via Bayes' theorem. The class with the highest posterior probability is selected. This probabilistic approach assumes feature independence and performs well on high-dimensional data like spam detection.^[24] Decision trees adapt to multiclass by selecting splits that minimize multiclass impurity measures, such as Gini impurity defined as $1 - \sum_{k=1}^K p_k^2, where p_k is the proportion of class k in the node. Ensembles like random forests extend this by averaging predictions from multiple trees, each grown with multiclass splits, to reduce variance and improve generalization. These methods provide interpretable hierarchies of decisions.^[25] Support vector machines (SVMs) can be extended directly using formulations like the Crammer-Singer method, which optimizes a single quadratic program to find hyperplanes separating all K classes simultaneously via structural risk minimization, maximizing the margin while penalizing multiclass hinge losses. This avoids binary decompositions and is particularly effective for linearly separable multiclass problems.^[26] Multi-expression programming (MEP), a variant of genetic programming, extends to multiclass by evolving chromosomes that encode multiple mathematical expressions, each discriminating between classes through fitness evaluation on training data. These expressions are decoded into programs that compute class probabilities or scores, allowing evolutionary optimization for complex, non-linear decision boundaries.^[27] Direct extensions like these are often more computationally efficient during prediction than transformation methods (e.g., one-vs-rest), as they avoid multiple model trainings, though they require tailored multiclass implementations within the algorithm.^[21]

Hierarchical Methods

Hierarchical methods in multiclass classification organize classes into tree-like structures or directed acyclic graphs (DAGs), such as biological taxonomies where categories progress from broad (e.g., animal) to specific (e.g., mammal > dog).^[28] This structure enables top-down classification, where decisions at higher levels constrain predictions at lower levels, localizing errors to sub-branches rather than affecting the entire output space.^[28] By exploiting these relationships, hierarchical approaches address the challenges of large-scale multiclass problems, such as exponential growth in decision boundaries for flat classifiers.^[29] Key algorithms train local classifiers at each node of the hierarchy, often using binary or small-multiclass models like support vector machines (SVMs) or naive Bayes.^[28] In hierarchical SVMs, kernel methods incorporate structural constraints, such as through maximum margin Markov networks, to predict paths while respecting parent-child dependencies.^[29] Similarly, hierarchical naive Bayes extends the independence assumption by modeling conditional probabilities along branches, training separate naive Bayes classifiers for each non-leaf node.^[30] During prediction, incompatible paths are pruned based on intermediate decisions, yielding a final class via the most probable trajectory.^[28] These methods offer advantages in reducing the effective number of classes considered at each decision point, which scales better for deep hierarchies and mitigates imbalance by progressing from coarse to fine granularity.^[28] They prove particularly useful in domains like text categorization, such as assigning books to Dewey Decimal classes (e.g., 000 > 500 > 510 for mathematics) or semantic labeling with WordNet synsets.^[31]^[32] However, challenges include error propagation, where a mistake at a high-level node can cascade to invalidate lower-level predictions, and the need for a predefined, accurate hierarchy that may not always align with data distributions.^[28] Flat predictions derived from hierarchical outputs can thus underperform if the structure is suboptimal.^[29] Probabilities in hierarchical classification follow the chain rule, computing the joint probability of a full path as the product of local conditional probabilities:

P(y \mid x) = \prod_{i=1}^{L} P(y_i \mid y_{i-1}, x)

where y = (y_1, \dots, y_L) is the path through L levels, and each P(y_i \mid y_{i-1}, x) is estimated by a local classifier at node i given its parent y_{i-1}.^[28] A practical example is web page classification, starting from a root category like "content" and narrowing to "news > sports > soccer," where local classifiers at each node filter documents based on textual features, improving precision over flat multiclass approaches.^[28]

Advanced Considerations

Learning Paradigms

Learning paradigms in multiclass classification encompass the foundational frameworks for training models to predict among three or more classes, building on binary classification techniques while addressing the increased complexity of multiple outputs. These paradigms dictate how data is utilized during training, from full supervision to incorporating partial or no labels, and range from batch processing to incremental updates. Key approaches include supervised, semi-supervised, ensemble, active, and online learning, each tailored to handle the distribution of class probabilities across K classes rather than simple positive-negative distinctions. In supervised learning, the standard paradigm requires complete labeling of all training instances with one of the K possible classes, enabling direct optimization of multiclass loss functions as an extension of binary supervised methods. This full-labeling approach underpins most binary algorithm extensions, such as multiclass support vector machines, where the model learns decision boundaries that separate all classes simultaneously.^[33] Semi-supervised learning extends this by incorporating unlabeled data to improve generalization, particularly when labeled examples are scarce, through multiclass adaptations of techniques like self-training or graph-based propagation. For instance, label propagation constructs a graph over labeled and unlabeled instances and propagates class labels across K classes by solving a harmonic function on the graph manifold, effectively using manifold assumptions to infer labels for unlabeled points. This method has been further refined for multi-class/multi-label settings via dynamic updates that enhance discriminative power by iteratively adjusting propagation based on current predictions.^[34] Ensemble paradigms combine multiple weak learners to form robust multiclass classifiers, with bagging and boosting adapted to handle multi-class errors. In boosting, AdaBoost.MH extends the binary AdaBoost by treating the multiclass problem as multiple binary tasks and weighting misclassifications across all classes during iterative training, focusing on Hamming loss minimization. Random forests, a bagging-based ensemble, natively support multiclass classification by growing decision trees on bootstrapped samples with random feature subsets and aggregating predictions via majority vote over K classes, providing inherent handling of multiple outcomes without pairwise decomposition.^[35] Active learning paradigms address labeling costs by selectively querying instances for human annotation, using multiclass-specific strategies to maximize information gain. A common query strategy selects the most uncertain instance based on the entropy of the predicted class probability distribution:

H = -\sum_{k=1}^K p_k \log p_k,

where p_k is the predicted probability for class k, prioritizing samples with high predictive ambiguity across multiple classes to refine the model efficiently. This approach has been shown to outperform random sampling in multi-class settings by focusing on regions of the input space with overlapping class boundaries.^[36] Online learning paradigms enable incremental model updates as data arrives in streams, suitable for multiclass problems in dynamic environments. These methods use convex surrogates like the multiclass hinge loss to penalize errors across all classes in a single update step, as in online perceptron variants. Seminal work formalized efficient online algorithms for multiclass kernel machines, achieving sublinear regret bounds by ultraconservatively updating only when necessary to maintain margins for all classes.^[37] These paradigms evolved from binary classification foundations during the early 2000s, driven by the need to scale theoretical guarantees and practical implementations to multiple classes, with influential contributions like Crammer and Singer's online multiclass framework marking a shift toward unified vector-based optimizations.^[33]

Imbalanced and Specialized Scenarios

In multiclass classification, class imbalance occurs when some classes have significantly fewer instances than others, leading to degraded performance as models tend to favor majority classes and overlook rare ones. This issue is particularly pronounced in real-world datasets where minority classes represent critical but infrequent outcomes, such as rare diseases in medical data.^[38] To address imbalance, resampling techniques like the Synthetic Minority Oversampling Technique (SMOTE) have been extended to multiclass settings by applying it pairwise (one-vs-rest) or through variants that generate synthetic samples for each minority class while preserving inter-class relationships. Cost-sensitive learning assigns higher misclassification costs to minority classes during training, modifying algorithms like support vector machines or decision trees to prioritize errors on rare classes.^[39]^[40] Additionally, threshold tuning adjusts decision boundaries per class post-training, optimizing probability thresholds to improve recall for minorities without altering the underlying model.^[41] Specialized scenarios in multiclass classification include multi-instance learning, where data is structured as bags of instances labeled with one of multiple classes, typically under the assumption that the bag label is determined by at least one instance in the bag, complicating direct classification. This paradigm, originally motivated by drug activity prediction, trains models to aggregate instance-level predictions (e.g., via max-pooling) to determine bag labels across multiple classes.^[42] Ordinal classification handles ordered classes, such as severity ratings from 1 to 5, using ordinal regression methods that model cumulative probabilities to respect the natural ordering and reduce errors between adjacent classes.^[43] These techniques find application in medical diagnostics, where multiclass models classify cancer subtypes or rare diseases from imbalanced imaging data, achieving improved detection of minorities through cost-sensitive deep networks. In fault detection, they identify multiple error types in industrial systems, such as harmonic drive failures, using vibration signals to differentiate subtle anomalies in skewed datasets.^[44]^[45] For evaluation in these scenarios, metrics emphasize balance across classes; the per-class F1-score, defined as

F1_k = 2 \times \frac{\precision_k \times \recall_k}{\precision_k + \recall_k}

for class k, is macro-averaged by taking the unweighted mean over all classes to equally penalize poor performance on minorities.^[46] A 2017 deep learning adaptation, focal loss, extends cross-entropy by down-weighting easy examples with a modulating factor (1 - p_t)^\gamma, effectively addressing multiclass imbalance in object detection and segmentation tasks.^[47] In privacy-sensitive domains, federated learning enables multiclass classification across distributed medical datasets without sharing raw data, using aggregated updates to train models for tasks like image-based disease subtyping while preserving patient confidentiality.^[48] More recent advances as of 2025 include the use of large language model prompting for multiclass classification tasks.^[49]

References

[1]
Multiclass Classification - an overview | ScienceDirect Topics
Multiclass classification is defined as a problem in which each sample is assigned to one of several finite, mutually exclusive classes, with the classifier ...<|control11|><|separator|>
[2]
Multiclass classification in machine learning | DataRobot Blog
Multiclass classification is a machine learning classification task that consists of more than two classes, or outputs. For example, using a model to identify ...
[3]
[2008.05756] Metrics for Multi-Class Classification: an Overview - arXiv
Aug 13, 2020 · In this white paper we review a list of the most promising multi-class metrics, we highlight their advantages and disadvantages and show their possible usages.
[4]
[PDF] In Defense of One-Vs-All Classification
The central thesis of this chapter is that one-vs-all classification using SVMs or RLSC is an excellent choice for multiclass classification. In the past few ...
[5]
Neural networks: Multi-class classification | Machine Learning
Aug 25, 2025 · This document explores multi-class classification models, which predict from multiple possibilities instead of just two, like binary ...
[6]
[PDF] Multiclass Classification Overview 1 Introduction 2 Task setting 3 ...
Multiclass classification, for some aspects, is very simple. There are some in- teresting issues in multiclass classification that are the stepping stone to ...
[7]
[PDF] Optimal Learners for Multiclass Problems
1. Introduction. Multiclass classification is the problem of learning a classifier h from a domain X to a label space Y, where |Y| > 2 and the error of a ...
[8]
[PDF] metrics for multi-class classification: an overview - arXiv
Aug 13, 2020 · In this white paper we review a list of the most promising multi-class metrics, we highlight their advantages and disadvantages and show their ...
[9]
Iris - UCI Machine Learning Repository
Jun 30, 1988 · Donated on 6/30/1988. A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.
[10]
None
### Summary of Differences and Challenges in Multiclass vs. Binary Classification
[11]
[PDF] Reducing Multiclass to Binary: A Unifying Approach for Margin ...
The framework reduces multiclass problems to multiple binary problems, then solves them using a margin-based binary learning algorithm.Missing: challenges cost
[12]
https://scikit-learn.org/stable/modules/model_evaluation.html
[13]
https://openreview.net/pdf?id=VdW9SkALSd
[14]
3.4. Metrics and scoring: quantifying the quality of predictions
accuracy_score is the special case of k = 1 . The function covers the binary and multiclass classification cases but not the multilabel case. If ...Top_k_accuracy_score · Accuracy_score · Balanced_accuracy_score · F1_score
[15]
[PDF] Mathematical Characterization of Better-than-Random Multiclass ...
We also obtain a more theoretical formulation: a model does better than chance if and only if it is a maximum likelihood estimator of the target variable. When ...
[16]
A Multiclass Likelihood Ratio Approach for Genetic Risk Prediction ...
Simulation results demonstrated that the new approach had more accurate and robust performance than existing approaches under various underlying disease models.
[17]
[PDF] The Balanced Accuracy and Its Posterior Distribution
Abstract—Evaluating the performance of a classification algorithm critically requires a measure of the degree to which unseen examples have been identified ...
[18]
[PDF] Probability Estimates for Multi-class Classification by Pairwise ...
We then define the classification rule as δ2 = arg max i. [p2 i ] ... This measurement is called Brier Score (Brier, 1950), which is popular in meteorology.
[19]
[PDF] Volume Under the ROC Surface for Multi-class Problems - ELP
In this paper, we present the real extension to the Area Under the ROC Curve in the form of the Volume. Under the ROC Surface (VUS), showing how to compute the ...Missing: seminal | Show results with:seminal
[20]
Elements of Statistical Learning: data mining, inference, and ...
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition February 2009. Trevor Hastie, Robert Tibshirani, Jerome Friedman.Missing: one- rest
[21]
Classification by pairwise coupling - Project Euclid
... Classification by pairwise coupling. Trevor Hastie, Robert Tibshirani · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Statist. 26(2): 451-471 (April 1998). DOI ...
[22]
[PDF] A Comparison of Methods for Multi-class Support Vector Machines
In this paper we will give a decomposition implementation for two such “all-together” methods: [25], [27] and [7]. We then compare their performance with three ...
[23]
Softmax Regression - Deep Learning
Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes.
[24]
1.6. Nearest Neighbors — scikit-learn 1.7.2 documentation
Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most ...
[25]
1.9. Naive Bayes — scikit-learn 1.7.2 documentation
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the “naive” assumption of conditional independence.Missing: extension | Show results with:extension
[26]
1.10. Decision Trees — scikit-learn 1.7.2 documentation
DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset. As with other classifiers, DecisionTreeClassifier takes as ...
[27]
[PDF] On the Algorithmic Implementation of Multiclass Kernel-based ...
Unlike most of previous approaches which typically decompose a multiclass ... Stopping criteria of decomposition methods for support vector machines: a theo-.
[28]
Multi Expression Programming for solving classification problems
Mar 16, 2022 · This paper introduces and deeply describes several strategies for solving binary and multi-class classification problems within the \textit{multi solutions per ...
[29]
https://www.jmlr.org/papers/volume7/rousu06a/rousu06a.pdf
[30]
[PDF] Kernel-Based Learning of Hierarchical Multilabel Classification ...
Abstract. We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a ...Missing: seminal | Show results with:seminal
[31]
Classification using Hierarchical Naïve Bayes models
Mar 3, 2006 · Experimental results show that the learned models can significantly improve classification accuracy as compared to other frameworks.
[32]
Hierarchical Classification of OAI Metadata Using the DDC Taxonomy
To be more specific, we automatically classify scientific documents according to the DDC taxonomy within three levels using a machine learning-based classifier ...
[33]
[PDF] Hierarchical Semantic Classification: Word Sense Disambiguation ...
In NLP the hierarchical structure of WordNet has been used to overcome sparseness data problems for estimating class distributions [Clark and Weir, 2002], and ...
[34]
https://openaccess.thecvf.com/content_iccv_2013/papers/Wang_Dynamic_Label_Propagation_2013_ICCV_paper.pdf
[35]
[PDF] Dynamic Label Propagation for Semi-supervised Multi-class Multi ...
In this paper, we have proposed a novel classification method named dynamic label propagation (DLP), which improves the discriminative power in multi-class/ ...
[36]
[PDF] 1 RANDOM FORESTS Leo Breiman Statistics Department University ...
A recent paper (Breiman [2000]) shows that in distribution space for two class problems, random forests are equivalent to a kernel acting on the true margin.
[37]
[PDF] Multi-Class Active Learning by Uncertainty Sampling with Diversity ...
In this paper, we pro- pose a semi-supervised batch mode multi-class active learn- ing algorithm for visual concept recognition. Our algorithm exploits the ...
[38]
Cost-sensitive learning strategies for high-dimensional and ... - NIH
Dec 24, 2021 · Essentially, cost-sensitive learning involves assigning different misclassification costs to the different classes, based on their importance ...
[39]
[PDF] SMOTE for Learning from Imbalanced Data
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered “de facto” standard in the framework of learning from imbalanced ...<|separator|>
[40]
[PDF] Cost-Sensitive Learning Methods for Imbalanced Data
Moreover, [26] applied synthetic minority oversampling technique (SMOTE [4]) to balance the dataset first, then built the model using SVM with different costs ...
[41]
GHOST: Adjusting the Decision Threshold to Handle Imbalanced ...
Jun 8, 2021 · In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification.Methods · Results and Discussion · Conclusions · Supporting Information
[42]
Solving the multiple instance problem with axis-parallel rectangles
This paper describes and compares three kinds of algorithms that learn axis-parallel rectangles to solve the multiple instance problem.
[43]
A Survey on Ordinal Regression: Applications, Advances and ... - arXiv
Mar 2, 2025 · In this survey, we present a comprehensive examination of advances and applications of ordinal regression.
[44]
Novel multiclass classification machine learning approach for the ...
Jan 31, 2024 · This study aims to develop a multiclass machine learning (ML) model for early-stage SARDs classification using accessible laboratory indicators.
[45]
Feature-Based Multi-Class Classification and Novelty Detection for ...
In particular, this paper uses different ML techniques for fault diagnosis and anomaly detection and evaluates them in terms of the ability to provide, in ...
[46]
[PDF] A structured overview of metrics for multi-class - Heidelberg University
Metrics like Accuracy, macro Precision, macro Recall, macro F1, Matthews Correlation Coefficient, and Kappa are used to evaluate classifiers.<|control11|><|separator|>
[47]
[PDF] Focal Loss for Dense Object Detection - CVF Open Access
Focal Loss adds (1-pt)^γ to cross entropy, reducing loss for well-classified examples, focusing on hard examples and addressing class imbalance.
[48]
Privacy-preserving federated learning for collaborative medical data ...
Apr 11, 2025 · This study investigates the integration of transfer learning and federated learning for privacy-preserving medical image classification