Fact-checked by Grok 2 weeks ago

Classification

In and statistics, is a task where a model is trained on a of input features paired with known output labels to predict the category or for new, unseen instances. This process involves learning a or function that maps inputs to one of several predefined classes, enabling automated and . is fundamental to many applications, distinguishing it from , which predicts continuous values. Common classification problems include , where instances are assigned to one of two classes (e.g., vs. not spam in ), and multi-class or multi-label variants for more than two categories or multiple labels per instance (e.g., identifying multiple objects in an image). These tasks underpin diverse fields such as , , and fraud detection, though they face challenges like class imbalance and , addressed in later sections.

Fundamentals

Definition and Scope

Classification is a core task in supervised and , where the objective is to train a model on a labeled to predict categories, or classes, for new, unseen instances based on their input features. In this paradigm, the model learns a mapping from feature vectors—numerical representations of the inputs—to predefined class labels by identifying patterns in the training data. This process enables the assignment of categorical outcomes, such as classifying an as or not , distinguishing it from tasks that predict continuous numerical values instead. The scope of classification encompasses the exploration of decision boundaries in the feature space, which separate regions belonging to different classes, and the hypothesis space of possible functions that approximate these boundaries. Central to this task is the use of a training set consisting of input features (e.g., measurements like dimensions or attributes) paired with corresponding output labels, from which the model generalizes to minimize the misclassification error—the proportion of instances incorrectly assigned to a class. Binary classification, where instances are assigned to one of two classes, represents a foundational case within this broader framework. A classic illustrative example is the Iris dataset, which involves predicting the species of iris flowers (setosa, versicolor, or virginica) using features such as length, width, length, and width. This multivariate dataset, comprising 150 samples, demonstrates how classification models can learn to delineate classes from morphological measurements, providing a benchmark for evaluating the task's foundational principles.

Historical Development

The origins of classification techniques in and statistics trace back to early 20th-century efforts in statistical , particularly Ronald A. 's introduction of in 1936. developed this method to classify species using multiple morphological measurements, aiming to find linear combinations of variables that maximize the separation between classes while minimizing within-class variance. This approach laid foundational principles for supervised classification by emphasizing statistical separability in high-dimensional data. In the mid-20th century, the field advanced with the emergence of computational models inspired by biological systems. Frank Rosenblatt's , proposed in 1958, represented an early architecture capable of learning boundaries through supervised training on patterns. This innovation marked a shift toward algorithmic learning, influencing the development of by demonstrating how machines could adaptively classify inputs based on weighted connections. Concurrently, David R. Cox formalized in 1958 as a probabilistic framework for modeling outcomes, using the to estimate class probabilities from linear predictors and enabling for classification tasks. The 1960s through 1980s saw the maturation of tree-based and ensemble precursors amid fluctuating interest in . Decision trees gained prominence as interpretable classifiers, with the Classification and Regression Trees () algorithm introduced by Leo Breiman and colleagues in 1984, which recursively partitions data based on feature splits to minimize impurity measures like Gini index for both classification and regression. This period was punctuated by AI winters—funding droughts from 1974–1980 and 1987–1993—that slowed research but sustained progress in statistical , as researchers focused on robust, non-connectionist methods less prone to computational limitations of the era. From the 1990s onward, classification methodologies shifted toward kernel-based and techniques, revitalizing the field during AI's resurgence. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik presented support vector machines in 1992, formulating classification as an to find the maximum-margin separating classes, which proved highly effective for non-linear problems via kernel tricks. Building on tree ensembles, Breiman's random forests algorithm, introduced in 2001, combined multiple decision trees trained on bootstrapped samples with random feature subsets, reducing and improving in classification tasks through bagging and . These developments underscored a trend toward scalable, high-performance classifiers integral to modern .

Problem Types

Binary Classification

Binary classification is a fundamental problem in where the task is to assign input instances to one of two mutually exclusive classes, often denoted as 0 (negative) and 1 (positive). The formulation involves learning a mapping from feature vectors \mathbf{x} to class labels y \in \{0, 1\} using a training dataset of labeled examples. Models typically output either discrete class labels or continuous scores, such as probabilities, which are then thresholded to produce final binary decisions. This setup is prevalent in applications requiring yes/no outcomes, such as detection or diagnosis. The decision-making process in commonly relies on a applied to the estimated P(y=1 \mid \mathbf{x}). For instance, an instance is classified as positive if P(y=1 \mid \mathbf{x}) > 0.5, assuming equal misclassification costs; otherwise, it is assigned to the negative class. This rule derives from Bayesian , minimizing expected risk under the 0-1 loss framework. Prominent datasets for binary classification include the Breast Cancer Wisconsin (Diagnostic) dataset, which contains 569 instances with 30 features derived from digitized images of breast mass, used to distinguish malignant from benign tumors. Another example is the dataset, comprising passenger records to predict survival (1) or non-survival (0) based on features like age, sex, and ticket class, with 891 training instances. These datasets illustrate real-world binary tasks and are widely used for benchmarking due to their accessibility and relevance. Class imbalance, where the positive or negative dominates the , is a frequent challenge in , potentially biasing models toward the majority . Initial mitigation strategies involve data-level techniques such as the minority to increase its representation or the majority to reduce it, thereby balancing the before model fitting. A core evaluation metric for binary classifiers is the 0-1 , defined as: L(y, \hat{y}) = \begin{cases} 1 & \text{if } y \neq \hat{y}, \\ 0 & \text{otherwise}. \end{cases} This directly quantifies misclassifications, with the empirical being the fraction of errors on the test set. serves as a building block for multi-class problems through one-vs-all , training separate classifiers for each class against the rest.

Multi-Class and Multi-Label Classification

Multi-class classification generalizes the binary case to problems involving K > 2 mutually exclusive classes, where each instance is assigned exactly one label from the set. This setup is common in tasks like digit recognition, where distinguishing between more than two categories requires adaptations to binary-focused algorithms. To address multi-class problems using binary classifiers, strategies are widely used, including one-versus-all (OvA) and one-versus-one (). In OvA, also known as one-versus-rest, K separate binary classifiers are trained; each treats one class as positive and all others as negative, with the class corresponding to the highest confidence score selected as the prediction. The OvO approach, in contrast, constructs a binary classifier for every pair of classes, resulting in \binom{K}{2} classifiers, and aggregates predictions via majority voting to determine the final class. OvA is computationally efficient for training but can suffer from class imbalance in the "rest" group, while OvO reduces imbalance but increases the number of models exponentially with K. For probabilistic multi-class models, such as multinomial logistic regression or neural network outputs, the softmax function normalizes raw scores (logits) into a probability distribution over the K classes, ensuring the probabilities sum to 1. The softmax is defined as: P(y = k \mid \mathbf{x}) = \frac{\exp(z_k)}{\sum_{j=1}^K \exp(z_j)}, where z_k is the logit for class k, and \mathbf{x} is the input features. This function, originating from early neural network interpretations, facilitates maximum likelihood estimation and interpretable outputs. A classic benchmark for multi-class classification is the MNIST dataset, comprising 70,000 grayscale images of handwritten digits labeled into 10 classes, often achieving over 99% accuracy with modern methods. Multi-label classification extends further by permitting multiple, non-exclusive labels per instance, modeling dependencies where an example like an image might simultaneously belong to categories such as "" and "outdoors." Unlike multi-class, labels are independent or correlated, leading to an output space of 2^L possible combinations for L labels, which poses unique challenges in prediction and . A foundational decomposition method is binary relevance (), which simplifies the problem by training L independent binary classifiers—one per label—ignoring inter-label correlations, though extensions like classifier chains address this limitation. Hamming loss serves as a key for multi-label , quantifying the average fraction of labels incorrectly predicted across all instances and labels; it ranges from 0 (perfect) to 1 and is particularly useful for imbalanced label distributions. Representative datasets for multi-label tasks include MS-COCO, which features over 330,000 images annotated with multiple object categories from 80 classes, enabling applications in scene understanding. Another example is the MovieLens dataset, where movies receive multiple user-applied tags from thousands of categories, supporting recommendation systems with multi-faceted descriptions. These datasets highlight the scalability issues in multi-label settings, where often provides a strong baseline despite its independence assumption.

Algorithms and Techniques

Discriminative Methods

Discriminative methods in classification focus on learning a direct mapping from input features to class labels by estimating the P(y \mid x) or decision boundaries that separate classes, without explicitly modeling the underlying data generation process P(x \mid y). These approaches prioritize boundary optimization for effective separation, often yielding superior performance on complex datasets compared to generative alternatives when sufficient training data is available.

Linear Classifiers

Linear classifiers form a foundational class of discriminative models that assume a in the feature space. , a prominent example, models the log-odds of class membership as a of the inputs: \log \left( \frac{P(y=1 \mid x)}{P(y=0 \mid x)} \right) = \beta_0 + \beta \cdot x, where \beta_0 is the intercept and \beta are the coefficients. The model applies the to map this to probabilities between 0 and 1. Parameters are estimated by maximizing the log-likelihood of the observed data, typically via or iterative reweighted , which minimizes the loss. Originally developed for bio-assay applications, has become a staple in for due to its interpretability and efficiency.

Support Vector Machines

Support Vector Machines (SVMs) seek to find the optimal that maximizes the margin between classes, enhancing by focusing on the most critical data points near the boundary, known as support vectors. The objective is formulated as minimizing the subject to constraints ensuring correct classification with a margin of at least 1: \min_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^n \max(0, 1 - y_i (\mathbf{w} \cdot \mathbf{x}_i + b)), where C controls the trade-off between margin maximization and classification error. For non-linearly separable data, the kernel trick implicitly maps inputs to a higher-dimensional space using functions like the (RBF) , K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2), enabling complex boundary construction without explicit feature transformation. Introduced in the early , SVMs excel in high-dimensional settings, such as text and image classification, due to their robustness to . A classic illustration of SVM's non-linear capability is the XOR problem, where inputs (0,0) and (1,1) map to one class, and (0,1) and (1,0) to another, rendering linear separation impossible. Using an RBF kernel, SVM maps the data to a space where a linear hyperplane achieves perfect separation, demonstrating the kernel trick's power for handling such exclusive-or patterns.

Decision Trees

Decision trees construct hierarchical decision boundaries through recursive partitioning of the feature space, selecting splits that best separate classes based on criteria like Gini impurity or information entropy. Gini impurity measures the probability of misclassifying a randomly chosen element, favoring splits that minimize it: Gini = 1 - \sum_{k=1}^K p_k^2, where p_k is the proportion of class k in the node. Entropy, alternatively, quantifies uncertainty as H = -\sum_{k=1}^K p_k \log_2 p_k, with splits chosen to maximize information gain. To prevent overfitting, post-pruning techniques remove branches that do not significantly improve validation performance, balancing tree complexity with accuracy. Pioneered in algorithms like ID3 and CART, decision trees offer intuitive, interpretable models suitable for tabular data.

Probabilistic Approaches

Probabilistic approaches to classification focus on generative models that estimate the underlying probability distributions of the , enabling the of posterior probabilities for class assignments. These methods model the joint probability of features and classes, contrasting with discriminative techniques by prioritizing the generation of data rather than direct boundary optimization. By leveraging , they provide a principled framework for handling and incorporating prior knowledge. Bayesian classifiers form the foundation of these approaches, applying to compute the of a given the observed features. The theorem states: P(y \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid y) P(y)}{P(\mathbf{x})} where P(y) is the of the , P(\mathbf{x} \mid y) is the likelihood of the features given the , and P(\mathbf{x}) is the marginal probability of the features, often serving as a . In practice, classification assigns the instance to the maximizing this posterior, known as maximum (MAP) estimation. This full Bayesian formulation allows for exact in simple cases but can be computationally intensive for complex distributions. The simplifies Bayesian methods by assuming among features given the class, which greatly reduces computational demands while often yielding robust performance despite the strong assumption. Under this , the posterior simplifies to: P(y \mid \mathbf{x}) \propto P(y) \prod_{i=1}^n P(x_i \mid y) where \mathbf{x} = (x_1, \dots, x_n) represents the feature vector. This approximation enables efficient parameter estimation from training data, typically using maximum likelihood for priors and conditionals. Naive Bayes excels in high-dimensional settings, such as text classification, and remains competitive even when does not hold perfectly. Variants of Naive Bayes adapt to different data types. The Gaussian Naive Bayes assumes continuous features follow a within each class, modeling P(x_i \mid y) as a Gaussian density with class-specific and variance estimated from the . This variant is suitable for real-valued features, such as measurements, and has been shown to improve over approximations in continuous domains. For or count-based features, like word frequencies in documents, the multinomial Naive Bayes uses a , treating features as occurrences from a vocabulary. Hidden Markov Models (HMMs) extend to sequential data, modeling observations as arising from a hidden Markov process where states represent latent classes and transitions capture dependencies over time. In HMMs, the probability of a sequence is computed via the forward algorithm, and classification often involves finding the most likely state sequence using the . HMMs have been pivotal in applications requiring temporal modeling, such as , where acoustic features are classified into phonetic units across utterances. A representative application is spam detection, where multinomial Naive Bayes classifies emails based on word frequency counts from the bag-of-words representation. Training involves estimating class priors from labeled and legitimate messages, then computing conditional probabilities for vocabulary terms. This approach effectively discriminates by leveraging term likelihoods, achieving high accuracy on corpora with minimal computational overhead.

Model Evaluation

Core Metrics

Core metrics for evaluating classification models provide straightforward measures of performance based on comparisons between predicted and actual labels. These metrics are derived from the , a fundamental tool that summarizes prediction errors and correct classifications across all instances in a . For , the confusion matrix is a 2x2 , while for multi-class problems, it extends to a k×k where k is the number of classes. Each entry in the represents counts of instances: true positives (TP) for correct positive predictions, true negatives (TN) for correct negative predictions, false positives (FP) for incorrect positive predictions, and false negatives (FN) for incorrect negative predictions. In multi-class settings, TP, TN, FP, and FN are generalized by summing over the diagonal for correct predictions and off-diagonal for errors, often computed per class or in aggregated forms. The concept of the confusion matrix, originally termed a , traces back to statistical analysis in the early 20th century, but its specific application to evaluation was popularized in the late . Accuracy, one of the simplest core metrics, quantifies the proportion of correct predictions overall and is calculated as: \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} This metric is intuitive for balanced datasets but can be misleading in imbalanced scenarios where it may overstate by favoring the majority class. The error rate, conversely, measures the proportion of incorrect predictions and is simply the complement of accuracy: \text{Error Rate} = 1 - \text{Accuracy} = \frac{FP + FN}{TP + TN + FP + FN} These rates provide a high-level view of model reliability, with seminal discussions in evaluation emphasizing their role in initial assessments. Building on the confusion matrix, offer class-specific insights, particularly useful for understanding trade-offs in positive predictions. , also known as positive predictive value, is the of true positives to the total predicted positives: \text{Precision} = \frac{TP}{TP + FP} It indicates the reliability of positive predictions, minimizing false positives. , or , measures the ratio of true positives to the total actual positives: \text{[Recall](/page/The_Recall)} = \frac{TP}{TP + FN} This focuses on capturing all relevant instances, reducing false negatives. Originating in literature from the mid-20th century, these metrics were adapted to classification tasks to evaluate how well models identify relevant cases amid noise. In multi-class problems, are typically computed per class using one-vs-rest strategies, then averaged (e.g., macro or micro averaging) for an overall score. To contextualize these metrics, baseline comparisons are essential; a common is the majority accuracy, where the model simply predicts the most frequent for all instances, yielding an accuracy equal to the proportion of that in the . This trivial baseline helps gauge whether a classifier provides meaningful improvement over random or naive strategies, as highlighted in foundational glossaries. For instance, on a with 90% negative samples, a majority baseline achieves 90% accuracy, underscoring the need for metrics beyond raw accuracy in skewed distributions. Advanced techniques, such as (ROC) curves, build on these basics for threshold-independent analysis but are explored separately.
MetricFormulaInterpretation
Accuracy\frac{TP + TN}{TP + TN + FP + FN}Overall correctness proportion
Error Rate\frac{FP + FN}{TP + TN + FP + FN}Proportion of misclassifications
\frac{TP}{TP + FP}Accuracy of positive predictions
\frac{TP}{TP + FN}Coverage of actual positives
Majority BaselineProportion of largest classSimplest non-informative predictor

Advanced Assessment Techniques

Advanced assessment techniques in classification extend beyond basic metrics by incorporating threshold variations, handling class imbalances, and evaluating probabilistic outputs to provide a more nuanced understanding of model performance. These methods are particularly valuable in scenarios where simple accuracy or error rates fail to capture trade-offs in or reliability of predictions. The (ROC) curve is a graphical tool that evaluates a classifier's performance across all possible classification by plotting the true positive rate (TPR, or ) against the (FPR, or 1-specificity). This curve allows visualization of the model's ability to discriminate between es, with the area under the ROC curve (-ROC) serving as a single scalar summary: an AUC of 0.5 indicates random guessing, while 1.0 represents perfect separation. Originating from and adapted for , ROC analysis is robust to threshold selection and class distribution changes, making it suitable for comparing models irrespective of operating points. For datasets with severe class imbalance, where positive instances are rare, the precision-recall (PR) curve offers a more informative alternative to the curve by plotting (positive predictive value) against (TPR) at varying thresholds. The area under the PR curve (AUC-PR) emphasizes the model's performance on the minority class, as and both focus on true positives relative to false positives and false negatives. A related scalar metric, the F1-score, balances these by computing the of and , defined as
F_1 = 2 \times \frac{\precision \times \recall}{\precision + \recall},
which penalizes models that excel in one but fail in the other and is particularly useful for imbalanced settings or when equal importance is assigned to and .
To obtain robust estimates of model performance and reduce overfitting risks associated with single train-test splits, resampling techniques like k-fold cross-validation and bootstrapping are employed. In k-fold cross-validation, the dataset is partitioned into k equal-sized folds, with the model trained on k-1 folds and tested on the remaining one iteratively; the average performance across folds provides an unbiased estimate, with 10-fold often recommended for real-world datasets due to its balance of variance and bias. Bootstrapping complements this by repeatedly sampling the dataset with replacement to generate multiple training sets, enabling variance estimation through the distribution of performance scores on out-of-bag samples, which is especially effective for small datasets or assessing confidence intervals. Probabilistic classifiers output confidence scores rather than hard labels, necessitating calibration to ensure predicted probabilities align with true outcome frequencies. Calibration assesses this reliability, for instance, by verifying that samples with a predicted probability of 0.8 exhibit the event approximately 80% of the time, often visualized via reliability diagrams. The measures the accuracy of probabilistic predictions as the mean squared difference between predicted probabilities and actual binary outcomes,
BS = \frac{1}{N} \sum_{i=1}^N (f_i - o_i)^2,
where f_i is the predicted probability and o_i the observed outcome for instance i; lower scores indicate better probabilistic accuracy, with 0 representing perfect predictions. In , techniques like or are applied post-training to improve calibration without altering .

Applications and Challenges

Practical Uses

In healthcare, classification techniques play a pivotal role in automating disease diagnosis, enhancing accuracy and in clinical settings. A prominent example is the use of convolutional neural networks (CNNs) to detect from retinal fundus photographs, where a demonstrated, for example, 87.0% at 98.5% specificity on the Messidor-2 (high specificity ) and 97.5% at 93.4% specificity on the EyePACS-1 (high ), as reported in a 2016 validation study. This approach allows for early intervention in diabetic patients, reducing the burden on ophthalmologists and improving outcomes in resource-limited environments. In the financial sector, classification models are essential for credit scoring and fraud detection, processing vast transaction datasets to flag anomalies in real time. For , classifiers such as random forests and neural networks achieve high detection accuracies, as evidenced in comprehensive surveys of deployed systems that handle imbalanced data through techniques like . These applications minimize financial losses, with global fraud prevention systems safeguarding billions in annual transactions. Natural language processing leverages classification for tasks like , categorizing user-generated text to gauge opinions and emotions. Seminal work on movie reviews introduced methods that classify sentiments as positive or negative with accuracies around 80-90%, laying the foundation for modern applications in customer feedback analysis across and . In , classification enables critical to autonomous vehicles, where models identify and categorize elements like pedestrians, vehicles, and traffic signs from camera feeds. The framework, designed for real-time performance, processes images at over 45 frames per second while classifying objects with mean average precision of 63.4% on the PASCAL VOC 2007 benchmark dataset, supporting safe navigation in dynamic environments. Classification also underpins recommendation systems, such as Netflix's, by tagging into genres and predicting user preferences through multi-label approaches integrated with . This results in personalized suggestions that drive 75% of viewer activity, optimizing discovery and retention. In recent years, transformer-based models have advanced classification applications, such as in healthcare tasks combining text and images for disease prediction, achieving state-of-the-art performance as of 2024.

Common Issues and Solutions

One prevalent challenge in classification tasks is class imbalance, where the minority class is underrepresented relative to the majority class, leading to biased models that perform poorly on rare but critical instances. To address this, the Synthetic Minority Over-sampling Technique (SMOTE) generates synthetic examples of the minority class by interpolating between existing minority instances and their nearest neighbors, thereby balancing the dataset without simple duplication that could exacerbate . Additionally, cost-sensitive learning assigns higher misclassification costs to minority class errors during training, prompting the model to prioritize their correct prediction through modified loss functions. Overfitting occurs when a classification model learns and idiosyncrasies in the training data, resulting in strong performance on training sets but poor to unseen data. Regularization techniques mitigate this by adding penalty terms to the loss function; L2 regularization () constrains the magnitude of model weights via their squared sum, while L1 regularization () promotes sparsity by penalizing the absolute sum of weights, both discouraging excessive complexity. complements these by halting training when validation performance begins to degrade, typically monitored via a parameter that tracks epochs without improvement. The curse of dimensionality arises in high-dimensional feature spaces, where data becomes sparse, increasing computational demands and the risk of irrelevant features dominating the model. methods, such as filter-based approaches (e.g., chi-squared tests) or wrapper methods (e.g., recursive feature elimination), identify and retain only the most informative features to reduce noise. techniques like () transform the original features into a lower-dimensional space by projecting data onto principal components that capture maximum variance, preserving essential information while alleviating sparsity. Interpretability issues stem from complex models like neural networks, which act as black boxes, making it difficult to understand processes and trust predictions in high-stakes domains. Simpler models, such as decision trees, offer inherent interpretability through their hierarchical structure of if-then rules, allowing of decision paths and feature importance, often at the cost of slightly lower accuracy compared to neural networks. Ethical concerns in classification frequently involve in training data, which propagates unfair predictions, as seen in facial recognition systems where datasets skewed toward lighter skin tones lead to higher error rates for darker-skinned individuals. Remedies include auditing datasets for demographic and applying debiasing techniques, such as reweighting samples or adversarial training to minimize disparate impacts across groups.

References

  1. [1]
    Classification (IEKO)
    Classification is a fundamental concept and activity in knowledge organization, but it is also an important concept in many other fields, including biology and ...
  2. [2]
    Understanding Classification – The Discipline of Organizing
    A classification creates a semantic or conceptual roadmap to a domain by highlighting the properties and relationships that distinguish the resources in it.
  3. [3]
    Classification - an overview | ScienceDirect Topics
    Classification is a general process for recognizing, differentiating, understanding, and grouping ideas and objects into classes.
  4. [4]
    [PDF] Auguste Comte's Classification of the Sciences† - IMR Press
    It may be said of Isidore-Auguste-Marie-François-Xavier Comte (1798-1857) that he was the only scholar whose whole life was devoted to classifying the sciences ...
  5. [5]
    Cataloging Tools and Resources: Classification - ALA LibGuides
    Jun 10, 2025 · In the United States there are two commonly used classification schemes: the Dewey Decimal Classification and the Library of Congress ...
  6. [6]
    [1502.04469] Classification and its applications for drug-target ...
    Feb 16, 2015 · Classification is one of the most popular and widely used supervised learning tasks, which categorizes objects into predefined classes based on ...
  7. [7]
    THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC ...
    THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS. R. A. FISHER Sc.D ... Download PDF. back. Additional links. About Wiley Online Library. Privacy ...
  8. [8]
    The Regression Analysis of Binary Sequences - Cox - 1958
    Dec 5, 2018 · A sequence of 0's and 1's is observed and it is suspected that the chance that a particular trial is a 1 depends on the value of one or more independent ...
  9. [9]
    Classification and Regression Trees | Leo Breiman, Jerome ...
    Oct 19, 2017 · Citation. Get Citation. Breiman, L., Friedman, J., Olshen, R.A., & Stone, C.J. (1984). Classification and Regression Trees (1st ed.). Chapman ...
  10. [10]
    A training algorithm for optimal margin classifiers - ACM Digital Library
    A training algorithm that maximizes the margin between the training patterns and the decision boundary is presented.
  11. [11]
    Random Forests | Machine Learning
    Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently.
  12. [12]
    [PDF] Mathematics of Machine Learning Lecture 1 Notes
    Sep 9, 2015 · The goal of binary classification is to build a rule to predict Y given X using only the data at hand. Such a rule is a function h : X →{0,1} ...
  13. [13]
    [PDF] A Unified Formulation and Fast Accelerated Proximal Gradient ...
    Binary classification is one of the most important problems in machine learning. ... Section 3 presents a unified formulation of binary classification models.
  14. [14]
    [PDF] Lecture 2. Bayes Decision Theory
    We combine the prior p(y) with the likelihood p(x|y) to obtain the posterior probability ... For the binary classification case – y ∈ {±1} – the decision depends ...
  15. [15]
    [PDF] Introduction to Classification Algorithms - Martin Haugh
    Simplest and most common type of classification is binary classification ... threshold value for the posterior probability that is used to perform the assignment.
  16. [16]
    [PDF] Using Random Forest to Learn Imbalanced Data
    The other approach is to use a sampling technique: Either down-sampling the majority class or over-sampling the minority class, or both. Most research has been ...
  17. [17]
    Lecture 1: Supervised Learning - CS@Cornell
    The zero-one loss is often used to evaluate classifiers in multi-class/binary classification settings but rarely useful to guide optimization procedures because ...
  18. [18]
    [PDF] Algorithms for Direct 0–1 Loss Optimization in Binary Classification
    The paper explores direct optimization of the 0-1 loss using branch and bound, combinatorial search, and coordinate descent on smooth relaxations.
  19. [19]
    [PDF] 6.883: Online Methods in Machine Learning - MIT
    Two standard approaches (one-vs-all and all-pairs) use binary classification as a subroutine. One-vs-all trains a collection of k binary classifiers.
  20. [20]
    [PDF] In Defense of One-Vs-All Classification
    The central thesis of this chapter is that one-vs-all classification using SVMs or RLSC is an excellent choice for multiclass classification. In the past few ...
  21. [21]
    One-vs-One classification for deep neural networks - ScienceDirect
    One-vs-All (OvA) classification is the most commonly used method for dealing with multi-class problems. In this classification scheme, multiple binary ...
  22. [22]
    Probabilistic Interpretation of Feedforward Classification Network ...
    We wish to treat the outputs of the network as probabilities of alternatives (eg. pattern classes), conditioned on the inputs.
  23. [23]
    (PDF) Multi-Label Classification: An Overview - ResearchGate
    Aug 7, 2025 · A wide range of measures has been developed to evaluate multi-label classification methods (Tsoumakas and Katakis 2008) . In this study, a ...
  24. [24]
    [PDF] On Discriminative vs. Generative classifiers: A comparison of logistic ...
    We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widely- held belief that ...
  25. [25]
    [PDF] Pattern Recognition and Machine Learning - Microsoft
    A companion volume (Bishop and Nabney,. 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab ...
  26. [26]
    Application of the Logistic Function to Bio-Assay
    (1944). Application of the Logistic Function to Bio-Assay. Journal of the American Statistical Association: Vol. 39, No. 227, pp. 357-365.
  27. [27]
    [PDF] A Training Algorithm for Optimal Margin Classi ers
    In this paper we describe a training algorithm that au- tomatically tunes the capacity of the classification func- tion by maximizing the margin between ...Missing: original | Show results with:original
  28. [28]
    [PDF] Induction of decision trees - Machine Learning (Theory)
    This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system,. ID3, in detail.
  29. [29]
    [PDF] Support Vector Machines - Stanford Engineering Everywhere
    This set of notes presents the Support Vector Machine (SVM) learning al- gorithm. SVMs are among the best (and many believe is indeed the best).
  30. [30]
    [PDF] 1992-An Analysis of Bayesian Classifiers
    A Bayesian classifier is a simple induction algorithm that calculates the probability of correct classification, assuming independent, noise-free attributes. ...
  31. [31]
    [PDF] An empirical study of the naive Bayes classifier
    Abstract. The naive Bayes classifier greatly simplify learn- ing by assuming that features are independent given class. Although independence is generally a ...
  32. [32]
    [PDF] 338 Estimating Continuous Distributions in Bayesian Classifiers ...
    When modeling a probability distribution with a Bayesian network, we are faced with the problem of how to handle continuous vari.
  33. [33]
    [PDF] A Tutorial on Hidden Markov Models and Selected Applications in ...
    In this paper we attempt to care- fully and methodically review the theoretical aspects of this type of statistical modeling and show how they have been applied ...Missing: classification | Show results with:classification
  34. [34]
    [PDF] A Bayesian Approach to Filtering Junk E-Mail
    Abstract. In addressing the growing problem of junk E-mail on the Internet, we examine methods for the automated construction of filters to eliminate such ...
  35. [35]
    (PDF) Glossary of Terms - ResearchGate
    Aug 6, 2025 · PDF | On Jan 1, 1998, Ron Kohavi and others published Glossary of Terms | Find, read and cite all the research you need on ResearchGate.
  36. [36]
    Machine literature searching VIII. Operational criteria for designing ...
    Machine literature searching VIII. Operational criteria for designing information retrieval systems. Allen Kent,. Allen Kent. Search for more papers by this ...Missing: et al
  37. [37]
    [PDF] An introduction to ROC analysis - ELTE
    Dec 19, 2005 · Receiver operating characteristics (ROC) graphs are useful for organizing classifiers and visualizing their performance. ROC graphs.
  38. [38]
    [PDF] The Relationship Between Precision-Recall and ROC Curves
    ROC and PR curves are typically generated to evalu- ate the performance of a machine learning algorithm on a given dataset. Each dataset contains a fixed num-.Missing: seminal | Show results with:seminal
  39. [39]
    [PDF] A Study of Cross-Validation and Bootstrap for Accuracy Estimation ...
    In this paper we explain some of the assumptions made by the different estimation methods, and present con- crete examples where each method fails. While it is.
  40. [40]
    Development and Validation of a Deep Learning Algorithm for ...
    Nov 29, 2016 · Deep learning algorithms had high sensitivity and specificity for detecting diabetic retinopathy and macular edema in retinal fundus photographs.
  41. [41]
    Credit card fraud detection using machine learning: A survey - arXiv
    Oct 13, 2020 · In this survey, we study data-driven credit card fraud detection particularities and several machine learning methods to address each of its intricate ...
  42. [42]
    Thumbs up? Sentiment Classification using Machine Learning ...
    Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 ...
  43. [43]
    The Netflix Recommender System: Algorithms, Business Value, and ...
    Dec 28, 2015 · This article discusses the various algorithms that make up the Netflix recommender system, and describes its business purpose.<|control11|><|separator|>
  44. [44]
    (PDF) Cost-Sensitive Learning and the Class Imbalance Problem
    The goal of this type of learning is to pursue a high accuracy of classifying examples into a set of known classes. The class imbalanced datasets occurs in many ...
  45. [45]
    [PDF] Early Stopping | but when?
    Abstract. Validation can be used to detect when overfitting starts dur- ing supervised training of a neural network; training is then stopped.