Fact-checked by Grok 2 weeks ago

Binary classification

Binary classification is a core task in supervised machine learning where the objective is to categorize input data points, represented by feature vectors in \mathbb{R}^d, into one of two distinct classes, typically labeled as -1 or +1, or $0 or $1, using a model trained on labeled examples to learn an optimal decision boundary that separates the classes in the feature space.^[1]^[2]^[3] The goal is to minimize the classification error, defined as the probability P(g(X) \neq Y) that the classifier g misassigns the true label Y to the input X, with the Bayes classifier achieving the lowest possible error by thresholding the conditional probability \eta(x) = P(Y=1 \mid X=x) at $0.5.^[2] This task forms the foundation for many practical applications, including spam email detection, where emails are classified as spam or non-spam based on word frequencies and metadata; medical diagnosis, such as distinguishing between benign and malignant tumors from imaging features; and fraud detection in financial transactions, identifying suspicious activities versus legitimate ones.^[1]^[3] Common algorithms for binary classification include logistic regression, which models class probabilities using the sigmoid function and optimizes via maximum likelihood estimation; support vector machines (SVMs), which maximize the margin between classes using hyperplanes and kernel tricks for non-linear separability; decision trees and ensemble methods like random forests, which recursively partition the feature space based on impurity measures; Naive Bayes classifiers, applying Bayes' theorem under independence assumptions; k-nearest neighbors (k-NN), predicting based on the majority vote of nearest training examples; and neural networks, which learn hierarchical representations through layered activations and backpropagation.^[1]^[3] Performance evaluation relies on metrics beyond simple accuracy, such as precision (TP / (TP + FP)), recall (TP / (TP + FN)), F1-score (2 × (precision × recall) / (precision + recall)), specificity (TN / (TN + FP)), and the area under the ROC curve (AUC), which assess trade-offs between true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) to handle imbalanced datasets effectively.^[1] In practice, binary classifiers are trained on independent and identically distributed (i.i.d.) samples and aim for consistency, where the empirical risk minimizer converges to the optimal classifier in the function class with high probability, as analyzed in convex risk minimization frameworks.^[2]

Fundamentals

Definition and Overview

Binary classification is a fundamental task in supervised machine learning where the goal is to assign each instance from a dataset to one of two mutually exclusive categories, typically labeled as positive (1) or negative (0), based on its features.^[4] This predictive modeling approach treats the problem as learning a decision boundary that separates the two classes in the feature space, enabling the classifier to make probabilistic or deterministic predictions on new, unseen data.^[2] At its core, a binary classifier can be conceptualized as a function f: \mathcal{X} \to \{0, 1\}, where \mathcal{X} represents the input space of feature vectors, mapping inputs to class labels.^[3] The origins of binary classification trace back to early 20th-century statistical methods, notably Ronald A. Fisher's development of linear discriminant analysis in 1936, which sought to find linear combinations of features that best separate two classes in taxonomic problems.^[5] This work laid foundational principles for discriminant functions in statistics. By the 1950s, the field evolved into machine learning with the introduction of the perceptron by Frank Rosenblatt in 1958, an early neural network model capable of learning binary decisions through supervised training on labeled examples.^[6] As a prerequisite for binary classification, supervised learning requires a dataset consisting of labeled examples, where each instance includes a vector of features (e.g., numerical or categorical attributes) paired with a binary label indicating the true class.^[7] Training typically involves splitting the dataset into training and test sets to fit the model on the former and evaluate generalization on the latter, ensuring the classifier learns patterns without overfitting.^[8] Binary classification's simplicity—focusing on just two outcomes—makes it a ubiquitous building block for more complex problems, such as reducing multiclass classification to multiple binary decisions, while finding widespread applications in areas like spam email detection, medical diagnosis of diseases (e.g., presence or absence of a condition), and credit card fraud detection.^[9] These domains benefit from its efficiency in handling imbalanced datasets and providing interpretable decisions that directly impact real-world outcomes.^[10]

Classification Outcomes

In binary classification, the outcomes of a model's predictions are determined by comparing the predicted label ŷ to the true label y, where both are binary values in {0, 1}, and the positive class (1) typically represents the occurrence of an event or condition of interest. These outcomes categorize whether the prediction correctly or incorrectly identifies the positive or negative class, forming the foundational building blocks for evaluating classifier performance. The four possible classification outcomes are as follows:

True Positive (TP): The model correctly predicts the positive class when the true label is positive. For instance, in medical diagnostics, a TP occurs when a test accurately identifies a patient with a disease, enabling timely treatment.
False Positive (FP): The model incorrectly predicts the positive class when the true label is negative. In the same medical context, an FP represents a false alarm, such as flagging a healthy individual as diseased, which may lead to unnecessary procedures.
True Negative (TN): The model correctly predicts the negative class when the true label is negative. Continuing the example, a TN is when the test correctly rules out the disease in a healthy patient, avoiding undue concern.
False Negative (FN): The model incorrectly predicts the negative class when the true label is positive. This is particularly critical in disease detection, as an FN might miss a sick patient, delaying intervention and potentially worsening outcomes.

These outcomes can be visually represented in a 2x2 contingency table, which organizes the results based on the true and predicted labels:

	Predicted Positive (ŷ = 1)	Predicted Negative (ŷ = 0)
Actual Positive (y = 1)	True Positive (TP)	False Negative (FN)
Actual Negative (y = 0)	False Positive (FP)	True Negative (TN)

This layout highlights the alignment or mismatch between predictions and reality without deriving any aggregate measures. Although the four outcomes are conceptually symmetric—each representing a match or mismatch in one of the two classes—their implications are often asymmetric in real-world applications due to differing costs. For example, in high-stakes scenarios like medical testing, the cost of a false negative (missing a positive case) can far exceed that of a false positive, influencing how models are designed and tuned.

Performance Evaluation

Confusion Matrix and Basic Metrics

In binary classification, the confusion matrix serves as a foundational tool for summarizing the performance of a model by tabulating the counts of correct and incorrect predictions against the true labels. It is structured as a 2×2 table where the rows represent the actual classes (positive or negative) and the columns represent the predicted classes (positive or negative). The four cells correspond to true positives (TP), where the model correctly identifies positive instances; false positives (FP), where negative instances are incorrectly predicted as positive; false negatives (FN), where positive instances are missed and predicted as negative; and true negatives (TN), where negative instances are correctly identified.^[11]^[12] To construct the confusion matrix, predictions are generated by applying the trained model to a dataset, and each instance is categorized based on its true label and the model's output. For example, consider a hypothetical dataset of 100 samples where the positive class is the event of interest (e.g., disease presence). After evaluation, suppose the model yields 40 TP (correctly detected positives), 10 FP (false alarms on negatives), 5 FN (missed positives), and 45 TN (correctly identified negatives). These counts populate the matrix as follows:

	Predicted Positive	Predicted Negative
Actual Positive	40 (TP)	5 (FN)
Actual Negative	10 (FP)	45 (TN)

This table provides a direct visual summary of the model's decision outcomes, enabling the derivation of key performance ratios.^[12]^[13] From the confusion matrix, basic metrics are computed as ratios of these counts, offering interpretable measures of performance. Accuracy, defined as the proportion of correct predictions overall, is calculated as:

\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}

In the example, this yields \frac{40 + 45}{100} = 0.85 or 85%, indicating the model's overall correctness. The error rate, simply the complement, is $1 - \text{Accuracy} = 0.15 or 15%, representing the fraction of misclassifications.^[14]^[15] Sensitivity, also known as recall or true positive rate, measures the model's ability to identify positive instances and is given by:

\text{[Sensitivity](/page/Sensitivity) (Recall)} = \frac{\text{TP}}{\text{TP} + \text{FN}}

For the example, \frac{40}{40 + 5} = 0.889, showing that 88.9% of actual positives were detected. Specificity, the true negative rate, assesses performance on negative instances:

\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}

Here, \frac{45}{45 + 10} = 0.818, or 81.8% of negatives were correctly classified. These two metrics together evaluate the balance between detecting the target class and avoiding errors on the other.^[14]^[16] Precision, or positive predictive value (PPV), quantifies the reliability of positive predictions:

\text{Precision (PPV)} = \frac{\text{TP}}{\text{TP} + \text{FP}}

In the example, \frac{40}{40 + 10} = 0.800, meaning 80% of predicted positives were truly positive. Negative predictive value (NPV) similarly evaluates negative predictions:

\text{NPV} = \frac{\text{TN}}{\text{TN} + \text{FN}}

This computes to \frac{45}{45 + 5} = 0.900, or 90% reliability for negative predictions. False positive rate (FPR) and false negative rate (FNR) capture error proportions:

\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} = 1 - \text{Specificity}, \quad \text{FNR} = \frac{\text{FN}}{\text{TP} + \text{FN}} = 1 - \text{Sensitivity}

For the dataset, FPR = 0.182 and FNR = 0.111, highlighting the rates of specific errors. These ratios are derived directly by dividing the relevant cell counts by the marginal totals of their rows or columns in the matrix.^[14]^[16] These metrics are empirical estimates obtained by evaluating the model on holdout data, such as a validation or test set, separate from the training data to prevent overfitting and ensure generalizability.^[14] However, accuracy can be misleading in imbalanced datasets; for instance, if 99 samples are negative and only 1 is positive, a model predicting all as negative achieves 99% accuracy but fails to detect the positive case entirely.^[15]^[16]

Advanced Metrics and Considerations

The Receiver Operating Characteristic (ROC) curve plots the true positive rate (TPR, or sensitivity) against the false positive rate (FPR, or 1-specificity) at various classification thresholds, providing a threshold-independent visualization of a model's trade-off between detecting positives and avoiding false alarms.^[17] The area under the ROC curve (AUC) quantifies this performance as the integral of the ROC curve, ranging from 0 to 1, where an AUC of 0.5 indicates random guessing and 1.0 represents perfect discrimination; higher AUC values enable robust model comparisons across datasets.^[17] Originating in signal detection theory during the 1940s for radar applications and formalized in psychophysics, the ROC framework was adapted to machine learning evaluations in the late 1980s, with one of the earliest adoptions by Spackman in 1989, to assess probabilistic classifiers beyond simple accuracy.^[18] For datasets with class imbalance, where the positive class is rare, the Precision-Recall (PR) curve offers a more informative alternative by plotting precision (positive predictive value) against recall (TPR) across thresholds, emphasizing the model's ability to handle sparse positives without dilution by the majority class.^[19] The average precision (AP) is computed as the area under the PR curve, providing a single scalar summary that prioritizes high-precision retrievals at increasing recall levels and is particularly sensitive to performance on the minority class.^[19] Beyond curve-based metrics, composite scores like the F1-score balance precision and recall through their harmonic mean, defined as
F1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}},
which penalizes imbalances between the two and is useful for single-threshold evaluations in balanced scenarios.^[20] The Matthews correlation coefficient (MCC) extends this by incorporating all confusion matrix quadrants equally, yielding
\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}},
a value between -1 and 1 that remains balanced even under severe class imbalance or when true negatives dominate, making it preferable for comprehensive assessments.^[21] Class imbalance poses challenges for standard metrics, as models may overfit to the majority class; one conceptual approach is oversampling the minority class using techniques like SMOTE, which generates synthetic examples along line segments between existing minority instances to augment the dataset without mere duplication.^[22] Alternatively, cost-sensitive learning assigns unequal penalties to false positives (FP) and false negatives (FN) during training, weighting errors based on domain-specific consequences—such as higher costs for FN in medical diagnostics—to optimize decision boundaries for asymmetric risks.^[23] To ensure metrics reflect generalization rather than overfitting, they are typically averaged across k-fold cross-validation folds, where the dataset is partitioned into k subsets, each used once as a holdout while training on the rest, yielding a stable estimate of expected performance on unseen data.^[24]

Classification Techniques

Statistical and Probabilistic Methods

Statistical and probabilistic methods for binary classification model the probability of class membership given input features, often under assumptions of data distribution to enable parameter estimation via likelihood maximization. These approaches, rooted in statistical theory, provide interpretable probabilistic outputs and are foundational for understanding more complex classifiers. They typically fall into generative models, which estimate the joint distribution P(x, y) by modeling P(x \mid y) and P(y), contrasting with discriminative methods that directly approximate P(y \mid x).^[25] Logistic regression, a seminal probabilistic method, models the probability of the positive class as p(y=1 \mid x) = \frac{1}{1 + \exp(-(\beta_0 + \beta \cdot x))}, where \beta_0 is the intercept and \beta the coefficient vector, using the logit link function to map linear combinations to the [0,1] interval. Popularized by David Cox in biomedicine for analyzing binary outcomes like disease presence, it assumes linearity in the logit space.^[26] Parameters are estimated via maximum likelihood estimation (MLE), maximizing the log-likelihood \sum [y_i \log p_i + (1-y_i) \log(1-p_i)], often optimized using gradient descent or the Newton-Raphson method for iterative convergence.^[26] The model's performance is commonly evaluated using binary log-loss, defined as -\left[ y \log p + (1-y) \log(1-p) \right], which quantifies prediction uncertainty.^[25] Naive Bayes classifiers apply Bayes' theorem to compute posterior probabilities as P(y \mid x) \propto P(x \mid y) P(y), assuming conditional independence among features given the class label to simplify joint likelihood computation. This generative approach models class-conditional densities P(x \mid y); for continuous features, the Gaussian variant assumes each follows a normal distribution parameterized by class-specific means and variances. Despite the strong independence assumption, Naive Bayes excels in high-dimensional settings, such as text classification for spam filtering, where it efficiently handles sparse data with minimal computational cost.^[27] Training involves estimating prior P(y) from class frequencies and likelihoods from feature distributions, typically via closed-form MLE without iterative optimization.^[27] Linear discriminant analysis (LDA), introduced by Ronald Fisher, assumes features follow multivariate Gaussian distributions per class with equal covariance matrices, deriving a linear decision boundary that maximizes the ratio of between-class to within-class variance. The posterior probability for class k is given by

P(y=k \mid x) = \frac{\exp\left( -\frac{1}{2}(x - \mu_k)^T \Sigma^{-1} (x - \mu_k) + \log \pi_k \right)}{\sum_{j} \exp\left( -\frac{1}{2}(x - \mu_j)^T \Sigma^{-1} (x - \mu_j) + \log \pi_j \right)},

where \mu_k is the class mean, \Sigma the shared covariance, and \pi_k the prior. Parameter estimation uses MLE on pooled sample statistics, enabling dimensionality reduction alongside classification by projecting onto discriminant directions. As a generative method, LDA contrasts with discriminative alternatives by explicitly modeling data generation processes, though it requires normality assumptions for optimal performance.^[25]

Thresholding Continuous Predictions

Many binary classification models that provide probabilistic outputs, such as logistic regression or support vector machines with calibration (e.g., via Platt scaling), output continuous scores s(x) \in [0,1], which represent estimated probabilities of the positive class.^[28] To convert these into binary decisions, a threshold \theta is applied such that the prediction \hat{y} = 1 if s(x) > \theta, and \hat{y} = 0 otherwise.^[29] The conventional choice is \theta = 0.5, which assumes balanced classes and equal misclassification costs.^[30] Threshold selection strategies aim to optimize performance based on specific criteria. One common approach maximizes Youden's index, defined as J = \text{[sensitivity](/page/Sensitivity)} + \text{specificity} - 1, which identifies the threshold balancing true positive and true negative rates. In cost-sensitive scenarios, the threshold minimizes expected loss, formulated as C_{FP} \cdot \text{FPR} + C_{FN} \cdot \text{[FNR](/page/FNR)}, where C_{FP} and C_{FN} denote the costs of false positives and false negatives, respectively, and FPR and FNR are the corresponding error rates.^[32] Adjusting \theta directly impacts key metrics by trading off precision (positive predictive value) against recall (sensitivity). Lowering \theta below 0.5 typically boosts recall by classifying more instances as positive but reduces precision due to increased false positives; conversely, raising \theta enhances precision at the expense of recall.^[33] This tradeoff is often illustrated in precision-recall curves, where points correspond to different \theta values, showing how metric values shift along the curve.^[15] In imbalanced datasets, where the positive class is rare, the optimal \theta frequently shifts below 0.5 to prioritize minority class detection and avoid overwhelming false negatives.^[30] This adjustment is particularly relevant in production systems like credit scoring, where thresholds are tuned to minimize costly errors such as approving high-risk loans (false negatives).^[34] A related but distinct process involves binarizing continuous input features prior to modeling, converting them into binary indicators via methods like median splits (dividing at the feature's median value) or domain-specific thresholds (e.g., age > 18 for adulthood).^[35] Such transformations can simplify algorithms but may lead to information loss if not carefully chosen.^[35] For reliable thresholding, output scores should be calibrated to reflect true probabilities; Platt scaling achieves this by fitting a logistic regression model to map raw classifier outputs to calibrated probabilities using a held-out dataset.^[28]

References

[1]
Binary Classification - an overview | ScienceDirect Topics
Binary classification is a fundamental supervised learning task that involves categorizing data points into two classes, commonly referred to as positive and ...Introduction to Binary... · Theoretical Foundations of... · Common Algorithms and...
[2]
[PDF] binary classification - maxim raginsky
The problem of binary classification can be stated as follows. We have a random couple Z = (X, Y ), where X ∈ Rd is called the feature vector and Y ∈ {−1,1} ...
[3]
[PDF] CHAPTER 4 Classification - 6.390 | IntroML | Fall25
4.1 Classification. Classification is a machine learning problem seeking to map from inputs Rd to outputs in an unordered set. Examples of classification ...
[4]
[PDF] Binary classification
CS 2750 Machine Learning. Binary classification. • Two classes. • Our goal is to learn to classify correctly two types of examples. – Class 0 – labeled as 0 ...
[5]
Linear Discriminant Analysis | Solutions - DTREG
Linear Discriminant Analysis (LDA). Introduction to Discriminant Analysis. Originally developed in 1936 by R.A. Fisher, Discriminant Analysis is a classic ...
[6]
The Perceptron — A Perceiving and Recognizing Automaton
In the 1950s, the perceptron (Rosenblatt, 1958, 1962) became the ?rst model that could learn the weights de?ning the categories given examples of inputs ...
[7]
10 Classification – STAT 508 | Applied Data Mining and Statistical ...
For binary classification, let 'Y.hat' be a 0-1 vector of the predicted class labels, and 'y' be a 0-1 vector of the observed ...<|control11|><|separator|>
[8]
Chapter 4 Binary Classification | Theory of Machine Learning - People
Proofread and polished by Baozhen Wang.) In this chapter, we focus on analyzing a particular problem: binary classification. ... definition. Applying Hoeffding's ...
[9]
Classification Algorithm in Machine Learning - Types & Examples
May 3, 2025 · Binary Classification. This involves classifying data into two categories or classes. Examples include: Email spam detection (Spam/Not Spam) ...
[10]
Binary Classification Algorithm - an overview | ScienceDirect Topics
Other application areas where binary classification can be used are medical diagnosis, spam detection, etc. 2. Multiclass classification : As the name ...1. Introduction · 2. Fundamental Concepts And... · 4. Model Training...<|control11|><|separator|>
[11]
Confusion Matrix - an overview | ScienceDirect Topics
A confusion matrix is an N × N table where N is the number of classes in the classification problem, and it is used to specify the accuracy of a classifier by ...
[12]
Confusion Matrix in Binary Classification Problems: A Step-by-Step ...
Aug 9, 2025 · The confusion matrix is a widely used tool in machine learning that summarizes the results of a classification model by displaying the correct ...<|separator|>
[13]
confusion_matrix — scikit-learn 1.7.2 documentation
By definition a confusion matrix is such that C i , j is equal to the number of observations known to be in group and predicted to be in group . Thus in binary ...Missing: authoritative | Show results with:authoritative
[14]
Evaluation metrics and statistical tests for machine learning - Nature
Mar 13, 2024 · The most commonly used evaluation metrics for binary classification are accuracy, sensitivity, specificity, and precision, which express the ...
[15]
Classification: Accuracy, recall, precision, and related metrics
Learn how to calculate three key classification metrics—accuracy, precision, recall—and how to choose the appropriate metric to evaluate a given binary ...
[16]
Challenges in the real world use of classification accuracy metrics
Oct 4, 2023 · The core focus was on the four basic metrics of accuracy (Recall, Precision, Specificity and NPV) and then four important and widely used ...
[17]
The use of the area under the ROC curve in the evaluation of ...
In this paper we investigate the use of the area under the receiver operating characteristic (ROC) curve (AUC) as a performance measure for machine learning ...
[18]
The history of the ROC curve - Rik Huijzer
Apr 25, 2024 · Therefore, the ROC curve was likely invented in the US somewhere between 1945 and 1953 by Fox, Peterson, and Birdsall at the US Army Signal ...
[19]
[PDF] The Relationship Between Precision-Recall and ROC Curves
Abstract. Receiver Operator Characteristic (ROC) curves are commonly used to present re- sults for binary decision problems in ma- chine learning.
[20]
[PDF] Evaluation in information retrieval - Stanford NLP Group
The F measure (or,. F MEASURE rather its complement E = 1 − F) was introduced by van Rijsbergen (1979). He provides an extensive theoretical discussion ...<|separator|>
[21]
The advantages of the Matthews correlation coefficient (MCC) over ...
Jan 2, 2020 · In this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F 1 score.
[22]
SMOTE: Synthetic Minority Over-sampling Technique
Jun 1, 2002 · This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class ...
[23]
[PDF] The Foundations of Cost-Sensitive Learning - UCSD CSE
Research on cost-sensitive learning and decision-making when costs may be example-dependent is only just beginning [Zadrozny and Elkan, 2001a].
[24]
[PDF] A Study of Cross-Validation and Bootstrap for Accuracy Estimation ...
We review accuracy estimation methods and compare the two most common methods cross- validation and bootstrap Recent experimen-.
[25]
A comparison of logistic regression and naive Bayes - NIPS papers
Authors. Andrew Y. Ng, Michael I. Jordan. Abstract. We compare discriminative and generative learning as typified by logistic regression and naive Bayes.
[26]
The Regression Analysis of Binary Sequences - jstor
Cox's paper seems likely to result in a much wider acceptance of the logistic function as a regression model. I have never been a partisan in the probit v ...
[27]
[PDF] A Comparison of Event Models for Naive Bayes Text Classification
Mitchell 1997; Nigam et al. 1998; McCallum et al. 1998). This paper aims to clarify the confusion between these two approaches by explaining both models in.
[28]
[PDF] Probabilistic Outputs for Support Vector Machines and Comparisons ...
John C. Platt. March 26, 1999 !"" " Abstract. The output of a classifier should be a calibrated posterior probability to enable post-processing.Missing: original | Show results with:original
[29]
A Gentle Introduction to Threshold-Moving for Imbalanced ...
Jan 5, 2021 · In this tutorial, you will discover how to tune the optimal threshold when converting probabilities to crisp class labels for imbalanced classification.
[30]
Finding the Best Classification Threshold in Imbalanced Classification
In general, the classification threshold is simply set to 0.5, which is usually unsuitable for an imbalanced classification. In this study, we analyze the ...
[31]
[PDF] Thresholding for Making Classifiers Cost-sensitive
In this paper we propose a very simple, yet general and effective method to make any cost-insensitive classifiers. (that can produce probability estimates) cost ...<|separator|>
[32]
Precision-Recall — scikit-learn 1.7.2 documentation
Lowering the classifier threshold may increase recall, by increasing the number of true positive results. It is also possible that lowering the threshold may ...
[33]
Estimation of optimum thresholds for binary classification using ...
Dec 2, 2022 · This study proposes an approach based on a Genetic Algorithm (GA) and Neural Networks (NNs) to automatically find customized cut-off values.<|separator|>
[34]
[PDF] A researcher's guide to regression, discretization, and median splits ...
May 6, 2015 · We comment on Iacobucci, Posavac, Kardes, Schneider, and Popovich (2015) by evaluating the practice of discretizing continuous variables.Missing: binarizing | Show results with:binarizing