Fact-checked by Grok 2 weeks ago

Precision and recall

Precision and recall are fundamental performance metrics in and , particularly for evaluating and search systems. Precision quantifies the accuracy of a retrieval or by measuring the fraction of retrieved items that are relevant, calculated as P = \frac{tp}{tp + fp}, where tp denotes true positives and fp false positives. Recall, also known as , assesses completeness by measuring the fraction of relevant items that are successfully retrieved, given by R = \frac{tp}{tp + fn}, where fn represents false negatives. These metrics originated in the evaluation of systems in the mid-20th century, with early formalization by Kent et al. in 1955, and have since become standard for assessing models where class imbalance or the cost of errors varies. In , precision and recall evaluate how well a returns relevant documents from a collection in response to a query, using test collections with predefined relevance judgments. A key trade-off exists between the two: efforts to maximize recall, such as retrieving more documents, often reduce precision by including irrelevant results, and vice versa, leading to precision-recall curves that visualize this balance across varying thresholds. The F1-score, the of precision and recall (F_1 = 2 \frac{P \cdot R}{P + R}), provides a single composite measure balancing both when equal importance is desired, as introduced by van Rijsbergen in 1979. In , precision and recall are applied to binary classifiers to address limitations of accuracy in imbalanced datasets, where one class (e.g., positives) is rarer. High minimizes false positives, crucial in applications like detection to avoid misclassifying legitimate emails, while high minimizes false negatives, vital in medical diagnostics to ensure few cases are missed. For instance, in the Precision-Recall curve, often preferred over curves for imbalanced data, the area under the curve (AUC-PR) offers a robust summary of model performance. These metrics extend to multi-class problems via macro- or micro-averaging, enabling comprehensive evaluation across diverse domains like and .

Fundamental Concepts

Definition of Precision

Precision is a key performance metric in binary classification tasks, evaluating the accuracy of a model's positive predictions by measuring the proportion of true positives among all instances predicted as positive. This metric emphasizes the reliability of positive classifications, helping to assess how often a positive prediction is correct, which is crucial in applications where false positives carry significant costs, such as fraud detection or disease screening. Formally, precision is defined using elements from the confusion matrix as: \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} where TP represents true positives (correctly predicted positives) and FP represents false positives (incorrectly predicted positives). This formulation highlights precision's focus on the purity of the positive class predictions. To illustrate, consider a spam detection classifier applied to a of emails. Suppose the model predicts 100 emails as , with 80 of them actually being (TP = 80) and 20 being legitimate (FP = 20). The precision is then calculated as 80 / (80 + 20) = 0.80, or 80%, indicating that 80% of the predicted emails were correctly identified. This can be visualized using a :
Predicted SpamPredicted Not Spam
Actual SpamTP = 80FN = (unknown)
Actual Not SpamFP = 20TN = (unknown)
Precision depends only on the predicted positive column and remains invariant to changes in true negatives or false negatives. The concept of precision originated in information retrieval during the 1950s, introduced by Kent et al. in their foundational work on operational criteria for designing information retrieval systems using machine literature searching. It was adopted and formalized as a core evaluation metric in machine learning classification by the 1990s, aligning with the growth of data-driven predictive models. Precision is often evaluated alongside recall, the complementary metric assessing the coverage of actual positive instances.

Definition of Recall

Recall, also known as or the true positive rate, is a in that measures the proportion of actual positive instances correctly identified as positive by a classifier. It quantifies the model's ability to capture all relevant positives, emphasizing the minimization of false negatives. Formally, recall is defined as the ratio of true positives (TP) to the total number of actual positives, which includes both true positives and false negatives (FN): \text{[Recall](/page/The_Recall)} = \frac{\text{TP}}{\text{TP} + \text{FN}} This formula, derived from the confusion matrix, ranges from 0 to 1, where a value of 1 indicates perfect identification of all positives. In the context of disease diagnosis, assesses how effectively a diagnostic test identifies patients with the condition. For instance, in a study evaluating (PSA) density ≥0.08 ng/mL/cc for clinically significant , the () was 98%, calculated as 489 true positives divided by (489 true positives + 10 false negatives), meaning 98% of patients with the disease were correctly detected. In and statistical applications, is particularly valued in scenarios where missing positives is costly, such as medical screening, and it complements by focusing on coverage rather than the avoidance of false positives.

Precision-Recall Trade-off

In models, such as , the - trade-off emerges when adjusting the decision applied to the model's predicted probabilities. Raising the classifies fewer instances as positive, which reduces false positives and thereby increases , but it also increases false negatives, decreasing . Conversely, lowering the expands positive classifications, improving by capturing more true positives at the cost of additional false positives and reduced . This inverse relationship is inherent to -based classifiers and requires careful tuning to balance the relative costs of prediction errors. For illustration, consider a model trained on a using the default of 0.5, which achieves a of 0.7 and of 0.8. Increasing the to 0.8 shifts the to a of 0.9 but reduces to 0.5, demonstrating how adjustments directly the two metrics to suit domain-specific priorities. Such examples highlight the need for empirical during model deployment. The - curve provides a comprehensive of this by plotting against for all possible thresholds, typically generated by sorting model scores and computing metrics at each point. The area under the - curve (AUC-PR) serves as a threshold-independent summary , where values closer to 1 indicate superior model performance, especially in scenarios with imbalance, as it emphasizes the positive more than the ROC curve's area. The preferred operating point along the precision-recall curve varies by application, reflecting differing error costs. In fraud detection, high precision is often prioritized to minimize false positives, which could disrupt legitimate transactions and erode user trust. In contrast, search engines in typically emphasize high recall to retrieve as many relevant documents as possible, accepting some irrelevant results to ensure comprehensiveness.

Theoretical Foundations

Probabilistic Interpretation

In the context of , recall is defined as the that a positive instance is correctly predicted as positive, denoted as P(\hat{Y}=1 \mid Y=1), where Y is the true label and \hat{Y} is the predicted label. This directly corresponds to the true positive rate, capturing the model's ability to identify all actual positives. Precision, in probabilistic terms, is the conditional probability that a predicted positive instance is truly positive, given by P(Y=1 \mid \hat{Y}=1). By Bayes' theorem, this expands to P(Y=1 \mid \hat{Y}=1) = \frac{P(\hat{Y}=1 \mid Y=1) \cdot P(Y=1)}{P(\hat{Y}=1)}, linking precision to recall (as P(\hat{Y}=1 \mid Y=1)), the prior probability of the positive class P(Y=1), and the overall probability of a positive prediction P(\hat{Y}=1). These definitions emerge from the of true and predicted labels in a binary classifier's output. Specifically, the joint probability P(Y=1, \hat{Y}=1) represents the probability of both true and predicted positives, which factors as P(\hat{Y}=1 \mid Y=1) \cdot P(Y=1). then follows by normalizing this joint probability by the marginal P(\hat{Y}=1) = P(\hat{Y}=1 \mid Y=1) \cdot P(Y=1) + P(\hat{Y}=1 \mid Y=0) \cdot P(Y=0), while is the direct conditional component without . In probabilistic models such as Naive Bayes, which computes posterior probabilities P(Y=1 \mid X) for features X, and recall are derived by thresholding these posteriors to assign \hat{Y}; for instance, setting \hat{Y}=1 when P(Y=1 \mid X) > 0.5, with approximating the average posterior over predicted positives under calibration. This framework allows for Bayesian estimation of the metrics' distributions, treating them as random variables informed by the classifier's probabilistic outputs.

Baseline Classifiers

Baseline classifiers serve as fundamental benchmarks in evaluating precision and recall, representing simplistic strategies that ignore input features and rely solely on class distribution statistics. These baselines help determine whether a learned model provides meaningful improvements over trivial approaches, particularly in establishing the lower bounds for performance metrics in tasks. A no-skill classifier, often realized through random guessing independent of the , yields precision and recall values equal to the proportion of positive instances in the . This occurs because the expected proportion of true positives among predicted positives aligns with the of the positive class under . For instance, in a where positive instances constitute 10% of the samples, both precision and recall for the positive class are 0.10 for this baseline. The majority , also known as the ZeroR or most-frequent classifier, always predicts the dominant in the . When the negative is the majority (e.g., 90% of instances), this strategy achieves of 1 for the negative but of 0 for the positive (minority) ; for the negative equals the proportion of negative instances (0.90), while for the positive is undefined due to no positive predictions. This underscores the of targeting the minority in imbalanced scenarios, as it highlights zero performance on the positive without any modeling effort. A calibrated random baseline adjusts uniform random predictions to match the dataset's class priors, effectively predicting the positive class with a probability equal to its prevalence. In a dataset with 90% negative instances (10% positive), this baseline results in precision and recall of approximately 0.10 for the positive class, mirroring the no-skill outcome but ensuring predictions reflect the underlying distribution for fairer benchmarking. These baselines emphasize the necessity of surpassing class proportion levels to claim skillful precision and recall in model evaluation.

Handling Dataset Imbalances

Impact on Evaluation

Class imbalance in , where the negative vastly outnumbers the positive , distorts the of and recall by favoring models that ignore the minority positive instances. A classifier that predicts all instances as negative achieves high accuracy—often close to the proportion of negatives—but yields zero recall for the positive , as no true positives are identified. This occurs because recall, defined as the ratio of true positives to all actual positives, is inherently low when positives are scarce and easily overlooked. Meanwhile, for the positive becomes in such naive cases (zero true positives and zero false positives), but in general, severe imbalance tends to deflate for equivalent classifier discriminability due to the amplified of false positives relative to the low of positives. For instance, consider a detection with a 99:1 of legitimate to fraudulent transactions; a model classifying everything as legitimate attains 99% accuracy yet 0% for , rendering accuracy an unreliable proxy for performance on the critical minority class. suffers similarly in practice, as the formula relating it to true positive rate (TPR) and (FPR) incorporates the inverse of the positive-to-negative r, causing to drop as r decreases even if TPR and FPR remain fixed. Statistically, class imbalance biases threshold selection in probabilistic classifiers, shifting the optimal cutoff away from the default 0.5 (which assumes balanced priors) toward values that better balance the costs of false negatives versus false positives in rare-event scenarios. This bias arises because the predicted is influenced by training-time , potentially leading to suboptimal if thresholds are not adjusted. In precision-recall () curve interpretation, imbalance further complicates assessment: the baseline curve for a random classifier lies at the positive level (e.g., near zero for rare positives), providing a more realistic gauge of improvement potential compared to curves, which remain optimistic due to the dominance of easy negatives. A prominent real-world example is detection, where fraudulent transactions represent only about 0.17% of data, making naive unreliable as models may appear performant by conservatively predicting few frauds (high but low ), thus failing to capture most actual frauds and incurring significant unmitigated losses.

Mitigation Strategies

Data-level approaches to mitigate the impact of class imbalance on and involve resampling techniques that adjust the distribution of classes in the training . the minority class, such as through random duplication, can increase by providing more examples for the model to learn from, though it risks if not combined with other methods. A seminal technique is the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic minority class samples by interpolating between existing minority instances and their nearest neighbors, thereby improving both and without simply replicating data. the majority class, by randomly removing instances, reduces the dominance of the prevalent class and can enhance , but it may lead to loss of information if the is small. Hybrid methods, like combining SMOTE with majority (e.g., using Tomek links or Edited Nearest Neighbors to clean noisy samples), further balance the while preserving discriminative features. Algorithm-level approaches modify the learning process to account for imbalance directly. Cost-sensitive learning assigns higher misclassification costs to errors on the minority class, such as penalizing false negatives more heavily in the loss function, which encourages models to prioritize without altering the data distribution. For instance, in support vector machines or decision trees, class-specific weights can be incorporated into the optimization objective, leading to classifiers that achieve better trade-offs between precision and on imbalanced data. This approach is particularly effective in domains like fraud detection, where missing a positive instance (false negative) is costlier than false positives. Evaluation-level adjustments focus on robust assessment rather than changing the data or . Stratified sampling ensures that train-test splits and cross-validation folds maintain the original proportions, preventing biased estimates of and recall that could arise from uneven representation in sets. Additionally, the area under the precision-recall curve (PR-AUC) serves as a more reliable metric than accuracy or ROC-AUC for imbalanced settings, as it emphasizes performance on the minority and is less sensitive to . For example, in experiments on the detection dataset using a C4.5 classifier, applying SMOTE at 500% combined with majority improved recall to 98.0% from 76.0% achieved by alone, while adjusted to 35.5%, yielding a more balanced overall.

Extensions to Complex Scenarios

Multi-Class Evaluation

In multi-class classification problems, where instances are assigned to one of several mutually exclusive categories, precision and recall are extended from their binary formulations by treating each class independently through a one-vs-rest binarization approach. For each class C_i, the true positives (TP_i), false positives (FP_i), and false negatives (FN_i) are computed by considering predictions for C_i as positive and all others as negative, allowing per-class precision and recall to be calculated as P_i = \frac{TP_i}{TP_i + FP_i} and R_i = \frac{TP_i}{TP_i + FN_i}, respectively. To obtain overall metrics for the multi-class setting, per-class values are aggregated using methods such as macro-averaging or micro-averaging. Macro-averaging computes the unweighted across all classes, giving equal importance to each: P_{macro} = \frac{1}{L} \sum_{i=1}^L P_i and R_{macro} = \frac{1}{L} \sum_{i=1}^L R_i, where L is the number of classes; this approach is useful for evaluating without toward class frequency. In contrast, micro-averaging pools the contributions globally by summing numerators and denominators across classes before dividing: P_{micro} = \frac{\sum_{i=1}^L TP_i}{\sum_{i=1}^L (TP_i + FP_i)} and R_{micro} = \frac{\sum_{i=1}^L TP_i}{\sum_{i=1}^L (TP_i + FN_i)}, which effectively weights classes by their (number of instances) and equates to the total accuracy in balanced scenarios. Consider a task with three classes—positive, neutral, and negative—where per-class recalls are 0.8 for positive, 0.6 for neutral, and 0.9 for negative. The macro-recall would be the average: (0.8 + 0.6 + 0.9)/3 = 0.77, treating each class equally. However, if neutral instances are far fewer than the others, micro-recall would weight toward the majority classes, potentially yielding a higher value closer to the overall accuracy, such as 0.82 if positive and negative dominate the dataset. Class imbalance in multi-class settings particularly affects macro-averaging, as it equally emphasizes rare classes, which may have high variance in precision and recall due to limited samples, leading to metrics that do not reflect the model's behavior on the majority of data. Micro-averaging mitigates this by prioritizing prevalent classes but can mask poor performance on minorities, making the choice of aggregation dependent on whether balanced or instance-weighted evaluation is desired.

Multi-Label Evaluation

In , instances can belong to multiple classes or labels simultaneously, necessitating adaptations of precision and recall to evaluate predictions across non-exclusive categories. Unlike single-label scenarios, these metrics assess the correctness of label assignments per instance or per label, often treating the problem as multiple tasks. Precision measures the proportion of predicted positive labels that are correct, while recall measures the proportion of true positive labels that are retrieved, aggregated in ways that respect the multi-label structure. The predominant label-wise variant computes precision and recall independently for each label across all instances. For label j, true positives TP_j count instances where both the true and predicted label sets include j, false positives FP_j count instances predicted with j but not truly labeled, and false negatives FN_j count instances truly labeled with j but not predicted. for label j is given by P_j = \frac{TP_j}{TP_j + FP_j}, and by R_j = \frac{TP_j}{TP_j + FN_j}. These per-label metrics are then aggregated: micro-averaging sums TP, FP, and FN globally across labels before computing overall precision and recall, emphasizing total counts; macro-averaging takes the unweighted of per-label values, treating labels equally regardless of prevalence. This approach integrates with threshold-based decisions, where continuous scores are binarized (e.g., above 0.5), and relates to Hamming loss, which averages prediction errors per label-instance pair as \frac{1}{N L} \sum_{i=1}^N \sum_{j=1}^L \mathbb{I}(y_{ij} \neq \hat{y}_{ij}), where N is the number of instances, L the number of labels, and \mathbb{I} the —low Hamming loss often aligns with high threshold-optimized precision and recall. Alternative variants include instance-wise (or example-based) evaluation, which computes metrics per instance before averaging. For instance i with true labels Y_i and predicted labels \hat{Y}_i, is P_i = \frac{|Y_i \cap \hat{Y}_i|}{|\hat{Y}_i|}, and is R_i = \frac{|Y_i \cap \hat{Y}_i|}{|Y_i|}, yielding overall values as the across all instances; this captures per-sample accuracy in label set overlap. Subset-based evaluation, in contrast, assesses exact matches of the entire predicted label set to the true set per instance, with subset accuracy as \frac{1}{N} \sum_{i=1}^N \mathbb{I}(Y_i = \hat{Y}_i), a stricter measure that penalizes any discrepancy in the label . For instance, in image tagging where an image truly has tags {cat, dog} but is predicted as {cat, dog, outdoor}, instance-wise would be 2/3 while subset accuracy is 0, highlighting partial correctness. Key challenges arise in thresholding prediction scores to binary labels, where per-label thresholds allow customization to label-specific distributions but increase , versus global thresholds that simplify computation yet may towards prevalent labels. Moreover, label correlations—such as co-occurring tags in tagging tasks—complicate , as label-wise methods assume and may undervalue models exploiting dependencies, potentially leading to overly optimistic or pessimistic scores without correlation-aware adjustments.

Integrated Metrics and Limitations

F-Measure and Variants

The F-measure, also known as the F1-score, is defined as the of and with equal weighting, providing a single metric that balances the two when they are of comparable importance. It is calculated using the formula F_1 = 2 \times \frac{P \times R}{P + R}, where P denotes and R denotes . This formulation, introduced in the context of , yields a value between 0 and 1, with 1 indicating perfect and . Generalizations of the F-measure, known as F_{\beta}-scores, allow for adjustable weighting between and through a parameter \beta > 0. The formula is F_{\beta} = (1 + \beta^2) \times \frac{P \times R}{\beta^2 \times P + R}, where \beta = 1 recovers the standard F1-score, \beta < 1 emphasizes more heavily, and \beta > 1 prioritizes . For instance, the F2-score (\beta = 2) places twice as much weight on relative to . To illustrate, consider a classifier with precision P = 0.8 and R = 0.6. The F1-score is then F_1 = 2 \times (0.8 \times 0.6) / (0.8 + 0.6) \approx 0.69. For the F2-score, F_2 = (1 + 4) \times (0.8 \times 0.6) / (4 \times 0.8 + 0.6) = 5 \times 0.48 / 3.8 \approx 0.63, reflecting the lower 's greater penalty under recall weighting. The F1-score finds application in scenarios requiring equal emphasis on precision and recall, such as evaluating models on balanced datasets where carry similar costs. In such cases, it offers a concise summary of performance without favoring one metric over the other.

Optimization Challenges

Precision and recall, along with their harmonic mean known as the F-measure, emerged as preferred evaluation metrics in the early as researchers recognized the limitations of accuracy in handling imbalanced datasets, where minority classes could skew overall performance. This shift was driven by foundational work highlighting how accuracy often misled model assessment in real-world scenarios like fraud detection or , prompting a focus on metrics that better capture the trade-off between . A primary challenge in optimizing precision and recall directly during model training stems from their non-differentiable and non-convex nature, which hinders the use of gradient-based methods prevalent in modern frameworks. Precision and recall involve discrete counts of true positives, false positives, and false negatives, making them unsuitable as loss functions without approximations, as gradients cannot be reliably computed across classification thresholds. This non-convexity leads to multiple local optima, complicating convergence in optimization landscapes for models. Compounding this issue is the threshold dependency of precision and recall, where models are typically trained by minimizing losses like or log-loss, with thresholds applied post-hoc on validation sets to achieve desired precision-recall balances. This two-stage process can result in suboptimal performance, as the initial optimization does not directly target the final , often requiring extensive hyperparameter . Furthermore, maximizing the F-measure as a may conflict with domain-specific goals, such as when the of false positives (e.g., unnecessary medical treatments) far exceeds that of false negatives, necessitating cost-sensitive adjustments rather than equal weighting of precision and recall. To address these challenges, alternatives like the area under the precision-recall curve (AUC-PR) enable threshold-independent optimization by summarizing performance across all thresholds, with algorithms providing guarantees for non-convex settings. functions that errors according to false positive and false negative costs offer another approach, allowing incorporation of priorities during without relying solely on post-processing. The F-measure remains a common optimization target but is imperfect due to its assumption of balanced error costs.

References

  1. [1]
    [PDF] Evaluation in information retrieval - Stanford NLP Group
    Calculate precision, recall, and F1 of your system if a document is considered rel- evant if either judge thinks it is relevant. 8.6 A broader perspective: ...
  2. [2]
    Evaluation metrics and statistical tests for machine learning - NIH
    Mar 13, 2024 · With the exception of accuracy, the aforementioned metrics are often used as pairs, such as precision and recall or sensitivity and specificity.
  3. [3]
    Classification: Accuracy, recall, precision, and related metrics
    Precision improves as false positives decrease, while recall improves when false negatives decrease. But as seen in the previous section, increasing the ...
  4. [4]
    Evaluation of unranked retrieval sets
    ### Definition of Precision in Information Retrieval
  5. [5]
    Machine literature searching VIII. Operational criteria for designing ...
    Machine literature searching VIII. Operational criteria for designing information retrieval systems. Allen Kent ... First published: April 1955. https://doi ...
  6. [6]
    [PDF] FROM PRECISION, RECALL AND F-MEASURE TO ROC ... - arXiv
    EVALUATION: FROM PRECISION, RECALL AND F-MEASURE TO ROC,. INFORMEDNESS, MARKEDNESS & CORRELATION. POWERS, D.M.W. *AILab, School of Computer Science ...
  7. [7]
    Evaluation metrics and statistical tests for machine learning - Nature
    Mar 13, 2024 · Our aim here is to introduce the most common metrics for binary and multi-class classification, regression, image segmentation, and object detection.
  8. [8]
    Sensitivity, Specificity, Positive Predictive Value, and Negative ... - NIH
    May 16, 2021 · Sensitivity, which denotes the proportion of subjects correctly given a positive assignment out of all subjects who are actually positive ...
  9. [9]
    [PDF] The Relationship Between Precision-Recall and ROC Curves
    Finally, we show. Page 8. The Relationship Between Precision-Recall and ROC Curves that an algorithm that optimizes the area under the. ROC curve is not ...
  10. [10]
  11. [11]
    [PDF] Fraud Detection using Machine Learning - CS229
    This design trade off enables us to establish a balance between detecting fraud transactions with high accuracy and preventing large number of false positives.
  12. [12]
  13. [13]
    3.4. Metrics and scoring: quantifying the quality of predictions
    We want to give some guidance, inspired by statistical decision theory, on the choice of scoring functions for supervised learning.
  14. [14]
  15. [15]
    A systematic analysis of performance measures for classification tasks
    A systematic analysis of performance measures for classification tasks ... Sokolova, Japkowicz, and Szpakowicz (2006) which looks into relations between ...
  16. [16]
    A Review of the F-Measure: Its History, Properties, Criticism, and ...
    The F-measure is then a way of combining recall and precision, to yield a single number through which classification methods can be assessed and compared. The ...
  17. [17]
    f1_score — scikit-learn 1.7.2 documentation
    The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.<|control11|><|separator|>
  18. [18]
    fbeta_score — scikit-learn 1.7.1 documentation
    Compute the F-beta score. The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0.Missing: 0.8 0.6
  19. [19]
    Model F1 Score - Software Engineering - KPI Examples
    Formula. 2 * ((Precision * Recall) / (Precision + Recall)). Example. If the precision is 80% and recall is 60%, the F1 score is 2*((0.8*0.6)/(0.8+0.6)) = 0.69.
  20. [20]
    F1 Score in Machine Learning Explained - Encord
    Jul 18, 2023 · The F-beta score is calculated using the same formula as the F2 score, with beta dictating the importance of recall against precision.
  21. [21]
    Learning from Imbalanced Data | IEEE Journals & Magazine
    Jun 26, 2009 · In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a ...
  22. [22]
    Learning from Imbalanced Data - ResearchGate
    Aug 5, 2025 · The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution ...
  23. [23]
    [PDF] Optimizing F-Measures: A Tale of Two Approaches
    Optimizing the F-measure directly is often difficult as the F-measure is non-convex. ... Each bracketed score is higher than its non-bracketed counterpart ...Missing: challenges | Show results with:challenges
  24. [24]
    How to Build a Custom Loss Function for Your Machine Learning ...
    Nov 8, 2024 · Directly using F1 as a loss is tricky because it involves precision and recall, which are non-differentiable. So, what do you do? You ...
  25. [25]
    Optimizing F-measure with non-convex loss and sparse linear ...
    In this work, we propose a general framework for approximate F-measure maximization. We also propose a non-convex loss function which is robust to outliers. Use ...
  26. [26]
    Is it possible to train a linear classifier to maximize f1-score?
    May 27, 2016 · The usual approach to maximise f1-score in a classification problem is to train the classifier first and then optimize a threshold value on a ...Missing: challenges | Show results with:challenges
  27. [27]
    How to use classification threshold to balance precision and recall
    Jan 9, 2025 · This chapter explains the trade-off between precision and recall and how to set an optimal classification threshold to balance them.
  28. [28]
    [PDF] Optimizing F-Measures by Cost-Sensitive Classification - NIPS papers
    In this paper, we show that optimizing the F1-measure in binary classification over any (possibly restricted) class of functions and over any data distribution ...
  29. [29]
    Training models with unequal economic error costs using Amazon ...
    Sep 18, 2018 · The custom loss function allows us to choose the relative balance of the types of errors made by the model, and to evaluate the economic impact ...