Precision and recall
Precision and recall are fundamental performance metrics in information retrieval and machine learning, particularly for evaluating binary classification and search systems. Precision quantifies the accuracy of a retrieval or prediction process by measuring the fraction of retrieved items that are relevant, calculated as P = \frac{tp}{tp + fp}, where tp denotes true positives and fp false positives.[1] Recall, also known as sensitivity, assesses completeness by measuring the fraction of relevant items that are successfully retrieved, given by R = \frac{tp}{tp + fn}, where fn represents false negatives.[1] These metrics originated in the evaluation of information retrieval systems in the mid-20th century, with early formalization by Kent et al. in 1955, and have since become standard for assessing models where class imbalance or the cost of errors varies.[1] In information retrieval, precision and recall evaluate how well a system returns relevant documents from a collection in response to a query, using test collections with predefined relevance judgments.[1] A key trade-off exists between the two: efforts to maximize recall, such as retrieving more documents, often reduce precision by including irrelevant results, and vice versa, leading to precision-recall curves that visualize this balance across varying thresholds.[1] The F1-score, the harmonic mean of precision and recall (F_1 = 2 \frac{P \cdot R}{P + R}), provides a single composite measure balancing both when equal importance is desired, as introduced by van Rijsbergen in 1979.[1] In machine learning, precision and recall are applied to binary classifiers to address limitations of accuracy in imbalanced datasets, where one class (e.g., positives) is rarer.[2] High precision minimizes false positives, crucial in applications like spam detection to avoid misclassifying legitimate emails, while high recall minimizes false negatives, vital in medical diagnostics to ensure few cases are missed. For instance, in the Precision-Recall curve, often preferred over ROC curves for imbalanced data, the area under the curve (AUC-PR) offers a robust summary of model performance. These metrics extend to multi-class problems via macro- or micro-averaging, enabling comprehensive evaluation across diverse domains like natural language processing and computer vision.[2]Fundamental Concepts
Definition of Precision
Precision is a key performance metric in binary classification tasks, evaluating the accuracy of a model's positive predictions by measuring the proportion of true positives among all instances predicted as positive. This metric emphasizes the reliability of positive classifications, helping to assess how often a positive prediction is correct, which is crucial in applications where false positives carry significant costs, such as fraud detection or disease screening.[3] Formally, precision is defined using elements from the confusion matrix as: \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} where TP represents true positives (correctly predicted positives) and FP represents false positives (incorrectly predicted positives). This formulation highlights precision's focus on the purity of the positive class predictions.[4] To illustrate, consider a spam detection classifier applied to a dataset of emails. Suppose the model predicts 100 emails as spam, with 80 of them actually being spam (TP = 80) and 20 being legitimate (FP = 20). The precision is then calculated as 80 / (80 + 20) = 0.80, or 80%, indicating that 80% of the predicted spam emails were correctly identified. This can be visualized using a confusion matrix:| Predicted Spam | Predicted Not Spam | |
|---|---|---|
| Actual Spam | TP = 80 | FN = (unknown) |
| Actual Not Spam | FP = 20 | TN = (unknown) |