Fact-checked by Grok 2 weeks ago

Bayes classifier

The Bayes classifier is a statistical method in that applies to assign class labels to instances by calculating the of each possible class given the observed features, selecting the class with the highest probability as the prediction. This approach treats classification as a probabilistic problem, where the goal is to minimize the expected misclassification under the true underlying . Named after the 18th-century mathematician , who formulated the foundational theorem in his posthumously published 1763 essay, the classifier has roots in early statistical but gained prominence in modern during the late as computational power enabled its practical implementation. In its general form, the Bayes classifier assumes access to the complete of the data and computes the boundary, making it theoretically the best possible classifier if the model is correctly specified; however, in practice, estimating these distributions accurately is challenging, leading to approximations. The most widely used variant, known as the , simplifies the model by assuming among the input features given the class label—a "naive" but often effective assumption that reduces and performs well even when violated. This independence assumption allows the joint probability to be factored into a product of marginals, enabling efficient training on large datasets via or Bayesian updates. Bayes classifiers are particularly valued for their simplicity, interpretability, and speed, making them suitable for high-dimensional data where more complex models like neural networks may overfit. Common implementations include Gaussian Naive Bayes for continuous features assuming normal distributions, Multinomial Naive Bayes for discrete count data such as word frequencies in text, and Naive Bayes for binary features. Applications span diverse fields, including spam email detection—where it achieves high , such as 97.1% precision on benchmark datasets—medical diagnosis, sentiment analysis, and document categorization. Despite their strengths, limitations include sensitivity to the independence assumption and the need for smoothing techniques like Laplace estimation to handle zero probabilities in sparse data.

Fundamentals

Definition

The Bayes classifier is the theoretically optimal statistical classifier in the sense that it achieves the minimal possible misclassification error for a given over the space and labels. It operates as a decision rule that assigns an observed instance, represented by a x, to the r that maximizes the P(Y = r \mid X = x). In the standard supervised classification setup, the input is a random feature vector X taking values in a measurable space (often \mathbb{R}^d), and the class label Y is a random variable taking values in a finite set \{1, \dots, K\}, where K \geq 2 is the number of classes. The joint distribution of (X, Y) is assumed known, and the goal is to construct a classifier C: \mathcal{X} \to \{1, \dots, K\} that minimizes the misclassification risk under the 0-1 loss function, defined as R(C) = P(C(X) \neq Y) = \mathbb{E}[1\{C(X) \neq Y\}]. This risk represents the expected probability of error over the data-generating distribution. The minimal achievable risk, known as the Bayes risk R^*, is attained by the Bayes classifier, and any other classifier C incurs an excess risk R(C) - R^* \geq 0, which quantifies the additional error relative to this optimum. Conceptually, the Bayes classifier relies on to obtain the required posterior probabilities from the class-conditional densities and prior class probabilities, enabling the optimal assignment without further assumptions about the .

Historical Development

The foundations of the Bayes classifier trace back to the work of , an English mathematician and Presbyterian minister, whose essay "An Essay towards solving a Problem in the Doctrine of Chances" was published posthumously in 1763. In this seminal paper, Bayes introduced a theorem for updating probabilities based on new evidence, providing the probabilistic framework that later underpinned methods. The essay, communicated by , addressed inverse inference problems, laying the groundwork for reasoning from observed effects to underlying causes, though Bayes did not explicitly apply it to classification. During the 19th century, French mathematician expanded on these ideas through his development of , independently deriving an equivalent form of around 1774 and integrating it into broader statistical theory. Laplace's 1774 memoir on applied the rule to estimate causes from observed effects, such as in and error analysis, popularizing the approach despite initial resistance from frequentist statisticians. His work shifted focus toward practical probabilistic inference, influencing subsequent statistical methodologies that would inform techniques. In the early , Ronald A. advanced approximations to Bayesian principles with his 1936 introduction of linear functions for taxonomic . In his paper "The Use of Multiple Measurements in Taxonomic Problems," proposed methods to separate groups based on multivariate measurements, deriving functions that maximized between-group variance relative to within-group variance, effectively serving as non-probabilistic surrogates for calculations. This approach, applied to like iris measurements, bridged statistical discrimination and pattern separation without direct reliance on prior distributions. Post-World War II developments in further solidified the classifier's theoretical basis, particularly through Abraham Wald's 1945 paper on statistical decision functions that minimize . Wald formalized decision rules under uncertainty, incorporating Bayesian priors into strategies for testing and , thus linking to optimal statistical decisions in finite-sample settings. His framework treated classification as a game against nature, where actions minimize , influencing the integration of Bayes' methods into robust . By the and , the Bayes classifier gained prominence in the emerging field of , as computational advances enabled probabilistic models for classifying complex data like images and signals. Researchers applied to automate feature-based decisions in areas such as speech and character recognition, marking the transition from theoretical statistics to practical machine-based learning systems. This period saw Bayesian methods formalized as optimal classifiers under squared-error loss, though practical implementations often required approximations due to computational limits.

Mathematical Formulation

Bayes' Theorem

Bayes' theorem provides a fundamental framework for updating probabilities based on new evidence, expressed in its general form as P(Y \mid X) = \frac{P(X \mid Y) \, P(Y)}{P(X)}, where P(Y \mid X) is the of the Y given the evidence X, P(X \mid Y) is the likelihood of observing X under Y, P(Y) is the of Y, and P(X) is the marginal probability of X, serving as a or evidence. This formulation originates from the work of , who addressed the inversion of conditional probabilities to infer causes from effects. The theorem derives directly from the definition of . By definition, P(A \mid B) = \frac{P(A \cap B)}{P(B)} for events A and B with P(B) > 0, and the joint probability satisfies P(A \cap B) = P(B \cap A). Substituting yields P(A \mid B) \, P(B) = P(B \mid A) \, P(A), and rearranging gives P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}. This symmetry-based derivation holds under the standard axioms of probability theory, assuming the events are well-defined within a probability space. In the context of classification, Bayes' theorem enables the computation of posterior probabilities by combining prior beliefs about class labels with the likelihood of observed features, thereby supporting decisions that favor the most probable class. This process assumes knowledge of the joint distribution P(X, Y), which allows the prior, likelihood, and evidence to be evaluated or estimated accurately.

Classification Rule

The Bayes classifier employs a decision that assigns an observation x to the class r that maximizes the P(Y = r \mid X = x). Formally, this is expressed as C^{\text{Bayes}}(x) = \arg\max_r P(Y = r \mid X = x), where Y is the class label and X is the feature vector. This selects the most probable class given the observed features, leveraging the derived from the data distribution. Under , the posterior can be rewritten as P(Y = r \mid X = x) = \frac{P(X = x \mid Y = r) P(Y = r)}{P(X = x)}, so the decision rule is equivalent to C^{\text{Bayes}}(x) = \arg\max_r P(X = x \mid Y = r) P(Y = r), since the marginal probability P(X = x) is constant across classes and can be omitted for the argmax operation. Here, P(X = x \mid Y = r) is the class-conditional likelihood, and P(Y = r) is the of class r. This formulation highlights the balance between how well the features fit each class and the baseline prevalence of the classes. For multi-class problems with K > 2 classes, the extends directly by evaluating the posterior for all K classes and selecting the one with the maximum value, ensuring the assignment to the most likely class among multiple options. In the context of the , where misclassification incurs a cost of 1 and correct classification costs 0, this minimizes the overall probability of error by choosing the class that maximizes the chance of a correct . In cases where multiple classes yield equal maximum posterior probabilities (ties), the Bayes classifier typically assigns the observation arbitrarily to one of them, such as by convention to the lowest-indexed class or via randomization, as the expected risk remains unchanged.

Theoretical Properties

Optimality

The Bayes classifier achieves the lowest possible expected misclassification risk among all classifiers when the true underlying probability distributions are known. This optimality is established under the 0-1 loss function, where the risk R(C) = \mathbb{P}(C(X) \neq Y) measures the probability of error for a classifier C. The proof relies on the law of total probability to decompose the risk conditionally on X, combined with the non-negativity of certain expectations. In the binary classification case with labels \{0, 1\}, let \eta(x) = \mathbb{P}(Y = 1 \mid X = x) denote the regression function, and let C^\ast(x) = 1 if \eta(x) \geq 1/2 and $0 otherwise be the Bayes classifier. For any classifier C, the risk difference decomposes as R(C) - R(C^\ast) = \mathbb{E}\left[ \left| \eta(X) - \frac{1}{2} \right| \cdot \mathbf{1}\{ C(X) \neq C^\ast(X) \} \right] \geq 0, where the inequality follows from the non-negativity of the absolute value and the indicator function (which is zero when C(X) = C^\ast(X)). Equality holds if and only if C = C^\ast almost everywhere on the set where \eta(X) \neq 1/2. This decomposition arises by expressing the conditional risks and applying the law of total expectation, showing that no other classifier can outperform the Bayes rule. The result extends to the multi-class setting with K \geq 2 labels \{1, \dots, K\}. Here, the Bayes classifier assigns C^\ast(x) = \arg\max_k \mathbb{P}(Y = k \mid X = x). For any classifier C, the risk is R(C) = \mathbb{E}_X \left[ 1 - \mathbb{P}(Y = C(X) \mid X) \right]. Since \mathbb{P}(Y = C(X) \mid X) \leq \max_k \mathbb{P}(Y = k \mid X) pointwise for each X, it follows that R(C) \geq \mathbb{E}_X \left[ 1 - \max_k \mathbb{P}(Y = k \mid X) \right] = R(C^\ast), with the inequality preserved under expectation by the . This demonstrates that the Bayes classifier minimizes the risk across all possible decision rules. The minimal risk R(C^\ast), termed the Bayes risk, serves as the irreducible error floor inherent to the problem, arising from the probabilistic overlap between class-conditional distributions even with perfect knowledge of the model. Optimality requires access to the true distributions P(X, Y), which in practice are unknown and must be estimated.

Bayes Error Rate

The , often denoted as R^* or L^*, represents the minimal possible error probability achievable by any classifier for a given joint distribution of features X and class labels Y. It is defined as the infimum over all classifiers C of the risk R(C) = \mathbb{E}[ \mathbf{1}\{C(X) \neq Y\} ], which simplifies to R^* = \inf_C R(C) = \mathbb{E}\left[1 - \max_r P(Y = r \mid X)\right], where the expectation is taken over the distribution of X and the maximum is over possible class labels r. In the binary case, this equals \mathbb{E}\left[ \min_r P(Y = r \mid X) \right]. This quantity quantifies the irreducible error due to inherent uncertainty in the data, assuming perfect knowledge of the underlying distributions. In the binary classification setting with classes \{0, 1\}, the admits the explicit integral form R^* = \int \min\{\eta(x), 1 - \eta(x)\} \, dP_X(x), where \eta(x) = P(Y = 1 \mid X = x) is the and P_X is the of X. Equivalently, letting p = P(Y = 1) denote the and f_0, f_1 the class-conditional densities of X given Y = 0 and Y = 1, respectively, it can be expressed as R^* = \int \min\{(1 - p) f_0(x), p f_1(x)\} \, dx. $$ The [Bayes classifier](/page/Bayes_classifier), which assigns $X = x$ to class 1 if $\eta(x) > 1/2$ and to class 0 otherwise, achieves this error rate exactly, establishing it as the theoretical lower bound on [classification](/page/Classification) performance.[](https://mdav.ece.gatech.edu/ece-6254-spring2024/notes/03-notes-6254-sp24.pdf) The value of the Bayes error rate is fundamentally determined by the degree of overlap between the class-conditional distributions in the feature space and the prior probabilities of the classes. Greater overlap—measured by how much the supports or densities $f_0$ and $f_1$ intersect—increases $R^*$, as it heightens the ambiguity in posterior probabilities $\eta(x)$ near decision boundaries. Unequal priors $p \neq 1/2$ can also elevate $R^*$ by skewing the weighting of the densities, though the primary driver remains the separability of the classes; for perfectly separable distributions (e.g., disjoint supports), $R^* = 0$. Bounds on the [Bayes error rate](/page/Bayes_error_rate) provide useful approximations without requiring full distributional knowledge. For the binary case with equal priors $p = 1/2$, it equals $\frac{1}{2} (1 - V(\mu_0, \mu_1))$, where $V(\mu_0, \mu_1)$ is the [total variation](/page/Total_variation) distance between the class-conditional measures $\mu_0$ and $\mu_1$, defined as $V(\mu_0, \mu_1) = \sup_A |\mu_0(A) - \mu_1(A)|$. For densities, this distance is $\frac{1}{2} \int |f_0(x) - f_1(x)| \, dx$, yielding $R^* = \frac{1 - V}{2}$. More generally, $R^*$ is bounded above by expressions involving divergences like the Bhattacharyya coefficient, such as $R^* \leq \sqrt{p(1-p)} \cdot BC(f_0, f_1)$, where $BC(f_0, f_1) = \int \sqrt{f_0(x) f_1(x)} \, dx \leq 1$, with equality only for identical distributions. These bounds highlight how distributional similarity directly limits optimal accuracy.[](https://mdav.ece.gatech.edu/ece-6254-spring2024/notes/03-notes-6254-sp24.pdf) ## Practical Approximations ### Naive Bayes Assumption The [naive Bayes classifier](/page/Naive_Bayes_classifier) simplifies the full Bayesian approach by assuming that the features $X_1, \dots, X_n$ are conditionally independent given the class label $Y$. Under this [assumption](/page/Assumption), the joint [conditional probability](/page/Conditional_probability) factors as $P(\mathbf{X} \mid Y) = \prod_{i=1}^n P(X_i \mid Y)$, which reduces the [computational complexity](/page/Computational_complexity) of estimating the likelihood from an exponential number of parameters to a linear one.[](https://faculty.cc.gatech.edu/~isbell/reading/papers/Rish.pdf) This independence assumption leads to a simplified expression for the [posterior probability](/page/Posterior_probability): $P(Y \mid \mathbf{X}) \propto P(Y) \prod_{i=1}^n P(X_i \mid Y)$, allowing [classification](/page/Classification) by selecting the class that maximizes this product without needing the full joint distribution.[](https://gwern.net/doc/ai/1997-domingos.pdf) The assumption is termed "naive" because it is often unrealistic in real-world data, where features typically exhibit dependencies given the class, such as correlated attributes in text or images.[](https://gwern.net/doc/ai/1997-domingos.pdf) Despite this limitation, naive Bayes frequently achieves strong empirical performance, often rivaling more complex models, because the classifier remains optimal under zero-one loss for a broad range of dependency structures, including conjunctions and disjunctions that violate independence.[](https://link.springer.com/article/10.1023/A:1007413511361)[](https://faculty.cc.gatech.edu/~isbell/reading/papers/Rish.pdf) The assumption holds most effectively in scenarios with low feature correlations or near-deterministic dependencies, where the information loss from ignoring interactions is minimal, leading to accurate probability estimates and classifications.[](https://faculty.cc.gatech.edu/~isbell/reading/papers/Rish.pdf) ### Parametric and Non-Parametric Estimation In the Bayes classifier, the true underlying probability distributions, including the class-conditional densities $P(\mathbf{X} \mid Y)$ and class priors $P(Y)$, are typically unknown and must be approximated from a finite [training](/page/Training) sample to enable practical [implementation](/page/Implementation). These approximations introduce [estimation](/page/Estimation) error, leading to classifiers that may deviate from the theoretical optimum. Parametric and non-parametric methods provide distinct strategies for this [estimation](/page/Estimation), balancing assumptions about [data structure](/page/Data_structure) with flexibility in modeling complex distributions. Parametric [estimation](/page/Estimation) assumes a predefined functional form for the densities, such as the multivariate Gaussian distribution for $P(\mathbf{X} \mid Y = k)$ in each class $k$. Under this assumption, parameters like class-specific [mean](/page/Mean) vectors $\boldsymbol{\mu}_k$ and [covariance](/page/Covariance) matrices $\boldsymbol{\Sigma}_k$ (or a shared covariance for linear variants) are estimated via maximum likelihood from the training data, where the [sample mean and covariance](/page/Sample_mean_and_covariance) serve as [plug-in](/page/Plug-in) estimators for their [population](/page/Population) counterparts. This approach reduces the problem to estimating a fixed number of parameters, making it computationally efficient for moderate sample sizes, as seen in linear and quadratic discriminant analysis. Non-parametric methods avoid strong distributional assumptions, instead estimating the densities directly from the [data](/page/Data). [Kernel density estimation](/page/Kernel_density_estimation) (KDE) constructs $P(\mathbf{X} \mid Y = k)$ by placing a kernel function, such as a Gaussian, at each [training](/page/Training) point in class $k$ and smoothing to form a continuous estimate, with [bandwidth](/page/Bandwidth) selection controlling the [trade-off](/page/Trade-off) between [smoothness](/page/Smoothness) and [fidelity](/page/Fidelity) to the [data](/page/Data). Similarly, the k-nearest neighbors (k-NN) algorithm approximates the posterior $P(Y \mid \mathbf{X})$ non-parametrically by considering the labels of the $k$ closest [training](/page/Training) points to a query $\mathbf{X}$, weighting them proportionally to proximity for a local density estimate. The resulting estimates are incorporated into the Bayes decision rule via the [plug-in](/page/Plug-in) classifier, which replaces the true posteriors $P(Y = k \mid \mathbf{X})$ with their empirical counterparts to assign $\mathbf{X}$ to the class maximizing the approximated probability. This substitution yields a workable predictor but incurs excess [risk](/page/Risk) relative to the Bayes [risk](/page/Risk), quantified as the difference between the plug-in classifier's error rate and the irreducible minimum. Parametric methods generally exhibit lower variance due to fewer effective parameters but higher [bias](/page/Bias) if the assumed form mismatches the data, while non-parametric approaches reduce [bias](/page/Bias) at the cost of increased variance and computational demands, influencing overall excess [risk](/page/Risk) through the bias-variance decomposition. A common parametric simplification, as in the [naive Bayes classifier](/page/Naive_Bayes_classifier), further assumes [conditional independence](/page/Conditional_independence) among features to ease estimation. ## Applications ### Common Domains Bayes classifiers, particularly their naive approximations, find widespread application across diverse domains due to their probabilistic foundation and efficiency in handling uncertain data. In text classification tasks, naive Bayes classifiers are commonly employed for spam detection in [email](/page/Email) systems, where they analyze word frequencies and patterns to distinguish legitimate messages from unsolicited ones. Similarly, in [sentiment analysis](/page/Sentiment_analysis), these classifiers categorize textual content as positive, negative, or neutral by estimating conditional probabilities of features like n-grams, achieving robust performance on large corpora such as product reviews or [social media](/page/Social_media) posts.[](https://web.stanford.edu/~jurafsky/slp3/old_aug24/4.pdf)[](https://www.ijcseonline.org/pub_paper/112-IJCSE-03649.pdf)[](https://www.iaeng.org/IJCS/issues_v46/issue_2/IJCS_46_2_01.pdf) In [medical diagnosis](/page/Medical_diagnosis), Bayes classifiers support disease prediction by integrating symptoms, patient history, and biomarkers to compute posterior probabilities of conditions like Alzheimer's or cardiovascular diseases. For instance, naive Bayes classifiers have been validated in studies predicting outcomes from heterogeneous [health data](/page/Health_data), such as genetic markers and [imaging](/page/Imaging) results, enabling early detection in clinical settings with limited [labeled data](/page/Labeled_data).[](https://pmc.ncbi.nlm.nih.gov/articles/PMC5203736/)[](https://www.frontiersin.org/journals/aging-neuroscience/articles/10.3389/fnagi.2017.00077/full) Early [pattern recognition](/page/Pattern_recognition) systems in image recognition have approximated Bayes rules to classify visual features, such as shapes or textures, by treating [pixel](/page/Pixel) intensities or [edge](/page/Edge) detections as probabilistic evidence. These methods underpin foundational work in [object detection](/page/Object_detection), where [Bayesian inference](/page/Bayesian_inference) guides search and recognition in cluttered scenes, offering [scalability](/page/Scalability) for real-time processing.[](https://www.jstage.jst.go.jp/article/jacc/63/0/63_329/_pdf/-char/ja)[](https://www.sciencedirect.com/science/article/pii/S0042698910000052) In finance, Bayes classifiers facilitate credit scoring through probabilistic risk assessment, evaluating borrower profiles based on variables like [income](/page/Income), [debt](/page/Debt) ratios, and transaction history to predict [default](/page/Default) probabilities. Naive Bayes variants, in particular, construct scorecards that classify applicants as low or high risk, aiding lenders in decision-making with interpretable probability outputs.[](https://www.researchgate.net/publication/46526930_Assessing_naive_Bayes_as_a_method_for_screening_credit_applicants) A key advantage of Bayes classifiers, especially naive versions, lies in their [scalability](/page/Scalability) to high-dimensional spaces with small datasets, as they estimate parameters independently without requiring extensive [training](/page/Training) samples, making them suitable for sparse data environments like [genomics](/page/Genomics) or [text mining](/page/Text_mining). This efficiency stems from the independence assumption, which reduces computational demands while maintaining predictive accuracy in resource-constrained applications.[](https://www.sciencedirect.com/science/article/abs/pii/S0167865508003553)[](https://www.ibm.com/think/topics/naive-bayes) ### Computational Example To illustrate the application of the Bayes classifier in a [binary classification](/page/Binary_classification) setting, consider a spam detection task with two classes: [spam](/page/Spam) (S) and not spam ([ham](/page/Ham), H). The features are binary indicators for the presence of specific words, modeled using the [Bernoulli distribution](/page/Bernoulli_distribution), which is suitable for [binary](/page/Binary)/boolean features such as word occurrence in text.[](https://arxiv.org/pdf/1410.5329) Suppose we have a small training dataset of four emails, with two features: X₁ (presence of the word "[free](/page/Free)", 1 if present, 0 otherwise) and X₂ (presence of the word "offer", 1 if present, 0 otherwise). The dataset is as follows: | Email | Class | X₁ ("free") | X₂ ("offer") | |-------|-------|-------------|--------------| | 1 | S | 1 | 0 | | 2 | S | 0 | 1 | | 3 | [H](/page/H+) | 1 | 1 | | 4 | [H](/page/H+) | 0 | 0 | The priors are estimated from the data: P(S) = 2/4 = 0.5 and P([H](/page/H+)) = 0.5.[](https://arxiv.org/pdf/1410.5329) For the exact Bayes classifier, the class-conditional probabilities are the empirical joint probabilities for each feature combination, without assuming [independence](/page/Independence): - P(X₁=1, X₂=0 | S) = 1/2 = 0.5 (from email 1) - P(X₁=0, X₂=1 | S) = 1/2 = 0.5 (from email 2) - P(X₁=1, X₂=1 | S) = 0 - P(X₁=0, X₂=0 | S) = 0 - P(X₁=1, X₂=1 | H) = 1/2 = 0.5 (from email 3) - P(X₁=0, X₂=0 | H) = 1/2 = 0.5 (from email 4) - P(X₁=1, X₂=0 | H) = 0 - P(X₁=0, X₂=1 | H) = 0 Now, consider a test email with features x = (X₁=1, X₂=1), i.e., containing both "free" and "offer". The posterior probabilities are computed using [Bayes' theorem](/page/Bayes'_theorem): P(S | x) = \frac{P(x | S) P(S)}{P(x)}, \quad P(H | x) = \frac{P(x | H) P(H)}{P(x)} where $ P(x | S) = P(X_1=1, X_2=1 | S) = 0 $ and $ P(x | H) = 0.5 $. Thus, $ P(S | x) = 0 $ and $ P(H | x) = 1 $. The exact Bayes classifier assigns the test [email](/page/Email) to [H](/page/H+) ([ham](/page/Ham)), as the combination (1,1) never occurs in [spam](/page/Spam) but does in [ham](/page/Ham).[](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html) For the naive Bayes approximation, features are assumed conditionally [independent](/page/Independent) given the [class](/page/Class), so $ P(x | S) = P(X_1=1 | S) \times P(X_2=1 | S) $. The marginals are: - $ P(X_1=1 | S) = 1/2 = 0.5 $ - $ P(X_2=1 | S) = 1/2 = 0.5 $ - $ P(X_1=1 | H) = 1/2 = 0.5 $ - $ P(X_2=1 | H) = 1/2 = 0.5 $ (To avoid zero probabilities in general practice, Laplace [smoothing](/page/Smoothing) could be applied as $ (count + 1)/(n + 2) $, but here it yields the same values.) Thus, $ P(x | S) = 0.5 \times 0.5 = 0.25 $ and $ P(x | H) = 0.25 $. The unnormalized posteriors are equal: $ P(x | S) P(S) = 0.125 $ and $ P(x | H) P(H) = 0.125 $, leading to $ P(S | x) = 0.5 $. The [naive Bayes classifier](/page/Naive_Bayes_classifier) results in a tie, potentially assigning to either class (e.g., via a tie-breaking rule), but it fails to recognize the dependence that makes (1,1) impossible under S. This demonstrates the approximation effect: naive Bayes overestimates the likelihood under S by ignoring the [mutual exclusivity](/page/Mutual_exclusivity) of "free" and "offer" in [spam](/page/Spam) emails from the training data.[](https://arxiv.org/pdf/1410.5329)[](https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote05.html)

References

  1. [1]
    [1404.0933] Bayes and Naive Bayes Classifier - arXiv
    Apr 3, 2014 · Abstract:The Bayesian Classification represents a supervised learning method as well as a statistical method for classification.
  2. [2]
    Bayes Classifier - an overview | ScienceDirect Topics
    A Bayes classifier is defined as a statistical classifier based on Bayes' theorem that predicts class membership probabilities by assuming class-conditional ...Introduction to Bayes Classifier... · Variants of Bayes Classifiers
  3. [3]
    [PDF] Naïve Bayes - Stanford University
    Naïve Bayes is a probabilistic model that defines distributions for random variables, used for prediction based on observations.
  4. [4]
    [PDF] Lecture 6 Classification and Decision Theory - Brown CS
    Definition: We say f : X → Y is a Bayes optimal classifier if f minimizes E[L(y, f(x))] where (x, y) ∼ p(x, y). 2. Page 3.
  5. [5]
    [PDF] The Bayes Classifier 1 Introduction 2 Properties of the Bayes Risk
    Recall that a Bayes classifier is a classifier whose risk R(h) is minimal among all possible classifiers, and the minimum risk R∗ is called the Bayes risk.
  6. [6]
    [PDF] Bayes Classifiers - Matthieu R. Bloch
    May 23, 2020 · The classifier hB is called the Bayes classifier and RB ≜ R(hB) is called the Bayes risk. 2 Alternative forms of the Bayes classifier. You might ...
  7. [7]
    LII. An essay towards solving a problem in the doctrine of chances ...
    Bayes Thomas. 1763LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, F. R. S. communicated by Mr. Price, in a ...
  8. [8]
    [PDF] IX. Thomas Bayes's Essay Towards Solving a Problem ... - Mark Irwin
    Feb 1, 2005 · * Thomas Bayes's famous Essay is so often referred to in current statistical literature, but so rarely studied because of the difficulty of ...
  9. [9]
    Laplace's 1774 Memoir on Inverse Probability - jstor
    Abstract. Laplace's first major article on mathematical statistics was pub- lished in 1774. It is arguably the most influential article in this field to.
  10. [10]
    Pierre-Simon Laplace, Inverse Probability, and the Central Limit ...
    Mar 4, 2024 · On Laplace's brilliant solution to inverse probability and his discovery of the Central Limit Theorem · In the late 1600s, · In 1733, forty years ...
  11. [11]
    THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC ...
    THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS. R. A. FISHER Sc.D., F.R.S.,. R. A. FISHER Sc. ... First published: September 1936. https://doi.org ...
  12. [12]
    [PDF] THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC ...
    In the present paper the application of the same principle will be illustrated on a taxonomic problem; some questions connected with the precision of the ...
  13. [13]
    Statistical Decision Functions Which Minimize the Maximum Risk
    STATISTICAL DECISION FUNCTIONS WHICH MINIMIZE THE. MAXIMUM RISK. By ABRAHAM WALD. (Received November 7, 1944). 1. Introduction. In some previous publications ...
  14. [14]
    Pattern recognition - Holmström - Wiley Interdisciplinary Reviews
    Jul 15, 2010 · Pattern recognition has a long history. It had its beginnings in the statistical literature of the 1930s. The advent of computers in 1950s and ...
  15. [15]
    [PDF] Probability Theory 1 Sample spaces and events - MIT Mathematics
    Feb 10, 2015 · Bayes' rule. For any two events A and B, one has. P(B|A) = P(A|B). P(B). P(A) . The proof of Bayes' rule is straightforward. Replacing the ...
  16. [16]
    10.1 - Bayes Rule and Classification Problem | STAT 505
    The classification rule is to assign observation to the population for which the posterior probability is the greatest.
  17. [17]
    None
    ### Summary of Bayes Decision Rule for Classification (ECE408 Lecture Notes)
  18. [18]
    [PDF] Lecture 5: Classification 5.1 Introduction
    The excess risk is a quantity that measures how the quality of c is away from the optimal/Bayes classifier. If we cannot find the Bayes classifier, we will ...
  19. [19]
    [PDF] Proof that the Bayes Decision Rule is Optimal
    Proof that the Bayes Decision Rule is Optimal. Theorem For ... First we concentrate the attention on the error rate (probability of classification error).
  20. [20]
    [PDF] The Bayes Classifier
    If we have full knowledge of the distribution, then we can design an optimal classifier without seeing any data at all.<|control11|><|separator|>
  21. [21]
    [PDF] An empirical study of the naive Bayes classifier
    The naive Bayes classifier greatly simplify learn- ing by assuming that features are independent given class. Although independence is generally a poor.Missing: seminal | Show results with:seminal
  22. [22]
    [PDF] On the Optimality of the Simple Bayesian Classifier under Zero-One ...
    In practice, attributes are seldom independent given the class, which is why this assump- tion is “naive.” However, the question arises of whether the Bayesian ...
  23. [23]
    On the Optimality of the Simple Bayesian Classifier under Zero-One ...
    This article shows that, although the Bayesian classifier's probability estimates are only optimal under quadratic loss if the independence assumption holds,
  24. [24]
    [PDF] Naive Bayes, Text Classifica- tion, and Sentiment - Stanford University
    Text categorization, in which an entire text is assigned a class from a finite set, includes such tasks as sentiment analysis, spam detection, language identi-.
  25. [25]
    [PDF] Spam Detection using Naive Bayes Classifier
    Jul 7, 2018 · It analyses the text written in a natural language and classify them as positive or negative based on the human's sentiments, emotions, opinions.
  26. [26]
    [PDF] Text Classification: Naïve Bayes Classifier with Sentiment Lexicon
    May 27, 2019 · Abstract— This paper proposes a method of linguistic classification based on the analysis of positive, negative and neutral sentiments ...
  27. [27]
    Applying Naive Bayesian Networks to Disease Prediction - NIH
    Naive Bayesian networks (NBNs) are one of the most effective and simplest Bayesian networks for prediction. This paper aims to review published evidence ...
  28. [28]
    A Bayesian Model for the Prediction and Early Diagnosis ... - Frontiers
    In the current method, all the known AD biomarkers are combined in a complex Bayesian Network to establish a medical diagnostic decision system for AD, not as a ...
  29. [29]
    Pattern Recognition by Bayesian Inference - J-Stage
    Bayesian inference uses Bayes' theorem to estimate the cause of an outcome based on results, and is discussed for pattern recognition.
  30. [30]
    A Bayesian model for efficient visual search and recognition
    Jun 25, 2010 · We describe a new model of attention guidance for efficient and scalable first-stage search and recognition with many objects.
  31. [31]
    Assessing naive Bayes as a method for screening credit applicants
    Aug 10, 2025 · This study examines the effectiveness of NBR as a method for constructing classification rules (credit scorecards) in the context of screening ...
  32. [32]
    Class dependent feature scaling method using naive Bayes ...
    The naive Bayes classifier has been extensively used in text categorization. We have developed a new feature scaling method, called class–dependent–feature– ...
  33. [33]
    What Are Naïve Bayes Classifiers? - IBM
    These probabilities are denoted as the prior probability and the posterior probability. The prior probability is the initial probability of an event before it ...Missing: optimal | Show results with:optimal
  34. [34]
    [PDF] Naive Bayes and Text Classification I - arXiv
    Feb 14, 2017 · In the following sections, we will take a closer look at the probability model of the naive Bayes classifier and apply the concept to a simple ...
  35. [35]
    Lecture 5: Bayes Classifier and Naive Bayes
    Naive Bayes is a linear classifier. Naive Bayes leads to a linear decision boundary in many common cases. Illustrated here is the case where P(xα|y) is ...