Fact-checked by Grok 2 weeks ago

Additive smoothing

Additive smoothing, also known as Laplace smoothing or add-one smoothing, is a simple probabilistic technique that adjusts estimates derived from count data by adding a small positive constant (pseudocount) to each observed frequency, thereby avoiding zero probabilities for unseen events and providing a more robust estimate in sparse datasets. This method is particularly useful in scenarios where the sample size is limited relative to the number of possible outcomes, ensuring that all categories receive a non-zero probability allocation. The technique originates from Pierre-Simon Laplace's , introduced in his 1812 treatise Théorie Analytique des Probabilités, where it was applied to infer the probability of future events based on past observations, such as the likelihood of rising the next day. In its standard form, for a with vocabulary size |V| and observed count c for an event, the smoothed probability is given by P = \frac{c + \alpha}{N + \alpha |V|}, where N is the total number of observations and \alpha > 0 is the smoothing parameter, often set to 1 for the classic Laplace variant. This Bayesian-inspired approach can be viewed as incorporating a uniform Dirichlet prior over the categories, which shrinks maximum likelihood estimates toward uniformity. Additive smoothing finds extensive application in , particularly for estimating n-gram probabilities in language models to mitigate data sparsity, and in algorithms like the for text categorization, where it improves generalization by handling unseen features. While effective for small vocabularies or binary/multiclass problems, it is often outperformed in large-scale settings by more advanced methods like Good-Turing or Kneser-Ney due to its tendency to over-allocate probability mass to unseen events. Variants such as Lidstone smoothing generalize \alpha to values between 0 and 1, allowing finer control over the degree of .

Introduction

Definition and Purpose

Additive smoothing is a statistical employed to refine probability estimates derived from , particularly in categorical or multinomial distributions, by incorporating a positive constant—often denoted as α and referred to as a pseudocount—into the observed counts for each . This adjustment ensures that no , even those unobserved in the sample, is assigned a zero probability, thereby addressing limitations in raw frequency-based estimations. The core purpose of additive smoothing lies in resolving the zero-frequency problem, where events absent from the training data receive zero probability under , potentially resulting in zero posteriors or undefined predictions in downstream models. By adding pseudocounts, the method allocates a minimal but positive probability to all possible outcomes, fostering more reliable and less overconfident inferences, especially when dealing with sparse or incomplete datasets. At a conceptual level, additive smoothing redistributes probability mass evenly across all categories, effectively borrowing strength from the overall sample to bolster estimates for underrepresented events without relying on complex adjustments. This uniform redistribution promotes stability in probability distributions, making it particularly valuable for handling estimation difficulties in small samples or scenarios with high-dimensional sparse data. Historically, this approach traces back to Pierre-Simon Laplace's efforts to tackle inductive inference challenges in limited observations, as outlined in his foundational work on .

Illustrative Example

To illustrate additive smoothing, consider a scenario involving a biased flipped 4 times, resulting in 3 heads and 1 tail. The unsmoothed maximum likelihood estimates of the probabilities are P(\text{heads}) = \frac{3}{4} = 0.75 and P(\text{tails}) = \frac{1}{4} = 0.25. Applying additive smoothing with smoothing parameter \alpha = 1, also known as Laplace's , adds one pseudocount to each outcome and to the total number of observations. This yields the smoothed probabilities P(\text{heads}) = \frac{3 + 1}{4 + 2} = \frac{4}{6} \approx 0.667 and P(\text{tails}) = \frac{1 + 1}{4 + 2} = \frac{2}{6} \approx 0.333. The following table compares the unsmoothed and smoothed probability estimates:
OutcomeUnsmoothed ProbabilitySmoothed Probability
Heads0.750.667
Tails0.250.333
This adjustment pulls the probability estimates toward the (0.5 for each outcome), mitigating overconfidence and reducing variance that arises from limited data.

Mathematical Foundations

Core Formula

Additive smoothing, also known as Laplace smoothing, provides a smoothed estimate for the parameters of a based on observed counts. For a with d , where x_i denotes the observed count for i, N = \sum_{i=1}^d x_i is the total number of observations, and \alpha > 0 is the , the estimated probability for i is given by \hat{\theta}_i = \frac{x_i + \alpha}{N + \alpha d}. This formula incorporates \alpha as a pseudocount, equivalent to adding \alpha imaginary observations uniformly across all categories. The normalization inherent in the formula ensures that the estimated probabilities sum to 1, as \sum_{i=1}^d \hat{\theta}_i = \frac{\sum_{i=1}^d (x_i + \alpha)}{N + \alpha d} = \frac{N + \alpha d}{N + \alpha d} = 1. In the special case of a binary distribution (d=2), where x is the count of successes in n trials, the smoothed probability of success simplifies to \hat{p} = \frac{x + \alpha}{n + 2\alpha}. Key properties of additive smoothing include its behavior in limiting cases: as \alpha \to 0, the estimate \hat{\theta}_i approaches the maximum likelihood estimate \frac{x_i}{N}; as \alpha \to \infty, it converges to the \frac{1}{d} for all categories.

Derivation and Properties

Additive smoothing can be derived in the Bayesian framework as the posterior mean under a symmetric Dirichlet prior on the multinomial probability parameters. Specifically, for a multinomial distribution with d categories and observed counts n_1, \dots, n_d from N = \sum n_i trials, the symmetric Dirichlet prior has all concentration parameters equal to \alpha > 0. The posterior is Dirichlet with updated parameters n_i + \alpha, and the posterior mean takes the form \hat{\theta}_i = \frac{n_i + \alpha}{N + d \alpha}, which is exactly the additive smoothing formula. The maximum a posteriori (MAP) estimate, or mode of the posterior, is \hat{\theta}_i = \frac{n_i + \alpha - 1}{N + d(\alpha - 1)} for \alpha > 1, which approximates additive smoothing when \alpha is large but differs otherwise; for add-one smoothing (\alpha = 1), the mode is not interior but the posterior mean aligns with the formula. From a frequentist , additive smoothing acts as a correction to the maximum likelihood (MLE) \hat{\theta}_i = n_i / [N](/page/N+) for the , particularly in sparse regimes where the MLE may yield zero estimates for unobserved categories, resulting in infinite or poor generalization. By adding \alpha to each count, the shrinks probabilities toward the $1/d, mitigating to noise in limited samples while ensuring all probabilities are positive. The key properties of additive smoothing include a between and variance. It introduces a toward uniformity, with \mathbb{E}[\hat{\theta}_i] - \theta_i = \frac{d \alpha (1/d - \theta_i)}{N + d \alpha}, which is positive for categories with \theta_i < 1/d and negative otherwise, pulling estimates closer to the center of the simplex; this is O(\alpha / N) and vanishes as N grows. The variance is reduced relative to the MLE, with the exact expression \mathrm{Var}(\hat{\theta}_i) = \frac{N \theta_i (1 - \theta_i)}{(N + d \alpha)^2} = \frac{\theta_i (1 - \theta_i)}{N} \cdot \frac{1}{(1 + d \alpha / N)^2}. Consequently, the mean squared error (MSE) \mathbb{E}[(\hat{\theta}_i - \theta_i)^2] is lower than the MLE's MSE for small N (e.g., when N \ll d), as the variance reduction dominates the added , but the MLE becomes preferable for large N. Asymptotically, as N \to \infty, the additive smoothing estimator converges in probability to the true parameter \theta_i for any fixed \alpha > 0, achieving consistency equivalent to the MLE since the smoothing term becomes negligible.

Historical Context

Origins with Laplace

The origins of additive smoothing trace back to Pierre-Simon Laplace's foundational work in , particularly his 1774 memoir where he developed methods for inferring causes from observed events using . In this publication, Laplace addressed the challenge of estimating the probability of future events when past observations suggest certainty, introducing a technique that avoids absolute conclusions by incorporating prior uncertainty. This approach laid the groundwork for what would later be recognized as additive smoothing, through the application of uniform priors equivalent to adding fictitious observations. A classic illustration of is the "sunrise problem," which considers the probability that will rise tomorrow given that it has risen without failure for n = 5000 days. The classic "" illustration of this rule appears in Laplace's later essay Essai philosophique sur les probabilités. Without smoothing, the empirical estimate would yield a probability of 1, implying , but Laplace proposed adjusting for unknown possibilities by adding one imaginary success and one imaginary failure—corresponding to an additive parameter α = 1. This results in the estimated probability P(sunrise tomorrow) = (n + 1)/(n + 2), yielding approximately 0.9998 for n = 5000, thus acknowledging residual uncertainty. These imaginary observations function as pseudocounts, preventing overconfidence in limited data. Laplace termed this the "rule of succession," motivated by a philosophical commitment to account for potential future deviations or unknown causes that could alter observed patterns. By assuming an equal likelihood for all possible underlying probabilities—a uniform prior over [0,1]—the method ensures that even perfect historical success does not preclude the possibility of failure, promoting a balanced view of in probabilistic reasoning. This principle, first articulated in the 1774 memoir Mémoire sur la probabilité des causes par les événements, marked a pivotal shift toward Bayesian-like in estimating event probabilities from data.

Later Advancements

During the , the application of additive smoothing extended to multinomial distributions, building on its foundations to handle multiple categories in statistical estimation. A notable refinement came in 1961 with ' recommendation of α=0.5, corresponding to the (Beta(0.5, 0.5)) for under and multinomial models, as detailed in the third edition of his Theory of Probability. Jaynes further advanced subjective probability interpretations, framing additive smoothing as a maximum prior that encodes ignorance without introducing bias, emphasizing its role in logical from incomplete data. In the and , additive smoothing saw widespread popularization in through its integration into Naive Bayes classifiers, particularly for text classification tasks where it mitigated zero-probability issues in high-dimensional feature spaces. This era marked its adoption in statistical models for and , including implementations in early computational systems. No major theoretical updates to additive smoothing have emerged since , though it has been seamlessly integrated into modern software libraries, such as 's MultinomialNB class, where the alpha parameter enables configurable additive (Laplace/Lidstone) smoothing for practical probabilistic modeling.

Interpretations

Pseudocount Mechanism

In additive smoothing, the smoothing parameter α denotes the pseudocount added to each category in a , amounting to α pseudocounts per category where d is the number of categories; this is equivalent to incorporating α d additional hypothetical trials in which each outcome is observed exactly α times. This mechanism modifies the estimation process by incrementing each category's observed count x_i to x_i + α while expanding the overall sample size from N to N + α d, thereby yielding smoothed probability estimates that avoid zeros for unobserved categories. Intuitively, pseudocounts simulate the inclusion of fictional data in which each possible outcome appears exactly α times, offering a simple to mitigate the unreliability of maximum likelihood estimates in sparse datasets where certain categories lack empirical support. For instance, in a natural language processing task involving a vocabulary of 10,000 words, where some terms remain unseen in the training data, applying α = 1 introduces a minimal positive probability to each absent word, ensuring the model assigns non-zero likelihoods during inference. This pseudocount interpretation aligns with the core formula of additive smoothing by treating the added values as extra observations rather than probabilistic priors.

Bayesian Viewpoint

Additive smoothing arises naturally in Bayesian inference as the posterior mean estimator for the parameters of a multinomial distribution when a symmetric Dirichlet prior is employed. In this framework, the multinomial likelihood models the observed counts x_1, \dots, x_d from N = \sum_i x_i trials, with unknown probabilities \mathbf{p} = (p_1, \dots, p_d). The Dirichlet distribution, \mathrm{Dir}(\alpha, \dots, \alpha) for \alpha > 0, serves as the conjugate prior, encoding a belief in uniform probabilities with strength proportional to \alpha. The posterior is then \mathrm{Dir}(x_1 + \alpha, \dots, x_d + \alpha), and the maximum estimate aligns with the posterior \hat{p}_i = \frac{x_i + \alpha}{N + \alpha d} for each category i. This formulation directly yields the additive smoothing probabilities, where \alpha acts as a pseudocount added to each category, preventing zero estimates for unobserved events while shrinking empirical frequencies toward uniformity. The parameter \alpha quantifies the prior's influence, functioning as a hyperparameter that controls the effective sample size of the prior; for instance, \alpha = 1 corresponds to Laplace smoothing under a Dirichlet prior, equivalent to assuming one prior observation per category. In Bayesian models, varying \alpha allows tuning the balance between data-driven likelihood and prior regularization, with larger values emphasizing the uniform baseline. This Bayesian lens extends to , where the smoothed estimates represent the posterior predictive probabilities for a new : P(\text{next} = i \mid \mathbf{x}) = \frac{x_i + \alpha}{N + \alpha d}. Thus, additive smoothing not only regularizes estimates but also delivers coherent probabilistic forecasts under .

Parameter Selection

Common Choices

In additive smoothing, the parameter α determines the strength of the pseudocount addition to observed frequencies, with several conventional values selected based on theoretical and practical considerations. One standard choice is α = 1, corresponding to Laplace's rule of succession, which introduces uniform pseudocounts of 1 to each possible outcome, thereby assigning non-zero probabilities to unseen events. This value serves as the default in many implementations of probabilistic models, such as scikit-learn's MultinomialNB classifier. Another common selection is α = 0.5, which aligns with the and provides invariance under reparameterization, making it suitable for binary outcome scenarios where parameter transformations might otherwise affect estimates. For constructing confidence intervals around binomial proportions, α = 2 is frequently employed in Wilson's "plus four" rule, equivalent to adding two pseudocounts to both successes and failures to improve coverage, especially with sparse data. These fixed choices reflect specific Bayesian priors that balance with prior beliefs. For small sample sizes, higher α values generally enhance the uniformity of probability estimates by pulling them closer to the , mitigating to limited observations. The following table illustrates this effect in a setting using the \hat{p} = \frac{x + \alpha}{n + 2\alpha}, where x is the number of successes and n is the number of trials:
Observations (x/n)α = 0.5 (\hat{p})α = 1 (\hat{p})α = 2 (\hat{p})
0/10.250.330.40
1/10.750.670.60
In both cases, increasing α reduces the deviation from the probability of 0.5, promoting smoother and less estimates.

Advanced Criteria

In Bayesian frameworks, weakly informative priors for the can be used to select small positive values of α, providing minimal regularization that allows the to primarily drive the posterior estimates while avoiding probabilities and overly restrictive assumptions. A frequentist for choosing α aims to align the smoothed estimate's uncertainty with the Wilson score interval's width for 95% coverage, often approximated by the "plus four" rule in settings, where α = 2 adds pseudocounts to both success and failure categories for robust , particularly effective in small samples. When external knowledge about expected event rates is available, the hyperparameters of the Dirichlet (generalizing uniform α) can be set to incorporate this information as a mean, such as by scaling pseudocounts proportional to anticipated frequencies for rare events in specialized domains. Cross-validation provides a data-driven technique to optimize α by evaluating performance on held-out data, minimizing metrics like log-loss to balance smoothing against overfitting, as demonstrated in text classification tasks where tuned α improves model generalization beyond fixed values like α = 1.

Applications

Probabilistic Classification

Additive smoothing plays a crucial role in probabilistic classification models, particularly in the Naive Bayes classifier, where it is applied to estimate conditional probabilities P(\text{feature} \mid \text{class}) and prevent zero likelihoods for unseen feature-class combinations during prediction. In the multinomial variant of Naive Bayes, commonly used for discrete features, additive smoothing adjusts the frequency counts by adding a small positive value (often denoted as \alpha) to both the numerator and denominator of the probability estimates, ensuring that even features absent in the training data for a specific class receive a non-zero probability. This approach is essential for maintaining the validity of the posterior probability calculations under the independence assumption of Naive Bayes. A representative application occurs in text tasks using a bag-of-words representation, where documents are modeled as multinomial distributions over terms. For instance, with Laplace smoothing (\alpha = 1), the probability of an unseen word in a given class is adjusted to avoid assigning zero likelihood to test documents containing terms, thereby enabling the classifier to generalize beyond the . This is particularly beneficial in detection, where datasets are often sparse due to the high dimensionality of word features and the rarity of certain terms; enhances the model's ability to handle such sparsity by providing robust probability estimates that improve overall accuracy. From a computational perspective, after applying additive smoothing, the resulting probabilities are frequently transformed into log-probabilities to mitigate numerical underflow when multiplying numerous small values in the likelihood computation for high-dimensional inputs. This logarithmic formulation not only stabilizes the arithmetic but also aligns with the additive nature of log sums, facilitating efficient implementation in practice.

Natural Language Processing

In statistical language models, additive smoothing is applied to estimate n-gram probabilities, particularly to address data sparsity where certain word sequences do not appear in the training . For an n-gram model, the smoothed of a word w_i given its preceding context is given by P(w_i \mid \text{context}) = \frac{\text{count}(\text{context}, w_i) + \alpha}{\text{count}(\text{context}) + \alpha \cdot V}, where \alpha > 0 is the , and V is the size of the . This adjustment assigns a small positive probability to unseen n-grams, preventing zero probabilities that could otherwise degrade model performance. Additive smoothing plays a key role in applications such as and by enabling robust handling of unseen n-grams during inference. In speech recognition systems, it ensures that the language model can assign probabilities to novel word sequences encountered in real-time transcription, improving overall decoding accuracy. Similarly, in statistical machine translation, it supports the estimation of target language probabilities, allowing the system to generate fluent outputs even for low-frequency or absent phrases in the training data. A practical example is for computing probabilities, where \alpha = [1](/page/1) corresponds to Laplace smoothing as a simple baseline. For a like "I want to eat Chinese food," the probability is the product of smoothed probabilities, such as P(\text{to} \mid \text{want}) = \frac{\text{count}(\text{want to}) + [1](/page/1)}{\text{count}(\text{want}) + V}, which avoids underestimating the likelihood due to sparse counts.

Comparisons

Versus Other Smoothing Methods

Additive smoothing redistributes probability mass uniformly across all outcomes by adding a fixed pseudocount to observed frequencies, which contrasts with Good-Turing smoothing that adjusts probabilities for unseen events based on the empirical frequencies of low-count events, thereby providing a more targeted allocation of mass to rare occurrences. This makes Good-Turing particularly effective for handling the of sparse data distributions, as it estimates the total probability for unobserved items using the proportion of singletons and other low-frequency patterns, outperforming additive smoothing in scenarios with many unseen categories. In applications involving sparse n-grams, additive smoothing is simpler to apply but less effective than Kneser-Ney smoothing, which employs absolute discounting to subtract a fixed amount from higher-order probabilities and redistributes the mass using continuation counts that reflect the diversity of word contexts rather than uniform addition. Kneser-Ney thus better captures linguistic regularities in sparse data by prioritizing historical variety over equal pseudocount increments, leading to improved probability estimates for infrequent sequences. From a Bayesian perspective, additive smoothing relies on a Dirichlet prior with a fixed concentration parameter and symmetric pseudocounts, limiting it to a predefined number of categories, whereas the offers a non-parametric alternative that adaptively infers an unbounded number of categories through a stick-breaking construction or , allowing for more flexible clustering in evolving discrete distributions. Empirically, additive smoothing yields reasonable performance on small datasets with limited sparsity, such as corpora under 50,000 sentences, where perplexity differences among methods are modest, but it underperforms on large, sparse corpora like the million-word Brown Corpus, where advanced techniques more effectively model rare events.
AspectAdditive SmoothingOther Methods (e.g., Good-Turing, Kneser-Ney, Dirichlet Process)
Probability RedistributionUniform across all outcomesTargeted (frequency-based, context-diverse, or adaptive)
Handling Rare EventsOver-smooths tails, overestimates unseenAllocates mass efficiently to low-frequency/unseen items
ComplexitySimple implementation, highly interpretableMore computationally intensive, requires parameter estimation
Performance on Sparse DataAdequate for small datasets; poor for largeSuperior perplexity on large corpora, better generalization

Limitations and Extensions

One key limitation of additive smoothing is its tendency toward over-uniform distribution of probability mass, which distorts estimates for rare or tail events by assigning identical non-zero probabilities to all unseen outcomes, irrespective of their contextual likelihood. This approach can lead to suboptimal performance in sparse datasets, as demonstrated in empirical evaluations of language modeling where additive methods underperform compared to more nuanced techniques in capturing long-tail behaviors. The method is also highly sensitive to the selection of the smoothing parameter α, where small changes can substantially alter probability estimates and model accuracy, lacking a theoretically grounded beyond the conventional choice of 1. Furthermore, additive smoothing imposes a by applying a α across all categories or features, which overlooks variations in data scarcity or among them, potentially biasing results in heterogeneous datasets. In contemporary , additive smoothing sees reduced adoption in paradigms, where neural networks rely on specialized regularization strategies like dropout and penalties to manage and zero-probability issues. Nonetheless, it persists as a foundational in interpretable probabilistic models, such as naive Bayes classifiers, particularly in resource-constrained or low-data scenarios where and explainability are prioritized. Extensions to additive smoothing address these shortcomings by allowing variable α values tailored to individual categories or —for instance, scaling based on estimated feature importance to mitigate —effectively generalizing to Dirichlet priors with non-uniform parameters. Hybrid variants integrate additive smoothing with backoff mechanisms, recursively deferring to lower-order models for unseen events to preserve more accurate higher-order estimates in sparse contexts.

References

  1. [1]
    [PDF] Introduction to Machine Learning Lecture 3
    Mehryar Mohri - Introduction to Machine Learning. Additive Smoothing. Definition: the additive or Laplace smoothing for. estimating , , from a sample of size ...
  2. [2]
    [PDF] NLP Lunch Tutorial: Smoothing
    Apr 21, 2005 · • Additive smoothing. • Good-Turing estimate. • Jelinek-Mercer smoothing (interpolation). • Katz smoothing (backoff). • Witten-Bell smoothing.
  3. [3]
    The Rule of Succession (Chapter 2) - Symmetry and its Discontents
    This paper will trace the evolution of the rule, from its original formulation at the hands of Bayes, Price, and Laplace, to its generalizations by the English ...
  4. [4]
    Additive smoothing - Semantic Scholar
    In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing), or Lidstone smoothing, is a technique
  5. [5]
    None
    Below is a merged summary of **Additive Smoothing, Laplace Smoothing, and Related Concepts** based on all the provided segments. To retain all information in a dense and organized manner, I will use a table in CSV format to capture the details comprehensively, followed by a concise narrative summary that integrates the key points. The table includes page references, purposes, examples, and additional context where available, ensuring no information is lost.
  6. [6]
    [PDF] 22: MAP
    Mar 1, 2024 · Flip a coin 8 times. Observe n=7 heads and m=1 tail. • What is the MAP estimator of the Bernoulli parameter p, if ...
  7. [7]
    Théorie analytique des probabilités : Laplace, Pierre Simon ...
    Mar 30, 2006 · THÉORIE. ANALYTIQUE. DES PROBABILITÉS;. Par m. le marquis DE LAPLACE,. tf Pair de France; Grand Officier de la Légion d'honneur; l'un des ...
  8. [8]
    MultinomialNB — scikit-learn 1.7.2 documentation
    Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing). force_alphabool, default=True. If False and alpha is ...
  9. [9]
    [PDF] Section #8 Concept Check Solutions
    ... Laplace smoothing” and it guarantees that none of your probabilities are 0 or 1. The Laplace estimate for a Multinomial RV is pi = ni+1. N+6 for i = 1,...,6 ...
  10. [10]
    Laplace's Rule of Succession
    Laplace's Rule of Succession suggests that, in some circumstances, an estimate \displaystyle p=\frac{s+1}{n+2} is more useful.
  11. [11]
    [PDF] laplace's rule of succession - Jonathan Weisberg
    Laplace first derived a special case of the rule in 1774, using certain assumptions. Te same assumptions also allow us to derive the general rule, and following ...Missing: paper | Show results with:paper
  12. [12]
    [PDF] Lisa Yan May 27, 2020
    May 27, 2020 · Now flip 100 coins and get 58 heads and 42 tails. 1. What are the two posterior distributions? 2. What are the modes of the two posterior ...
  13. [13]
    None
    ### Summary of Additive Smoothing from https://aclanthology.org/P96-1041.pdf
  14. [14]
    [PDF] Memoir on the probability of the causes of events - University of York
    Originally published as "Mémoire sur la probabilité des causes par les évène- mens," par M. de la Place, Professeur à l'École royal Militaire, in Mémoires.Missing: 1812 | Show results with:1812
  15. [15]
    [PDF] A philosophical essay on probabilities
    Google is proud to partner with libraries to digitize public domain materials and make them widely accessible. Public domain books belong to the.
  16. [16]
    Richard Price, the First Bayesian - Project Euclid
    Richard Price (1723–1791) was a noted British moral philosopher, an expert in actuarial science and population statistics, and a friend of America and Ben-.
  17. [17]
    A history of mathematical statistics from 1750 to 1930-Hald
    HALD, Anders A history of mathematical statistics from 1750 to 1930 1998. ISBN 0471179124. Contents Preface XV 1. Plan of the Book 1Missing: post- | Show results with:post-<|separator|>
  18. [18]
    Comparison of Probability Distributions - jstor
    By considering observed responses as discrete measurements, a method of comparison and plausibility of fit of probability distributions is developed.
  19. [19]
    [PDF] Theory of probability - andrew.cmu.ed
    Jeffreys, Harold. Theory of probability.—3rd ed.- (the International series of monographs on physics). 1. Probabilities. I. Title. 519.2. 11. Series. Q4273.
  20. [20]
    [PDF] Probability Theory: The Logic Of Science by Edwin Jaynes - AltExploit
    PART A - PRINCIPLES AND ELEMENTARY APPLICATIONS. Chapter 1. Plausible Reasoning. Chapter 2. Quantitative Rules: The Cox Theorems. Chapter 3.
  21. [21]
    An Empirical Study of the Naïve Bayes Classifier - ResearchGate
    The naive Bayes classifier greatly simplify learn-ing by assuming that features are independent given class. Although independence is generally a poor ...
  22. [22]
    [PDF] Statistical Machine Translation: IBM Models 1 and 2
    In this note we will focus on the IBM translation models, which go back to the late 1980s/early 1990s. ... The model is certainly rather simple and naive.
  23. [23]
    [PDF] Bayesian Data Analysis Third edition (with errors fixed as of 20 ...
    This book is intended to have three roles and to serve three associated audiences: an introductory text on Bayesian inference starting from first principles, a ...
  24. [24]
    A hierarchical Dirichlet language model
    Sep 12, 2008 · We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as ...
  25. [25]
    [PDF] Explanation of dirichlet-topic.pl - Graham Neubig
    Aug 3, 2010 · It can be shown that additive smoothing is equivalent to using a Dirichlet distribution as a prior probability for the multinomial ...Missing: interpretation | Show results with:interpretation
  26. [26]
  27. [27]
    [PDF] Introduction to naivebayes package
    Mar 16, 2024 · In the context of Naïve Bayes classification, additive smoothing can be easily implemented by set- ting the parameter laplace to a positive ...
  28. [28]
  29. [29]
    [PDF] Tackling the Poor Assumptions of Naive Bayes Text Classifiers
    When we optimized the smoothing parameter for MNB via cross-validation, in experiments not reported here, our. MNB results were similar. Smoothing parameter op-.
  30. [30]
    [PDF] Naive Bayes and Text Classification I - arXiv
    Feb 14, 2017 · The most common variants of additive smoothing are the so-called Lidstone smoothing (α < 1) and Laplace smoothing (α = 1). ˆP(xi | ωj) = Nxi,ωj ...
  31. [31]
    [PDF] A Comparison of Event Models for Naive Bayes Text Classification
    Recent approaches to text classification have used two different first-order probabilistic models for classifica- tion, both of which make the naive Bayes ...
  32. [32]
    [PDF] Applying Naıve Bayes Classification to Google Play Apps ... - arXiv
    Aug 30, 2016 · The results show that the Naıve Bayes algorithm performs well for our classification problem and can potentially automate app categorization for ...
  33. [33]
    [PDF] N-gram Language Models - Stanford University
    An n-gram is a sequence of n words, and an n-gram model estimates the probability of a word given the n-1 previous words.
  34. [34]
    [PDF] CS229 Lecture Notes
    4.2 Naive bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42. 4.2.1 Laplace smoothing . ... modern machine-learning practice and the classical ...