Fact-checked by Grok 2 weeks ago

Bayes factor

The Bayes factor is a key quantity in that compares the relative support for two competing hypotheses or models given observed , defined as the of the marginal likelihood of the under one model to that under the other. It serves as an updating factor on prior odds, where a Bayes factor B_{10} = 6, for instance, indicates that the are six times more likely under the H_1 than under the H_0. Originating in the work of Harold Jeffreys in 1935, the concept built on earlier contributions from Dorothy Wrinch and J.B.S. Haldane in the 1920s and 1930s, with Jeffreys formalizing it as a tool for scientific inference in his influential book Theory of Probability. The term "Bayes factor" itself was coined later by Robert E. Kass and Adrian E. Raftery in their 1995 paper, which popularized its use for model selection and hypothesis testing. Unlike frequentist p-values, which only assess evidence against a null hypothesis, the Bayes factor provides a symmetric measure that can quantify evidence in favor of either hypothesis, distinguishing between absence of evidence and evidence of absence. It is particularly advantageous for comparing non-nested models and is robust to optional stopping in data collection, making it suitable for sequential experimental designs. Jeffreys proposed a heuristic scale to interpret Bayes factor magnitudes as strength of evidence: values between 1 and 3 indicate "anecdotal" support for H_1, 3 to 10 offer "moderate" evidence, 10 to 30 provide "strong" evidence, 30 to 100 yield "very strong" evidence, and greater than 100 represent "extreme" evidence, with reciprocals applying for support of H_0. This scale, while subjective, has been widely adopted and refined in fields like , , and for model comparison tasks. Bayes factors often require numerical approximations due to the intractability of marginal likelihoods in complex models, but they remain central to Bayesian and evidence accumulation.

Mathematical Foundations

Definition

The Bayes factor is a statistical measure used in Bayesian inference to quantify the relative evidence provided by observed data for one model over another competing model. It was introduced by Harold Jeffreys as a tool for objective hypothesis testing within a Bayesian framework. Mathematically, the Bayes factor in favor of model M_1 over model M_2 given data D, denoted BF_{10}, is defined as the ratio of the marginal likelihoods under each model: BF_{10} = \frac{p(D \mid M_1)}{p(D \mid M_2)} The marginal likelihood p(D \mid M) for a model M with parameters \theta is obtained by integrating the likelihood over the prior distribution of the parameters: p(D \mid M) = \int p(D \mid \theta, M) \, p(\theta \mid M) \, d\theta This integration averages the model's predictive performance across all plausible parameter values weighted by the prior, providing a summary of the model's overall fit to the independent of specific parameter estimates. A common notation convention is BF_{01} = 1 / BF_{10}, which reverses the comparison to favor M_0 (often the null model) over M_1. The Bayes factor plays a central role in Bayesian model comparison by directly comparing the predictive adequacy of competing models based on the observed data, facilitating decisions about model selection without relying on point estimates or frequentist criteria.

Relationship to Bayes' Theorem

The Bayes factor emerges directly from Bayes' theorem as a key component in updating the probabilities of competing models based on observed data. Bayes' theorem states that the posterior probability of a model M_i given data D is P(M_i | D) \propto P(D | M_i) P(M_i), where P(D | M_i) is the marginal likelihood under the model and P(M_i) is the prior probability. For two models M_1 and M_2, the ratio of posterior model probabilities, known as the posterior odds, is therefore \frac{P(M_1 | D)}{P(M_2 | D)} = \frac{P(M_1)}{P(M_2)} \times \frac{P(D | M_1)}{P(D | M_2)}, with the second factor on the right-hand side defining the Bayes factor BF_{12}. This formulation demonstrates that the Bayes factor serves as a multiplier that adjusts the prior odds to yield the posterior odds, encapsulating how the data shifts belief between models. By isolating \frac{P(D | M_1)}{P(D | M_2)}, the Bayes factor measures the relative support for each model provided solely by the , disentangling this evidential contribution from subjective beliefs about the models' plausibility. This separation allows the Bayes factor to function as an objective summary of the 's evidential value within the Bayesian updating process, applicable across diverse modeling contexts. The derivation of the Bayes factor highlights a fundamental distinction in handling point-null hypotheses versus composite models. Under a point-null hypothesis, such as M_0: \theta = \theta_0, the marginal likelihood P(D | M_0) simplifies to the likelihood evaluated directly at the fixed value, as there is no parameter uncertainty to integrate over. In contrast, for a composite model M_1 with parameters varying over a continuous space, P(D | M_1) requires integrating the likelihood over a on the parameters to average out uncertainty, as previously outlined in the definition of . This difference affects the computational form of the Bayes factor but preserves its role in the posterior odds equation. Harold Jeffreys pioneered the application of the within this framework of in his 1939 monograph Theory of Probability (first edition).

Interpretation

Evidence Scales

The interpretation of the (BF) relies on standardized scales that categorize its magnitude into qualitative levels of evidence for one model (say, the M_1) over another (say, the M_0). These scales provide a framework for assessing evidential strength, though they are not universally fixed. A seminal scale was proposed by , which divides BF values into grades based on orders of magnitude, emphasizing decisive evidence for large values. Jeffreys' scale, as commonly referenced, is as follows:
BF_{10}Evidence against M_0
> 100Decisive
30–100Very strong
10–30Strong
3–10Substantial
1–3Barely worth mentioning
This classification interprets BF_{10} > 1 as favoring M_1 over M_0, with the strength increasing as the value grows; reciprocally, BF_{10} < 1 (or BF_{01} > 1) supports M_0. Kass and Raftery later modified this scale to align more closely with logarithmic transformations of , adjusting thresholds for practicality in empirical applications and incorporating a deviance-like measure (2 ln(BF)). Their revised scale is:
BF_{10}2 ln(BF_{10})Evidence against M_0
> 150> 10Very strong
20–1506–10Strong
3–202–6Positive
1–30–2Barely worth mentioning
This adjustment extends the "very strong" category to higher values while broadening the "positive" range, facilitating interpretation in contexts like . Both scales maintain the directional guideline that BF_{10} > 1 indicates data more compatible with M_1 than M_0, with the ratio quantifying the relative evidential support. Despite their utility, these thresholds are inherently arbitrary, serving as rough guides rather than strict cutoffs, and have varied across implementations (e.g., some adaptations use "" instead of "decisive" for BF > 100). Moreover, Bayes factors are sensitive to model specification, as changes in how competing models are parameterized or nested can substantially alter the marginal likelihoods and thus the BF value, underscoring the need for careful model formulation.

Posterior Odds Connection

The Bayes factor connects directly to posterior odds through Bayes' theorem applied to model comparison. Specifically, the posterior odds in favor of model M_1 over model M_2 given data D are obtained by multiplying the prior odds by the Bayes factor: \frac{P(M_1 \mid D)}{P(M_2 \mid D)} = BF_{10} \times \frac{P(M_1)}{P(M_2)}, where BF_{10} is the Bayes factor comparing M_1 to M_2. This relationship highlights the Bayes factor's role as the multiplicative update factor representing the evidence contributed solely by the data, independent of prior beliefs. Expanding this to posterior probabilities, let \pi_1 = P(M_1) and \pi_2 = P(M_2) = 1 - \pi_1; then P(M_1 \mid D) = \frac{BF_{10} \pi_1}{BF_{10} \pi_1 + \pi_2}. In , when prior probabilities are fixed, the Bayes factor serves as a for the evidential content of the , allowing direct quantification of how the observed shifts belief between competing models without needing to recompute full posteriors for each prior adjustment. This makes it particularly valuable for objective comparisons, as it isolates the data's influence while priors handle subjective elements. A common default assumption in Bayes factor applications is equal prior probabilities (\pi_1 = \pi_2 = 0.5), which simplifies the posterior odds to equal the Bayes factor itself and the posterior probability to P(M_1 \mid D) = \frac{BF_{10}}{1 + BF_{10}}. This assumption rests on the premise that the models are a priori equally plausible, often justified in exploratory analyses or when domain knowledge lacks strong preferences, though it can be sensitive to model complexity if not carefully considered. In contrast to non-Bayesian approaches, where likelihood ratios compare point estimates of under each model, the Bayes factor employs marginal likelihoods that integrate over parameter priors, providing a fuller evidential measure that accounts for model .

Computation Methods

Exact Calculation

Exact calculation of the Bayes factor is possible in cases where the marginal likelihoods under each model can be derived analytically or evaluated via direct numerical methods, particularly for models with low-dimensional spaces or distributions. These approaches avoid the need for simulation-based approximations and provide precise values, though they are limited to relatively simple model structures. In models employing conjugate priors, such as the normal distribution with known variance or the with a , the marginal likelihoods admit closed-form expressions, enabling straightforward computation of the Bayes factor. For instance, consider testing a point H_0: \mu = \mu_0 against an alternative H_1: \mu \sim \mathcal{N}(\mu_0, \sigma_0^2) for data x_1, \dots, x_n \iid \mathcal{N}(\mu, \sigma^2) with known \sigma^2. The Bayes factor BF_{01} favoring the is given by BF_{01} = \sqrt{ \frac{ \sigma_0^2 + \frac{\sigma^2}{n} }{ \frac{\sigma^2}{n} } } \exp\left( -\frac{ n (\bar{x} - \mu_0)^2 \sigma_0^2 }{ 2 \sigma^2 \left( \sigma_0^2 + \frac{\sigma^2}{n} \right) } \right). This formula arises from the ratio of the normal marginal likelihood under the null to the integrated likelihood under the alternative prior. Similarly, for a binomial model testing H_0: p = p_0 against H_1: p \sim \text{Beta}(\alpha, \beta), the marginal likelihood under H_1 is the beta-binomial probability mass function, p(k | n, \alpha, \beta) = \binom{n}{k} \frac{ B(\alpha + k, \beta + n - k) }{ B(\alpha, \beta) }, where k is the number of successes and B is the beta function, yielding an exact Bayes factor as the ratio to the null binomial probability. When analytical solutions are unavailable but the parameter dimensionality remains low (e.g., one or two parameters), techniques such as can evaluate the required integrals for the marginal likelihoods with high precision. These methods discretize the integral over the parameter space using carefully chosen nodes and weights to approximate the exact value, making them suitable for exact computation in feasible cases. For slightly more complex low-dimensional settings, Laplace approximations provide near-exact results by expanding the integrand around its mode, though they rely on asymptotic assumptions for accuracy. For nested models, where the null model is a special case of the alternative (e.g., imposing a point restriction \theta = \theta_0), the Savage-Dickey density ratio offers an exact computational shortcut under specific conditions. The Bayes factor BF_{01} is then BF_{01} = \frac{p(\theta_0 \mid D, M_1)}{p(\theta_0 \mid M_1)}, provided the distribution for the nuisance parameters under M_1 matches that under M_0 when \theta \to \theta_0, and the posterior and densities are continuous at \theta_0. This ratio equates the marginal likelihoods without full integration over the . Software tools facilitate these exact methods for standard models. The R package BayesFactor implements analytical and numerical integration (via Monte Carlo with adjustable iterations for precision) to compute Bayes factors precisely for basic designs, including one-sample t-tests (equivalent to normal means with known variance under certain priors) and linear models.

Approximations and Algorithms

Computing Bayes factors exactly becomes infeasible for complex, high-dimensional models where the marginal likelihood integral cannot be evaluated analytically. Monte Carlo methods provide scalable approximations by estimating the marginal likelihood through simulation. Importance sampling draws samples from a proposal distribution to approximate the posterior, reweighting them to estimate the evidence; for instance, schemes tailored to mixture models use maximum likelihood estimates or Rao-Blackwellized dual sampling to mitigate bias from posterior mode exploration issues, enabling reliable Bayes factor computation in such settings. The harmonic mean estimator, derived from posterior samples, inverts the identity \hat{p}(y) = \left( \frac{1}{S} \sum_{s=1}^S \frac{1}{p(y \mid \theta^{(s)})} \right)^{-1}, where \theta^{(s)} are MCMC draws from the posterior, offering a simple yet variance-prone approach to marginal likelihoods for Bayes factors. Markov chain Monte Carlo (MCMC) techniques extend these approximations for more robust estimation in nested or non-nested models. Bridge sampling leverages samples from prior and posterior distributions to estimate the normalizing constant ratio via \hat{p}(y) = \frac{1}{S} \sum_{s=1}^S \frac{p(y \mid \theta_1^{(s)})}{q(\theta_1^{(s)} \mid y)} \cdot \frac{\int q(\theta_2 \mid y) d\theta_2}{\int \frac{p(y \mid \theta_2^{(t)})}{q(\theta_2^{(t)} \mid y)} p(\theta_2^{(t)}) d\theta_2}, where q bridges the distributions, yielding accurate Bayes factors with reduced variance compared to importance sampling alone. Thermodynamic integration approximates the marginal likelihood by integrating the expected log-likelihood along a power posterior path \beta \in [0,1], \log p(y) = \int_0^1 \mathbb{E}_{\pi(\theta \mid y^\beta)} [\log p(y \mid \theta)] d\beta, often implemented with MCMC at discrete \beta levels; this method excels for comparing phylogenetic or cognitive models, providing stable Bayes factors even in high dimensions. Recent enhancements, such as differential evolution MCMC for thermodynamic integration, further improve efficiency by requiring fewer samples per path rung, achieving convergence 5-8 times faster than standard implementations. Nested sampling is another class of algorithms for approximating marginal likelihoods, particularly effective in high-dimensional spaces. It transforms the evidence integral into a one-dimensional over prior mass, using sequential sampling to estimate it efficiently without tuning, as implemented in tools like MultiNest or diffuse nested sampling. This method is popular in fields like and provides reliable Bayes factors for complex models. Information criteria offer asymptotic approximations to Bayes factors without simulation. The (BIC) estimates \log p(y \mid M) \approx -\frac{1}{2} \mathrm{BIC} = L - \frac{k}{2} \log n, where L is the maximized log-likelihood, k the number of parameters, and n the sample size, deriving from under a unit information ; thus, \log \mathrm{BF}_{12} \approx \frac{\mathrm{BIC}_2 - \mathrm{BIC}_1}{2}. This approximation holds asymptotically for large n and fixed k, assuming regularity conditions like and correct model specification, but falters in small samples or high dimensions where the unit prior mismatches the true , potentially biasing . For large datasets, variational and integrated nested Laplace approximations (INLA) enable faster estimates. Variational methods optimize a lower bound on the , \log p(y) \geq \mathbb{E}_{q(\theta)} [\log p(y \mid \theta)] - \mathrm{[KL](/page/KL)}(q(\theta) \| p(\theta)), approximating the posterior with a tractable q to derive Bayes factors in or mixture settings, though they may underestimate due to the bound's conservatism. INLA targets latent Gaussian models, combining Laplace approximations for conditional modes with for hyperparameters to compute marginal posteriors and likelihoods efficiently; it supports Bayes factor estimation via model averaging in spatial or time-series contexts, scaling to thousands of observations without MCMC. Post-2020 advances refine these for broader applicability. Path sampling, an extension of thermodynamic integration, estimates evidence ratios by simulating paths between models, improving accuracy in non-nested comparisons for hydrological or evolutionary models. Generalized estimators, such as the learnt variant, employ to optimize the importance proposal from posterior samples, reducing variance by orders of and scalable Bayes factors in dimensions up to $10^3, outperforming traditional methods in speed and precision for cosmological and statistical applications.

Examples and Applications

Basic Coin Flip Example

Consider a simple hypothesis testing scenario involving coin flips to illustrate the Bayes factor. Suppose we observe data D consisting of 8 heads in 10 independent flips. We compare two models: M_0, the null hypothesis that the coin is fair with fixed bias \theta = 0.5; and M_1, the alternative hypothesis that the coin is biased with \theta following a uniform prior distribution Beta(1,1) on [0,1]. The marginal likelihood under M_0 is the binomial probability of the data given \theta = 0.5: p(D \mid M_0) = \binom{10}{8} (0.5)^{10} = \frac{45}{1024} \approx 0.0439. Under M_1, the marginal likelihood integrates the binomial likelihood over the prior: p(D \mid M_1) = \int_0^1 \binom{10}{8} \theta^8 (1 - \theta)^2 \, d\theta = \binom{10}{8} \frac{B(9,3)}{B(1,1)} = \frac{45}{495} = \frac{1}{11} \approx 0.0909, where B(a,b) is the beta function. The Bayes factor in favor of M_1 over M_0 is then BF_{10} = \frac{p(D \mid M_1)}{p(D \mid M_0)} \approx \frac{0.0909}{0.0439} \approx 2.07. This calculation shows how the Bayes factor quantifies the relative evidential support for the biased model. According to Jeffreys' scale for interpreting Bayes factors, a value of BF_{10} between 1 and 3 provides "barely worth mentioning" support for the —in this case, evidence for a biased . To visualize the models, consider a plot of the and posterior distributions under M_1 alongside the point at \theta = 0.5 under M_0. The under M_1 is flat ( on [0,1]). The posterior under M_1 is (9,3), which peaks around \theta \approx 0.75 and shifts toward higher values of \theta after observing 8 heads. The point under M_0 remains fixed at 0.5, highlighting the concentrated evidence for fairness versus the spread under the alternative. The relative heights of the predictive distributions at the observed data further illustrate why M_1 receives more support here.

Model Comparison in Regression

In analysis, Bayes factors enable the comparison of nested or non-nested models by quantifying the relative provided by the for each model, facilitating decisions on predictor . A typical scenario involves contrasting a model, which includes only an intercept, against an alternative model incorporating a single predictor. For instance, consider simulated with 50 observations where the response variable follows a linear relationship with the predictor under a moderate , such as a standardized of approximately 0.5, reflecting realistic conditions in . This setup allows researchers to evaluate whether the predictor explains a meaningful portion of the variance beyond chance. To compute the Bayes factor in this regression context, Zellner's g-prior is commonly employed for the coefficients, specifying a centered at zero with proportional to the inverse of the scaled by the hyperparameter g, which tunes the prior's informativeness. The under this admits a , enabling direct calculation of the Bayes factor between models. For practical implementation, approximations such as the (BIC) can be used, where the difference in BIC scores between the alternative and null models approximates twice the log Bayes factor, offering computational efficiency for initial assessments. Alternatively, (MCMC) methods provide more precise estimates by sampling from the posterior, as facilitated by tools like the BayesFactor R package, which defaults to a g-prior with g integrated over a hyperprior for robustness. In the simulated example, this yields a hypothetical BF_{10} = 5, signifying substantial evidence favoring the inclusion of the predictor according to established interpretive guidelines. Bayes factors find practical application in fields like psychology, where they support model comparisons in replication studies to assess the reliability of predictor effects across datasets, and in genetics, aiding the evaluation of whether genetic markers enhance linear models of quantitative traits in association analyses. For illustration, applying this framework to the Iris dataset—predicting sepal length from petal length—demonstrates how Bayes factors can quantify evidence for predictor utility in a real, multivariate biological context. However, results in regression settings are sensitive to prior specifications; for example, small values of g impose tighter shrinkage on coefficients, potentially reducing evidence for the alternative model, while larger g values approach non-informative priors and may inflate Bayes factors, underscoring the need for careful prior justification based on domain knowledge.

Historical Development

Origins and Key Contributors

The concept of the Bayes factor emerged in the early as a tool for Bayesian hypothesis testing, building on foundational principles of probability updating. Building on earlier contributions from Dorothy Wrinch and in the 1920s and 1930s, which introduced ideas of evidence accumulation and likelihood-based testing. first introduced the Bayes factor in his 1935 paper "Some tests of significance, treated by the theory of probability," which he further developed in his 1939 book Theory of Probability, where he defined it as the ratio of the likelihood of data under competing to quantify evidence in favor of one model over another. Jeffreys applied this approach to geophysical problems, such as testing about the rigidity of the Earth's core against seismic data, demonstrating its utility in scientific inference beyond subjective priors. Although Jeffreys's work was explicitly Bayesian, it drew indirect influence from Sir Ronald Fisher's development of likelihood ratios in the , which provided a non-Bayesian framework for comparing models based on data likelihoods alone. Fisher's likelihood ratio tests emphasized evidential strength without prior probabilities, laying groundwork that Bayesians later extended by incorporating priors to form the full Bayes factor. In the 1950s, I. J. Good extended these ideas by formalizing "odds factors" as measures of evidential weight, particularly in contexts linking probability to information theory, where he explored how data updates prior odds through logarithmic transformations akin to entropy. Good's contributions emphasized the Bayes factor's role in weighing evidence objectively, influencing its application in decision theory and cryptography-related statistical problems. Key advancements in the and highlighted the Bayes factor's advantages over frequentist methods, notably in the paper by Ward Edwards, Harold Lindman, and Leonard J. Savage, which advocated for psychological research by contrasting posterior odds derived from Bayes factors with p-values. This work argued that Bayes factors provide a direct measure of evidential support for hypotheses, addressing limitations in significance testing by integrating prior beliefs with data. Early recognition of computational challenges in Bayes factor evaluation came from Lindley in his 1957 paper "A Statistical ," which illustrated difficulties in calculating posterior probabilities for point null hypotheses under vague priors, especially in large-sample settings where integrals become intractable. Lindley's analysis underscored the where significant frequentist evidence fails to shift Bayesian posteriors substantially, pointing to the need for careful prior specification and numerical methods to make Bayes factors practical.

Evolution in Statistical Practice

The adoption of Bayes factors in statistical practice gained significant momentum in the 1980s and 1990s, building on ' foundational theoretical work from the mid-20th century. A pivotal advancement came with the 1995 paper by Robert E. Kass and Adrian E. Raftery, which provided a comprehensive review and standardization of Bayes factor interpretation and computation, particularly through practical guidelines for assessing evidence strength in model comparison. Published in the Journal of the American Statistical Association, this work addressed computational challenges and proposed interpretive scales (e.g., Bayes factors between 1 and 3 indicating "barely worth mentioning" evidence), making the method more accessible for applied researchers across fields like and . The 2000s marked a rise in practical implementation through accessible software tools, facilitating broader use in . The BayesFactor , developed by Richard D. Morey and Jeffrey N. Rouder and first released in 2012, enabled straightforward computation of Bayes factors for common designs such as t-tests, ANOVA, and , with default priors based on JASP guidelines for . Complementing this, JASP—an open-source graphical interface launched around 2015 by Eric-Jan Wagenmakers and collaborators—integrated Bayes factor analyses with user-friendly defaults, promoting its adoption in psychological and workflows by automating prior specification and output visualization. These tools democratized Bayes factor use, shifting it from theoretical discussions to routine testing in software environments like . In the 2010s, Bayes factors experienced a surge in and social sciences, particularly following the reproducibility crisis highlighted by low replication rates in landmark studies around 2011–2015. Eric-Jan Wagenmakers played a key role in this advocacy, promoting Bayes factors as a superior alternative to p-values for quantifying evidence and addressing selective reporting biases, as detailed in his 2016 analysis of the : , where Bayes factors provided nuanced assessments of original versus replication findings. This period saw increased publications and guidelines emphasizing Bayes factors for robust inference, with journals like Psychological Methods featuring tutorials on their application to mitigate crisis-driven toward significance testing. From 2020 to 2025, trends have focused on integrating Bayes factors with techniques to handle challenges, such as scalable in high-dimensional settings. Methods like deep Bayes factors, proposed in 2023, leverage neural networks to approximate marginal likelihoods efficiently for large datasets, enabling applications in and predictive modeling where traditional computations falter. Concurrently, efforts to resolve —where Bayes factors overly favor null hypotheses in large samples—have advanced through objective Bayesian approaches and interval-null testing, with 2021 and 2025 works demonstrating prior adjustments that align Bayesian evidence with practical effect sizes in contexts. An influential approximation enhancing this evolution has been the (BIC), originally formulated by Gideon Schwarz in 1978 as a frequentist tool for model dimensionality but later recognized as an asymptotic approximation to the Bayes factor under certain priors. Its Bayesian reinterpretation in the 1990s and , as elaborated by Kass and Raftery, has made BIC a computationally efficient proxy for Bayes factors in large-sample scenarios, widely adopted in software for quick model comparisons without full posterior sampling.

Comparisons and Limitations

Versus Frequentist Approaches

The Bayes factor (BF) differs fundamentally from frequentist approaches like p-values in its conceptualization of evidence. Whereas p-values measure the probability of observing data as extreme or more extreme than the observed under the null hypothesis, assuming long-run frequency properties, they often result in binary decisions (e.g., reject at α = 0.05) that do not quantify support for the alternative. In contrast, the BF provides a continuous measure of relative evidence between two hypotheses, H₀ and H₁, defined as the ratio of their marginal likelihoods, explicitly incorporating prior distributions to update beliefs via posterior odds. This allows the BF to assess evidence in favor of either hypothesis, avoiding the asymmetry inherent in p-values that only tests against the null. A striking example of these differences is Lindley's paradox, which arises in large-sample settings where a small deviation from the null yields a statistically significant p-value (e.g., rejecting H₀), but the BF may strongly favor the null if the prior on the alternative hypothesis is sufficiently diffuse. This occurs because the BF averages evidence over the entire parameter space under each model, diluting support for the alternative as sample size grows without strong effect magnitude, whereas p-values focus on tail probabilities that become sensitive to minor discrepancies. The paradox underscores calibration issues, where frequentist procedures amplify evidence against the null in high-power scenarios, while Bayesian methods require substantive prior specification to align conclusions. Relative to likelihood ratio tests (LRT), which compare maximized likelihoods between models and can favor overly complex nested models by noise in the data, the BF integrates out nuisance parameters using priors, yielding marginal likelihoods that inherently penalize excessive complexity. This averaging prevents the LRT's tendency to select models that fit sample idiosyncrasies without generalizing, providing a more stable comparison especially when models are nested. Empirical calibrations further highlight the BF's conservatism; for instance, Sellke et al. (2001) showed that a of 0.05 corresponds to a maximum Bayes factor in favor of the (BF_{10}) of about 2.5:1 (or minimum BF_{01} of about 0.4:1), indicating at best modest evidence against the null, while p=0.01 corresponds to a maximum BF_{10} of approximately 7.9:1, revealing how p-values often overestimate evidence against the null compared to BFs. Hybrid methods bridge these paradigms by embedding BFs within frequentist-inspired frameworks via Bayes approaches, which select non-informative priors to achieve like frequentist coverage while retaining Bayesian coherence. For example, reference priors ensure BFs approximate classical confidence intervals in limiting cases, facilitating their use in settings demanding both evidential quantification and error-rate control.

Common Criticisms

One major criticism of the Bayes factor is its to the choice of distributions, which can lead to varying conclusions depending on subjective or arbitrary specifications. For instance, the use of unit priors, intended to provide an objective benchmark by incorporating the of a single observation, has been debated for potentially overstating or understating evidence in certain models. analyses are often recommended to assess how alterations in priors affect the Bayes factor, as even modest changes can substantially shift the results. Another limitation is the computational intractability of exact Bayes factor calculations, particularly in high-dimensional settings where the marginal likelihood becomes difficult to evaluate due to the curse of dimensionality. Even approximation methods, such as those based on variational inference or sampling, can introduce biases in such scenarios, requiring extensive computational resources for reliable estimates. These approximations serve as partial solutions but do not fully resolve the challenges in complex models. Bayes factors are also prone to misinterpretation, with researchers sometimes treating them as direct probabilities of a rather than ratios of marginal likelihoods, which can foster overconfidence in conclusions. This risk is heightened in applied , where common errors include conflating the Bayes factor with posterior probabilities or ignoring its dependence on model assumptions. A related issue is base rate neglect, where the Bayes factor is reported in isolation without integrating prior probabilities, potentially misleading inferences in scenarios with low base rates for the hypotheses under consideration. This occurs because the Bayes factor represents only the update from the data (likelihood ratio), and neglecting the prior odds can exaggerate evidence against intuitive base rates. In recent critiques from the 2020s, particularly within initiatives in and , the Bayes factor has faced scrutiny for over-reliance as a decision tool, prompting alternatives like the region of practical (ROPE) for testing. These critiques highlight how Bayes factors may not adequately address practical hypotheses or parameter , leading to calls for more robust Bayesian approaches over testing via Bayes factors. Recent studies from 2024-2025 have further highlighted practical misuses, including selective application of Bayes factors that overestimates evidence for the and biases in estimates for designs, reinforcing calls for careful implementation in applied .

References

  1. [1]
    A tutorial on Bayes Factor Design Analysis using an informed prior
    The Bayes factor was originally developed by Harold Jeffreys ( · The Bayes factor can be understood as an updating factor for prior beliefs. · The Bayes factor ...
  2. [2]
    [PDF] Bayes Factors - Robert E. Kass; Adrian E. Raftery
    Oct 14, 2003 · Bayes factors are the posterior odds of the null hypothesis when the prior probability on the null is one-half, used to evaluate evidence for a ...
  3. [3]
  4. [4]
  5. [5]
    [PDF] Bayes Factors - Statistics & Data Science
    The centerpiece was a number, now called the Bayes factor, which is the posterior odds of the null hypothesis when the prior probability on the null is one-half ...
  6. [6]
    [PDF] A tutorial on the Savage–Dickey method - Eric-Jan Wagenmakers
    Jan 12, 2010 · Relative ease of computing the Bayes factor in nested models ... Computing Bayes factors using a generalization of the Savage–Dickey density ratio ...
  7. [7]
    Using the 'BayesFactor' package, version 0.9.2+ - GitHub Pages
    The BayesFactor package enables the computation of Bayes factors in standard designs, such as one- and two- sample designs, ANOVA designs, and regression.
  8. [8]
    Importance Sampling Schemes for Evidence Approximation in Mixture Models
    - **Title:** Importance Sampling Schemes for Evidence Approximation in Mixture Models
  9. [9]
    Estimating the Integrated Likelihood via Posterior Simulation Using ...
    The Bayes factor for model comparison and Bayesian testing is a ratio of integrated likelihoods, and the model weights in Bayesian model averaging are ...
  10. [10]
    A tutorial on bridge sampling - ScienceDirect
    The marginal likelihood plays an important role in many areas of Bayesian statistics such as parameter estimation, model comparison, and model averaging.
  11. [11]
    Thermodynamic integration via differential evolution: A method for ...
    Jan 2, 2019 · The Bayes factor provides a principled approach for making these selections, though the integral required to calculate the marginal likelihood ...Abstract · Explore Related Subjects · Examples
  12. [12]
    Estimating the Dimension of a Model
    - **Title:** Estimating the Dimension of a Model
  13. [13]
    Fast Variational Inference for Bayesian Factor Analysis in Single and ...
    May 22, 2023 · We propose variational inference algorithms to approximate the posterior distribution of Bayesian latent factor models using the multiplicative gamma process ...
  14. [14]
    [1911.00797] Bayesian model averaging with the integrated nested ...
    Nov 2, 2019 · The integrated nested Laplace approximation (INLA) for Bayesian inference is an efficient approach to estimate the posterior marginal ...Missing: factor | Show results with:factor
  15. [15]
    Selecting a conceptual hydrological model using Bayes' factors ...
    Mar 12, 2025 · We develop a method for computing Bayes' factors of conceptual rainfall–runoff models based on thermodynamic integration, gradient-based replica-exchange ...
  16. [16]
    Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator
    **Title:** Machine learning assisted Bayesian model comparison: learnt harmonic mean estimator
  17. [17]
    [PDF] Bayesian analysis of factorial designs - Eric-Jan Wagenmakers
    Now the coin is flipped, and say 7 heads are observed in 10 flips. The ... Bayes factor of BAB = .1 means that prior odds should be updated by a factor ...
  18. [18]
    [PDF] Mixtures of g-priors for Bayesian Variable Selection - Stat@Duke
    Jan 28, 2007 · Zellner's g-prior remains a popular conventional prior for use in Bayesian variable selection, despite several undesirable consistency ...
  19. [19]
    Using the 'BayesFactor' package, version 0.9.2+
    Jan 23, 2024 · The BayesFactor package enables the computation of Bayes factors in standard designs, such as one- and two- sample designs, ANOVA designs, and regression.
  20. [20]
    [PDF] A Review of Applications of the Bayes Factor in Psychological ...
    The Bayes factor provides a method for quantifying the relative evi- dence for two competing hypotheses that are both instantiated by specific statistical ...
  21. [21]
    [PDF] Harold Jeffreys's Default Bayes Factor Hypothesis Tests
    Harold Jeffreys pioneered the development of default Bayes factor hypoth- esis tests for standard statistical problems. Using Jeffreys's Bayes factor.
  22. [22]
    [PDF] Bayes Factors Robert E. Kass; Adrian E. Raftery Journal of the ...
    Apr 10, 2007 · When comparing results with standard likelihood ratio tests, it is convenient to instead put the null hypothesis in the denom- inator of ( 1) ...
  23. [23]
    [PDF] Probability and the Weighing of Evidence - Gwern
    The Meteorological Office could set a good example by offering odds with ... use of the relative factors as in 7.5 and to allow for the initial probabilities.Missing: extensions | Show results with:extensions
  24. [24]
    [PDF] PSYCHOLOGICAL REVIEW - Error Statistics Philosophy
    The tools of Bayesian statistics include the theory of specific distributions and the principle of stable estimation, which specifies when actual prior opinions ...Missing: frequentist | Show results with:frequentist
  25. [25]
    [PDF] A Statistical Paradox - Gwern.net
    A paradox will only have been generated if we can show there exist situations where (a) a prior distribution of this form is reasonable, and (b) a significance ...Missing: Dennis | Show results with:Dennis
  26. [26]
    A Bayesian Perspective on the Reproducibility Project: Psychology
    The Bayes factor (B) is a tool from Bayesian statistics that expresses how much a data set shifts the balance of evidence from one hypothesis (e.g., the null ...Missing: linear | Show results with:linear
  27. [27]
    Bayes factor approaches for testing interval null hypotheses.
    Bayes factor approaches for testing interval null hypotheses. Citation. Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval ...
  28. [28]
    [2503.14650] Resolving Jeffreys-Lindley Paradox - arXiv
    Mar 18, 2025 · Jeffreys-Lindley paradox is a case where frequentist and Bayesian hypothesis testing methodologies contradict with each other.Missing: 2020-2025 | Show results with:2020-2025
  29. [29]
    [PDF] The Bayes Factor vs. P-Value
    Importantly, in this paper we show that the Bayes factor is more objective and has nicer properties than the p-value, a fact that should be of interest ...Missing: key | Show results with:key
  30. [30]
    [PDF] Calibration of p Values for Testing Precise Null Hypotheses
    Since our goal is to interpret the calibrated p values as lower bounds on Bayes factors or conditional frequentist error probabilities, we have to explicitly ...
  31. [31]
    [PDF] The Case for Objective Bayesian Analysis - Stat@Duke
    Here is an example, from Berger and Sun (2006), which raises the interesting issue of whether avoidance of a marginalization paradox is fundamentally necessary.
  32. [32]
    On the Sensitivity of Bayes Factors to the Prior Distributions
    Bayes factors can be highly sensitive to the prior distributions used for the parameters of the models under consideration.Missing: specification | Show results with:specification
  33. [33]
    The Importance of Prior Sensitivity Analysis in Bayesian Statistics
    A sensitivity analysis is, in many ways, one of the most important elements needed to fully understand Bayesian results in an applied research setting. The ...
  34. [34]
    Make the most of your samples: Bayes factor estimators for high ...
    Mar 6, 2013 · Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more ...
  35. [35]
    New Estimators of the Bayes Factor for Models with High ... - MDPI
    The new Bayes factor estimator corrects the arithmetic mean by trimming the posterior sample and using posterior probabilities of a subset of parameters and ...
  36. [36]
    Diagnosing the Misuse of the Bayes Factor in Applied Research
    Feb 13, 2024 · Furthermore, sensitivity analyses showing the sensitivity of the Bayes factor to varying priors are important.
  37. [37]
    Preventing common misconceptions about Bayes Factors
    Apr 13, 2023 · 92% of articles demonstrating at least one misconception of Bayes Factors. Here I will review some of the most common misconceptions, and how to prevent them.Missing: misinterpretation risks
  38. [38]
    Chapter 11 Odds and Bayes Factors | An Introduction to Bayesian ...
    The posterior odds are the product of the prior odds and the Bayes factor. The Bayes factor is the ratio of the likelihoods.Missing: derivation | Show results with:derivation
  39. [39]
    With Bayesian estimation one can get all that Bayes factors offer ...
    Only if the prior odds are set to 50/50, the posterior odds equals the Bayes factor and only then the Bayes factor has an appealing interpretation for ...
  40. [40]
    The HDI + ROPE decision rule is logically incoherent but we can fix it.
    May 23, 2024 · The Bayesian highest-density interval plus region of practical equivalence (HDI + ROPE) decision rule is an increasingly common approach to testing null ...