Fact-checked by Grok 2 weeks ago

Bayesian inference

Bayesian inference is a method of that employs to update the probability of a or as new becomes available, by combining beliefs with the likelihood of observed to produce a distribution. This approach treats probabilities as degrees of belief rather than long-run frequencies, allowing for the explicit incorporation of uncertainty and prior knowledge into the inference process. The foundations of Bayesian inference trace back to the , with , an English mathematician and Presbyterian minister, who developed the core theorem in an essay published posthumously in 1763 by . , a mathematician, independently derived and expanded upon in the late 1700s, applying it to problems in astronomy, physics, and probability, thereby establishing early applied Bayesian methods such as the normal-normal conjugate model. Although the approach waned in popularity during the early due to the rise of frequentist statistics, it experienced a revival in the mid-20th century through works on hierarchical modeling and , and further advanced in the late 20th and 21st centuries with computational innovations enabling complex nonconjugate models and posterior predictive checking. At its core, Bayesian inference revolves around three fundamental elements: the prior distribution, which encodes initial beliefs or knowledge about the parameters before observing data; the , which quantifies the probability of the data given those parameters; and the posterior distribution, obtained by proportionally multiplying the prior and likelihood via . This framework contrasts with frequentist methods, which treat parameters as fixed unknowns and rely solely on data-derived estimates like confidence intervals, whereas Bayesian approaches yield credible intervals that directly interpret the probability of parameter values. Beyond the theorem itself, Bayesian inference incorporates the for marginalization over nuisance parameters, enabling robust handling of uncertainties in composite hypotheses and systematic errors. Bayesian inference has broad applications across disciplines, including for modeling cognitive processes, astronomy for analyzing survey data and inferring cosmic properties, and statistics for hierarchical modeling and model comparison. Its emphasis on probabilistic predictions and makes it particularly valuable in fields requiring under incomplete information, such as , , and .

Fundamentals

Bayes' Theorem

Bayes' theorem is a fundamental result in that describes how to update the probability of a based on new . It is derived from the basic definition of . The conditional probability P(A \mid B) of event A given event B (with P(B) > 0) is defined as the ratio of the joint probability P(A \cap B) to the marginal probability P(B): P(A \mid B) = \frac{P(A \cap B)}{P(B)}. Similarly, the reverse conditional probability is P(B \mid A) = \frac{P(A \cap B)}{P(A)}, assuming P(A) > 0. Equating the two expressions for the joint probability yields P(A \cap B) = P(A \mid B) P(B) = P(B \mid A) P(A), and solving for P(A \mid B) gives P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}. This is Bayes' theorem, where P(B) in the denominator is the marginal probability of B, often computed as P(B) = \sum_i P(B \mid A_i) P(A_i) over a partition of events \{A_i\}. In terms of inference, Bayes' theorem formalizes the process of updating the probability of a hypothesis H in light of evidence E, yielding the posterior probability P(H \mid E) as proportional to the product of the prior probability P(H) and the likelihood P(E \mid H), normalized by the total probability of the evidence P(E). This framework enables the revision of initial beliefs about causes or states based on observed effects or data. A useful verbal interpretation of the theorem uses odds ratios. The posterior odds in favor of hypothesis A over alternative B given evidence D are the prior odds \frac{P(A)}{P(B)} multiplied by the likelihood ratio \frac{P(D \mid A)}{P(D \mid B)}, which quantifies how much more (or less) likely the evidence is under A than under B. If the likelihood ratio exceeds 1, the evidence strengthens support for A; if below 1, it weakens it. The theorem is named after Thomas Bayes (c. 1701–1761), an English mathematician and Presbyterian minister, who formulated it in an essay likely written in the late 1740s but published posthumously in 1763 as "An Essay Towards Solving a Problem in the Doctrine of Chances" in the Philosophical Transactions of the Royal Society, edited by his colleague . Independently, the French mathematician rediscovered the result around 1774 and developed its applications in , with his 1812 treatise giving it wider prominence before Bayes's name was retroactively attached by R. A. Fisher in 1950.

Prior, Likelihood, and Posterior

In Bayesian inference, the distribution encodes the initial beliefs or about the θ before any data are observed. It is a assigned to the , which can incorporate expert opinion, historical data, or theoretical considerations. Subjective priors reflect the personal degrees of belief of , as emphasized in the subjectivist of probability, where probabilities are coherent previsions that avoid books. Objective priors, on the other hand, aim to be minimally informative and free from subjective input, such as uniform priors over a bounded or the , which is derived from the matrix to ensure invariance under reparameterization. The quantifies the probability of observing the y given a specific value of the parameters θ, denoted as p(y \mid \theta). It arises from the probabilistic model of the -generating and is typically specified based on the assumed , such as a or likelihood depending on the nature of the . Unlike in frequentist statistics, where the likelihood is used to estimate point values of θ, in Bayesian inference it serves to update the by weighting parameter values according to how well they explain the observed . The posterior distribution represents the updated beliefs about the parameters after incorporating the data, given by as p(\theta \mid y) \propto p(y \mid \theta) p(\theta). This proportionality holds because the full expression includes a , the p(y) = \int p(y \mid \theta) p(\theta) \, d\theta, which integrates over all possible values to ensure the posterior is a valid . The , also known as the evidence or model probability, plays a crucial role in comparing different models, as it measures the overall predictive adequacy of the model without conditioning on specific parameters.

Updating Beliefs

In Bayesian inference, the process of updating beliefs begins with a distribution that encodes an agent's initial state of knowledge or subjective beliefs about an uncertain or . As new in the form of observed arrives, this prior is systematically revised to produce a posterior distribution that integrates the from the data, weighted by its likelihood under different possible values of the parameter. This dynamic revision reflects a coherent approach to learning, where beliefs evolve rationally in response to , allowing for the quantification and throughout the inference process. The mathematical basis for this updating is , which formalizes the combination of beliefs and into updated posteriors. An insightful reformulation expresses the process in terms of ratios: the posterior in favor of one hypothesis over another equal the multiplied by the , a quantity that captures solely the evidential impact of the by comparing the likelihoods under the competing hypotheses. This odds-based view, pioneered by , separates the roles of initial beliefs and data-driven , facilitating the assessment of how strongly observations support or refute particular models. While Bayesian updating relies on probabilistic priors and likelihoods, alternative frameworks offer contrasting approaches to . Logical probability methods, as developed by , derive degrees of from the structural similarities between and hypotheses using purely logical principles, eschewing subjective priors in favor of inductive rules. In a different vein, the Dempster-Shafer theory extends beyond additive probabilities by employing belief functions that distribute mass over subsets of hypotheses, enabling the representation of both uncertainty and ignorance without committing to precise point probabilities; this allows for more flexible combination of sources compared to strict Bayesian conditioning. These alternatives highlight limitations in Bayesian methods, such as sensitivity to prior specification, but often sacrifice the full coherence and normalization properties of probability. A fundamental for effective Bayesian updating is , which cautions against assigning probabilities of exactly zero to logically possible events or one to logically impossible ones, as such extremes can immunize beliefs against contradictory —for example, a zero ensures the posterior remains zero irrespective of strength. Articulated by Dennis Lindley and inspired by Oliver Cromwell's plea to "think it possible you may be mistaken," this rule promotes priors that remain responsive to , fostering robust even under incomplete initial .

Bayesian Updating

Single Observation

In Bayesian inference, updating the belief about a parameter \theta upon observing a single data point x follows directly from , yielding the posterior distribution p(\theta \mid x) \propto p(x \mid \theta) p(\theta), where p(\theta) denotes the distribution and p(x \mid \theta) the . The symbol \propto indicates , as the posterior is the unnormalized product of the likelihood and prior; to obtain the proper , it must be scaled by the (or ) p(x) = \int p(x \mid \theta) p(\theta) \, d\theta for continuous \theta, ensuring the posterior integrates to 1. This framework is particularly straightforward when \theta represents discrete hypotheses that are mutually exclusive and exhaustive, such as a \{\theta_1, \dots, \theta_k\}. In this case, the for each hypothesis is P(\theta_i \mid x) = \frac{P(x \mid \theta_i) P(\theta_i)}{\sum_{j=1}^k P(x \mid \theta_j) P(\theta_j)}, where the denominator serves as the , explicitly computable as the sum of the joint probabilities over all hypotheses. For simple cases with few hypotheses, such as outcomes (e.g., two competing explanations), this is direct: if the prior are P(\theta_1)/P(\theta_2) and the likelihood ratio is P(x \mid \theta_1)/P(x \mid \theta_2), the posterior odds become their product, with the marginal P(x) following as P(x \mid \theta_1) P(\theta_1) + P(x \mid \theta_2) P(\theta_2). To illustrate, consider updating the prior probability of rain tomorrow (0.1) based on a single weather reading, such as a cloudy morning, where the likelihood of clouds given rain is 0.8 and the marginal probability of clouds is 0.4; the posterior probability of rain then shifts upward to 0.2 to reflect this evidence, computed via the discrete formula above. Such single-observation updates form the foundation for incorporating additional data through repeated application of .

Multiple Observations

In Bayesian inference, the framework for incorporating multiple observations extends the single-observation case by combining evidence from several data points to update the distribution on the parameter \theta. For n and identically distributed (i.i.d.) observations x_1, \dots, x_n, the posterior distribution is given by p(\theta \mid x_1, \dots, x_n) \propto \left[ \prod_{i=1}^n p(x_i \mid \theta) \right] p(\theta), where the likelihood term factors into a product due to the i.i.d. assumption, reflecting how each observation contributes multiplicatively to the for \theta. This formulation scales the single-observation update, where the posterior is proportional to the prior times one likelihood, to a batch of data, enabling efficient incorporation of accumulated . The i.i.d. assumption—that the observations are independent conditional on \theta—simplifies the joint likelihood to the product form, making analytical or computational inference tractable in many models, such as those from the exponential family. This conditional independence is a modeling choice, often justified by the data-generating process, but it can be relaxed when observations exhibit dependence; in such cases, the full joint likelihood p(x_1, \dots, x_n \mid \theta) is used instead of the product, which may require specifying covariance structures or hierarchical models to capture correlations. For example, in time-series data, autoregressive components can model temporal dependence while still applying Bayes' theorem to the joint distribution. The for the multiple observations, which normalizes the posterior, is p(x_1, \dots, x_n) = \int \left[ \prod_{i=1}^n p(x_i \mid \theta) \right] p(\theta) \, d\theta under the i.i.d. assumption, representing the predictive probability of the data averaged over the . This , also known as the , plays a key role in via Bayes factors but can be challenging to compute exactly, often approximated using simulation methods like . When accumulating data from multiple sources or repeated experiments, the batch posterior formula allows direct computation using the full product of likelihoods and the initial , avoiding the need to iteratively re-derive intermediate posteriors for subsets of the data. This approach is particularly advantageous in large datasets, where the evidence from all observations is combined proportionally without stepwise adjustments, preserving the coherence of belief updating while scaling to practical applications in fields like or .

Sequential Updating

Sequential updating in Bayesian inference involves iteratively refining the posterior distribution as new observations arrive over time, enabling a dynamic incorporation of . The core mechanism is the recursive application of , where the posterior at time t, p(\theta \mid y_{1:t}), is proportional to the likelihood of the new observation y_t given the parameter \theta, multiplied by the posterior from the previous step p(\theta \mid y_{1:t-1}). Formally, p(\theta \mid y_{1:t}) \propto p(y_t \mid \theta, y_{1:t-1}) \cdot p(\theta \mid y_{1:t-1}), assuming the observations are conditionally independent given \theta. This form treats the previous posterior as the prior for the current update, allowing beliefs to evolve incrementally without recomputing from the initial prior each time. For independent and identically distributed observations, this sequential process yields the same result as a single batch update using all data at once. The advantages of this recursive approach are pronounced in environments, where streams continuously and computational efficiency is paramount, as it avoids the need to store or reprocess the entire . It supports by providing updated inferences after each new , which is essential for adaptive algorithms that respond to evolving . Additionally, sequential updating is well-suited to dynamic models, where parameters or states change over time, facilitating the tracking of temporal variations through successive refinements of the . These benefits have been demonstrated in large-scale applications, such as cognitive modeling with high-velocity , where incremental updates preserve inferential accuracy while managing constraints. A conceptual example arises in time series filtering, where sequential updating estimates latent states underlying observed data, such as inferring a system's hidden from noisy sequential measurements. At each time step, the current posterior—representing beliefs about the —serves as the , which is then updated with the new observation's likelihood to produce a sharper estimate, progressively reducing as more accumulates. This mirrors in sequential data contexts, emphasizing how each update builds on prior knowledge to form a coherent evolving picture. Despite these strengths, sequential updating presents challenges, particularly in eliciting an appropriate initial for long sequences of observations. The of starting prior can influence early updates disproportionately if data is sparse initially, and even as subsequent data dominates, misspecification may introduce subtle biases that propagate through the chain. Careful expert elicitation is thus crucial to ensure the prior reflects genuine uncertainty without unduly skewing long-term posteriors, a that requires structured methods to aggregate reliably.

Formal Framework

Definitions and Notation

In the Bayesian framework for statistical models, the unknown are elements θ of a Θ, typically a of ℝᵖ for some p, while the observed consist of realizations x from an observable space X, which may be , continuous, or mixed. The encodes initial uncertainty about θ via a π on Θ, which in the continuous case is specified by a π(θ) with respect to a dominating measure (such as ), and in the case by a . The is the measure of x given θ, denoted f(x|θ), which serves as the or mass function of the sampling x ~ f(·|θ). Distinctions between densities and probabilities arise depending on the nature of the spaces: for continuous X and Θ, π(θ) and f(x|θ) are probability density functions, integrating to 1 over their respective spaces, whereas for cases they are probability mass functions summing to 1. In scenarios involving point masses, such as degenerate priors or components in mixed distributions, the δ_τ(θ) represents a unit point at a specific value τ ∈ Θ, defined such that for any g at τ, ∫ g(θ) δ_τ(θ) dθ = g(τ). The posterior distribution π(θ|x) then combines the and likelihood to reflect updated beliefs about θ after observing x, with providing the linkage in the form π(θ|x) ∝ f(x|θ) π(θ). This general setup underpins Bayesian inference in models, where Θ parameterizes the of distributions {f(·|θ) : θ ∈ Θ}.

Posterior Distribution

In Bayesian inference, the posterior represents the updated state of about the unknown parameters \theta after observing the x, synthesizing beliefs with the provided by the likelihood. This , denoted \pi(\theta \mid x), quantifies the relative plausibility of different values of \theta conditional on x, serving as the foundation for all parameter-focused inferences such as estimating \theta or assessing its . The posterior is formally derived from , which states that the joint density of \theta and x factors as p(\theta, x) = f(x \mid \theta) \pi(\theta), where f(x \mid \theta) is the and \pi(\theta) is the prior distribution. The posterior then follows as the conditional density: \pi(\theta \mid x) = \frac{f(x \mid \theta) \pi(\theta)}{m(x)}, with the m(x) = \int f(x \mid \theta) \pi(\theta) \, d\theta acting as the normalizing constant to ensure \pi(\theta \mid x) integrates to 1 over \theta. This update rule, originally proposed by , proportionally weights the prior by the likelihood and normalizes to produce a proper . Bayesian posteriors can be parametric or non-parametric, differing in the dimensionality and flexibility of the . Parametric posteriors assume \theta lies in a finite-dimensional , constraining the form of the (e.g., a likelihood with unknown yielding a posterior under a prior), which facilitates but may impose overly rigid assumptions on the data-generating process. In contrast, non-parametric posteriors operate over infinite-dimensional s, such as s indexed by functions or measures (e.g., via priors), enabling adaptive modeling of complex, unspecified structures while maintaining coherent . The posterior's role in inference centers on its use to draw conclusions about \theta given x, such as computing expectations \mathbb{E}[\theta \mid x] for point summaries or integrating over it for decision-making under uncertainty, thereby providing a complete probabilistic framework for parameter estimation and hypothesis evaluation.

Predictive Distribution

In Bayesian inference, the predictive distribution for new, unobserved data x^* given observed data x is obtained by integrating the likelihood of the new data over the posterior distribution of the parameters \theta. This is known as the posterior predictive distribution, formally expressed as p(x^* \mid x) = \int p(x^* \mid \theta) \, \pi(\theta \mid x) \, d\theta, where p(x^* \mid \theta) is the sampling distribution (likelihood) for the new data and \pi(\theta \mid x) is the posterior density of the parameters. This formulation marginalizes over the uncertainty in \theta, providing a full probabilistic description of future observations that accounts for both data variability and parameter estimation error. The computation of the posterior predictive distribution involves marginalization, which integrates out the parameters from the joint posterior predictive density p(x^*, \theta \mid x) = p(x^* \mid \theta) \, \pi(\theta \mid x). In practice, this integral is rarely tractable analytically and is typically approximated using simulation methods, such as drawing samples \theta^{(s)} from the posterior \pi(\theta \mid x) and then generating replicated data x^{*(s)} \sim p(x^* \mid \theta^{(s)}) for s = 1, \dots, S, yielding an empirical approximation to the distribution. These simulations enable the estimation of predictive quantities like means, variances, or quantiles directly from the sample of x^{*(s)}. Unlike frequentist predictions, which substitute a point estimate (e.g., the maximum likelihood estimate) for \theta into the likelihood to obtain a predictive p(x^* \mid \hat{\theta}), the Bayesian posterior predictive averages over the entire posterior, incorporating uncertainty and potentially information. This leads to wider predictive intervals in small samples and better for forecasting, as the approach underestimates variability by treating \hat{\theta} as fixed. The is central to new data in applications such as election outcomes or environmental modeling, where it generates probabilistic predictions by propagating forward. It also facilitates through posterior predictive checks, which compare observed data to simulated replicates from the posterior predictive to assess fit, such as by evaluating discrepancies via test statistics like means or extremes.

Mathematical Properties

Marginalization and Conditioning

In Bayesian inference, marginalization is the process of obtaining the of a of variables by integrating out the others from their , effectively accounting for in those excluded variables. This operation is essential for focusing on quantities of interest while treating others as nuisance parameters. For instance, the , also known as the , for observed \mathbf{x} under a model parameterized by \theta is given by m(\mathbf{x}) = \int f(\mathbf{x} \mid \theta) \, \pi(\theta) \, d\theta, where f(\mathbf{x} \mid \theta) is the sampling distribution or likelihood of the data given the parameters, and \pi(\theta) is the prior distribution on \theta. This integral represents the predictive probability of the data under the prior model and serves as a normalizing constant in Bayes' theorem. The law of total probability provides the foundational justification for marginalization in the Bayesian context, stating that the unconditional density of a variable is the expected value of its conditional density with respect to the marginal density of the conditioning variables. In continuous form, this is p(\mathbf{x}) = \int p(\mathbf{x} \mid \theta) \, p(\theta) \, d\theta, which directly corresponds to the evidence computation and extends naturally to discrete cases via summation. By performing marginalization, Bayesian analyses can reduce the dimensionality of high-dimensional parameter spaces, making inference more tractable and interpretable without losing the uncertainty encoded in the integrated variables. Conditioning complements marginalization by restricting probabilities to scenarios consistent with observed or specified conditions, thereby updating beliefs about remaining uncertainties. In Bayesian inference, conditioning on \mathbf{x} transforms the prior \pi(\theta) into the posterior \pi(\theta \mid \mathbf{x}) via \pi(\theta \mid \mathbf{x}) = \frac{f(\mathbf{x} \mid \theta) \, \pi(\theta)}{m(\mathbf{x})}, where the denominator is the marginalized . This operation can also apply to subsets of or auxiliary parameters, allowing for targeted updates that incorporate partial information. Together, marginalization and conditioning enable the decomposition of complex joint distributions into manageable components, facilitating and precise probabilistic reasoning in Bayesian models.

Conjugate Priors

In Bayesian inference, a is defined as a of distributions for which the posterior distribution belongs to the same after updating with data from a specified . This property ensures that the posterior can be obtained by simply updating the parameters of the prior, without requiring changes in the distributional form. The concept is particularly useful for distributions in the , where conjugate priors can be constructed to match the sufficient statistics of the likelihood. A classic example is the - model, where the parameter \theta of a Binomial likelihood represents the success probability. The is taken as \theta \sim \text{[Beta](/page/Beta)}(\alpha, \beta), with proportional to \theta^{\alpha-1}(1-\theta)^{\beta-1}. For n independent observations yielding k successes, the posterior is \theta \mid \text{data} \sim \text{[Beta](/page/Beta)}(\alpha + k, \beta + n - k). This update interprets \alpha and \beta as pseudocounts of successes and failures, respectively. Another prominent case is the Normal-Normal conjugate pair, applicable when estimating the of a with known variance. The is \mu \sim \mathcal{N}(\mu_0, \sigma_0^2). Given n i.i.d. observations x_1, \dots, x_n \sim \mathcal{N}(\mu, \sigma^2) with sample \bar{x}, the posterior is: \mu \mid \text{data} \sim \mathcal{N}\left( \frac{\frac{n}{\sigma^2} \bar{x} + \frac{1}{\sigma_0^2} \mu_0}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}}, \ \frac{1}{\frac{n}{\sigma^2} + \frac{1}{\sigma_0^2}} \right). The posterior is a precision-weighted of the prior and sample , while the posterior variance is reduced relative to both. For count data, the Gamma- model provides conjugacy, with the \lambda having \lambda \sim \text{Gamma}(\alpha, \beta), density proportional to \lambda^{\alpha-1} e^{-\beta \lambda}. For n i.i.d. observations summing to s = \sum x_i, the posterior is \lambda \mid \text{data} \sim \text{Gamma}(\alpha + s, \beta + n). Here, \alpha and \beta act as and parameters updated by the total counts and exposure time. The primary advantage of conjugate priors lies in their analytical tractability: posteriors, marginal likelihoods, and predictive distributions can often be derived in closed form, avoiding and enabling efficient sequential updating in dynamic models. This is especially beneficial for evidence calculation via marginalization, where the is straightforward to compute. However, conjugate families impose restrictions on the form of beliefs, potentially limiting flexibility in capturing complex or data-driven uncertainties, which may require sensitivity analyses to assess robustness.

Asymptotic Behavior

As the sample size n increases, the Bayesian posterior distribution exhibits desirable asymptotic properties under suitable regularity conditions, ensuring that inference becomes increasingly reliable. A fundamental result is the consistency of the posterior, which states that the posterior probability concentrates on the true parameter value \theta_0 with respect to the data-generating measure, provided the model is well-specified and the prior assigns positive mass to neighborhoods of \theta_0. This property, first established by Doob, implies that the posterior mean and other summaries converge to \theta_0, justifying the use of Bayesian methods for large datasets. Under additional and assumptions, the Bernstein-von Mises provides a more precise : the posterior \pi(\theta \mid y) asymptotically approximates a centered at the maximum likelihood estimator \hat{\theta}_n, with given by the inverse observed I_n(\hat{\theta}_n)^{-1}, scaled by n. Specifically, for \theta = \hat{\theta}_n + n^{-1/2} u, \sqrt{n} (\pi(\theta \mid y) - N(\hat{\theta}_n, n^{-1} I_n(\hat{\theta}_n)^{-1})) \to 0 in total variation distance, almost surely. This approximation holds for i.i.d. data from a correctly specified parametric model and priors that are sufficiently smooth and non-degenerate near \theta_0, as detailed in standard treatments of asymptotic statistics. The rate of convergence in the Bernstein-von Mises theorem is typically \sqrt{n}, reflecting the parametric efficiency of the posterior, which matches the frequentist central limit theorem for the MLE. Asymptotically, the influence of the prior diminishes, with the posterior becoming increasingly dominated by the likelihood; the prior's effect is of higher order, o_p(n^{-1/2}), ensuring that posterior credible intervals align closely with confidence intervals based on the observed information. This vanishing prior influence underscores the robustness of Bayesian inference to prior choice in large samples. In cases of model misspecification, where the true data-generating lies outside the assumed model, these asymptotic behaviors adapt accordingly. The posterior remains consistent but concentrates on a pseudo-true \theta^* that minimizes the from the true to the model, rather than the true \theta_0. The Bernstein-von Mises approximation still holds, now centered at the MLE \hat{\theta}_n converging to \theta^*, with the asymptotic preserved under asymptotic conditions on the misspecified likelihood. However, the rate may degrade in severely misspecified scenarios, and influence can persist if the prior favors regions away from \theta^*.

Estimation and Inference

Point Estimates

In Bayesian inference, point estimates provide a single summary value for the parameter of interest, derived from the posterior distribution \pi(\theta | x), where \theta is the and x represents the observed data. These estimates balance prior beliefs with the likelihood of the data, offering a way to condense the full posterior into a practical representative value. The choice of point estimate depends on the decision-theoretic framework, particularly the loss function that quantifies the cost of . The posterior mean, also known as the under squared error loss, is given by \hat{\theta} = \mathbb{E}[\theta | x] = \int \theta \, \pi(\theta | x) \, d\theta. This estimate minimizes the expected posterior loss \mathbb{E}[(\theta - \hat{\theta})^2 | x], making it suitable when errors are symmetrically penalized proportional to their squared magnitude. For instance, in estimating a normal mean with a normal , the posterior mean is a weighted average of the prior mean and the sample mean, reflecting the precision of each. The posterior mean is often preferred in applications requiring unbiased summaries under quadratic penalties, as it coincides with the minimum mean squared error estimator in the posterior sense. The posterior median minimizes the expected absolute error loss \mathbb{E}[|\theta - \hat{\theta}| | x] and serves as a robust point estimate, particularly when the posterior is skewed or outliers are a concern. It is defined as the value \hat{\theta} such that \int_{-\infty}^{\hat{\theta}} \pi(\theta | x) \, d\theta = 0.5. This property makes the median less sensitive to extreme posterior tails compared to the mean. In contrast, the maximum a posteriori (MAP) estimate, which is the posterior mode \hat{\theta}_{\text{MAP}} = \arg\max_\theta \pi(\theta | x), minimizes the 0-1 loss function \mathbb{E}[\mathbb{I}(\theta \neq \hat{\theta}) | x], where \mathbb{I} is the ; it is ideal for scenarios penalizing any deviation equally, regardless of size, and often aligns with maximizing the posterior density, equivalent to penalized maximum likelihood. The MAP can be computed via optimization techniques and is computationally convenient when the posterior is unimodal. The selection among these estimates hinges on the assumed : squared loss favors the for its emphasis on large errors, absolute loss suits the for robustness, and 0-1 loss highlights the for peak . Unlike frequentist point estimates, such as the maximum likelihood , which rely solely on the and exhibit properties like in large samples without , Bayesian point estimates incorporate prior information, potentially improving accuracy in small-sample or informative-prior settings but introducing dependence on prior choice.

Credible Intervals

In Bayesian inference, a provides a range for an unknown \theta such that the that \theta lies within the interval, given the observed x, equals $1 - \alpha. Formally, a (1 - \alpha) I satisfies P(\theta \in I \mid x) = 1 - \alpha, where the probability is computed with respect to the posterior distribution \pi(\theta \mid x). This direct probabilistic statement contrasts with frequentist confidence intervals, which quantify the long-run frequency with which a produces intervals containing the fixed true , without assigning probability to \theta itself given the . Two primary types of credible intervals are the equal-tail interval and the highest posterior density (HPD) interval. The equal-tail interval is defined by the central (1 - \alpha) portion of the posterior, specifically the interval between the \alpha/2 and $1 - \alpha/2 quantiles of \pi(\theta \mid x); it is symmetric in probability mass but may not be the shortest possible interval. In contrast, the HPD interval is the shortest interval achieving the coverage $1 - \alpha, consisting of the set \{\theta : \pi(\theta \mid x) \geq k\} where k is chosen such that the integral over this set equals $1 - \alpha; this makes it particularly suitable for skewed posteriors, as it prioritizes regions of highest density. The equal-tail approach performs well for symmetric unimodal posteriors, where the two types coincide, but the HPD generally offers better for asymmetric cases. Computation of credible intervals depends on the posterior form. For models with conjugate priors, where the posterior belongs to a known family (e.g., or ), credible intervals can be obtained analytically using the or functions of that family. In non-conjugate or complex cases, numerical methods are required, such as (MCMC) sampling to approximate \pi(\theta \mid x), followed by estimation for equal-tail intervals or optimization algorithms to find the HPD region. These numerical approaches ensure reliable interval construction even for high-dimensional parameters.

Hypothesis Testing

In Bayesian hypothesis testing, hypotheses are evaluated through the comparison of posterior probabilities derived from , providing a direct measure of relative evidence in favor of competing models or hypotheses. Unlike approaches that rely on long-run frequencies, this framework incorporates prior beliefs and updates them with observed data to assess the plausibility of each hypothesis. A central tool for this purpose is the , which quantifies the relative support for one hypothesis over another based on the data alone. The Bayes factor (BF) is defined as the ratio of the marginal likelihoods under two competing hypotheses, BF_{10} = \frac{m(\mathbf{x} | H_1)}{m(\mathbf{x} | H_0)}, where m(\mathbf{x} | H_i) is the marginal probability of the data under hypothesis H_i, obtained by integrating the likelihood over the prior distribution for the parameters under that hypothesis. This ratio arises from the work of , who developed it as a method for objective model comparison in scientific inference. Values of BF greater than 1 indicate evidence in favor of H_1, while values less than 1 support H_0; for instance, BF values between 3 and 10 are often interpreted as substantial evidence according to Jeffreys' scale. The marginal likelihoods can be challenging to compute analytically, particularly for complex models, but approximations such as or are commonly employed. Posterior odds for the hypotheses are then obtained by multiplying the Bayes factor by the prior odds: \frac{P(H_1 | \mathbf{x})}{P(H_0 | \mathbf{x})} = BF_{10} \times \frac{P(H_1)}{P(H_0)}. This relationship, a direct consequence of , allows the incorporation of subjective or objective probabilities on the hypotheses themselves, yielding posterior probabilities that can guide decisions. For point null hypotheses, such as H_0: \theta = \theta_0, the posterior odds can be linked to credible intervals by examining the posterior at the value, though this is typically a secondary consideration to the Bayes factor approach. For testing equivalence or practical null hypotheses, where the goal is to determine if a lies within a predefined of negligible effect (e.g., no meaningful difference), the provides a complementary Bayesian procedure. The ROPE is specified as an around the value, such as [- \delta, \delta], reflecting domain-specific notions of practical insignificance. Evidence for the is declared if a high- (e.g., 95% highest density ) of the posterior falls entirely within the ROPE, while evidence against equivalence occurs if the lies outside. This method, advocated by John Kruschke, addresses limitations in traditional testing by explicitly quantifying decisions about parameter values rather than point estimates. Despite these advantages, Bayesian hypothesis testing via Bayes factors and related tools exhibits to the choice of on model parameters and , which can substantially alter the marginal likelihoods and thus the evidential conclusions. This dependence underscores the need for robustness checks, such as varying the and reporting the range of resulting Bayes factors, to ensure inferences are not overly influenced by prior specifications.

Examples

Coin Toss Problem

The coin toss problem exemplifies Bayesian inference in a simple discrete setting, where the goal is to estimate the unknown probability p of the landing heads, assuming tosses. This scenario models situations like estimating success probabilities in trials, such as defect rates or outcomes. Observations of heads and tails update an initial belief () about p to form a posterior that quantifies updated uncertainty. The setup begins with a binomial likelihood for the data: given n tosses, the number of heads y follows y \sim \text{[Binomial](/page/Binomial)}(n, p), with P(y \mid p) = \binom{n}{y} p^y (1-p)^{n-y}. The distribution for p \in [0,1] is chosen as the , p \sim \text{[Beta](/page/Beta)}(\alpha, \beta), with density f(p) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1}, where \alpha > 0 and \beta > 0 act as counts of heads and tails, respectively. This choice is convenient because the beta family is conjugate to the , ensuring the posterior remains beta-distributed: p \mid y \sim \text{[Beta](/page/Beta)}(\alpha + y, \beta + n - y). The conjugacy of the - pair facilitates analytical updates and was systematically developed in early frameworks. The posterior mean provides a point estimate for p: \mathbb{E}[p \mid y] = \frac{\alpha + y}{\alpha + \beta + n}, which blends the prior mean \frac{\alpha}{\alpha + \beta} and the maximum likelihood estimate \frac{y}{n}, with weights proportional to their effective sample sizes \alpha + \beta and n. For after n tosses, a 95% is the 0.025 and 0.975 quantiles of the posterior , which can be obtained via the quantile function q_{\text{Beta}}(\cdot; \alpha + y, \beta + n - y). As an illustration, consider a uniform (\alpha = 1, \beta = 1) and data of 437 heads in 980 tosses; the posterior \text{Beta}(438, 544) has mean approximately 0.446 and 95% [0.415, 0.477], showing contraction around the data while influenced by the . Visualization of the prior and posterior densities reveals the updating process: the prior beta density starts as a broad curve (e.g., uniform for \alpha = \beta = 1), and successive data incorporation shifts the mode toward y/n while reducing variance, as seen in overlaid density plots. For small n, the posterior retains substantial prior shape; with large n, it approximates a normal density centered at the sample proportion. These plots, often generated using statistical software, underscore the gradual dominance of data over prior beliefs.

Medical Diagnosis

In medical diagnosis, Bayesian inference enables the calculation of the probability that a patient has after receiving a test result, by combining the disease's (typically its in the population) with the test's likelihood properties. , defined as the probability of a positive test given the presence of the disease, and specificity, the probability of a negative test given the absence of the disease, serve as the key likelihood ratios in this updating process. These parameters allow clinicians to compute the posterior probability using Bayes' theorem, which formally is P(D \mid +) = \frac{P(+ \mid D) \, P(D)}{P(+ \mid D) \, P(D) + P(+ \mid \neg D) \, P(\neg D)}, where D denotes the presence of the disease, + a positive test result, and P(+ \mid \neg D) = 1 - specificity; an analogous formula applies for a negative test result. A classic illustrative example involves a rare disease with a 1% prevalence (P(D) = 0.01) and a diagnostic test exhibiting 99% sensitivity (P(+ \mid D) = 0.99) and 99% specificity (P(- \mid \neg D) = 0.99). To compute the posteriors, consider a hypothetical cohort of 10,000 individuals screened for the disease. The resulting contingency table breaks down the outcomes as follows:
Disease Present (D)Disease Absent (\neg D)Total
Positive Test (+ )99 (true positives)99 (false positives)198
Negative Test (- )1 (false negative)9,801 (true negatives)9,802
Total1009,90010,000
From this table, the posterior probability of disease given a positive test is P(D \mid +) = 99 / 198 \approx 0.50 or 50%, meaning half of positive results are false positives due to the low prior prevalence outweighing the test's high accuracy. Conversely, P(D \mid -) = 1 / 9,802 \approx 0.0001, confirming the test's strong ability to rule out the disease in this scenario. This example highlights the , where individuals— including medical professionals—often neglect the and overestimate the posterior based solely on the test's accuracy, such as assuming P(D \mid +) \approx 99\%. In a seminal study, Casscells et al. surveyed physicians, medical students, and house officers using a similar with 0.1% prevalence and 5% ; most respondents incorrectly estimated the posterior at around 95%, ignoring the and leading to potential . This bias, part of broader heuristics in probabilistic judgment, underscores the need for explicit Bayesian updating to avoid misinterpreting test results in low-prevalence settings.

Linear Regression

Bayesian linear regression applies to model the of a response \mathbf{y} given predictors \mathbf{X}, assuming a linear relationship with additive . The model is specified as \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, where \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n) and \sigma^2 is known. This setup allows for exact posterior inference when a conjugate is used on the regression coefficients \boldsymbol{\beta}. A natural conjugate prior for \boldsymbol{\beta} is the , \boldsymbol{\beta} \sim \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Lambda}_0^{-1}), where \boldsymbol{\Lambda}_0 is the prior precision matrix. The likelihood is p(\mathbf{y} \mid \boldsymbol{\beta}) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X} \boldsymbol{\beta}) \right). The resulting posterior distribution is also , p(\boldsymbol{\beta} \mid \mathbf{y}) = \mathcal{N}(\boldsymbol{\mu}_n, \boldsymbol{\Lambda}_n^{-1}), with updated precision \boldsymbol{\Lambda}_n = \boldsymbol{\Lambda}_0 + \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{X} and mean \boldsymbol{\mu}_n = \boldsymbol{\Lambda}_n^{-1} \left( \boldsymbol{\Lambda}_0 \boldsymbol{\mu}_0 + \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{y} \right). This conjugate update combines the prior information with the data evidence in a closed form, enabling straightforward computation of posterior summaries such as the mean and credible intervals for \boldsymbol{\beta}. The predictive distribution for a new response y_* at covariate values \mathbf{x}_* follows from integrating over the posterior, p(y_* \mid \mathbf{y}, \mathbf{x}_*) = \mathcal{N}\left( \mathbf{x}_*^\top \boldsymbol{\mu}_n, \sigma^2 + \mathbf{x}_*^\top \boldsymbol{\Lambda}_n^{-1} \mathbf{x}_* \right). This quantifies in predictions, incorporating both the residual variance \sigma^2 and the posterior variability in \boldsymbol{\beta}, which widens for extrapolations where \mathbf{x}_* lies far from the training data . In comparison to ordinary least squares (OLS), which yields the point estimate \hat{\boldsymbol{\beta}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}, the Bayesian posterior \boldsymbol{\mu}_n acts as a shrinkage estimator that pulls estimates toward the \boldsymbol{\mu}_0. The strength of shrinkage depends on the precision relative to the data precision \mathbf{X}^\top \mathbf{X} / \sigma^2; a weakly informative (large \boldsymbol{\Lambda}_0^{-1}) results in \boldsymbol{\mu}_n \approx \hat{\boldsymbol{\beta}}_{\text{OLS}}, while stronger priors regularize against , particularly in low-data regimes. Unlike OLS, which lacks inherent for coefficients, the full posterior provides a over possible \boldsymbol{\beta} values.

Comparisons with Frequentist Methods

Key Differences

Bayesian inference fundamentally differs from frequentist statistics in its philosophical foundations and methodological approaches, particularly in how is quantified and incorporated into statistical reasoning. At the core of this distinction lies the interpretation of probability: in the Bayesian paradigm, probability represents a degree of belief about unknown parameters, treating them as random variables with distributions that reflect subjective or . In contrast, the frequentist view regards parameters as fixed but unknown constants, with probability defined as the long-run frequency of events in repeated sampling, applying only to random data generation processes. This epistemological divide shapes all subsequent aspects of inference, emphasizing belief updating in Bayesian methods versus sampling properties in frequentist ones. A primary methodological contrast arises in the process of inference. Bayesian inference derives the posterior of parameters by combining the likelihood of the observed with a , yielding direct probabilistic statements about parameter values, such as the probability that a parameter lies within a certain . , however, relies on the of statistics under fixed parameters, producing measures like p-values or that describe the behavior of estimators over hypothetical repeated samples rather than probabilities for the parameters themselves. For instance, a Bayesian quantifies the plausible range for a parameter given the and , while a frequentist indicates the method's long-run coverage reliability, not a direct probability statement. The role of prior information further delineates these paradigms. Bayesian methods explicitly incorporate distributions to represent pre-existing or assumptions about parameters, allowing for the subjective of expert opinion or historical into the analysis, which can be updated sequentially as new emerges. Frequentist approaches eschew priors entirely, aiming for objectivity by basing inferences solely on the observed and likelihood, without accommodating prior beliefs, which proponents argue avoids but limits flexibility in incorporating . This inclusion of priors in Bayesian inference is often contentious, as it introduces elements of subjectivity, yet it enables more nuanced modeling in complex scenarios where alone may be insufficient. Regarding repeatability and the nature of statistical conclusions, frequentist statistics emphasizes long-run frequency properties, such as the of intervals approaching the nominal level over infinite repetitions of the experiment under the true parameter. Bayesian inference, by contrast, focuses on updating beliefs through the posterior, providing a coherent framework for sequential learning where conclusions evolve with accumulating data, without reliance on hypothetical repetitions. This belief-updating mechanism allows Bayesian methods to offer interpretable probabilities for hypotheses directly, fostering a dynamic approach to uncertainty that aligns with in scientific inquiry.

Model Selection

In Bayesian inference, model selection involves comparing multiple competing models to determine which best explains the observed , accounting for both fit and complexity. The posterior probability of a model M_k given x is given by P(M_k \mid x) \propto p(x \mid M_k) P(M_k), where P(M_k) is the of the model and p(x \mid M_k) is the , also known as the evidence or predictive density of the data under the model. This formulation naturally incorporates prior beliefs about model plausibility and favors models that balance goodness-of-fit with . The marginal likelihood p(x \mid M_k) is computed as the integral \int p(x \mid \theta_k, M_k) p(\theta_k \mid M_k) \, d\theta_k, integrating out the model parameters \theta_k with respect to their distribution. This integral quantifies the average predictive performance of the model across its parameter space, penalizing overly complex models whose prior probability mass is dispersed over a larger volume, thus making it less likely to concentrate on the data under a point null or simple alternative. For comparing two models M_1 and M_2, the B_{12} = p(x \mid M_1) / p(x \mid M_2) serves as the ratio of their marginal likelihoods, providing a measure of relative ; values greater than 1 indicate support for M_1. This approach embodies through the inherent complexity penalty in the : simpler models assign higher density to parameter regions compatible with the data, yielding higher evidence, while complex models dilute this density across implausible regions, reducing their posterior odds unless the data strongly favors the added flexibility. Posterior model probabilities can then be obtained by normalizing over all models, enabling probabilistic statements about model uncertainty, such as the probability that the true model is among a . Computing the marginal likelihood exactly is often intractable for high-dimensional models, leading to approximations like the (BIC), which provides an asymptotic estimate: \mathrm{BIC}_k = -2 \log L(\hat{\theta}_k \mid x) + d_k \log n, where L is the maximized likelihood, d_k is the number of parameters in M_k, and n is the sample size; lower BIC values approximate higher log marginal likelihoods. Similarly, the (AIC), \mathrm{AIC}_k = -2 \log L(\hat{\theta}_k \mid x) + 2 d_k, can be interpreted in a Bayesian context as a rough to the relative expected Kullback-Leibler divergence, though it applies a milder penalty and is less consistent for in large samples compared to BIC. These criteria facilitate practical model by approximating the Bayesian without full .

Decision Theory Integration

Bayesian decision theory integrates the principles of Bayesian inference with decision-making under uncertainty, providing a framework for selecting actions that minimize expected losses based on posterior beliefs. In this approach, a loss function L(\theta, a) quantifies the penalty for taking action a when the true parameter \theta is the case, allowing decisions to be evaluated relative to probabilistic assessments of uncertainty. The posterior expected loss, or posterior risk, for an action a given data x is then computed as \rho(\pi, a \mid x) = \int L(\theta, a) \pi(\theta \mid x) \, d\theta, where \pi(\theta \mid x) is the posterior distribution; the optimal Bayes action \delta^*(x) minimizes this quantity for each observed x. This setup ensures that decisions are coherent with the subjective or objective probabilities encoded in the prior and updated via Bayes' theorem. The overall performance of a decision rule \delta is assessed through the Bayes risk, which averages the risk function R(\theta, \delta) = \mathbb{E}[L(\theta, \delta(X)) \mid \theta] over the prior distribution: r(\pi, \delta) = \int R(\theta, \delta) \pi(\theta) \, d\theta. A Bayes rule \delta_\pi, which minimizes the posterior risk for prior \pi, in turn minimizes the Bayes risk among all decision rules, establishing it as the optimal procedure under the chosen prior. For specific loss functions, such as squared error loss L(\theta, a) = (\theta - a)^2, the Bayes rule corresponds to the posterior mean as a point estimate, linking directly to common Bayesian summaries. Within the Bayesian framework, admissibility requires that no other decision rule has a risk function that is smaller or equal everywhere and strictly smaller for some \theta; Bayes rules are generally admissible, particularly under conditions like bounded loss functions or compact parameter spaces, as they achieve the minimal possible risk in a neighborhood of the prior. The minimax criterion, which seeks to minimize the maximum risk \sup_\theta R(\theta, \delta), can be attained by Bayes rules when the risk is constant over \theta, providing a robust alternative when priors are uncertain. This Bayesian minimax approach contrasts with non-Bayesian versions by incorporating prior information to stabilize decisions. Bayesian is fundamentally connected to maximization, where the loss function is the negative of a function U(\theta, a) = -L(\theta, a), so that selecting the action maximizing the posterior expected \int U(\theta, a) \pi(\theta \mid x) \, d\theta yields the same optimal decisions. This linkage, axiomatized in subjective expected , ensures that rational choices under align with coherent probability assessments, as developed in foundational works on personal probability.

Computational Methods

Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) methods are essential computational techniques in Bayesian inference for approximating posterior distributions when analytical solutions are intractable, particularly for complex models with high-dimensional parameter spaces. These methods generate a sequence of samples from a whose stationary distribution matches the target posterior, allowing estimation of posterior expectations, credible intervals, and other summaries through . By simulating dependent samples that converge to the posterior, MCMC enables in scenarios where direct sampling is impossible, such as non-conjugate priors where the posterior lacks a closed form. The Metropolis-Hastings algorithm, a foundational MCMC method, constructs the through a proposal distribution and an acceptance mechanism to ensure the chain targets the desired posterior. At each iteration, a candidate state \theta' is proposed from a distribution q(\theta' \mid \theta^{(t)}), where \theta^{(t)} is the current state. The acceptance probability is then computed as \alpha = \min\left(1, \frac{p(\theta') q(\theta^{(t)} \mid \theta')}{p(\theta^{(t)}) q(\theta' \mid \theta^{(t)})}\right), where p(\cdot) denotes the unnormalized posterior density; the proposal is accepted with probability \alpha, otherwise the chain remains at \theta^{(t)}. This general framework, introduced by et al. in for symmetric proposals and extended by in to arbitrary proposals, guarantees and thus convergence to the posterior under mild conditions. Gibbs sampling, a special case of Metropolis-Hastings, simplifies the process for multivariate posteriors by iteratively sampling from full conditional distributions, avoiding explicit acceptance steps. For a parameter vector \theta = (\theta_1, \dots, \theta_d), the algorithm updates each component \theta_j^{(t+1)} \sim p(\theta_j \mid \theta_{-j}^{(t)}, y) sequentially or in random order, where \theta_{-j} denotes all components except j and y is the data. This method, originally proposed by Geman and Geman in 1984 for image restoration, exploits conditional independence to explore the posterior efficiently, particularly in hierarchical models where conditionals are tractable despite an intractable joint. Assessing MCMC convergence is crucial, as chains may mix slowly or fail to explore the posterior adequately. Trace plots visualize sample paths over iterations, revealing trends, , or stationarity; effective sample size, accounting for dependence, quantifies the relative to draws. The Gelman-Rubin diagnostic compares variability across multiple chains started from overdispersed initials, estimating the potential scale reduction factor \hat{R}, where values near 1 indicate ; originally developed by Gelman and Rubin in , it monitors both within- and between-chain variances to detect lack of equilibration. In high-dimensional Bayesian inference, MCMC excels at handling posteriors with thousands of parameters, such as in genomic models or spatial statistics, by iteratively navigating complex geometries that defy analytical tractability. For instance, in large-scale , Metropolis-Hastings with adaptive proposals or in conjugate-like blocks scales to dimensions where direct fails, providing asymptotically exact approximations whose accuracy improves with chain length. These methods underpin applications in fields requiring over vast parameter spaces, though computational cost grows with dimension, motivating efficient implementations.

Variational Inference

Variational inference is a deterministic optimization-based approach to approximating the intractable posterior in Bayesian models by selecting a simpler variational q(\theta) that minimizes the to the true posterior p(\theta \mid x). This method transforms the problem into an optimization task, making it suitable for large-scale models where exact computation is infeasible. The , defined as \KL(q(\theta) \parallel p(\theta \mid x)) = \E_{q(\theta)}[\log q(\theta) - \log p(\theta \mid x)], measures the information loss when using q to approximate p, and minimizing it yields a tight when q is flexible enough. In variational Bayes, the approximation is achieved by maximizing the (ELBO), which provides a tractable lower bound on the log \log p(x): \ELBO(q) = \E_{q(\theta)}[\log p(x, \theta)] - \E_{q(\theta)}[\log q(\theta)] = \log p(x) - \KL(q(\theta) \parallel p(\theta \mid x)). This objective decomposes the KL divergence and can be optimized directly, as the term is constant with respect to q. The ELBO balances model fit (via the expected log joint) and regularization (via the of q), ensuring the approximation remains close to the while explaining the data. A common choice for q is the mean-field approximation, which assumes full among the parameters, factorizing as q(\theta) = \prod_j q_j(\theta_j). This simplifies computations in high-dimensional spaces, such as graphical models, by the updates for each factor. Optimization often proceeds via coordinate ascent, iteratively maximizing the ELBO with respect to each q_j while holding others fixed, leading to closed-form updates in conjugate models. Compared to (MCMC) methods, variational inference offers significant speed advantages, scaling to millions of data points through efficient optimization, but it introduces bias due to the restrictive form of q, potentially underestimating posterior uncertainty. In practice, this trade-off favors variational methods for applications requiring , while MCMC is preferred when unbiased estimates are critical despite longer computation times.

Probabilistic Programming

Probabilistic programming languages facilitate the specification and of Bayesian models by allowing users to define probabilistic structures in code, separating model declaration from inference algorithms. These tools enable statisticians and data scientists to express complex hierarchical models intuitively, often using declarative syntax where the focus is on the rather than implementation details. JAGS (Just Another Gibbs Sampler), introduced in 2003, exemplifies declarative modeling through a BUGS-like language that represents Bayesian hierarchical models as directed acyclic graphs, specifying nodes' distributions and dependencies. , released in 2012, employs an imperative approach in its , defining a function over parameters and data with blocks for transformed parameters and generated quantities, offering greater expressiveness for custom computations. PyMC, evolving from its 2015 version to a comprehensive framework by 2023, uses Python-based declarative syntax to build models with distributions like pm.[Normal](/page/Normal) and supports hierarchical structures seamlessly. These languages integrate engines such as (MCMC) methods—including in JAGS and in Stan and PyMC—and variational (VI) approximations, allowing automatic posterior sampling or optimization without manual coding of samplers. Key benefits of these frameworks include enhanced , as models and configurations can be version-controlled and shared via repositories, ensuring identical results with fixed seeds and software versions. (AD) further accelerates ; Stan computes exact gradients using reverse-mode AD for efficient MCMC, while PyMC leverages PyTensor for gradient-based VI and sampling. JAGS, though lacking native AD, promotes through its scripting interface and compatibility with R for transparent analysis pipelines. By 2025, has evolved to incorporate integrations, exemplified by , a -based language introduced in 2018 that unifies neural networks with Bayesian modeling for scalable deep probabilistic programs. supports MCMC and engines with via , enabling hybrid models like variational autoencoders within Bayesian frameworks, and its NumPyro extension provides JAX-accelerated inference for large-scale applications. This progression reflects a broader trend toward universal , bridging traditional Bayesian tools with modern ecosystems.

Applications

Machine Learning and AI

Bayesian neural networks (BNNs) extend traditional neural networks by placing prior distributions over the weights, enabling the quantification of epistemic uncertainty in predictions. This approach treats the network parameters as random variables, allowing the posterior distribution to capture both data fit and model uncertainty, which is particularly useful in safety-critical applications where overconfidence can be detrimental. The foundational work on BNNs was developed in Radford Neal's 1996 thesis, which demonstrated how Bayesian methods can regularize neural networks and provide principled uncertainty estimates through integration over the posterior. In practice, priors such as Gaussian distributions are commonly imposed on weights to encode assumptions about their magnitude and correlations, leading to more robust models that avoid compared to . Gaussian processes (GPs) serve as a cornerstone of for non-parametric regression and tasks, modeling as distributions over possible mappings from inputs to outputs. In , GPs use a kernel to define the structure, yielding predictive distributions that naturally incorporate uncertainty, with the providing point estimates and the variance reflecting intervals. For , GPs extend this framework via latent approximations, such as through Laplace methods or variational techniques, to handle or multi-class problems while maintaining probabilistic outputs. The seminal formulation of GPs for was advanced in the 2006 book by and Williams, which established GPs as a flexible to models, especially effective for small-to-medium datasets where exact is feasible. GPs excel in scenarios requiring interpretable uncertainty, such as time-series forecasting or spatial interpolation, and their Bayesian nature ensures that predictions update coherently with new data. Active learning leverages Bayesian methods to select the most informative data points for labeling, reducing the annotation burden in pipelines. By querying instances that maximize expected information gain—often measured via between predictions and model efficiently explores the data space, particularly when integrated with GPs or BNNs as models. A influential approach, BALD (), uses the between predictions and posterior to prioritize queries that resolve . This , building on earlier information-theoretic frameworks, has shown substantial gains in image classification tasks. Complementing , employs GPs to model objective functions in black-box settings, iteratively selecting points via acquisition functions like expected to balance and . The expected criterion, introduced in Jones et al.'s 1998 work, has become a standard for hyperparameter tuning and experimental , achieving faster convergence than search or random sampling in high-dimensional spaces. In the 2020s, advancements have focused on scalable for BNNs in large-scale models, addressing the computational challenges of exact posterior approximation through variational (VI) and related techniques. VI approximates the posterior by optimizing a lower bound on the , enabling efficient of BNNs with millions of parameters by amortizing inference across mini-batches. Notable progress includes rank-1 factorizations that reduce the parameter space while preserving uncertainty calibration, as demonstrated in Dusenberry et al.'s 2020 method, which improved scalability on datasets like without sacrificing predictive performance. These developments have facilitated the integration of Bayesian principles into architectures, enhancing reliability in domains like autonomous systems and . Predictive distributions in these models provide calibrated uncertainties that guide under limited data.

Bioinformatics and Healthcare

Bayesian inference plays a pivotal role in phylogenetic analysis by incorporating priors on evolutionary trees to estimate relationships among species or sequences from genomic . In this framework, priors such as the birth-death sampling process model the tree topology and branch lengths, accounting for incomplete sampling and extinction events to produce posterior distributions of phylogenies. Seminal software like MrBayes implements (MCMC) sampling to explore these posteriors under mixed substitution models, enabling robust inference even with sparse . Similarly, extends this by integrating time-calibrated trees with priors, facilitating divergence time estimation in and . These methods have revolutionized by quantifying uncertainty in tree topologies and supporting hypotheses like adaptive radiations through posterior probabilities. In , Bayesian adaptive trials optimize clinical development by dynamically adjusting enrollment, dosing, or arms based on interim , incorporating historical priors to enhance efficiency and ethical patient allocation. For instance, multi-arm multi-stage designs use posterior probabilities to drop ineffective treatments early, reducing sample sizes while maintaining power, as demonstrated in trials where priors from preclinical inform efficacy thresholds. High-impact applications include the I-SPY 2 trial, which employed Bayesian hierarchical models to predict pathological complete response rates, accelerating the identification of promising therapies for subtypes. This approach minimizes exposure to futile regimens and integrates real-time learning, contrasting with fixed frequentist designs by leveraging accumulating evidence for dose escalation or futility stopping. Genomic data analysis benefits from Bayesian hierarchical models to detect and characterize genetic variants, such as single nucleotide polymorphisms (SNPs), by pooling information across loci or populations to shrink effect estimates and control false positives. These models place hyperpriors on variant effects, enabling variable selection in genome-wide association studies (GWAS) where thousands of markers are tested simultaneously, as in the approach that penalizes small effects while highlighting causal variants. For structural variants like copy number variations (CNVs), hierarchical priors model probe-level noise and allelic imbalance, inferring segment boundaries and states from next-generation sequencing reads with improved resolution over non-Bayesian methods. Such frameworks have identified population-specific selection signals in genomes, quantifying admixture and through posterior credible intervals. During the COVID-19 pandemic, Bayesian extensions of the susceptible-infected-recovered (SIR) model incorporated informative priors on transmission rates and reporting biases to forecast epidemics and evaluate interventions across regions. These models used time-varying priors derived from early outbreak data to update basic reproduction numbers (R_t) sequentially, capturing multiple waves and non-pharmaceutical effects like lockdowns with spatiotemporal hierarchies. Influential analyses, such as those integrating changepoint detection, estimated underreporting factors and intervention impacts in the United Kingdom, providing probabilistic forecasts that informed policy with uncertainty bands. By briefly referencing sequential updating with incoming case data, these approaches allowed real-time refinement of parameters without refitting from scratch.

Astrophysics and Cosmology

In and , Bayesian inference plays a central role in parameter estimation for the standard ΛCDM model, particularly through analyses of (CMB) data from the Planck satellite. The Planck Collaboration employed (MCMC) methods within a Bayesian framework to derive constraints on cosmological parameters such as the Hubble constant, matter density, and amplitude of scalar perturbations, yielding precise posteriors that confirm the flatness of the universe and the presence of at approximately 26% of the energy density. These inferences integrate likelihoods from temperature and polarization anisotropies, incorporating priors informed by previous missions like WMAP, to quantify uncertainties and tensions, such as the Hubble constant discrepancy. Bayesian model comparison has been instrumental in evaluating hypotheses about , such as comparing (CDM) profiles against cored or warm dark matter (WDM) alternatives using data. For instance, analyses of satellites like and Sculptor applied Bayesian evidence calculations to assess Navarro-Frenk-White (NFW) cuspy profiles versus Burkert cored models, finding strong preference for cored profiles in some systems due to the Occam penalty favoring simpler fits to rotation curves and . In broader cosmological contexts, such comparisons extend to WDM models constrained by data, where Bayesian evidence disfavors pure WDM over CDM but allows mixed scenarios to alleviate small-scale structure issues. Hierarchical Bayesian modeling enhances inference from large galaxy surveys by accounting for population-level variations and selection effects. In surveys like the Spectroscopic Survey (), hierarchical approaches model galaxy clustering and , treating individual galaxy redshifts as draws from a shared cosmological power spectrum while marginalizing over astrophysical nuisance parameters like bias. This framework propagates uncertainties through the , enabling robust constraints on parameters like the growth rate of structure, and has been adapted for forward modeling in upcoming surveys such as to forecast properties. Recent advancements leverage Bayesian methods for (JWST) data, enabling inference on high-redshift galaxy properties and early universe cosmology. Post-2022 analyses of JWST's NIRCam and observations use simulation-based Bayesian inference to fit spectral energy distributions of galaxies at z > 10, constraining histories and escape fractions while incorporating JWST-specific like point-spread function variations. These efforts challenge ΛCDM by probing reionization-era , with hierarchical models integrating JWST photometry to infer global parameters like the ionizing budget. In , Bayesian inference underpins signal detection and parameter estimation by and collaborations. For events like GW150914, nested sampling algorithms compute posteriors on source masses, spins, and sky locations by comparing waveform models against detector noise, achieving sub-percent precision on chirp masses through marginalization over calibration errors. Hierarchical extensions further infer population properties, such as merger rates, from multiple detections, informing astrophysical models of formation. As datasets grow, asymptotic approximations facilitate efficient inference on large-scale catalogs.

Philosophical and Historical Context

Bayesian Epistemology

Bayesian epistemology posits that rational degrees of , or credences, must satisfy the axioms of probability to ensure among one's opinions. This theory requires that credences be non-negative, sum to one over complementary propositions, and be additive for disjoint events, thereby avoiding internal inconsistencies in structures. Probabilism, as this norm is known, serves as a foundational , dictating that beliefs ought to cohere probabilistically to prevent . Dutch book arguments provide a pragmatic justification for treating subjective probabilities as coherent degrees of belief, demonstrating that violations of expose an agent to guaranteed losses in fair betting scenarios. A consists of a set of bets that appear individually acceptable based on the agent's credences but collectively result in a sure loss, such as assigning a credence greater than 1 to an event or failing additivity for mutually exclusive outcomes. These arguments, rooted in the idea that rational agents avoid sure losses, compel subjective probabilities to align with probabilistic coherence, though critics note that agents might rationally decline certain bets or that incoherence does not always lead to exploitation. In Bayesian confirmation theory, evidence confirms a hypothesis if it increases the agent's credence in that hypothesis upon updating beliefs, while disconfirmation occurs if the credence decreases. Specifically, evidence e confirms hypothesis h when the posterior probability P(h|e) > P(h), the prior probability, often measured by the Bayesian multiplier \frac{P(e|h)}{P(e)} > 1, where P(e|h) is the likelihood and P(e) the marginal probability of the evidence. This framework quantifies evidential support through ratios or differences in probabilities, allowing hypotheses to be incrementally strengthened or weakened by data, such as a black raven observation mildly confirming the hypothesis that all ravens are black under uniform priors. Updating beliefs via conditionalization preserves these confirmation relations, ensuring that new evidence coherently revises the probability distribution. Critiques of Bayesian epistemology often center on the tension between subjective and variants. Subjective Bayesianism permits any coherent assignment, emphasizing personal degrees of belief without further constraints, which allows for diverse but potentially biased inferences. In contrast, Bayesianism imposes additional norms, such as the principle of indifference or maximum priors, to derive unbiased probabilities from available , aiming for intersubjective agreement and scientific objectivity. Detractors of subjective Bayesianism argue it leads to practical inconsistencies, like marginalization , and relies on unverifiable personal priors, while approaches face challenges like Bertrand's in uniform prior selection, potentially undermining .

Historical Development

The foundations of Bayesian inference trace back to the posthumous publication in 1763 of Thomas Bayes's essay "An Essay towards solving a Problem in the Doctrine of Chances," which introduced a method for inverting conditional probabilities that forms the basis of what is now known as Bayes's theorem. This work, communicated by after Bayes's death, laid the groundwork for updating beliefs in light of new evidence, though it remained relatively obscure for decades. In 1812, independently developed a similar framework in his Théorie Analytique des Probabilités, where he explicitly formulated the rule for inverse probabilities and applied it to problems in astronomy and , effectively popularizing the approach without reference to Bayes. Laplace's contributions emphasized the theorem's utility in scientific , marking an early expansion of its scope beyond Bayes's initial probabilistic . Amid the dominance of frequentist statistics in the early , Bayesian methods were defended in debates over , as seen in ' 1939 publication of Theory of Probability, advocating for objective priors to resolve issues of subjectivity in Bayesian analysis and applying the approach to geophysical problems, thereby defending its role in scientific hypothesis testing. This work countered criticisms by proposing reference priors that minimized prior influence on posteriors. Complementing this, Leonard J. Savage's 1954 book The Foundations of Statistics provided a subjective , axiomatizing personal probability and within a Bayesian framework, which unified utility and belief updating. Savage's axioms demonstrated how coherent behavior implies Bayesian updating, influencing decision-theoretic applications. Bayesian inference experienced a major revival in the 1990s, driven by computational advances that addressed longstanding integration challenges in posterior estimation. The 1990 paper by Alan E. Gelfand and Adrian F. M. Smith introduced sampling-based methods using (MCMC) to approximate marginal densities, enabling practical Bayesian analysis for complex models. This innovation, particularly variants, facilitated the method's widespread adoption in statistics and beyond, marking the shift from theoretical foundations to computationally feasible inference.

Thomas Bayes and Beyond

Thomas Bayes (1701–1761), an English mathematician and Presbyterian minister, developed the foundational concept of , which allows updating beliefs about causes based on observed effects. His key contribution, detailed in an unpublished essay discovered after his death, addressed the probability of causes from known effects, providing the mathematical framework now recognized as . This work, edited and published posthumously by in 1763 as "An Essay towards solving a Problem in the Doctrine of Chances," introduced a uniform prior distribution for unknown probabilities and proposed a method using imaginary outcomes to approximate posterior distributions, though it remained largely overlooked for decades. Pierre-Simon Laplace (1749–1827), a prominent and astronomer, independently derived and generalized in the late , framing it as the principle of to reason from effects to causes. In his 1774 memoir "Mémoire sur la probabilité des causes par les événements," Laplace applied this principle to estimate probabilities in astronomical observations, such as planetary perturbations, demonstrating its utility in scientific contexts. He further expanded its applications in his 1812 treatise Théorie Analytique des Probabilités, integrating with error analysis and , which helped establish as a tool for empirical inference across disciplines. Harold Jeffreys (1891–1989), a mathematician, statistician, and geophysicist, played a pivotal role in reviving Bayesian methods during the early 20th century amid the rise of frequentist approaches. In his influential book Theory of Probability (1939), first edition published by , Jeffreys articulated a systematic theory of scientific inference grounded in Bayesian principles, emphasizing the use of probability for hypothesis testing and parameter estimation. He proposed objective prior distributions, such as the invariant under reparameterization, and applied Bayesian techniques to geophysical problems like , arguing that provided a more coherent basis for than likelihood-based methods. The book, revised in multiple editions through 1961, became a cornerstone for Bayesian advocates in scientific fields. Following , Bayesian inference gained renewed momentum through key figures who advanced its theoretical foundations and computational feasibility. Dennis Lindley (1923–2013), a , became a leading proponent of Bayesian , integrating utility maximization with probability updating to guide rational choice under uncertainty; he co-authored seminal works on Bayesian and founded the Valencia International Meetings on Bayesian Statistics in 1979 to foster global collaboration. (1906–1985), an Italian probabilist, formalized subjective probability as degrees of belief coherent under arguments, rejecting objective frequencies in favor of personal probabilities updated via Bayes' rule, as detailed in his multi-volume Teoria delle Probabilità (1974–1975). In the computational era, Radford Neal advanced practical Bayesian analysis by developing (MCMC) methods, particularly in his 1993 technical report "Probabilistic Inference Using Methods," which demonstrated efficient sampling from complex posterior distributions, enabling applications in and neural networks. These contributions transformed Bayesian thought from philosophical abstraction to a computationally viable for modern .

References

  1. [1]
    A Gentle Introduction to Bayesian Analysis - PubMed Central - NIH
    The Bayesian framework offers a more direct expression of uncertainty, including complete ignorance. A major difference between frequentist and Bayesian methods ...
  2. [2]
    [PDF] The Development of Bayesian Statistics - Columbia University
    Jan 13, 2022 · Bayes' theorem is a mathematical identity of conditional probability, and applied Bayesian inference dates back to Laplace in the late 1700s, so ...
  3. [3]
    Bayesian inference: more than Bayes's theorem - Frontiers
    Bayesian inference gets its name from Bayes's theorem, expressing posterior probabilities for hypotheses about a data generating process as the (normalized) ...
  4. [4]
    Bayes' theorem | The Book of Statistical Proofs
    Sep 27, 2019 · Proof: Bayes' theorem​​ p(A|B)=p(B|A)p(A)p(B). (1) Proof: The conditional probability is defined as the ratio of joint probability, i.e. the ...
  5. [5]
    Bayes' Theorem: What It Is, Formula, and Examples - Investopedia
    Deriving the Bayes' Theorem Formula. Bayes' Theorem follows from the axioms of conditional probability, which is the probability of an event given that another ...
  6. [6]
    6. Odds and Addends — Think Bayes
    odds ( A | D ) = odds ( A ) P ( D | A ) P ( D | B ). This is Bayes's Rule, which says that the posterior odds are the prior odds times the likelihood ratio.Missing: explanation | Show results with:explanation
  7. [7]
    Bayes's Theorem for Calculating Inverse Probabilities
    Apr 7, 2014 · Ten years later, the Frenchman Pierre Simon Laplace independently discovered the rules of inverse probability, and although he later learned ...
  8. [8]
    [PDF] Probability Theory: The Logic of Science
    PROBABILITY THEORY – THE LOGIC OF SCIENCE. VOLUME I – PRINCIPLES AND ELEMENTARY APPLICATIONS. Chapter 1 Plausible Reasoning. 1. Deductive and Plausible ...
  9. [9]
    [PDF] Harold Jeffreys's default Bayes factor hypothesis tests
    Aug 28, 2015 · With equal prior odds, the posterior probability for M0 remains an arguably non-negligible 17%. For nested models, the Bayes factor can be ...
  10. [10]
    [PDF] Logical Foundations of Probability
    PREFACE. The purpose of this work. This book presents a new approach to the old problem of induction and probability. The theory here developed is.
  11. [11]
    [PDF] The Bayesian Approach to Statistics. - DTIC
    Cromwell's rule would have avoided the difficulties. A realistic position seems to be that a coherent view must not assign density zero to any possibility.
  12. [12]
    [PDF] Chapter 12 Bayesian Inference - Statistics & Data Science
    The posterior mean can be viewed as smoothing out the maximum likelihood estimate by allocating some additional probability mass to low frequency observations.Missing: seminal | Show results with:seminal
  13. [13]
    [PDF] Bayesian models of perception: a tutorial introduction
    Bayesian inference for discrete hypotheses. The simplest type of Bayesian inference involves a finite number of distinct hypothe- ses H1 ...Hn, each of which ...
  14. [14]
    [PDF] Lecture 16: Bayesian inference - MS&E 226: “Small” Data
    The posterior is the distribution of the parameter, given the data. Bayes' rule says: posterior ∝ likelihood × prior. Here “∝” means “proportional to”; the ...
  15. [15]
    Lecture 2 - CSCI S-80
    Applying Bayes' rule, we compute (0.1)(0.8)/(0.4) = 0.2. That is, the probability that it rains in the afternoon given that it was cloudy in the morning is 20%.Missing: updating | Show results with:updating
  16. [16]
    [PDF] Bayesian Data Analysis Third edition (with errors fixed as of 20 ...
    This book is intended to have three roles and to serve three associated audiences: an introductory text on Bayesian inference starting from first principles, a ...
  17. [17]
    Bayesian inference | Introduction with explained examples - StatLect
    Bayesian inference uses subjective probabilities to assign prior distributions, then updates them to posterior distributions after observing data.
  18. [18]
    [PDF] Sequential Bayesian Updating - Oxford statistics department
    for a Bayesian updating scheme posterior ∝ prior × likelihood with revised ∝ current × new likelihood represented by the formula πn+1(θ) ∝ πn(θ) × Ln+1(θ) ...
  19. [19]
    [PDF] 2 Sequential Bayesian updating for Big Data - UC Irvine
    We introduce sequential Bayesian updating as a tool to mine these three core properties. In the Bayesian approach, we summarize the current state of knowledge ...
  20. [20]
    Chapter 4 Balance and Sequentiality in Bayesian Analyses
    In a sequential Bayesian analysis, a posterior model is updated incrementally as more data come in. With each new piece of data, the previous posterior model ...
  21. [21]
    Whose Judgement? Reflections on Elicitation in Bayesian Analysis
    Apr 18, 2024 · Subjective probabilities need eliciting either in their entirety or partially via prior distributions that are updated in the light of data ...
  22. [22]
    [PDF] APTS: Statistical Inference - University of Warwick
    Thus, the model specifies the sample space X of the quantity to be observed X, the parameter space Θ, and a family of distributions, F say, where fX(x | θ) is ...<|control11|><|separator|>
  23. [23]
    [PDF] Bayes Methods - SC7 lecture notes, HT25
    The probability density or mass function π(θ) determines a probability distribution when taken ... Dirac delta-function giving the density of a.
  24. [24]
    LII. An essay towards solving a problem in the doctrine of chances ...
    An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S.
  25. [25]
  26. [26]
    [PDF] CS 287 Lecture 11 (Fall 2019) Probability Review, Bayes Filters ...
    ▫ Law of Total Probability. ▫ Conditioning (Bayes' rule). Disclaimer: lots ... Marginalization: p(x) = ? We integrate out over y to find the marginal ...
  27. [27]
    Conjugate Priors for Exponential Families - Project Euclid
    March, 1979 Conjugate Priors for Exponential Families. Persi Diaconis, Donald Ylvisaker · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Statist. 7(2): 269-281 (March, ...
  28. [28]
    [PDF] 18.05 S22 Reading 15: Conjugate priors: Beta and normal
    Our main goal here is to introduce the idea of conjugate priors and look at some specific conjugate pairs. These simplify the job of Bayesian updating to ...Missing: inference | Show results with:inference
  29. [29]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · The use of conjugate priors allows all the results to be derived in closed form. Unfortunately, different books use different conventions on how ...
  30. [30]
    [PDF] Chapter 9 The exponential family: Conjugate priors - People @EECS
    For exponential families the likelihood is a simple standarized function of the parameter and we can define conjugate priors by mimicking the form of the ...
  31. [31]
    The Bernstein-Von-Mises theorem under misspecification
    Abstract: We prove that the posterior distribution of a parameter in mis- specified LAN parametric models can be approximated by a random normal.
  32. [32]
    Statistical Decision Theory and Bayesian Analysis - SpringerLink
    Book Title · Statistical Decision Theory and Bayesian Analysis ; Authors · James O. Berger ; Series Title · Springer Series in Statistics ; DOI · https://doi.org/ ...
  33. [33]
    Bayesian Theory | Wiley Series in Probability and Statistics
    May 3, 1994 · Bayesian Theory ; Author(s):. José M. Bernardo, Adrian F. M. Smith, ; First published · May 1994 ; Print ISBN:9780471494645 | ; Online ISBN: ...
  34. [34]
    [PDF] Interval Estimation - Arizona Math
    A Bayesian interval estimate is called a credible interval. Recall that for the Bayesian approach to statistics, both the data and the parameter are random ...Missing: interpretation | Show results with:interpretation
  35. [35]
    [PDF] Bayesian Inference: Posterior Intervals
    HPD Intervals / Regions. ▻ The equal-tail credible interval approach is ideal when the posterior distribution is symmetric. ▻ But what if π(θ|x) is skewed ...
  36. [36]
    Bayes Factors: Journal of the American Statistical Association
    Journal of the American Statistical Association Volume 90, 1995 - Issue 430 ... Citation & references. Download citations. Information for. Authors · R&D ...
  37. [37]
    Theory of Probability - Harold Jeffreys - Oxford University Press
    Jeffreys' Theory of Probability, first published in 1939, was the first attempt to develop a fundamental theory of scientific inference based on Bayesian ...
  38. [38]
    Rejecting or Accepting Parameter Values in Bayesian Estimation
    May 8, 2018 · This range is called the region of practical equivalence (ROPE). The decision rule, which I refer to as the HDI+ROPE decision rule, is ...Bayesian Parameter... · More About The Rope · Specifying Rope Limits
  39. [39]
    The Importance of Prior Sensitivity Analysis in Bayesian Statistics
    In this paper, we discuss the importance of examining prior distributions through a sensitivity analysis. We argue that conducting a prior sensitivity ...What Is a Sensitivity Analysis... · Proof of Concept Simulation... · Conclusion
  40. [40]
    [PDF] Applied Statistical Decision Theory - Gwern
    In the field of statistical decision theory Professors Raiffa and Schlaifer have sought to develop new analytical tech niques by which the modern theory of ...
  41. [41]
    Bayes' rule in diagnosis - ScienceDirect.com
    Common properties of the quality of diagnostic tests include sensitivity and specificity. Sensitivity refers to the probability of a true-positive test result ...Missing: seminal paper
  42. [42]
    Interpretation by Physicians of Clinical Laboratory Results
    Authors: Ward Casscells, B.S., Arno Schoenberger, M.D., and Thomas B. Graboys, M.D.Author Info & Affiliations. Published November 2, 1978.
  43. [43]
    [PDF] Judgment under Uncertainty: Heuristics and Biases Author(s)
    Biases in judgments reveal some heuristics of thinking under uncertainty. Amos Tversky and Daniel Kahneman. The authors are members of the department of.
  44. [44]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · Conjugate Bayesian analysis of Gaussian distribution uses conjugate priors, allowing closed-form results. A natural conjugate prior has the ...Missing: seminal | Show results with:seminal
  45. [45]
  46. [46]
    Bayes or not Bayes, is this the question? - PMC - NIH
    Frequentist statistics never uses or calculates the probability of the hypothesis, while Bayesian uses probabilities of data and probabilities of both ...
  47. [47]
    [PDF] Bayes Factors - Robert E. Kass; Adrian E. Raftery
    Oct 14, 2003 · These help connect hypothesis testing with model selection and introduce several problems that Bayesian methodology can solve, including the ...
  48. [48]
    [PDF] Estimating the Dimension of a Model Gideon Schwarz The Annals of ...
    Apr 5, 2007 · The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading ...
  49. [49]
    [PDF] A New Look at the Statistical Model Identification
    AIC and numerical comparisons of JLUCE with other procedures in various specific applications will be the subjects of further stud>-. VIII. COWLUSION. The ...
  50. [50]
    [PDF] MCMC-Based Inference in the Era of Big Data - arXiv
    Aug 28, 2015 · Abstract. Markov chain Monte Carlo (MCMC) lies at the core of modern Bayesian methodol- ogy, much of which would be impossible without it.
  51. [51]
    [PDF] metropolis-et-al-1953.pdf - aliquote.org
    The paper describes a method using modified Monte Carlo integration for calculating properties of interacting molecules, using classical statistics and two- ...
  52. [52]
    [PDF] Monte Carlo sampling methods using Markov chains and their ...
    Biometrika (1970), 57, 1, p. 97. 9 7. Printed in Great Britain. Monte Carlo sampling methods using Markov chains and their applications. BY W. K. HASTINGS.
  53. [53]
    [PDF] Stochastic Relaxation, Gibbs Distributions, and the Bayesian ...
    GEMAN AND GEMAN: STOCHASTIC RELAXATION, GIBBS DISTRIBUTIONS, AND BAYESIAN RESTORATION.
  54. [54]
    [PDF] Inference from Iterative Simulation Using Multiple Sequences
    457. Page 2. 458. A. GELMAN AND D. B. RUBIN. Our focus is on Bayesian posterior distributions aris- ing from relatively complicated practical models, often with ...
  55. [55]
    [PDF] An Introduction to Variational Methods for Graphical Models
    This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov ...
  56. [56]
    [PDF] Variational Inference: A Review for Statisticians - arXiv
    May 9, 2018 · ... (Jordan et al., 1999; Wainwright and Jordan, 2008). Variational inference is widely used to approximate posterior densities for Bayesian models,.
  57. [57]
    [PDF] Stochastic Variational Inference
    For good reviews of variational inference see Jordan et al. (1999) and Wainwright and Jordan (2008). In this paper, we develop scalable methods for generic ...
  58. [58]
    [PDF] JAGS: A Program for Analysis of Bayesian Graphical Models Using ...
    BUGS (Bayesian inference Using Gibbs Sampling) is a program for analyzing Bayesian graphical models via Markov Chain Monte Carlo (MCMC) simulation ...
  59. [59]
    PyMC: a modern, and comprehensive probabilistic programming ...
    Sep 1, 2023 · PyMC is a probabilistic programming library for Python that provides tools for constructing and fitting Bayesian models.
  60. [60]
    [PDF] Stan: A Probabilistic Programming Language - Columbia University
    Abstract. Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function ...Missing: original | Show results with:original
  61. [61]
    [PDF] Past, Present, and Future of Software for Bayesian Inference
    This review aims to summarize the most popular software and provide a useful map for a reader to navigate the world of Bayesian computation. We anticipate a ...
  62. [62]
    [1810.09538] Pyro: Deep Universal Probabilistic Programming - arXiv
    Oct 18, 2018 · Pyro is a probabilistic programming language built on Python as a platform for developing advanced probabilistic models in AI research.Missing: original | Show results with:original
  63. [63]
    Pyro
    Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling. It was designed with these ...Tutorials, How-to Guides and... · Pyro Discussion Forum · DocumentationMissing: integration 2025
  64. [64]
    [PDF] Bayesian Methods for Neural Networks - UBC Computer Science
    Bayesian learning for neural networks creates a flexible framework for regression, density estimation, prediction, and classification, using probabilities to ...
  65. [65]
    [PDF] Gaussian Processes for Machine Learning
    ... Gaussian processes in regression and classification tasks. They also show how Gaussian processes can be interpreted as a Bayesian version of the well-known ...
  66. [66]
    Bayesian Active Learning for Classification and Preference Learning
    Dec 24, 2011 · Bayesian Active Learning for Classification and Preference Learning. Authors:Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, Máté Lengyel.
  67. [67]
    Efficient Global Optimization of Expensive Black-Box Functions
    Jones, D.R., Schonlau, M. & Welch, W.J. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization 13, 455–492 (1998).
  68. [68]
    Adaptive Designs for Clinical Trials | New England Journal of Medicine
    Jul 7, 2016 · Adaptive trial design has been proposed as a means to increase the efficiency of randomized clinical trials, potentially benefiting trial participants and ...Missing: seminal | Show results with:seminal
  69. [69]
  70. [70]
    Bayesian SIR model with change points with application to ... - Nature
    Dec 2, 2022 · The multi-wave SIR model proposed by Ghosh and Ghosh10 allows researchers to investigate the nonperiodicity of COVID-19 pandemic waves, while ...
  71. [71]
    [1807.06209] Planck 2018 results. VI. Cosmological parameters - arXiv
    Jul 17, 2018 · Abstract:We present cosmological parameter results from the final full-mission Planck measurements of the CMB anisotropies.Missing: Bayesian | Show results with:Bayesian
  72. [72]
    Model comparison of the dark matter profiles of Fornax, Sculptor ...
    Apr 10, 2013 · Our goal is to compare dark matter profile models of these four systems using Bayesian evidence. We consider NFW, Einasto and several cored profiles for their ...
  73. [73]
    Bayesian epistemology - Stanford Encyclopedia of Philosophy
    Jun 13, 2022 · Probabilism is often regarded as a coherence norm, which says how one's opinions ought to fit together on pain of incoherence. So, if ...
  74. [74]
    Dutch Book Arguments - Stanford Encyclopedia of Philosophy
    Jun 15, 2011 · The Dutch Book argument (DBA) for probabilism (namely the view that an agent's degrees of belief should satisfy the axioms of probability)The Basic Dutch Book... · Diachronic Dutch Book... · Other Uses of Dutch Book...
  75. [75]
    [PDF] Notes on Bayesian Confirmation Theory - Michael Strevens
    The degree to which a piece of evidence confirms a hypothesis can be quantified in various ways using the Bayesian framework of subjective probabilities. The ...
  76. [76]
    The Case for Objective Bayesian Analysis - Project Euclid
    Note that, in practice, I view both objective Bayesian analysis and subjective Bayesian analysis to be indispensable, and to be complementary parts of the ...Missing: critiques | Show results with:critiques
  77. [77]
    [PDF] LII. An Essay towards solving a Problem in the Doctrine of Chances ...
    Mr. Bayes has thought fit to begin his work with a brief demonstration of the general laws of chance. His reason for doing this, as he says ...
  78. [78]
    [PDF] BERNOULLI, BAYES, AND LAPLACE 2.3 FREQUENTIST ...
    Bayesian probability theory offers unique and demonstrably optimal solutions to well-posed statistical problems, and is historically the original approach to.
  79. [79]
    Theory Of Probability : Jeffreys Harold - Internet Archive
    Jan 17, 2017 · Theory Of Probability. by: Jeffreys Harold. Publication date: 1948. Topics: C-DAC. Collection: digitallibraryindia; JaiGyan. Language: English.
  80. [80]
    [PDF] Harold Jeffreys's Theory of Probability Revisited - arXiv
    The posterior probabilities of the hypotheses are proportional to the products of the prior probabilities and the likelihoods. H. Jeffreys, Theory of ...
  81. [81]
    [PDF] <em>The Foundations of Statistics</em> (Second Revised Edition)
    They are in full harmony with the ideas in this book but are more down to earth and less spellbound by tradition. L. J. SAVAGE. Yale Unwersity. June, 1971. Page ...
  82. [82]
    [PDF] Foundations - of Statistics
    LEONARD J. SAVAGE. Associate Professor of Statistics. University of Chicago. New York John Wiley & Sons, Inc. London. Chapman & Hall, Limited. Page 2. COPYRIGHT ...
  83. [83]
    Sampling-Based Approaches to Calculating Marginal Densities
    In particular, the relevance of the approaches to calculating Bayesian posterior densities for a variety of structured models will be discussed and illustrated.Missing: revival | Show results with:revival
  84. [84]
    [PDF] Sampling-Based Approaches to Marginal Densities
    Nov 17, 2007 · Gelfand; Adrian F. M. Smith. Journal of the American Statistical Association, Vol. 85, No. 410. (Jun., 1990), pp. 398-409 ...Missing: revival | Show results with:revival
  85. [85]
    When Did Bayesian Inference Become “Bayesian”? - Project Euclid
    While Bayes' theorem has a 250-year history, and the method of in- verse probability that flowed from it dominated statistical thinking into the twen- tieth ...
  86. [86]
    Thomas Bayes (1701-1761)
    The idea of inverse probability is manifested today in both Bayesian statistics and the more common Bayes' Rule , and it is possible that Bayes never published ...
  87. [87]
    Thomas Bayes's Bayesian Inference - jstor
    published posthumously in 1764, by virtue of the efforts of Richard Price, Bayes's intellectual executor. The theorem it presented, though ignored by all ...
  88. [88]
    Laplace's 1774 Memoir on Inverse Probability - jstor
    Abstract. Laplace's first major article on mathematical statistics was pub- lished in 1774. It is arguably the most influential article in this field to.
  89. [89]
    Obituary: Dennis V. Lindley 1923–2013
    May 15, 2014 · He was a founder of the regular Valencia International Meetings on Bayesian statistics, in which he participated enthusiastically for many ...
  90. [90]
    De Finetti's Contribution to Probability and Statistics - Project Euclid
    Abstract. This paper summarizes the scientific activity of de Finetti in probability and statistics. It falls into three sections: Section 1 includes.
  91. [91]
    [PDF] Probabilistic Inference Using Markov Chain Monte Carlo Methods
    Sep 25, 1993 · Page 1. Probabilistic Inference Using. Markov Chain Monte Carlo Methods. Radford M. Neal ... Bayesian learning for neural networks.