Prior probability
In Bayesian statistics, the prior probability, often referred to simply as the prior, is the probability distribution assigned to an uncertain quantity or hypothesis before any relevant evidence is taken into account.[1] It represents the initial degree of belief in a parameter or event, derived from existing knowledge, expert opinion, or assumptions of ignorance.[2] This distribution serves as the starting point for inference, which is updated through Bayes' theorem by incorporating observed data to yield the posterior probability.[3] The concept of prior probability originated in the work of Thomas Bayes, an English mathematician and Presbyterian minister, whose essay "An Essay towards solving a Problem in the Doctrine of Chances" was published posthumously in 1763.[3] Bayes' ideas were further developed by Pierre-Simon Laplace in the late 18th century, who applied them to inverse probability problems using uniform priors and conjugate models, such as the beta-binomial for proportions.[4] By the 19th century, the approach gained prominence among probabilists but faced criticism for the perceived subjectivity in selecting priors, leading to a decline in favor of frequentist methods in the early 20th century.[3] The modern Bayesian revival began mid-century, driven by figures like Bruno de Finetti, Harold Jeffreys, Jimmy Savage, and Dennis Lindley, who emphasized subjective probability and decision theory; computational advances in the late 1980s and 1990s, including Markov chain Monte Carlo methods, made prior-based inference practical for complex models.[4] Priors can be classified as informative, drawing on historical data or domain expertise to encode specific beliefs (e.g., a beta distribution centered around an expected success rate), or non-informative, such as Jeffreys priors, which reflect minimal prior knowledge by being invariant to reparameterization and often yielding flat or weakly informative distributions.[5] Conjugate priors, like the beta for binomial likelihoods, are particularly useful because they result in posteriors from the same family, simplifying analytical computations.[4] The choice and impact of priors remain central to Bayesian practice, balancing prior beliefs with data to quantify uncertainty, though debates persist over their subjectivity and sensitivity in analyses.[1]Fundamentals
Definition and Interpretation
In Bayesian statistics, the prior probability refers to the probability distribution assigned to an unknown parameter θ or hypothesis before observing any data, denoted as p(θ), which encapsulates the initial state of knowledge or belief about θ.[6] This distribution serves as the starting point for inference, representing uncertainty or information available prior to data collection.[7] The interpretation of prior probabilities can be subjective or objective. In the subjective view, priors reflect the personal beliefs or expert knowledge of the analyst, allowing for the incorporation of relevant prior information into the analysis.[8] Conversely, the objective approach seeks priors that are minimally informative or derived from formal principles to ensure reproducibility and lack of personal bias, as discussed in foundational works on Bayesian methodology.[9] Historically, Pierre-Simon Laplace introduced an early objective perspective through his 1812 principle of insufficient reason (also known as the principle of indifference), which posits that in the absence of information favoring one outcome over another, equal probabilities should be assigned to equiprobable possibilities.[10] For continuous parameters, the prior is typically expressed as a probability density function p(θ), which integrates to 1 over the parameter space, while for discrete cases, it is a probability mass function specifying probabilities for each possible value.[11] A simple example illustrates this: consider estimating the bias θ (probability of heads) of a coin with no prior flips observed; a Beta(1,1) prior, which is uniform over [0,1], represents complete ignorance about θ by assigning equal density to all values.[12]Role in Bayesian Inference
In Bayesian inference, the prior probability plays a central role by serving as the initial distribution over possible parameter values or hypotheses before observing the data, which is then updated through Bayes' theorem to form the posterior distribution.[13] Bayes' theorem states that the posterior probability density function is proportional to the product of the likelihood and the prior: p(\theta \mid y) \propto p(y \mid \theta) \cdot p(\theta), where \theta represents the parameters, y the observed data, p(\theta) the prior, and p(y \mid \theta) the likelihood; this multiplicative structure ensures that the prior directly influences the weighting of the likelihood in yielding updated beliefs about \theta.[14] The full posterior is obtained by normalizing this product: p(\theta \mid y) = \frac{p(y \mid \theta) \cdot p(\theta)}{p(y)}, with the evidence p(y) = \int p(y \mid \theta) p(\theta) \, d\theta acting as the marginal likelihood that ensures the posterior integrates to 1, thereby quantifying the total probability of the data under the model and facilitating model comparison.[15] The posterior distribution encapsulates the updated beliefs, combining the prior's information with the data's evidential content via the likelihood; for predictive purposes, one can marginalize the posterior over \theta to obtain the predictive distribution p(\tilde{y} \mid y) = \int p(\tilde{y} \mid \theta) p(\theta \mid y) \, d\theta, where \tilde{y} denotes future observations.[14] This integration reflects how the prior shapes not only parameter inference but also forecasts by propagating initial uncertainties through the model.[15] Conceptually, Bayesian updating follows a sequential flow: begin with the prior p(\theta) encoding pre-data knowledge, incorporate the likelihood p(y \mid \theta) to reflect data compatibility, and arrive at the posterior p(\theta \mid y) as the synthesis, with the evidence serving as a normalizing bridge.[16] A simple discrete example illustrates this in disease testing: suppose the prior probability of having a rare disease is 0.01 (1% prevalence), and a test with 99% sensitivity (true positive rate) and 99% specificity (true negative rate) yields a positive result. The likelihood of a positive test given the disease is 0.99, and given no disease is 0.01 (false positive rate). Applying Bayes' theorem, the posterior probability of having the disease is approximately 0.50, calculated as p(\text{disease} \mid +) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} \approx 0.50, demonstrating how the low prior tempers the test's evidential strength to avoid overconfidence.[17]Informative Priors
Strong Priors
Strong priors, also known as highly informative priors, are probability distributions characterized by high concentration around specific values, typically featuring low variance or precision parameters that allow them to dominate the posterior distribution, particularly in scenarios with sparse data. For instance, a normal prior distribution N(\mu_0, \sigma^2) with a small \sigma^2 places substantial weight near \mu_0, effectively constraining the posterior mean toward this value even with limited observations. This concentration reflects strong expert beliefs or accumulated evidence from prior studies, enabling the prior to act as a robust anchor in Bayesian updating.[18] The primary advantage of strong priors lies in their utility for small-sample studies or when reliable domain knowledge is available, as they incorporate substantive information to improve estimation precision and reduce overfitting. In clinical trials, for example, historical data from previous experiments can inform a strong prior on treatment effects, allowing efficient borrowing of information to enhance power without requiring large new samples. This approach is particularly beneficial in pharmaceutical research, where a strong prior derived from prior studies on drug efficacy—such as a normal prior centered on an expected response rate from Phase II trials—can shift the posterior toward the prior mean when new Phase III data is limited, leading to more stable inferences about efficacy.[19] However, strong priors carry the disadvantage of introducing bias if misspecified, as their dominant influence can skew the posterior away from the true parameter value, potentially leading to misleading conclusions. Sensitivity analyses are essential to assess how posterior inferences change under prior perturbations, highlighting the need for careful validation against domain expertise. As a milder alternative, weakly informative priors can provide regularization with less risk of overriding data.[18]Weakly Informative Priors
Weakly informative priors are probability distributions that incorporate a minimal amount of prior knowledge, featuring broad spreads and some structural constraints to ensure computational stability and reasonable posterior inferences without substantially overriding the data. These priors typically employ heavy-tailed distributions such as the Cauchy or Student's t-distribution with low degrees of freedom and large scale parameters, which center around plausible values like zero for regression coefficients while allowing for extreme outcomes if supported by the evidence. For instance, a Cauchy distribution with location 0 and scale 2.5 (or 10 for intercepts) bounds the tails to prevent implausibly large parameter values, yet remains diffuse enough to let the likelihood dominate in most scenarios.[20][21] The primary purpose of weakly informative priors is to mitigate pathological issues in Bayesian inference, such as improper posteriors or infinite variance that can arise from fully non-informative alternatives, while staying close to objectivity by exerting only light regularization. They stabilize estimates in challenging settings like small sample sizes, high-dimensional models, or cases of parameter non-identifiability (e.g., complete separation in logistic regression), where the data alone might yield unstable or extreme results. By introducing just enough structure—such as finite variance and tail decay—these priors promote robust modeling without assuming strong domain-specific beliefs, making them suitable as default choices for exploratory analyses or when prior elicitation is difficult.[20][21] Weakly informative priors gained prominence in the 2010s through the advocacy of Andrew Gelman and collaborators, who emphasized their role in robust Bayesian data analysis via tools like Stan and detailed methodological guidance. In their seminal work, Gelman et al. recommended these priors for hierarchical and regression models to balance flexibility and reliability, influencing their adoption in fields requiring reproducible inference. A representative example occurs in linear regression, where a normal prior on coefficients with mean 0 and a large standard deviation (e.g., 10) provides mild shrinkage toward zero, regularizing the model against overfitting multicollinear predictors while permitting data-driven deviations for truly important effects—thus avoiding the pitfalls of flat priors that can lead to erratic predictions.[21][20]Non-Informative Priors
Objective Priors
Objective priors in Bayesian statistics are probability distributions selected to exert minimal influence on the posterior distribution, thereby allowing the observed data to predominantly determine the inference. These priors aim to represent a state of ignorance or objectivity regarding the parameter values, ensuring that the posterior closely approximates the normalized likelihood when sufficient data are available.[22][23] A prominent example is the Jeffreys prior, derived as the square root of the determinant of the Fisher information matrix, \pi(\theta) \propto \sqrt{\det \mathbf{I}(\theta)}, where \mathbf{I}(\theta) quantifies the amount of information the data provide about \theta. This construction, originally proposed by Harold Jeffreys, ensures invariance under smooth reparameterizations of the model, meaning the prior transforms appropriately to maintain the same inferential properties regardless of how the parameter is expressed.[22][24] For instance, in a normal distribution with known variance, the Jeffreys prior for the mean \mu is uniform, \pi(\mu) \propto 1, while for the standard deviation \sigma with known mean, it is \pi(\sigma) \propto 1/\sigma.[22][25] Another key type is the uniform prior, often used for bounded parameters to express uniformity over the possible range. For a proportion parameter \theta in a binomial model, a uniform prior on [0,1] corresponds to a Beta(1,1) distribution, which integrates to 1 and results in a posterior that is simply the likelihood normalized over the parameter space.[25][23] If y successes are observed in n trials, the posterior becomes Beta(1 + y, 1 + n - y), directly reflecting the data's evidential content without additional prior weighting.[25] Reference priors extend this objectivity asymptotically, maximizing the expected Kullback-Leibler divergence between the prior and the posterior to ensure the prior adds the least possible information relative to the data. Developed by José M. Bernardo, these priors coincide with Jeffreys priors in one-dimensional cases but provide a more robust approach for multiparameter models by prioritizing parameters of interest.[26][22] This property makes reference priors particularly suitable for achieving consistent inference as sample sizes grow, preserving the data's dominance in the limit.[26]Improper Priors
Improper priors are probability distributions that do not integrate to a finite value over their domain, meaning ∫ p(θ) dθ = ∞, rendering them non-normalizable as formal probability densities.[27] Classic examples include the uniform distribution over the entire real line, (-∞, ∞), and the prior proportional to 1/θ for positive scale parameters θ > 0, both of which assign equal weight across unbounded spaces but fail to sum to unity.[27] Despite their mathematical impropriety, these priors can serve as limiting cases of proper distributions with increasingly diffuse support, facilitating non-informative Bayesian analysis. For inference to be valid, an improper prior must yield a proper posterior distribution, which requires that the integral ∫ p(data|θ) p(θ) dθ remains finite and normalizable, ensuring the posterior integrates to 1.[28] This condition holds when the likelihood p(data|θ) sufficiently bounds the parameter space, dominating the prior's divergence. A prominent example is the Haldane prior, Beta(0,0), which is proportional to p^{-1}(1-p)^{-1} for a binomial success probability p ∈ (0,1) and is improper due to singularities at the boundaries.[29] When combined with binomial data showing at least one success and one failure, the resulting Beta(n+0, m+0) posterior—where n and m are the counts—is proper and equivalent to the maximum likelihood estimate, highlighting the prior's utility in objective settings.[29] These priors offer computational simplicity, as they often lead to analytically tractable posteriors without imposing subjective beliefs, making them appealing for default analyses.[30] However, risks arise if the posterior remains improper, which can occur with insufficient data or ill-posed models, leading to paradoxes such as undefined marginal likelihoods that invalidate model comparisons.[30] Careful verification of posterior propriety is essential to avoid misleading inferences.[28] In linear regression, a flat improper prior on the coefficients β, such as p(β) ∝ 1, paired with a similar prior on the precision 1/σ² ∝ 1/σ², exemplifies these dynamics. Without data, the posterior stays improper, reflecting the model's underidentification.[31] In contrast, with sufficient observations, the likelihood regularizes the posterior into a proper multivariate normal for β (conditional on σ²) and inverse-gamma for σ², enabling standard Bayesian estimates akin to least squares but with uncertainty quantification.[31] This setup underscores how data can salvage inference from improper priors in well-posed problems.[30]Prior Selection
Elicitation Techniques
Elicitation techniques for prior probabilities involve systematic processes to incorporate expert knowledge or existing data into prior distributions within Bayesian analysis. One primary method is expert elicitation through structured questionnaires, which quantify subjective beliefs from domain specialists. The Delphi method, for instance, facilitates this by conducting iterative rounds of anonymous surveys where experts provide probability assessments, followed by feedback on group responses to converge toward consensus and reduce individual biases.[32] This approach is particularly useful when direct data is scarce, allowing experts to express uncertainties via quantiles or intervals that can be aggregated into a prior distribution.[33] Another technique aggregates historical data to inform priors, drawing on past observations or similar studies to construct distributions that reflect accumulated evidence. This often employs hierarchical models to borrow strength across related datasets, ensuring the prior captures patterns without overfitting to any single source.[34] Empirical Bayes methods further refine this by using preliminary data from past datasets to estimate hyperparameters of the prior, treating the marginal likelihood as a basis for selecting a data-driven yet regularized distribution.[33] These data-informed approaches balance objectivity with the need for prior specification in new analyses. Formal approaches to elicitation often encode elicited beliefs into distributional moments, such as mean and variance, before fitting a parametric family like the normal or Beta distribution to match those characteristics. For continuous parameters, experts might provide judgments on expected values and spreads, which are then used to parameterize the prior via maximum likelihood or moment-matching techniques.[35] In discrete cases, such as success probabilities, Beta distributions are commonly fitted to elicited quantiles or odds ratios expressed by experts.[34] Challenges in these techniques include avoiding cognitive biases, such as overconfidence or anchoring, which can distort elicited probabilities and lead to overly narrow priors.[33] Handling uncertainty is also critical, as experts may struggle to quantify second-order uncertainties, necessitating robust aggregation methods to propagate variability into the final prior.00175-9/pdf) Protocols like the SHELF framework address this by incorporating feedback loops and sensitivity checks during elicitation.[36] A representative example involves eliciting a prior for earthquake magnitude from seismologists, where experts provide quantile judgments on expected magnitudes in a region. These assessments are then fitted to a log-normal distribution to capture the skewed nature of seismic events, yielding a prior that informs probabilistic hazard models.[37]Conjugate Priors
In Bayesian statistics, a conjugate prior for a parameter \theta is a prior distribution p(\theta) such that, when combined with a likelihood p(x|\theta) via Bayes' theorem, the resulting posterior p(\theta|x) belongs to the same distributional family as the prior. This conjugacy ensures analytical tractability, as the posterior can be obtained by simply updating the prior's hyperparameters based on the observed data. The formalization of conjugate priors traces back to Raiffa and Schlaifer (1961), who emphasized their role in decision-theoretic contexts where sufficient statistics of fixed dimension enable closed-form solutions.[38] Conjugate priors are most naturally defined for likelihoods from exponential families, where the prior is constructed to mimic the likelihood's kernel. Key examples include the beta distribution as conjugate to the binomial or Bernoulli likelihood for modeling success probabilities, the gamma distribution conjugate to the Poisson likelihood for rates, and the normal distribution conjugate to the normal likelihood for mean estimation with known variance. For multivariate settings, the inverse-Wishart distribution serves as the conjugate prior for the covariance matrix of a multivariate normal likelihood. These pairs are summarized in the following table of common conjugate relationships:| Likelihood Model | Parameter(s) | Conjugate Prior Family | Hyperparameters |
|---|---|---|---|
| Bernoulli/Binomial | p | Beta | \alpha > 0, \beta > 0 |
| Poisson | \lambda | Gamma | \alpha > 0, \beta > 0 |
| Normal (known variance) | \mu | Normal | Mean m, precision \rho |
| Normal (known mean) | \sigma^2 | Inverse-gamma | Shape \alpha, scale \beta |
| Multivariate normal | \Sigma | Inverse-Wishart | Degrees of freedom \nu, scale matrix S |