In Bayesian statistics, the Jeffreys prior is a non-informative priordistribution designed to reflect a lack of priorknowledge about model parameters while ensuring objectivity and invariance under reparameterization.[1] It is mathematically defined as the priordensity \pi_J(\theta) \propto \sqrt{\det \mathbf{I}(\theta)}, where \mathbf{I}(\theta) is the Fisher informationmatrix, which measures the amount of information that an observable random variable carries about an unknown parameter \theta.[1] This formulation arises from the expected value of the negative Hessian of the log-likelihood, specifically \mathbf{I}(\theta) = -\mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta \partial \theta^T} \log p(X \mid \theta) \right], making the prior proportional to the geometric mean of the precisions in the parameter space.[2]Proposed by the British mathematician, physicist, and statistician Sir Harold Jeffreys (1891–1989), the prior was introduced in his influential book Theory of Probability (first published in 1939), as part of a broader framework for objective Bayesian inference.[3] Jeffreys developed it to provide a systematic, transformation-invariant method for selecting priors in scientific applications, countering the subjectivity often associated with earlier Bayesian approaches and promoting the use of inverse probability for inductive reasoning.[4] In the book, he emphasized priors that maximize ignorance while aligning with the geometry of the parameter space, drawing on concepts from information theory and differential geometry avant la lettre.[4]A defining property of the Jeffreys prior is its invariance under reparameterization: if \phi = g(\theta) is a one-to-one transformation, the prior in the new parameterization becomes \pi_J(\phi) \propto \sqrt{\det \mathbf{I}(\phi)} \cdot |J|, where J is the Jacobian, ensuring consistent posterior inferences regardless of how parameters are labeled.[2] For univariate cases, it simplifies to \pi_J(\theta) \propto \sqrt{I(\theta)}, where I(\theta) is the scalar Fisher information; common examples include the Beta(0.5, 0.5) distribution for the binomial success probability \theta (arising from I(\theta) = n / [\theta(1-\theta)]) and a uniform prior on the mean \mu of a normal distribution with known variance, but $1/\sigma for the standard deviation \sigma.[1] This invariance distinguishes it from simpler non-informative priors like the uniform distribution, which can lead to paradoxes under nonlinear transformations.[2]The Jeffreys prior has been foundational in objective Bayesian methods, facilitating hypothesis testing, parameterestimation, and model comparison in fields from physics to machine learning, often serving as a default choice when substantive prior information is unavailable.[4] However, it is frequently improper (integrating to infinity), which can complicate posterior normalization in low-data regimes, and in multiparameter settings, it may concentrate probability in undesirable regions, prompting refinements like reference priors.[1] It also underlies the Jeffreys-Lindley paradox, highlighting tensions between Bayesian and frequentist inference for point null hypotheses as sample sizes grow large.[5] Despite these limitations, its role in promoting rigorous, geometry-aware prior selection endures, influencing modern developments in statistical decision theory.[2]
Historical Background and Motivation
Development by Harold Jeffreys
Harold Jeffreys, a Britishmathematician and statistician, introduced the Jeffreys prior in the second edition of his influential bookTheory of Probability, published in 1948 (building on his 1946 paper "An Invariant Form for the Prior Probability in Estimation Problems"), with the third edition released in 1961 that further refined and expanded its presentation.[6][7] The prior emerged as a cornerstone of his effort to establish a rigorous foundation for Bayesian inference, specifically designed to provide an objective and non-informative starting point for parameter estimation and hypothesis testing.Jeffreys' formulation was driven by the need for priors that maintain objectivity while being invariant to changes in parameterization, thereby avoiding the arbitrary outcomes produced by simple uniform priors, which could favor certain scales over others depending on how parameters were defined.[6] This approach built directly on earlier Bayesian traditions, including Pierre-Simon Laplace's principle of insufficient reason from the early 19th century, but Jeffreys advanced it by prioritizing invariance as a key criterion for non-informative priors, ensuring consistency across different representations of the same problem.[8]The development occurred against the backdrop of intense methodological debates in the 1930s, as frequentist statistics gained prominence through the work of Jerzy Neyman and Egon Pearson, who emphasized error rates and long-run frequencies in hypothesis testing, often criticizing Bayesian reliance on priors as subjective.[9] Jeffreys countered these critiques by advocating Bayesian methods as more logically coherent for scientific inference, particularly in addressing the limitations of p-values and fixed significance levels; his prior played a central role in this defense, enabling principled assessments of evidence in testing point null hypotheses and estimating parameters.[6]
Motivation for Invariant Non-Informative Priors
Non-informative priors in Bayesian statistics are designed to express a state of ignorance about the parameters, allowing the data to drive the inference without introducing subjective beliefs from the analyst. These priors seek to minimize the influence of prior assumptions, particularly in objective Bayesian approaches where the goal is to base conclusions solely on the observed evidence. However, simpler choices like uniform priors often fail to achieve this neutrality, as they do not remain consistent under changes in parameterization. For instance, assuming a uniformprior on a parameter θ yields different posterior inferences compared to a uniformprior on log θ, which can alter the effective weighting of the data and lead to contradictory results depending on the chosen scale.[10][11]The requirement for invariance addresses this fundamental issue by ensuring that the prior distribution produces equivalent posterior inferences regardless of how the parameters are parameterized, thereby resolving scale-dependent ambiguities that arise in scientific modeling. This principle is crucial for maintaining consistency in inductive reasoning, as different parameterizations—such as switching between additive and multiplicative scales—should not affect the logical conclusions drawn from the data. Without invariance, inferences become artifacts of arbitrary choices in model formulation, undermining the reliability of Bayesian methods in empirical applications. Jeffreys emphasized that such priors should reflect a principled notion of ignorance, derived from the intrinsic geometry of the parameter space rather than ad-hoc selections, to avoid these inconsistencies.[11][10]Early illustrations of these problems appear in location-scale families, where a uniform prior on the scale parameter, such as the standard deviation σ in a normal distribution, results in improper posteriors that assign probability 1 to σ being less than any positive value, or introduces biases in estimation. In these cases, the uniform prior fails to account for the natural metric of the parameter space, leading to unacceptable consequences like infinite ranges or skewed inferences. Jeffreys' approach, building on his earlier work in probability theory and his 1946 paper on invariant priors, advocates for priors that are invariant under group transformations inherent to the model, providing a systematic way to construct non-informative distributions that preserve the intended objectivity.[11][10][7]
Definition and Derivation
Fisher Information Matrix
The Fisher information serves as a fundamental measure in statistical inference, quantifying the amount of information that an observable random variable provides about an unknown parameter through the likelihood function. For a single parameter \theta and likelihood f(x|\theta), the Fisher information I(\theta) is defined as the expected value of the negative second derivative of the log-likelihood:I(\theta) = -\mathbb{E}\left[ \frac{\partial^2 \log f(x|\theta)}{\partial \theta^2} \bigg| \theta \right],which is equivalent to the variance of the score function, the first derivative of the log-likelihood:I(\theta) = \mathrm{Var}\left[ \frac{\partial \log f(x|\theta)}{\partial \theta} \bigg| \theta \right].This equivalence holds under regularity conditions ensuring the interchangeability of differentiation and expectation.[12][13]In the multiparameter case, where \theta = (\theta_1, \dots, \theta_k)^\top is a k-dimensional parameter vector, the Fisher information is represented by a k \times k symmetric matrix \mathbf{I}(\theta) with elements given byI_{ij}(\theta) = \mathbb{E}\left[ \frac{\partial \log f(x|\theta)}{\partial \theta_i} \frac{\partial \log f(x|\theta)}{\partial \theta_j} \bigg| \theta \right] = -\mathbb{E}\left[ \frac{\partial^2 \log f(x|\theta)}{\partial \theta_i \partial \theta_j} \bigg| \theta \right].The matrix \mathbf{I}(\theta) is positive semi-definite, reflecting its role as a measure of the data's sensitivity to changes in \theta; higher values indicate greater precision in estimating the parameter from the data. In frequentist settings, the inverse of the Fisher information matrix provides the asymptotic covariance matrix of the maximum likelihood estimator, such that \sqrt{n} (\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \mathbf{I}(\theta)^{-1}) as the sample size n grows, establishing the efficiency bound for estimators.[13][14]Within Bayesian inference, the Fisher information matrix captures the local curvature of the log-likelihood surface, enabling second-order approximations to the posterior distribution and informing the construction of non-informative priors that respect the model's information structure. This curvature interpretation underscores its utility in assessing parameter identifiability and sensitivity, where regions of low information correspond to flat likelihoods and higher posterior uncertainty.[13]
One-Parameter Case
In the one-parameter case, the Jeffreys prior for a parameter \theta is defined as \pi(\theta) \propto \sqrt{I(\theta)}, where I(\theta) denotes the scalar Fisher information evaluated at \theta.[11] This construction arises from the requirement that the prior should be non-informative in a way that respects the geometry of the parameterspace, leading to a form that scales with the local sensitivity of the likelihood to changes in \theta.The key motivation for this form is its invariance under one-to-one reparameterizations of \theta. Suppose \phi = \phi(\theta) is a differentiable one-to-one transformation with inverse \theta = \theta(\phi) and Jacobian derivative d\theta/d\phi \neq 0. The Fisher information transforms according toI'(\phi) = I(\theta) \left( \frac{d\theta}{d\phi} \right)^2,where I'(\phi) is the Fisher information in the \phi-parameterization.[11] Under a change of variables, the prior density transforms as \pi'(\phi) = \pi(\theta(\phi)) \left| d\theta/d\phi \right|. Substituting the Jeffreys form \pi(\theta) \propto \sqrt{I(\theta)} yields\pi'(\phi) \propto \sqrt{I(\theta)} \left| \frac{d\theta}{d\phi} \right| = \sqrt{ I(\theta) \left( \frac{d\theta}{d\phi} \right)^2 } = \sqrt{I'(\phi)},demonstrating that \pi'(\phi) \propto \sqrt{I'(\phi)}, the Jeffreys prior in the new parameterization.[11] This property holds for any smooth one-to-one reparameterization, ensuring the prior is invariant under the group of diffeomorphisms on the parameter space.The Jeffreys prior in the one-parameter case is often improper, meaning \int \pi(\theta) \, d\theta = \infty, as seen in cases like uniform priors for location parameters.[15] However, under standard regularity conditions—such as the existence of the Fisher information, twice differentiability of the log-likelihood, and integrability of the likelihood—the resulting posterior distribution remains proper (normalizable) when combined with observed data.[15]Up to multiplication by a constant, this prior is unique among densities that satisfy invariance under all one-to-one reparameterizations, as the invariance condition uniquely determines the functional form proportional to \sqrt{I(\theta)}.[11]
Multi-Parameter Case
In the multi-parameter case, the Jeffreys prior extends the scalar formulation by incorporating the full Fisher information matrix I(\theta), where \theta = (\theta_1, \dots, \theta_k)^\top is a k-dimensional parameter vector. The prior density is given by\pi(\theta) \propto \sqrt{|\det I(\theta)|},with the Fisher information matrix defined asI(\theta)_{ij} = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta_i \partial\theta_j} \log f(X|\theta)\right],where f(X|\theta) is the likelihood function. The square root of the determinant arises because it measures the volume distortion in the parameterspace under reparameterization, ensuring the prior reflects the geometry of the information provided by the data. This form generalizes the one-parameter case, where \sqrt{|\det I(\theta)|} = \sqrt{I(\theta)}, and was originally proposed by Jeffreys to achieve invariance in joint inference over multiple parameters.[13]The prior's invariance under reparameterization is demonstrated through the transformation properties of the Fisher information matrix. Consider a differentiable reparameterization \phi = \phi(\theta) with Jacobian matrix J = \partial\phi / \partial\theta. The transformed Fisher information matrix I'(\phi) satisfies\det I'(\phi) = \frac{\det I(\theta)}{|\det J|^2}.Consequently, the transformed prior density \pi'(\phi) includes the Jacobian factor from the change of variables, yielding\pi'(\phi) \propto \sqrt{|\det I'(\phi)|} = \frac{\sqrt{|\det I(\theta)|}}{|\det J|},which, when multiplied by |\det J|, recovers \pi(\theta), preserving the prior measure across parameterizations. This property ensures that inferences remain consistent regardless of the chosen parameterization.[16]Despite these advantages, the multi-parameter Jeffreys prior faces challenges, particularly in its optimality for inference on subsets of parameters. It may not yield satisfactory marginal posteriors for individual parameters or groups, as the joint structure can lead to inconsistencies, such as biased estimators in problems like the Neyman-Scott paradox. These issues arise because the prior optimizes joint information but neglects parameter ordering or conditional inference, prompting generalizations like reference priors for targeted marginalization.Computationally, evaluating the prior requires calculating the determinant of I(\theta), typically by estimating the expected Hessian matrix of the log-likelihood, \mathbf{H}(\theta) = -\partial^2 \log f(X|\theta) / \partial\theta \partial\theta^\top, and taking I(\theta) = \mathbb{E}[\mathbf{H}(\theta)]. For complex models, this often involves numerical integration or Monte Carlo methods to approximate the expectation, especially when analytical forms are unavailable.[13]
Properties
Invariance under Reparameterization
One key property of the Jeffreys prior is its invariance under reparameterization, meaning that if the parameter space is transformed via a diffeomorphism, the prior density adjusts by the appropriate Jacobian factor to maintain the same form in the new coordinates. This ensures that the prior behaves as a proper density under such transformations, unlike a uniform prior, which generally does not transform invariantly and can lead to inconsistent inferences depending on the chosen parameterization.[2]To see this formally, consider a one-parameter model with parameter \theta and Jeffreys prior \pi(\theta) \propto \sqrt{I(\theta)}, where I(\theta) is the Fisher information. Now reparameterize to \phi = h(\theta), where h is a diffeomorphism with inverse \theta = h^{-1}(\phi) and Jacobian determinant |J| = \left| \frac{d\theta}{d\phi} \right|. The Fisher information transforms asI(\phi) = I(\theta) \left( \frac{d\theta}{d\phi} \right)^2,since the expected value of the squared score function scales by the chain rule. The Jeffreys prior in the new parameterization is then \pi(\phi) \propto \sqrt{I(\phi)} = \sqrt{I(\theta)} \left| \frac{d\theta}{d\phi} \right|. Transforming the original prior to the new scale gives \pi(\phi) = \pi(\theta) \left| \frac{d\theta}{d\phi} \right| \propto \sqrt{I(\theta)} \left| \frac{d\theta}{d\phi} \right|, which matches exactly. Thus, the prior is equivariant under reparameterization, preserving its structure.[2][16]In the multi-parameter case, the Jeffreys prior \pi(\boldsymbol{\theta}) \propto \sqrt{\det \mathbf{I}(\boldsymbol{\theta})} follows analogously, with the transformation law for the Fisher information matrix \mathbf{I}(\phi) = \mathbf{J}^T \mathbf{I}(\theta) \mathbf{J}, where \mathbf{J} is the Jacobian matrix, leading to \det \mathbf{I}(\phi) = \det \mathbf{I}(\theta) \cdot (\det \mathbf{J})^2. The prior then transforms as \pi(\phi) = \pi(\theta) |\det \mathbf{J}|, confirming global invariance. This property implies that posterior inferences, such as credible intervals or posterior means, remain consistent across parameterizations, promoting scientific objectivity by avoiding artifacts from arbitrary choices of coordinates.[16][1]This contrasts with group-invariant priors, which ensure invariance only for specific transformation groups (e.g., location-scale) but may fail more broadly.[1]Historically, this invariance resolved concerns raised by Harold Jeffreys regarding earlier attempts at invariant non-informative priors, such as those proposed by J.B.S. Haldane, which were limited to particular transformations and lacked general applicability across diffeomorphisms. Jeffreys introduced the Fisher-information-based form in 1946 to provide a systematic, fully invariantsolution for estimation problems.[11]
Uniqueness and Other Attributes
The Jeffreys prior stands out among non-informative priors as a distribution that maintains invariance under reparameterization in invariant statistical models, where the parameter space admits a transitive group action, among other relatively invariant priors. This property arises from its construction as the density proportional to the square root of the determinant of the Fisher information matrix, ensuring consistency across different coordinate systems without introducing extraneous structure. In the broader context of information geometry, this uniqueness extends to the prior inducing the only invariantvolume element (up to scale) compatible with the diffeomorphism group acting on the parameter manifold, as the underlying Fisher-Rao metric is itself the unique smooth, weak Riemannian metric invariant under such transformations on spaces of densities.[17]A key attribute contributing to the perceived objectivity of the Jeffreys prior is its derivation exclusively from the intrinsic geometry of the sampling model, via the Fisher information matrix, which defines a natural Riemannian metric on the parameter space. This approach encodes "ignorance" about the parameter by scaling the prior according to the model's sensitivity to parameter changes, thereby avoiding subjective choices tied to specific units or parameterizations that could bias inference. Unlike uniform priors, which may inadvertently favor certain scales, the Jeffreys prior aligns with the statistical manifold's geometry, promoting inferences that are robust to such arbitrary decisions and reflecting a form of model-based neutrality.[15]The Jeffreys prior is frequently improper, with an infinite integral over the parameter space, particularly when parameters range over unbounded domains like the positive reals or entire real line; however, it typically produces proper posterior distributions when combined with data from standard likelihoods satisfying regularity conditions, such as those ensuring the likelihood integrates to a finite value. Propriety of the prior itself occurs in cases where the parameter space is compact, confining the support to a bounded region and yielding a finite normalizing constant. Despite its impropriety, this prior does not compromise inference validity in asymptotic regimes.Asymptotically, the posterior under the Jeffreys prior exhibits desirable concentration properties: it localizes around the maximum likelihood estimator at the parametric rate of \sqrt{n}, where n is the sample size, under mild regularity assumptions on the model. Furthermore, the Bernstein-von Mises theorem guarantees that this posterior converges in total variation to a normal distribution centered at the maximum likelihood estimator with covariance matrix equal to the inverse of the Fisher information matrix, facilitating frequentist-like confidence statements from Bayesian procedures even with this non-informative prior. These attributes underscore the prior's utility in large-sample settings, where it bridges Bayesian and classical inference paradigms.
Information-Theoretic Connections
Minimum Description Length Principle
The Minimum Description Length (MDL) principle, developed by Jorma Rissanen, advocates selecting models or priors that minimize the total expected length required to encode both the observed data and the model parameters, thereby capturing regularities in the data through optimal compression.[18] In this framework, the "description length" measures the uncertainty in data generation and parameter specification, where priors play a crucial role in encoding the parameters efficiently given the data's information content.[19]Within MDL, the Jeffreys prior emerges as the optimal choice for minimizing the asymptotic expected code length in parametric models. This prior, proportional to the square root of the determinant of the Fisher information matrix, quantifies the local "volume" of the parameter space needed to encode the parameter \theta after observing the data, as the Fisher information I(\theta) inversely scales with estimation precision.[18] Specifically, the derivation shows that the redundancy in code length is bounded below by \frac{k}{2} \log n + o(1) bits for k-dimensional parameters and large sample size n, and the Jeffreys prior achieves this minimax bound by aligning the prior density with the geometric structure induced by the information matrix.[19]The explicit link to the Jeffreys prior formula, \pi(\theta) \propto \sqrt{|\det I(\theta)|}, arises through the Jeffreys-Bernstein approximation of the normalized maximum likelihood (NML) code in MDL, where the marginal likelihood is approximated to yield this invariant prior density.[18] This approximation ensures that the total code length, comprising the negative log-likelihood plus a penalty term for parameter complexity, is asymptotically optimal for universal coding.[19]This MDL perspective offers a frequentist justification for the Jeffreys prior, demonstrating its role in achieving the information-theoretic lower bounds on redundancy and thus bridging Bayesian inference with coding theory by emphasizing compression efficiency under parametric uncertainty.[20]
Relation to Reference Priors
Reference priors, introduced by José M. Bernardo in 1979, represent a generalization of noninformative priors designed to maximize the expected Kullback-Leibler divergence between the posterior distribution and the prior, thereby optimizing the information gained from the data in an asymptotic sense. This criterion leads to priors that are particularly suitable for Bayesian inference, emphasizing the parameter of interest while accounting for experimental design and sampling properties.In the single-parameter case, the reference prior coincides exactly with the Jeffreys prior under standard regularity conditions, confirming its role as a foundational special case for full-parameter inference. However, the key distinction arises in multiparameter models, where the Jeffreys prior—derived from the square root of the determinant of the Fisher information matrix—often fails to provide noninformative marginal posteriors for subsets of parameters, such as when nuisance parameters are present. Reference priors address this by employing an iterative procedure that groups parameters hierarchically, conditioning on nuisance parameters to derive a prior that is asymptotically optimal for the parameters of interest.Historically, the reference prior framework builds directly on Harold Jeffreys' work from the 1960s, particularly his revisions to multiparameter priors in the third edition of Theory of Probability (1961), which highlighted issues like dependence on parameterization for marginal inference but did not fully resolve them. Bernardo's approach, further refined by James O. Berger and collaborators in subsequent decades, overcomes these shortcomings by prioritizing inference for specific parameters through sequential maximization of missing information, ensuring greater robustness in complex models.Under regularity conditions, such as continuous parameter spaces and well-behaved likelihoods, the reference prior asymptotically approximates the Jeffreys prior when all parameters are of equal interest, underscoring their shared invariance properties while extending applicability to targeted inference scenarios.
Examples
Gaussian Mean Parameter
Consider the model where independent observations X_i \sim \mathcal{N}([\mu](/page/MU), [\sigma^2](/page/Sigma)) for i = 1, \dots, n, with the variance \sigma^2 known and the mean \mu the parameter of interest.[2] The likelihood function is given byf(\mathbf{x} \mid \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right),which simplifies to a form proportional to \exp\left( -\frac{n}{2\sigma^2} (\bar{x} - \mu)^2 \right).[2][21]The Fisher information for the parameter \mu based on n observations is I(\mu) = \frac{n}{\sigma^2}, a constant that does not depend on \mu.[2][21] According to Jeffreys' rule, the prior is proportional to the square root of the Fisher information, yielding \pi(\mu) \propto \sqrt{I(\mu)} \propto 1.[2][22] This results in a flat improper uniformprior over the entire real line, reflecting complete prior ignorance about the location of \mu.[21][22]This prior is non-informative for the location parameter \mu, ensuring that inferences remain invariant under reparameterization.[2][21] The resulting posterior distribution is \mu \mid \mathbf{x} \sim \mathcal{N}\left( \bar{x}, \frac{\sigma^2}{n} \right), which is proper and independent of the arbitrary constant in the prior.[2] In practice, this posterior leads to standard Bayesian credible intervals, such as the 100(1-\alpha)% interval \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, that asymptotically coincide with frequentist confidence intervals.[2]
Gaussian Variance Parameter
Consider a sample of independent observations X_1, \dots, X_n drawn from a normal distribution \mathcal{N}(\mu, \sigma^2), where the mean \mu is known and the variance \sigma^2 > 0 is the unknown parameter of interest.[2]The log-likelihood function for this model is\ell(\sigma^2 \mid \mathbf{X}) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2.Differentiating twice with respect to \sigma^2 yields the observed information, and taking the expectation gives the Fisher information matrix (scalar here) I(\sigma^2) = \frac{n}{2 (\sigma^2)^2}.[23][2]Jeffreys prior is then proportional to the square root of this Fisher information: \pi(\sigma^2) \propto \sqrt{I(\sigma^2)} \propto \frac{1}{\sigma^2}. This improper prior arises directly from the invariance principle underlying Jeffreys' rule.[23][24]For convenience in some analyses, reparameterize to the precision \tau = 1/\sigma^2 > 0. The transformation of the priordensity under the Jacobian |d\sigma^2 / d\tau| = 1/\tau^2 yields \pi(\tau) \propto 1/\tau. Equivalently, the Fisher information in terms of \tau is I(\tau) = n/(2 \tau^2), so \pi(\tau) \propto \sqrt{I(\tau)} \propto 1/\tau.[23][24]This prior is scale-invariant: if the data and parameter are rescaled by a constant k > 0 (so new variance is k^2 \sigma^2), the priordensity transforms appropriately to maintain the same form, ensuring consistency under units changes. In contrast, a uniformprior on \sigma^2 (i.e., \pi(\sigma^2) \propto 1) lacks this invariance, as rescaling would distort the prior to favor unrealistically small variances in the new scale.[24][2]When combined with the normal likelihood, the posterior for \sigma^2 under this prior is inverse-gamma distributed: \sigma^2 \mid \mathbf{X} \sim \text{IG}(n/2, S/2), where S = \sum_{i=1}^n (X_i - \mu)^2. This posterior is proper for any n \geq 1, providing a well-defined inference despite the impropriety of the prior itself.[23][24]
Poisson Rate Parameter
Consider the Poisson distribution for independent observations X_1, \dots, X_n \sim \text{[Poisson](/page/Poisson_distribution)}(\lambda), where \lambda > 0 is the unknown rate parameter. The likelihood function is given byL(\lambda \mid \mathbf{x}) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} \propto \lambda^{\sum x_i} e^{-n \lambda},which arises from the probability mass function of the Poisson distribution.[23]The Fisher information for a single observation X \sim \text{[Poisson](/page/Poisson_distribution)}(\lambda) is derived from the second derivative of the log-likelihood: the score function is (X - \lambda)/\lambda, and its negative expected second derivative yields I(\lambda) = 1/\lambda. For n independent observations, the total Fisher information scales to I_n(\lambda) = n / \lambda.[25][23]Jeffreys prior is then proportional to the square root of the Fisher information, \pi(\lambda) \propto \sqrt{I_n(\lambda)} \propto \lambda^{-1/2}, which is independent of n up to a constant factor. This improper prior corresponds to a Gamma distribution with shape parameter $1/2 and rate parameter approaching 0, \pi(\lambda) \propto \lambda^{1/2 - 1} e^{-0 \cdot \lambda}.[23][26]Combining this prior with the likelihood yields a posterior distribution that is Gamma with updated shape \sum x_i + 1/2 and rate n, so \pi(\lambda \mid \mathbf{x}) \sim \text{Gamma}\left( \sum x_i + \frac{1}{2}, n \right). The posterior mean is \left( \sum x_i + 1/2 \right) / n, which introduces a slight upward shift from the maximum likelihood estimator \sum x_i / n.[26][23]To illustrate invariance under reparameterization, consider the transformation \eta = \log \lambda, so \lambda = e^\eta. The Fisher information transforms as I(\eta) = I(\lambda) \left( \frac{d\lambda}{d\eta} \right)^2 = (1/\lambda) e^{2\eta} = e^\eta. Thus, the Jeffreys prior in \eta is \pi(\eta) \propto \sqrt{e^\eta} = e^{\eta/2}. Transforming the original prior gives \pi(\eta) \propto \lambda^{-1/2} \frac{d\lambda}{d\eta} = e^{-\eta/2} e^\eta = e^{\eta/2}, confirming consistency. In contrast, a uniform prior on \lambda transforms to \pi(\eta) \propto e^\eta, which is not uniform on \eta.[23]
Bernoulli Probability Parameter
In the Bernoulli model, each observation X_i (for i = 1, \dots, n) is independently distributed as \mathrm{[Bernoulli](/page/Bernoulli)}(p), where p \in (0,1) is the success probability. The likelihood function is L(p \mid \mathbf{x}) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} = p^s (1-p)^{n-s}, with s = \sum_{i=1}^n x_i denoting the observed number of successes.[11]To derive Jeffreys prior, compute the Fisher information based on a single observation, as the prior is constructed independently of sample size. The log-likelihood for one observation is \ell(p \mid x) = x \log p + (1-x) \log(1-p). The score function is \frac{\partial \ell}{\partial p} = \frac{x}{p} - \frac{1-x}{1-p}, and the observed information is -\frac{\partial^2 \ell}{\partial p^2} = \frac{x}{p^2} + \frac{1-x}{(1-p)^2}. The expected Fisher information is I(p) = \mathbb{E}\left[-\frac{\partial^2 \ell}{\partial p^2}\right] = \frac{1}{p(1-p)}. Thus, Jeffreys prior is \pi(p) \propto \sqrt{I(p)} = [p(1-p)]^{-1/2}.[11]This prior density corresponds to a \mathrm{Beta}(1/2, 1/2) distribution, also known as the arcsine distribution, with normalizing constant \frac{1}{\pi \sqrt{p(1-p)}}. The Beta form arises because the kernel p^{1/2 - 1} (1-p)^{1/2 - 1} matches the general Beta(\alpha, \beta) density for \alpha = \beta = 1/2. Given the conjugacy of the Beta prior with the Bernoulli likelihood (equivalent to a binomial update), the posterior distribution is \mathrm{Beta}(s + 1/2, n - s + 1/2), effectively adding 0.5 pseudo-counts to both successes and failures for regularization.[11]Compared to the uniform prior \mathrm{Beta}(1,1), which assumes constant density across [0,1], the Jeffreys prior exhibits U-shaped density with peaks near the boundaries p=0 and p=1. This addresses boundary issues in sparse data scenarios, where the uniform prior can lead to overly confident inferences (e.g., posterior probability mass concentrating too sharply at extremes after few observations), by assigning higher prior weight to regions of higher parameter uncertainty as measured by the Fisher information. Additionally, it satisfies invariance under reparameterization, ensuring consistency across transformations of p.[11]
Multinomial Probabilities for Biased Die
The multinomial distribution provides a natural model for the probabilities of outcomes when rolling a biased N-sided die, where the parameter vector \vec{\pi} = (\pi_1, \dots, \pi_N) represents the probabilities of landing on each face, satisfying \sum_{j=1}^N \pi_j = 1 and \pi_j > 0 for all j. Given observed counts n_j for each face over n = \sum n_j independent rolls, the likelihood isp(\vec{n} \mid \vec{\pi}) = \frac{n!}{\prod_{j=1}^N n_j!} \prod_{j=1}^N \pi_j^{n_j}.This model captures the constrained parameter space of the (N-1)-dimensional probability simplex.The Fisher information matrix I(\vec{\pi}) for the multinomial likelihood, computed with respect to a suitable parameterization of the simplex (e.g., using N-1 free parameters), is such that its determinant satisfies \det I(\vec{\pi}) \propto n^{N-1} \prod_{j=1}^N \pi_j^{-1}. The diagonal elements in the unconstrained coordinates are I_{jj} = n / \pi_j with off-diagonal elements I_{jk} = 0 for j \neq k, but the constraint \sum \pi_j = 1 requires adjustment, leading to the effective proportionality in the determinant after projection onto the tangent space of the simplex.The Jeffreys prior is then derived as \pi(\vec{\pi}) \propto \sqrt{\det I(\vec{\pi})} \propto \prod_{j=1}^N \pi_j^{-1/2}, which corresponds to the density of a Dirichlet distribution with all shape parameters equal to 1/2, denoted \text{Dir}(1/2, \dots, 1/2). This prior is improper for N > 1 but yields a proper posterior when combined with data having at least one observation in each category.As a generalization of the Beta(1/2, 1/2) prior for the two-category Bernoulli case, the Dirichlet(1/2, \dots, 1/2) prior adds pseudo-counts of 1/2 to each observed n_j, resulting in a posterior distribution \text{Dir}(n_1 + 1/2, \dots, n_N + 1/2). This formulation ensures invariance under relabeling of the die faces, reflecting the symmetry of the multinomial model.
Generalizations and Extensions
Probability-Matching Priors
Probability-matching priors are prior distributions designed such that the prior predictive distribution assigns equal probability to events that have equal likelihood under the sampling model. This property ensures that the Bayesian predictive inference reflects a form of invariance with respect to the model's likelihood structure, providing a noninformative foundation for prediction.[27]In relation to the Jeffreys prior, probability-matching priors coincide with it for one-parameter models in the exponential family, where both emphasize invariance under reparameterization. However, they diverge in multi-parameter scenarios, such as the joint estimation of mean and variance in a normal distribution, where the Jeffreys prior may not satisfy the predictive matching condition due to its focus on Fisher information rather than predictive uniformity.[27]These priors are often constructed by deriving the functional form of π(θ) that satisfies the condition ∫ π(θ) f(x|θ) dθ = m(x), where the prior predictive m(x) is uniform over sets of x corresponding to symmetric or equally likely events under the model. This involves solving integral equations tailored to the specific symmetry of the parameter space and data generating process.[28]A key advantage of probability-matching priors lies in their improved finite-sample performance for calibrating Bayesian p-values and conducting tests, as the matching ensures that predictive probabilities align more closely with frequentist coverage properties, enhancing reliability in small-sample Bayesian analyses.[29]
Alpha-Parallel Priors
Alpha-parallel priors constitute a family of noninformative prior distributions in Bayesian statistics, introduced within the framework of information geometry by Takeuchi and Amari. These priors are parameterized by a scalar α, such that when α = 0, the prior recovers the standard Jeffreys prior; for other values of α, they generalize Jeffreys' rule by incorporating adjustments derived from α-connections on the statistical manifold. This construction leverages the geometry of the parameter space, where geodesics offset from the original manifold define the prior's form, ensuring invariance under reparametrization while addressing limitations in multi-parameter settings.The explicit form of an α-parallel prior is given by\pi_\alpha(\theta) \propto \sqrt{|\det I(\theta)|} \exp(\alpha h(\theta)),where I(\theta) denotes the Fisher information matrix and h(\theta) is an adjustment function obtained via parallel transport with respect to the α-connection. In statistically equiaffine models, h(\theta) relates to a scalar potential ϕ(θ) satisfying certain differential conditions, such as T_a = \partial_a \phi, which preserves the geometric structure.[30] Existence of the α-parallel prior for α ≠ 0 is not guaranteed and depends on the curvature properties of the model manifold, unlike the Jeffreys prior, which always exists.Relative to the Jeffreys prior, α-parallel priors offer a continuum of invariant options that facilitate sensitivity analysis in complex models, allowing practitioners to explore how posterior inferences vary with the choice of α. The α-parallel curves inherent in this framework maintain orthogonality with respect to the Fisher-Rao metric, enhancing robustness in scenarios involving nuisance parameters or multiparameter inference.[31] This geometric invariance extends the location-scale properties of Jeffreys priors to higher-order approximations.In applications, α-parallel priors have been employed to refine marginal posteriors in hierarchical and multiparameter models, often yielding more accurate frequentist coverage compared to the standard Jeffreys prior by mitigating sensitivity to model geometry.[32] For instance, asymptotic expansions of marginal posterior densities under these priors demonstrate improved higher-order accuracy in Bayesian predictions.[32]
Modern Reference Priors
Modern reference priors represent a significant advancement in the construction of noninformative priors for Bayesian inference, particularly in multiparameter models where certain parameters are of primary interest while others are treated as nuisances. Building on earlier information-theoretic ideas from Rissanen in the 1980s, which emphasized minimizing description length to derive objective priors, Berger and Bernardo formalized iterative algorithms in the 1990s to generalize Jeffreys priors for partial inference.[33] These algorithms prioritize parameters of interest by sequentially maximizing the expected Kullback-Leibler (KL) divergence between the prior and posterior distributions, ensuring that the prior conveys minimal information about the parameters of interest while accounting for nuisance parameters.The construction of a modern reference prior for a parameter of interest \psi in the presence of nuisance parameters \lambda involves an iterative process. First, a conditional prior \pi(\lambda \mid \psi) is derived by maximizing the KL divergence within compact subsets of the parameter space, often leading to a form proportional to the square root of the determinant of the Fisher information matrix conditioned on \psi. This conditional prior is then combined with a marginal prior \pi(\psi) obtained similarly, yielding the joint prior \pi(\psi, \lambda) = \pi(\lambda \mid \psi) \pi(\psi). The iteration proceeds over ordered groups of parameters if there are multiple, ensuring asymptotic optimality in terms of missing information.[34] A key result is that when all parameters are treated as of equal interest (no nuisances distinguished), the reference prior coincides with Jeffreys prior. For instance, in the normal distribution where the mean \mu is the parameter of interest and the standard deviation \sigma > 0 is a nuisance, the reference prior is \pi(\mu, \sigma) \propto 1/\sigma.These modern reference priors address limitations of earlier approaches by handling non-regular models, such as those with discontinuities or unbounded parameters, through a rigorous limiting process over increasing compact sets. Berger, Bernardo, and Sun provided a formal general definition in 2009, clarifying the conditions under which the priors exist and are unique.[34] Post-2000 developments have extended the framework to high-dimensional settings, establishing posterior consistency under reference priors for sparse regression and covarianceestimation, where the priors adapt to dimensionality while maintaining frequentist validity.[35][36] This ensures robust inference even as the number of parameters grows with sample size.