Fact-checked by Grok 2 weeks ago

Jeffreys prior

In , the Jeffreys prior is a non-informative designed to reflect a lack of about model while ensuring objectivity and invariance under reparameterization. It is mathematically defined as the \pi_J(\theta) \propto \sqrt{\det \mathbf{I}(\theta)}, where \mathbf{I}(\theta) is the , which measures the amount of information that an observable carries about an unknown \theta. This formulation arises from the of the negative of the log-likelihood, specifically \mathbf{I}(\theta) = -\mathbb{E}_\theta \left[ \frac{\partial^2}{\partial \theta \partial \theta^T} \log p(X \mid \theta) \right], making the proportional to the of the precisions in the space. Proposed by the British mathematician, physicist, and statistician Sir Harold Jeffreys (1891–1989), the prior was introduced in his influential book Theory of Probability (first published in 1939), as part of a broader framework for objective Bayesian inference. Jeffreys developed it to provide a systematic, transformation-invariant method for selecting priors in scientific applications, countering the subjectivity often associated with earlier Bayesian approaches and promoting the use of inverse probability for inductive reasoning. In the book, he emphasized priors that maximize ignorance while aligning with the geometry of the parameter space, drawing on concepts from information theory and differential geometry avant la lettre. A defining property of the Jeffreys is its invariance under reparameterization: if \phi = g(\theta) is a transformation, the prior in the new parameterization becomes \pi_J(\phi) \propto \sqrt{\det \mathbf{I}(\phi)} \cdot |J|, where J is the , ensuring consistent posterior inferences regardless of how parameters are labeled. For univariate cases, it simplifies to \pi_J(\theta) \propto \sqrt{I(\theta)}, where I(\theta) is the scalar ; common examples include the (0.5, 0.5) distribution for the binomial success probability \theta (arising from I(\theta) = n / [\theta(1-\theta)]) and a prior on the mean \mu of a normal distribution with known variance, but $1/\sigma for the standard deviation \sigma. This invariance distinguishes it from simpler non-informative priors like the uniform distribution, which can lead to paradoxes under nonlinear transformations. The Jeffreys prior has been foundational in Bayesian methods, facilitating testing, , and model in fields from physics to , often serving as a default choice when substantive information is unavailable. However, it is frequently improper (integrating to ), which can complicate posterior in low-data regimes, and in multiparameter settings, it may concentrate probability in undesirable regions, prompting refinements like reference priors. It also underlies the Jeffreys-Lindley , highlighting tensions between Bayesian and for point null hypotheses as sample sizes grow large. Despite these limitations, its role in promoting rigorous, geometry-aware selection endures, influencing modern developments in statistical .

Historical Background and Motivation

Development by Harold Jeffreys

, a and , introduced the Jeffreys prior in the second edition of his influential Theory of Probability, published in 1948 (building on his 1946 paper "An Invariant Form for the Prior Probability in Estimation Problems"), with the third edition released in 1961 that further refined and expanded its presentation. The prior emerged as a cornerstone of his effort to establish a rigorous foundation for , specifically designed to provide an objective and non-informative starting point for parameter estimation and hypothesis testing. Jeffreys' formulation was driven by the need for priors that maintain objectivity while being invariant to changes in parameterization, thereby avoiding the arbitrary outcomes produced by simple priors, which could favor certain scales over others depending on how parameters were defined. This approach built directly on earlier Bayesian traditions, including Pierre-Simon Laplace's principle of insufficient reason from the early , but Jeffreys advanced it by prioritizing invariance as a key criterion for non-informative priors, ensuring consistency across different representations of the same problem. The development occurred against the backdrop of intense methodological debates in , as frequentist statistics gained prominence through the work of and , who emphasized error rates and long-run frequencies in hypothesis testing, often criticizing Bayesian reliance on priors as subjective. Jeffreys countered these critiques by advocating Bayesian methods as more logically coherent for scientific inference, particularly in addressing the limitations of p-values and fixed significance levels; his prior played a central role in this defense, enabling principled assessments of evidence in testing point null hypotheses and estimating parameters.

Motivation for Invariant Non-Informative Priors

Non-informative priors in are designed to express a state of ignorance about the parameters, allowing the data to drive the without introducing subjective beliefs from . These priors seek to minimize the influence of prior assumptions, particularly in Bayesian approaches where the goal is to base conclusions solely on the observed . However, simpler choices like priors often fail to achieve this neutrality, as they do not remain consistent under changes in parameterization. For instance, assuming a on a θ yields different posterior inferences compared to a on log θ, which can alter the effective weighting of the data and lead to contradictory results depending on the chosen scale. The requirement for invariance addresses this fundamental issue by ensuring that the prior distribution produces equivalent posterior inferences regardless of how the parameters are parameterized, thereby resolving scale-dependent ambiguities that arise in scientific modeling. This principle is crucial for maintaining consistency in inductive reasoning, as different parameterizations—such as switching between additive and multiplicative scales—should not affect the logical conclusions drawn from the data. Without invariance, inferences become artifacts of arbitrary choices in model formulation, undermining the reliability of Bayesian methods in empirical applications. Jeffreys emphasized that such priors should reflect a principled notion of ignorance, derived from the intrinsic geometry of the parameter space rather than ad-hoc selections, to avoid these inconsistencies. Early illustrations of these problems appear in location-scale families, where a uniform prior on the scale parameter, such as the standard deviation σ in a , results in improper posteriors that assign probability 1 to σ being less than any positive value, or introduces biases in . In these cases, the uniform fails to account for the natural metric of the parameter space, leading to unacceptable consequences like infinite ranges or skewed inferences. Jeffreys' approach, building on his earlier work in and his 1946 paper on invariant priors, advocates for priors that are invariant under group transformations inherent to the model, providing a systematic way to construct non-informative distributions that preserve the intended objectivity.

Definition and Derivation

Fisher Information Matrix

The Fisher information serves as a fundamental measure in , quantifying the amount of information that an observable provides about an unknown through the . For a single \theta and likelihood f(x|\theta), the Fisher information I(\theta) is defined as the of the negative of the log-likelihood: I(\theta) = -\mathbb{E}\left[ \frac{\partial^2 \log f(x|\theta)}{\partial \theta^2} \bigg| \theta \right], which is equivalent to the variance of the score function, the first derivative of the log-likelihood: I(\theta) = \mathrm{Var}\left[ \frac{\partial \log f(x|\theta)}{\partial \theta} \bigg| \theta \right]. This equivalence holds under regularity conditions ensuring the interchangeability of differentiation and expectation. In the multiparameter case, where \theta = (\theta_1, \dots, \theta_k)^\top is a k-dimensional parameter vector, the Fisher information is represented by a k \times k symmetric matrix \mathbf{I}(\theta) with elements given by I_{ij}(\theta) = \mathbb{E}\left[ \frac{\partial \log f(x|\theta)}{\partial \theta_i} \frac{\partial \log f(x|\theta)}{\partial \theta_j} \bigg| \theta \right] = -\mathbb{E}\left[ \frac{\partial^2 \log f(x|\theta)}{\partial \theta_i \partial \theta_j} \bigg| \theta \right]. The matrix \mathbf{I}(\theta) is positive semi-definite, reflecting its role as a measure of the data's sensitivity to changes in \theta; higher values indicate greater precision in estimating the parameter from the data. In frequentist settings, the inverse of the Fisher information matrix provides the asymptotic covariance matrix of the maximum likelihood estimator, such that \sqrt{n} (\hat{\theta}_n - \theta) \xrightarrow{d} \mathcal{N}(0, \mathbf{I}(\theta)^{-1}) as the sample size n grows, establishing the efficiency bound for estimators. Within , the matrix captures the local of the log-likelihood surface, enabling second-order approximations to the posterior distribution and informing the of non-informative priors that respect the model's structure. This interpretation underscores its utility in assessing parameter identifiability and sensitivity, where regions of low correspond to flat likelihoods and higher posterior uncertainty.

One-Parameter Case

In the one-parameter case, the Jeffreys prior for a \theta is defined as \pi(\theta) \propto \sqrt{I(\theta)}, where I(\theta) denotes the scalar evaluated at \theta. This construction arises from the requirement that the prior should be non-informative in a way that respects the of the , leading to a form that scales with the local sensitivity of the likelihood to changes in \theta. The key motivation for this form is its invariance under one-to-one reparameterizations of \theta. Suppose \phi = \phi(\theta) is a differentiable one-to-one transformation with inverse \theta = \theta(\phi) and Jacobian derivative d\theta/d\phi \neq 0. The Fisher information transforms according to I'(\phi) = I(\theta) \left( \frac{d\theta}{d\phi} \right)^2, where I'(\phi) is the Fisher information in the \phi-parameterization. Under a change of variables, the prior density transforms as \pi'(\phi) = \pi(\theta(\phi)) \left| d\theta/d\phi \right|. Substituting the Jeffreys form \pi(\theta) \propto \sqrt{I(\theta)} yields \pi'(\phi) \propto \sqrt{I(\theta)} \left| \frac{d\theta}{d\phi} \right| = \sqrt{ I(\theta) \left( \frac{d\theta}{d\phi} \right)^2 } = \sqrt{I'(\phi)}, demonstrating that \pi'(\phi) \propto \sqrt{I'(\phi)}, the Jeffreys prior in the new parameterization. This property holds for any smooth one-to-one reparameterization, ensuring the prior is invariant under the group of diffeomorphisms on the parameter space. The Jeffreys prior in the one-parameter case is often improper, meaning \int \pi(\theta) \, d\theta = \infty, as seen in cases like uniform priors for location parameters. However, under standard regularity conditions—such as the existence of the , twice differentiability of the log-likelihood, and integrability of the likelihood—the resulting posterior distribution remains proper (normalizable) when combined with observed data. Up to multiplication by a constant, this prior is unique among densities that satisfy invariance under all one-to-one reparameterizations, as the invariance condition uniquely determines the functional form proportional to \sqrt{I(\theta)}.

Multi-Parameter Case

In the multi-parameter case, the Jeffreys prior extends the scalar formulation by incorporating the full matrix I(\theta), where \theta = (\theta_1, \dots, \theta_k)^\top is a k-dimensional vector. The prior density is given by \pi(\theta) \propto \sqrt{|\det I(\theta)|}, with the matrix defined as I(\theta)_{ij} = -\mathbb{E}_\theta\left[\frac{\partial^2}{\partial\theta_i \partial\theta_j} \log f(X|\theta)\right], where f(X|\theta) is the . The square root of the arises because it measures the volume distortion in the under reparameterization, ensuring the prior reflects the geometry of the information provided by the data. This form generalizes the one-parameter case, where \sqrt{|\det I(\theta)|} = \sqrt{I(\theta)}, and was originally proposed by Jeffreys to achieve invariance in joint inference over multiple parameters. The prior's invariance under reparameterization is demonstrated through the transformation properties of the matrix. Consider a differentiable reparameterization \phi = \phi(\theta) with Jacobian matrix J = \partial\phi / \partial\theta. The transformed matrix I'(\phi) satisfies \det I'(\phi) = \frac{\det I(\theta)}{|\det J|^2}. Consequently, the transformed prior density \pi'(\phi) includes the Jacobian factor from the , yielding \pi'(\phi) \propto \sqrt{|\det I'(\phi)|} = \frac{\sqrt{|\det I(\theta)|}}{|\det J|}, which, when multiplied by |\det J|, recovers \pi(\theta), preserving the prior measure across parameterizations. This property ensures that inferences remain consistent regardless of the chosen parameterization. Despite these advantages, the multi-parameter Jeffreys prior faces challenges, particularly in its optimality for on subsets of parameters. It may not yield satisfactory marginal posteriors for individual parameters or groups, as the joint structure can lead to inconsistencies, such as biased estimators in problems like the . These issues arise because the prior optimizes joint information but neglects parameter ordering or conditional , prompting generalizations like reference priors for targeted marginalization. Computationally, evaluating the prior requires calculating the of I(\theta), typically by estimating the expected of the log-likelihood, \mathbf{H}(\theta) = -\partial^2 \log f(X|\theta) / \partial\theta \partial\theta^\top, and taking I(\theta) = \mathbb{E}[\mathbf{H}(\theta)]. For complex models, this often involves or methods to approximate the , especially when analytical forms are unavailable.

Properties

Invariance under Reparameterization

One key property of the Jeffreys prior is its invariance under reparameterization, meaning that if the parameter space is transformed via a , the prior adjusts by the appropriate factor to maintain the same form in the new coordinates. This ensures that the prior behaves as a proper under such transformations, unlike a uniform prior, which generally does not transform invariantly and can lead to inconsistent inferences depending on the chosen parameterization. To see this formally, consider a one-parameter model with parameter \theta and Jeffreys prior \pi(\theta) \propto \sqrt{I(\theta)}, where I(\theta) is the Fisher information. Now reparameterize to \phi = h(\theta), where h is a diffeomorphism with inverse \theta = h^{-1}(\phi) and Jacobian determinant |J| = \left| \frac{d\theta}{d\phi} \right|. The Fisher information transforms as I(\phi) = I(\theta) \left( \frac{d\theta}{d\phi} \right)^2, since the expected value of the squared score function scales by the chain rule. The Jeffreys prior in the new parameterization is then \pi(\phi) \propto \sqrt{I(\phi)} = \sqrt{I(\theta)} \left| \frac{d\theta}{d\phi} \right|. Transforming the original prior to the new scale gives \pi(\phi) = \pi(\theta) \left| \frac{d\theta}{d\phi} \right| \propto \sqrt{I(\theta)} \left| \frac{d\theta}{d\phi} \right|, which matches exactly. Thus, the prior is equivariant under reparameterization, preserving its structure. In the multi-parameter case, the Jeffreys prior \pi(\boldsymbol{\theta}) \propto \sqrt{\det \mathbf{I}(\boldsymbol{\theta})} follows analogously, with the transformation law for the Fisher information matrix \mathbf{I}(\phi) = \mathbf{J}^T \mathbf{I}(\theta) \mathbf{J}, where \mathbf{J} is the Jacobian matrix, leading to \det \mathbf{I}(\phi) = \det \mathbf{I}(\theta) \cdot (\det \mathbf{J})^2. The prior then transforms as \pi(\phi) = \pi(\theta) |\det \mathbf{J}|, confirming global invariance. This property implies that posterior inferences, such as credible intervals or posterior means, remain consistent across parameterizations, promoting scientific objectivity by avoiding artifacts from arbitrary choices of coordinates. This contrasts with group-invariant priors, which ensure invariance only for specific transformation groups (e.g., location-scale) but may fail more broadly. Historically, this invariance resolved concerns raised by regarding earlier attempts at invariant non-informative priors, such as those proposed by , which were limited to particular transformations and lacked general applicability across diffeomorphisms. introduced the Fisher-information-based form in to provide a systematic, fully for problems.

Uniqueness and Other Attributes

The Jeffreys prior stands out among non-informative priors as a distribution that maintains under reparameterization in invariant statistical models, where the parameter space admits a , among other relatively priors. This property arises from its construction as the density proportional to the of the determinant of the matrix, ensuring consistency across different coordinate systems without introducing extraneous structure. In the broader context of , this uniqueness extends to the prior inducing the only (up to scale) compatible with the group acting on the parameter manifold, as the underlying Fisher-Rao is itself the unique smooth, weak Riemannian invariant under such transformations on spaces of densities. A key attribute contributing to the perceived objectivity of the Jeffreys prior is its derivation exclusively from the intrinsic geometry of the sampling model, via the matrix, which defines a natural Riemannian metric on the space. This approach encodes "" about the by scaling the according to the model's sensitivity to changes, thereby avoiding subjective choices tied to specific units or parameterizations that could bias inference. Unlike uniform priors, which may inadvertently favor certain scales, the Jeffreys prior aligns with the statistical manifold's geometry, promoting inferences that are robust to such arbitrary decisions and reflecting a form of model-based neutrality. The Jeffreys prior is frequently improper, with an infinite integral over the parameter space, particularly when parameters range over unbounded domains like the positive reals or entire real line; however, it typically produces proper posterior distributions when combined with data from standard likelihoods satisfying regularity conditions, such as those ensuring the likelihood integrates to a finite value. Propriety of the prior itself occurs in cases where the parameter space is compact, confining the support to a bounded and yielding a finite . Despite its impropriety, this prior does not compromise validity in asymptotic regimes. Asymptotically, the posterior under the Jeffreys prior exhibits desirable concentration properties: it localizes around the maximum likelihood at the parametric rate of \sqrt{n}, where n is the sample , under mild regularity assumptions on the model. Furthermore, the Bernstein-von Mises guarantees that this posterior converges in to a centered at the maximum likelihood with covariance matrix equal to the inverse of the matrix, facilitating frequentist-like statements from Bayesian procedures even with this non-informative . These attributes underscore the prior's utility in large-sample settings, where it bridges Bayesian and classical paradigms.

Information-Theoretic Connections

Minimum Description Length Principle

The Minimum Description Length (MDL) principle, developed by , advocates selecting models or priors that minimize the total expected length required to encode both the observed data and the model parameters, thereby capturing regularities in the data through optimal compression. In this framework, the "description length" measures the uncertainty in data generation and parameter specification, where priors play a crucial role in encoding the parameters efficiently given the data's information content. Within MDL, the Jeffreys prior emerges as the optimal choice for minimizing the asymptotic expected code length in parametric models. This prior, proportional to the of the of the matrix, quantifies the local "volume" of the parameter space needed to encode the parameter \theta after observing the data, as the I(\theta) inversely scales with estimation precision. Specifically, the shows that the redundancy in code length is bounded below by \frac{k}{2} \log n + o(1) bits for k-dimensional parameters and large sample size n, and the Jeffreys prior achieves this bound by aligning the prior density with the geometric structure induced by the information matrix. The explicit link to the Jeffreys prior formula, \pi(\theta) \propto \sqrt{|\det I(\theta)|}, arises through the Jeffreys-Bernstein of the normalized maximum likelihood (NML) in MDL, where the is approximated to yield this invariant prior density. This ensures that the total length, comprising the negative log-likelihood plus a penalty term for complexity, is asymptotically optimal for universal coding. This MDL perspective offers a frequentist justification for the Jeffreys prior, demonstrating its role in achieving the information-theoretic lower bounds on redundancy and thus bridging with by emphasizing compression efficiency under parametric uncertainty.

Relation to Reference Priors

Reference priors, introduced by José M. Bernardo in , represent a of noninformative priors designed to maximize the expected between the posterior and the , thereby optimizing the gained from the in an asymptotic sense. This criterion leads to priors that are particularly suitable for , emphasizing the parameter of interest while accounting for experimental design and sampling properties. In the single-parameter case, the reference prior coincides exactly with the Jeffreys prior under standard regularity conditions, confirming its role as a foundational special case for full-parameter . However, the key distinction arises in multiparameter models, where the Jeffreys prior—derived from the of the of the matrix—often fails to provide noninformative marginal posteriors for subsets of parameters, such as when parameters are present. priors address this by employing an iterative procedure that groups parameters hierarchically, conditioning on parameters to derive a prior that is asymptotically optimal for the parameters of interest. Historically, the reference prior framework builds directly on ' work from the 1960s, particularly his revisions to multiparameter priors in the third edition of Theory of Probability (1961), which highlighted issues like dependence on parameterization for marginal but did not fully resolve them. Bernardo's approach, further refined by James O. and collaborators in subsequent decades, overcomes these shortcomings by prioritizing for specific parameters through sequential maximization of missing , ensuring greater robustness in complex models. Under regularity conditions, such as continuous parameter spaces and well-behaved likelihoods, the reference asymptotically approximates the when all parameters are of equal interest, underscoring their shared invariance properties while extending applicability to targeted scenarios.

Examples

Gaussian Mean Parameter

Consider the model where independent observations X_i \sim \mathcal{N}([\mu](/page/MU), [\sigma^2](/page/Sigma)) for i = 1, \dots, n, with the variance \sigma^2 known and the mean \mu the parameter of interest. The is given by f(\mathbf{x} \mid \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right), which simplifies to a form proportional to \exp\left( -\frac{n}{2\sigma^2} (\bar{x} - \mu)^2 \right). The for the parameter \mu based on n observations is I(\mu) = \frac{n}{\sigma^2}, a constant that does not depend on \mu. According to Jeffreys' rule, the is proportional to the of the , yielding \pi(\mu) \propto \sqrt{I(\mu)} \propto 1. This results in a flat improper over the entire real line, reflecting complete prior ignorance about the location of \mu. This is non-informative for the \mu, ensuring that inferences remain invariant under reparameterization. The resulting posterior is \mu \mid \mathbf{x} \sim \mathcal{N}\left( \bar{x}, \frac{\sigma^2}{n} \right), which is proper and independent of the arbitrary constant in the prior. In practice, this posterior leads to standard Bayesian credible intervals, such as the 100(1-\alpha)% interval \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, that asymptotically coincide with frequentist intervals.

Gaussian Variance Parameter

Consider a sample of independent observations X_1, \dots, X_n drawn from a normal distribution \mathcal{N}(\mu, \sigma^2), where the mean \mu is known and the variance \sigma^2 > 0 is the unknown parameter of interest. The log-likelihood function for this model is \ell(\sigma^2 \mid \mathbf{X}) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (X_i - \mu)^2. Differentiating twice with respect to \sigma^2 yields the observed information, and taking the expectation gives the Fisher information matrix (scalar here) I(\sigma^2) = \frac{n}{2 (\sigma^2)^2}. Jeffreys prior is then proportional to the square root of this : \pi(\sigma^2) \propto \sqrt{I(\sigma^2)} \propto \frac{1}{\sigma^2}. This improper arises directly from the invariance principle underlying Jeffreys' rule. For convenience in some analyses, reparameterize to the \tau = 1/\sigma^2 > 0. The of the under the |d\sigma^2 / d\tau| = 1/\tau^2 yields \pi(\tau) \propto 1/\tau. Equivalently, the in terms of \tau is I(\tau) = n/(2 \tau^2), so \pi(\tau) \propto \sqrt{I(\tau)} \propto 1/\tau. This is scale-invariant: if the and are rescaled by a constant k > 0 (so new variance is k^2 \sigma^2), the transforms appropriately to maintain the same form, ensuring consistency under units changes. In contrast, a on \sigma^2 (i.e., \pi(\sigma^2) \propto 1) lacks this invariance, as rescaling would distort the prior to favor unrealistically small variances in the new scale. When combined with the normal likelihood, the posterior for \sigma^2 under this is inverse-gamma distributed: \sigma^2 \mid \mathbf{X} \sim \text{IG}(n/2, S/2), where S = \sum_{i=1}^n (X_i - \mu)^2. This posterior is proper for any n \geq 1, providing a well-defined despite the impropriety of the prior itself.

Poisson Rate Parameter

Consider the Poisson distribution for independent observations X_1, \dots, X_n \sim \text{[Poisson](/page/Poisson_distribution)}(\lambda), where \lambda > 0 is the unknown rate parameter. The is given by L(\lambda \mid \mathbf{x}) = \prod_{i=1}^n \frac{\lambda^{x_i} e^{-\lambda}}{x_i!} \propto \lambda^{\sum x_i} e^{-n \lambda}, which arises from the probability mass function of the . The for a single observation X \sim \text{[Poisson](/page/Poisson_distribution)}(\lambda) is derived from the of the log-likelihood: the score function is (X - \lambda)/\lambda, and its negative expected yields I(\lambda) = 1/\lambda. For n independent observations, the total scales to I_n(\lambda) = n / \lambda. Jeffreys prior is then proportional to the square root of the , \pi(\lambda) \propto \sqrt{I_n(\lambda)} \propto \lambda^{-1/2}, which is independent of n up to a constant factor. This improper prior corresponds to a with $1/2 and rate parameter approaching 0, \pi(\lambda) \propto \lambda^{1/2 - 1} e^{-0 \cdot \lambda}. Combining this with the likelihood yields a posterior that is Gamma with updated shape \sum x_i + 1/2 and rate n, so \pi(\lambda \mid \mathbf{x}) \sim \text{Gamma}\left( \sum x_i + \frac{1}{2}, n \right). The posterior mean is \left( \sum x_i + 1/2 \right) / n, which introduces a slight upward shift from the maximum likelihood \sum x_i / n. To illustrate invariance under reparameterization, consider the transformation \eta = \log \lambda, so \lambda = e^\eta. The Fisher information transforms as I(\eta) = I(\lambda) \left( \frac{d\lambda}{d\eta} \right)^2 = (1/\lambda) e^{2\eta} = e^\eta. Thus, the Jeffreys prior in \eta is \pi(\eta) \propto \sqrt{e^\eta} = e^{\eta/2}. Transforming the original prior gives \pi(\eta) \propto \lambda^{-1/2} \frac{d\lambda}{d\eta} = e^{-\eta/2} e^\eta = e^{\eta/2}, confirming consistency. In contrast, a uniform prior on \lambda transforms to \pi(\eta) \propto e^\eta, which is not uniform on \eta.

Bernoulli Probability Parameter

In the Bernoulli model, each observation X_i (for i = 1, \dots, n) is independently distributed as \mathrm{[Bernoulli](/page/Bernoulli)}(p), where p \in (0,1) is the success probability. The is L(p \mid \mathbf{x}) = \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} = p^s (1-p)^{n-s}, with s = \sum_{i=1}^n x_i denoting the observed number of successes. To derive Jeffreys prior, compute the based on a single observation, as the prior is constructed independently of sample size. The log-likelihood for one observation is \ell(p \mid x) = x \log p + (1-x) \log(1-p). The score function is \frac{\partial \ell}{\partial p} = \frac{x}{p} - \frac{1-x}{1-p}, and the is -\frac{\partial^2 \ell}{\partial p^2} = \frac{x}{p^2} + \frac{1-x}{(1-p)^2}. The expected is I(p) = \mathbb{E}\left[-\frac{\partial^2 \ell}{\partial p^2}\right] = \frac{1}{p(1-p)}. Thus, Jeffreys prior is \pi(p) \propto \sqrt{I(p)} = [p(1-p)]^{-1/2}. This prior density corresponds to a \mathrm{Beta}(1/2, 1/2) distribution, also known as the , with \frac{1}{\pi \sqrt{p(1-p)}}. The Beta form arises because the p^{1/2 - 1} (1-p)^{1/2 - 1} matches the general Beta(\alpha, \beta) density for \alpha = \beta = 1/2. Given the conjugacy of the Beta prior with the likelihood (equivalent to a update), the posterior distribution is \mathrm{Beta}(s + 1/2, n - s + 1/2), effectively adding 0.5 pseudo-counts to both successes and failures for regularization. Compared to the uniform prior \mathrm{Beta}(1,1), which assumes constant density across [0,1], the Jeffreys prior exhibits U-shaped density with peaks near the boundaries p=0 and p=1. This addresses boundary issues in sparse data scenarios, where the uniform prior can lead to overly confident inferences (e.g., posterior probability mass concentrating too sharply at extremes after few observations), by assigning higher prior weight to regions of higher parameter uncertainty as measured by the Fisher information. Additionally, it satisfies invariance under reparameterization, ensuring consistency across transformations of p.

Multinomial Probabilities for Biased Die

The multinomial distribution provides a natural model for the probabilities of outcomes when rolling a biased N-sided die, where the parameter vector \vec{\pi} = (\pi_1, \dots, \pi_N) represents the probabilities of landing on each face, satisfying \sum_{j=1}^N \pi_j = 1 and \pi_j > 0 for all j. Given observed counts n_j for each face over n = \sum n_j independent rolls, the likelihood is p(\vec{n} \mid \vec{\pi}) = \frac{n!}{\prod_{j=1}^N n_j!} \prod_{j=1}^N \pi_j^{n_j}. This model captures the constrained parameter space of the (N-1)-dimensional probability simplex. The matrix I(\vec{\pi}) for the multinomial likelihood, computed with respect to a suitable parameterization of the (e.g., using N-1 free parameters), is such that its satisfies \det I(\vec{\pi}) \propto n^{N-1} \prod_{j=1}^N \pi_j^{-1}. The diagonal elements in the unconstrained coordinates are I_{jj} = n / \pi_j with off-diagonal elements I_{jk} = 0 for j \neq k, but the constraint \sum \pi_j = 1 requires adjustment, leading to the effective proportionality in the after projection onto the of the . The Jeffreys prior is then derived as \pi(\vec{\pi}) \propto \sqrt{\det I(\vec{\pi})} \propto \prod_{j=1}^N \pi_j^{-1/2}, which corresponds to the density of a with all shape parameters equal to 1/2, denoted \text{Dir}(1/2, \dots, 1/2). This is improper for N > 1 but yields a proper posterior when combined with data having at least one observation in each category. As a of the (1/2, 1/2) for the two-category case, the Dirichlet(1/2, \dots, 1/2) adds pseudo-counts of 1/2 to each observed n_j, resulting in a posterior distribution \text{Dir}(n_1 + 1/2, \dots, n_N + 1/2). This formulation ensures invariance under relabeling of the die faces, reflecting the symmetry of the multinomial model.

Generalizations and Extensions

Probability-Matching Priors

Probability-matching priors are prior distributions designed such that the prior predictive distribution assigns equal probability to events that have equal likelihood under the sampling model. This property ensures that the Bayesian predictive inference reflects a form of invariance with respect to the model's likelihood structure, providing a noninformative for . In relation to the Jeffreys prior, probability-matching priors coincide with it for one-parameter models in the , where both emphasize invariance under reparameterization. However, they diverge in multi-parameter scenarios, such as the joint estimation of mean and variance in a , where the Jeffreys prior may not satisfy the predictive matching condition due to its focus on rather than predictive uniformity. These priors are often constructed by deriving the functional form of π(θ) that satisfies the condition ∫ π(θ) f(x|θ) dθ = m(x), where the prior predictive m(x) is uniform over sets of x corresponding to symmetric or equally likely events under the model. This involves solving integral equations tailored to the specific symmetry of the space and . A key advantage of probability-matching priors lies in their improved finite-sample performance for calibrating Bayesian p-values and conducting tests, as the matching ensures that predictive probabilities align more closely with frequentist coverage properties, enhancing reliability in small-sample Bayesian analyses.

Alpha-Parallel Priors

Alpha-parallel priors constitute a family of noninformative distributions in , introduced within the framework of by Takeuchi and Amari. These priors are parameterized by a scalar α, such that when α = 0, the prior recovers the standard Jeffreys prior; for other values of α, they generalize Jeffreys' rule by incorporating adjustments derived from α-connections on the . This construction leverages the geometry of the parameter space, where geodesics offset from the original manifold define the prior's form, ensuring invariance under reparametrization while addressing limitations in multi-parameter settings. The explicit form of an α-parallel prior is given by \pi_\alpha(\theta) \propto \sqrt{|\det I(\theta)|} \exp(\alpha h(\theta)), where I(\theta) denotes the matrix and h(\theta) is an adjustment function obtained via with respect to the α-connection. In statistically equiaffine models, h(\theta) relates to a ϕ(θ) satisfying certain differential conditions, such as T_a = \partial_a \phi, which preserves the geometric structure. Existence of the α-parallel prior for α ≠ 0 is not guaranteed and depends on the curvature properties of the model manifold, unlike the Jeffreys prior, which always exists. Relative to the Jeffreys prior, α-parallel priors offer a continuum of invariant options that facilitate in complex models, allowing practitioners to explore how posterior vary with the choice of α. The α-parallel curves inherent in this framework maintain orthogonality with respect to the Fisher-Rao metric, enhancing robustness in scenarios involving parameters or multiparameter . This geometric invariance extends the location-scale properties of Jeffreys priors to higher-order approximations. In applications, α-parallel priors have been employed to refine marginal posteriors in hierarchical and multiparameter models, often yielding more accurate frequentist coverage compared to the standard by mitigating sensitivity to model geometry. For instance, asymptotic expansions of marginal posterior densities under these priors demonstrate improved higher-order accuracy in Bayesian predictions.

Modern Reference Priors

Modern reference priors represent a significant advancement in the construction of noninformative priors for , particularly in multiparameter models where certain parameters are of primary interest while others are treated as nuisances. Building on earlier information-theoretic ideas from Rissanen in the , which emphasized minimizing description length to derive objective priors, and Bernardo formalized iterative algorithms in the to generalize Jeffreys priors for partial . These algorithms prioritize parameters of interest by sequentially maximizing the expected Kullback-Leibler (KL) divergence between the prior and posterior distributions, ensuring that the prior conveys minimal information about the parameters of interest while accounting for nuisance parameters. The construction of a modern reference for a of interest \psi in the presence of nuisance parameters \lambda involves an iterative process. First, a conditional prior \pi(\lambda \mid \psi) is derived by maximizing the KL divergence within compact subsets of the parameter space, often leading to a form proportional to the square root of the determinant of the Fisher information matrix conditioned on \psi. This conditional prior is then combined with a marginal prior \pi(\psi) obtained similarly, yielding the joint prior \pi(\psi, \lambda) = \pi(\lambda \mid \psi) \pi(\psi). The iteration proceeds over ordered groups of parameters if there are multiple, ensuring asymptotic optimality in terms of missing information. A key result is that when all parameters are treated as of equal interest (no nuisances distinguished), the reference prior coincides with Jeffreys prior. For instance, in the normal distribution where the mean \mu is the parameter of interest and the standard deviation \sigma > 0 is a nuisance, the reference prior is \pi(\mu, \sigma) \propto 1/\sigma. These modern reference priors address limitations of earlier approaches by handling non-regular models, such as those with discontinuities or unbounded parameters, through a rigorous limiting process over increasing compact sets. , Bernardo, and Sun provided a formal general definition in , clarifying the conditions under which the priors exist and are unique. Post-2000 developments have extended the framework to high-dimensional settings, establishing posterior consistency under reference priors for sparse and , where the priors adapt to dimensionality while maintaining frequentist validity. This ensures robust inference even as the number of parameters grows with sample size.