Fact-checked by Grok 2 weeks ago

Normal-gamma distribution

The Normal-gamma distribution is a joint probability distribution over the mean \mu and precision \lambda = 1/\sigma^2 (or equivalently, variance \sigma^2) of a univariate normal distribution, serving as the conjugate prior in Bayesian inference for normal data with both parameters unknown. It is parameterized by four hyperparameters: \mu_0 (prior mean of \mu), \kappa_0 (prior strength or effective sample size for \mu), \alpha_0 (shape parameter for the gamma prior on \lambda), and \beta_0 (rate parameter for the gamma prior on \lambda). The density is given by the product p(\mu, \lambda) = \mathcal{N}(\mu \mid \mu_0, (\kappa_0 \lambda)^{-1}) \cdot \text{Gamma}(\lambda \mid \alpha_0, \beta_0), where the conditional normal distribution reflects uncertainty in the mean scaling with the inverse precision, and the marginal gamma encodes beliefs about the precision itself. This distribution arises naturally in hierarchical Bayesian models for normal likelihoods, enabling closed-form posterior updates that preserve the Normal-gamma family. Given n independent observations x_1, \dots, x_n \sim \mathcal{N}(\mu, \lambda^{-1}) with sample mean \bar{x} and sum of squared deviations s^2 = \sum (x_i - \bar{x})^2, the posterior is Normal-gamma with updated parameters: \mu_n = (\kappa_0 \mu_0 + n \bar{x}) / (\kappa_0 + n), \kappa_n = \kappa_0 + n, \alpha_n = \alpha_0 + n/2, and \beta_n = \beta_0 + \frac{1}{2} \left[ s^2 + \kappa_0 n (\bar{x} - \mu_0)^2 / (\kappa_0 + n)\right]. These updates interpret \kappa_0 as the prior's effective sample size, \alpha_0 as prior "pseudo-observations" for , and \beta_0 as a prior scale related to expected variance. The marginal posterior for \mu is a non-standard , while for \lambda it is gamma, facilitating exact without in many cases. Key applications include , where it extends to multivariate forms, and predictive distributions for new data, which follow a with location \mu_n, incorporating \beta_n / (\alpha_n \kappa_n), and degrees of freedom $2\alpha_n. The of hyperparameters can reflect vague priors (e.g., small \kappa_0 and \alpha_0) or informative ones based on , such as setting \beta_0 \approx \alpha_0 \cdot \mathbb{E}[\sigma^2] for expected variance. Despite slight notational variations across formulations (e.g., using rate vs. for the gamma), the Normal-gamma remains a cornerstone for tractable Bayesian analysis of normal models due to its conjugacy and interpretability.

Definition

Probability density function

The normal-gamma distribution specifies a over a distribution's \mu \in \mathbb{R} and \lambda > 0 (where is the of the variance). It parameterizes \mu and \lambda such that the conditional distribution \mu \mid \lambda is with hyperparameter \mu_0 and \kappa \lambda (equivalently, variance (\kappa \lambda)^{-1}), while the of \lambda is gamma with shape hyperparameter \alpha > 0 and rate hyperparameter \beta > 0. The joint probability density function (PDF) is the product of these conditional and marginal densities: f(\mu, \lambda \mid \mu_0, \kappa, \alpha, \beta) = \mathcal{N}(\mu \mid \mu_0, (\kappa \lambda)^{-1}) \cdot \Gamma(\lambda \mid \alpha, \beta), where \mathcal{N}(\cdot \mid \mu_0, (\kappa \lambda)^{-1}) denotes the \mathcal{N}(\mu \mid \mu_0, (\kappa \lambda)^{-1}) = \sqrt{\frac{\kappa \lambda}{2\pi}} \exp\left( -\frac{\kappa \lambda (\mu - \mu_0)^2}{2} \right) and \Gamma(\cdot \mid \alpha, \beta) denotes the gamma PDF \Gamma(\lambda \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} \exp(-\beta \lambda), \quad \lambda > 0. This product form reflects the hierarchical structure, with \lambda serving as a shared precision parameter that scales the variance of the conditional normal. Substituting the component densities yields the explicit joint PDF: \begin{aligned} f(\mu, \lambda \mid \mu_0, \kappa, \alpha, \beta) &= \sqrt{\frac{\kappa \lambda}{2\pi}} \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} \exp\left( -\lambda \left( \beta + \frac{\kappa (\mu - \mu_0)^2}{2} \right) \right) \\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \left( \frac{\kappa \lambda}{2\pi} \right)^{1/2} \lambda^{\alpha - 1/2} \exp\left( -\lambda \left( \beta + \frac{\kappa (\mu - \mu_0)^2}{2} \right) \right). \end{aligned} The distribution is defined over the support \mu \in \mathbb{R} and \lambda > 0.

Parameter interpretation

The hyperparameter \mu_0 serves as the a priori best guess for the \mu of the underlying , representing the of the prior belief about the . The hyperparameter \kappa quantifies the strength of this belief in \mu_0 by acting as the number of pseudo-observations that contribute to the of \mu around \mu_0; larger values of \kappa imply greater in the , equivalent to having observed \kappa points centered at \mu_0. The hyperparameters \alpha and \beta define the shape and rate, respectively, of the gamma on the \lambda = 1/\sigma^2, where \alpha reflects the prior strength analogous to (with $2\alpha providing a variance-scale interpretation), and \beta scales the expected via E[\lambda] = \alpha / \beta, influencing the anticipated variability in the normal distribution. Collectively, these hyperparameters allow the normal-gamma prior to be viewed as derived from a hypothetical set of pseudo-observations: specifically, \kappa observations with sample mean \mu_0 for informing the mean, combined with $2\alpha pseudo-observations whose sum of squared deviations totals $2\beta for shaping the precision prior.

Properties

Marginal distributions

The marginal distribution of the precision parameter \lambda in the normal-gamma distribution is a gamma distribution with shape parameter \alpha and rate parameter \beta, denoted as \lambda \sim \mathrm{Gamma}(\alpha, \beta). This marginal is independent of the mean parameter \mu, as the joint density factors into the conditional normal density for \mu given \lambda and the gamma density for \lambda. To derive this marginal, integrate the joint probability density function over \mu. The conditional density p(\mu \mid \lambda) is normal with mean \mu_0 and precision \kappa \lambda, which integrates to unity over \mu, leaving the marginal density of \lambda unchanged from its gamma form. The marginal distribution of the mean parameter \mu is a Student's t-distribution with location parameter \mu_0, scale parameter \frac{\beta}{\kappa \alpha}, and $2\alpha degrees of freedom. This arises from the scale mixture representation, where the normal-gamma joint can be viewed as a normal distribution for \mu mixed over a gamma-distributed precision \lambda, a construction known to produce a t-distribution. The derivation involves integrating the joint density over \lambda, which requires evaluating the integral of a normal density times a gamma density. This product integrates to the density of a non-standardized through recognition of the gamma as an inverse-chi-squared scale mixture or direct computation using properties.

Conditional distributions

The conditional distribution of the mean parameter \mu given the precision parameter \lambda follows a normal distribution: \mu \mid \lambda \sim \mathcal{N}(\mu_0, (\kappa \lambda)^{-1}). This form arises directly from the structure of the normal-gamma joint density, where the precision \lambda scales the prior precision factor \kappa, resulting in a variance that decreases as \lambda increases. Consequently, higher values of \lambda (indicating lower variance in the underlying normal model) lead to a tighter distribution around the prior mean \mu_0, emphasizing reduced uncertainty in \mu when the data precision is high. Conversely, the conditional distribution of the precision \lambda given \mu is a gamma distribution: \lambda \mid \mu \sim \mathrm{Gamma}(\alpha + 1/2, \beta + \kappa (\mu - \mu_0)^2 / 2), where the gamma is parameterized by shape and rate. The shape parameter incorporates an additional $1/2 from the dimensionality of the univariate normal component in the joint density, while the rate parameter adjusts the prior rate \beta by a term proportional to the squared deviation of \mu from \mu_0, weighted by \kappa. This adjustment reflects greater penalization (higher rate, narrower distribution) when \mu strays far from the prior mean, capturing the interdependence between the mean and precision parameters. These conditionals facilitate from the joint normal-gamma distribution, as each is available in closed form and conjugate to the other, allowing iterative draws: sample \mu from its conditional given current \lambda, then sample \lambda from its conditional given the new \mu. This approach is particularly useful for posterior inference in Bayesian models where the normal-gamma serves as a , enabling efficient exploration of the parameter space without direct of the joint density.

Moments and expectations

The expected value of the mean parameter μ in the normal-gamma distribution is E[μ] = μ₀. This follows from the symmetry of the conditional normal distribution around μ₀ and the law of total expectation. The variance of μ is Var(μ) = β / (κ (α - 1)) for α > 1. To derive this, note that the marginal distribution of μ is a non-standardized Student-t distribution with 2α degrees of freedom, location μ₀, and scale parameter √(β / (α κ)); the variance of a Student-t random variable with ν degrees of freedom and scale s is s² ν / (ν - 2), yielding [β / (α κ)] ⋅ 2α / (2α - 2) = β / (κ (α - 1)). Alternatively, using the law of total variance, Var(μ) = E[Var(μ | λ)] + Var(E[μ | λ]) = E[1 / (κ λ)] + 0 = (1/κ) E[1/λ], where 1/λ follows an inverse-gamma distribution with shape α and scale β, so E[1/λ] = β / (α - 1). The precision parameter λ follows a gamma distribution with shape α and rate β, so its expected value is E[λ] = α / β and its variance is Var(λ) = α / β², both requiring α > 1 for the variance to be finite. Higher moments, such as the second moment of μ, can be computed as E[μ²] = Var(μ) + (E[μ])² = μ₀² + β / (κ (α - 1)). This uses the law of total expectation: E[μ²] = E[E[μ² | λ]] = E[μ₀² + Var(μ | λ)] = μ₀² + E[1 / (κ λ)] = μ₀² + (1/κ) ⋅ β / (α - 1). Similarly, the cross-moment E[μ λ] = E[E[μ | λ] ⋅ λ] = E[μ₀ λ] = μ₀ E[λ] = μ₀ α / β. These follow from the conditional independence structure and the known moments of the gamma distribution. The mode of the joint distribution occurs at μ = μ₀ and λ = (α - 1/2) / β for α > 1/2. To arrive at this, consider the log-density (up to constants): log p(μ, λ) = (α - 1/2) log λ - β λ - (κ λ (μ - μ₀)²)/2. The partial derivative with respect to μ is -κ λ (μ - μ₀), which is zero at μ = μ₀. Substituting into the partial derivative with respect to λ gives (α - 1/2)/λ - β = 0, solving to λ = (α - 1/2) / β.

Exponential family form

The normal-gamma distribution is a member of the four-parameter exponential family, allowing it to be expressed in the canonical form p(\mu, \lambda \mid \boldsymbol{\eta}) = h(\mu, \lambda) \exp\left( \boldsymbol{\eta}^\top t(\mu, \lambda) - A(\boldsymbol{\eta}) \right), where the base measure is h(\mu, \lambda) = 1, the sufficient statistics are the vector t(\mu, \lambda) = (\mu^2 \lambda, \mu \lambda, \lambda, \log \lambda), and the natural parameters are \boldsymbol{\eta} = \left( -\frac{\kappa}{2}, \kappa \mu_0, -\beta - \frac{\kappa \mu_0^2}{2}, \alpha - \frac{1}{2} \right). This representation arises by taking the logarithm of the joint density, expanding the quadratic form (\mu - \mu_0)^2 in the conditional normal component to yield terms \mu^2 \lambda and \mu \lambda, and combining with the gamma terms for \lambda and \log \lambda. The log-partition function A(\boldsymbol{\eta}) ensures normalization and can be derived explicitly using properties of the gamma function. The moments of the sufficient statistics follow from derivatives of A(\boldsymbol{\eta}) with respect to the natural parameters. Specifically, the expectation E[\lambda] = \frac{\partial A}{\partial \eta_3} = \frac{\alpha}{\beta} corresponds to the mean of the marginal on \lambda. For the quadratic term, E[\lambda (\mu - \mu_0)^2] = \frac{1}{\kappa}, obtained by conditioning on \lambda and using the \mathrm{Var}(\mu \mid \lambda) = (\kappa \lambda)^{-1}, which simplifies independently of the gamma hyperparameters. These moments highlight the separation between the location-scale structure for \mu and the parameterization for \lambda. Membership in the enables straightforward conjugate Bayesian updates, where the posterior natural parameters are the sum of the prior natural parameters and the data's sufficient statistics scaled appropriately. This structure supports efficient in hierarchical models with likelihoods, as the posterior remains normal-gamma with updated hyperparameters that linearly incorporate sample sums like \sum x_i and \sum x_i^2, facilitating scalable computation without .

Bayesian usage

Conjugate prior for normal parameters

In , the normal-gamma distribution serves as a for the \mu and \lambda of independent and identically distributed observations x_1, \dots, x_n from a , where each x_i \sim \mathrm{Normal}(\mu, \lambda^{-1}). The is parameterized as (\mu, \lambda) \sim \mathrm{Normal\text{-}Gamma}(\mu_0, \kappa, \alpha, \beta), with \mu_0 representing the prior location for the , \kappa scaling the prior around \mu_0, and \alpha, \beta as the and parameters for the marginal gamma on \lambda. The conjugacy property stems from the form of the likelihood, which is proportional to \lambda^{n/2} \exp\left\{ -\frac{\lambda}{2} \sum_{i=1}^n (x_i - \mu)^2 \right\}, matching the kernel of the normal-gamma prior distribution and ensuring the posterior remains in the same family. This setup enables closed-form updates for the posterior hyperparameters, preserving analytical tractability. The normal-gamma prior was introduced in for normal models with unknown variance, overcoming the limitations of the earlier normal-normal , which requires a known variance and thus cannot jointly infer and . It was referenced in foundational work on applied statistical and later formalized in optimal statistical decisions. By maintaining conjugacy, the normal-gamma prior supports exact Bayesian inference through simple parameter updates, avoiding the computational demands of non-conjugate alternatives that often require approximation methods like Markov chain Monte Carlo.

Posterior distribution derivation

The normal-gamma distribution serves as a conjugate prior for the mean \mu and precision \lambda = 1/\sigma^2 of a normal likelihood x_i \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \lambda^{-1}), i=1,\dots,n. The prior is parameterized as \mu \mid \lambda \sim \mathcal{N}(\mu_0, (\kappa \lambda)^{-1}) and \lambda \sim \text{Gamma}(\alpha, \beta), where the gamma distribution uses the shape-rate parameterization with density proportional to \lambda^{\alpha-1} e^{-\beta \lambda}. The posterior distribution p(\mu, \lambda \mid \mathbf{x}) is derived by multiplying the prior density by the likelihood and recognizing the resulting form as another normal-gamma distribution. The likelihood is p(\mathbf{x} \mid \mu, \lambda) = (2\pi)^{-n/2} \lambda^{n/2} \exp\left( -\frac{\lambda}{2} \sum_{i=1}^n (x_i - \mu)^2 \right), where \mathbf{x} = (x_1, \dots, x_n) and \bar{x} = n^{-1} \sum_{i=1}^n x_i. The prior density is p(\mu, \lambda) \propto \lambda^{1/2} \exp\left( -\frac{\kappa \lambda}{2} (\mu - \mu_0)^2 \right) \cdot \lambda^{\alpha - 1} \exp(-\beta \lambda). Combining these yields p(\mu, \lambda \mid \mathbf{x}) \propto \lambda^{\alpha + n/2 - 1/2} \exp\left( -\lambda \left[ \beta + \frac{1}{2} \sum_{i=1}^n (x_i - \mu)^2 + \frac{\kappa}{2} (\mu - \mu_0)^2 \right] \right). To complete the derivation, expand the sum of squares as \sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n (x_i - \bar{x})^2 + n (\mu - \bar{x})^2, denoting S = \sum_{i=1}^n (x_i - \bar{x})^2. The exponent becomes -\lambda \left[ \beta + \frac{S}{2} + \frac{n}{2} (\mu - \bar{x})^2 + \frac{\kappa}{2} (\mu - \mu_0)^2 \right]. The quadratic terms in \mu are \frac{n}{2} (\mu - \bar{x})^2 + \frac{\kappa}{2} (\mu - \mu_0)^2 = \frac{\kappa + n}{2} \left( \mu - \mu_n \right)^2 + C, where \mu_n = (\kappa \mu_0 + n \bar{x}) / (\kappa + n) and C = \frac{\kappa n}{2(\kappa + n)} (\bar{x} - \mu_0)^2 is the constant term independent of \mu. Substituting back gives p(\mu, \lambda \mid \mathbf{x}) \propto \lambda^{(\alpha + n/2) - 1} \exp\left( -\lambda \left[ \beta + \frac{S}{2} + C \right] \right) \cdot \lambda^{1/2} \exp\left( -\frac{(\kappa + n) \lambda}{2} (\mu - \mu_n)^2 \right). This matches the kernel of a normal-gamma posterior with updated hyperparameters \mu_n = (\kappa \mu_0 + n \bar{x}) / (\kappa + n), \kappa_n = \kappa + n, \alpha_n = \alpha + n/2, and \beta_n = \beta + \frac{1}{2} \left[ S + \frac{\kappa n (\bar{x} - \mu_0)^2}{\kappa + n} \right]. The marginal posterior predictive distribution for a new observation x^* \mid \mathbf{x} integrates out \mu and \lambda from the posterior, yielding a non-standardized Student's t-distribution: x^* \mid \mathbf{x} \sim t_{2\alpha_n} \left( \mu_n, \frac{\beta_n (1 + 1/\kappa_n)}{\alpha_n} \right). This follows from the marginal posterior for \mu being a t-distribution and the conditional x^* \mid \mu, \mathbf{x} \sim \mathcal{N}(\mu, \lambda^{-1}) with \lambda integrated out.

Parameter update rules

The parameter update rules for the normal-gamma distribution enable efficient on the mean and precision of a normal likelihood, particularly in scenarios involving sequential data arrival. These rules leverage the conjugacy property, ensuring that the posterior remains in the normal-gamma family after incorporating new observations. For a single new observation x_n from a normal distribution with unknown mean \mu and precision \lambda, the hyperparameters are updated incrementally as follows: \kappa_n = \kappa_{n-1} + 1 \mu_n = \frac{\kappa_{n-1} \mu_{n-1} + x_n}{\kappa_n} = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{\kappa_n} \alpha_n = \alpha_{n-1} + \frac{1}{2} \beta_n = \beta_{n-1} + \frac{\kappa_{n-1} (x_n - \mu_{n-1})^2}{2 \kappa_n} These formulas adjust the prior precision multiplier \kappa, location \mu, gamma shape \alpha, and gamma scale \beta to reflect the new evidence, with the update for \beta accounting for the squared deviation weighted by the relative prior strength \kappa_{n-1}/\kappa_n. In , the updates aggregate sufficient statistics from n independent and identically distributed observations: the sample size n, the sample mean \bar{x}, and the sum of squared deviations SS = \sum_{i=1}^n (x_i - \bar{x})^2. The batch formulas are then \kappa_n = \kappa_0 + n \mu_n = \frac{\kappa_0 \mu_0 + n \bar{x}}{\kappa_n} \alpha_n = \alpha_0 + \frac{n}{2} \beta_n = \beta_0 + \frac{SS}{2} + \frac{\kappa_0 n (\bar{x} - \mu_0)^2}{2 \kappa_n}, where the subscript 0 denotes initial prior values. This batch approach is mathematically equivalent to repeated application of the sequential rules for i.i.d. data, as the order of observations does not affect the final posterior due to the exchangeability of the likelihood. The sequential update rules offer significant advantages in streaming or online scenarios, where arrives incrementally and computational resources must be managed efficiently. By avoiding the need to recompute sufficient statistics over the entire each time, these rules facilitate real-time Bayesian updating, which is prevalent in applications such as adaptive filtering and online probabilistic modeling.

Sampling and computation

Random variate generation

To generate random variates from the normal-gamma distribution NG(μ, λ | μ₀, κ, α, β), where the joint density is given by the product of a conditional normal distribution for the mean μ given the precision λ and a marginal gamma distribution for λ, one efficient approach is to use the known marginal and conditional forms. Specifically, first draw λ from a gamma distribution with shape parameter α and rate parameter β, Gamma(α, β). Then, conditional on this λ, draw μ from a normal distribution with mean μ₀ and variance (κ λ)^{-1}, denoted N(μ₀, (κ λ)^{-1}). This direct method is exact and computationally straightforward, leveraging the independence structure implicit in the parameterization, and requires only standard univariate random number generators for the gamma and normal distributions. For cases where direct sampling is impractical or when integrating into more complex models, a Gibbs sampler can be employed to simulate from the joint . The Gibbs sampler alternates between sampling from the full conditional distributions: μ | λ ~ N(μ₀, (κ λ)^{-1}) and λ | μ ~ Gamma(α + 1/2, β + [κ (μ - μ₀)^2]/2). The updated α + 1/2 arises from the quadratic term in the normal conditional contributing an additional 1/2 to the exponent in the gamma density. This (MCMC) procedure converges to the target joint under standard regularity conditions, though for the bivariate normal-gamma alone, the direct method is typically preferred due to its simplicity and lack of . In practice, random variate generation from the normal-gamma distribution is supported in probabilistic programming libraries used for Bayesian workflows. For instance, in Stan, users can implement the direct sampler using the built-in gamma_rng and normal_rng functions within a custom distribution block. Similarly, PyMC provides components like Gamma and Normal distributions that allow straightforward definition of the joint via the marginal-conditional approach. These implementations facilitate efficient simulation, especially when the normal-gamma serves as a conjugate prior in larger hierarchical models.

Scale invariance properties

The normal-gamma distribution, when parameterized using precision \lambda = 1/\sigma^2, possesses properties that preserve its functional form under linear of the or parameters. This arises because precision transforms inversely with the square of the scaling factor, ensuring that the joint prior on the \mu and precision \lambda remains within the same family after appropriate adjustments to the hyperparameters. Such invariance is a direct consequence of the conjugate structure and makes the normal-gamma particularly suitable for Bayesian analyses where the is arbitrary or unknown, as it avoids introducing artificial dependencies on units of . Consider a dataset \{x_i\}_{i=1}^n drawn from a normal distribution with unknown mean \mu and precision \lambda. If the data are scaled by a positive constant c, yielding \{x_i' = c x_i\}, the corresponding model parameters transform as \mu' = c \mu and \lambda' = \lambda / c^2, reflecting the scaling of the variance by c^2. For a prior (\mu, \lambda) \sim \text{Normal-Gamma}(\mu_0, \kappa_0, \alpha_0, \beta_0), where the gamma distribution on \lambda uses shape \alpha_0 and rate \beta_0 (such that the pdf is \frac{\beta_0^{\alpha_0}}{\Gamma(\alpha_0)} \lambda^{\alpha_0 - 1} e^{-\beta_0 \lambda}), the transformed variables (\mu', \lambda') follow \text{Normal-Gamma}(c \mu_0, \kappa_0, \alpha_0, \beta_0 c^2). This adjustment maintains the prior's interpretability: \kappa_0 (prior sample size) remains unchanged, \alpha_0 (prior degrees of freedom for precision) is invariant, while \beta_0 (prior rate for precision) scales with c^2 to compensate for the data's expanded variability, reducing the prior mean precision \alpha_0 / \beta_0 by a factor of $1/c^2. To verify this mathematically, start with the joint pdf of the : p(\mu, \lambda) = \sqrt{\frac{\kappa_0 \lambda}{2\pi}} \exp\left( -\frac{\kappa_0 \lambda}{2} (\mu - \mu_0)^2 \right) \cdot \frac{\beta_0^{\alpha_0} \lambda^{\alpha_0 - 1} e^{-\beta_0 \lambda}}{\Gamma(\alpha_0)}. Under the \mu' = c \mu and \lambda' = \lambda / c^2, the inverse is \mu = \mu'/c and \lambda = c^2 \lambda', with Jacobian determinant |J| = c. Substituting yields: p(\mu', \lambda') = p\left(\frac{\mu'}{c}, c^2 \lambda'\right) \cdot c. The normal component becomes: \sqrt{\frac{\kappa_0 (c^2 \lambda')}{2\pi}} \exp\left( -\frac{\kappa_0 (c^2 \lambda')}{2} \left(\frac{\mu'}{c} - \mu_0\right)^2 \right) = c \sqrt{\frac{\kappa_0 \lambda'}{2\pi}} \exp\left( -\frac{\kappa_0 \lambda'}{2} (\mu' - c \mu_0)^2 \right), accounting for the \sqrt{c^2} = c factor. The gamma component is: \frac{\beta_0^{\alpha_0} (c^2 \lambda')^{\alpha_0 - 1} e^{-\beta_0 (c^2 \lambda')} }{\Gamma(\alpha_0)} = \frac{\beta_0^{\alpha_0} c^{2(\alpha_0 - 1)} \lambda'^{\alpha_0 - 1} e^{- (\beta_0 c^2) \lambda'} }{\Gamma(\alpha_0)}. Multiplying by the Jacobian c gives overall prefactors c \cdot c \cdot c^{2(\alpha_0 - 1)} \beta_0^{\alpha_0} = \beta_0^{\alpha_0} c^{2\alpha_0}, and the exponential e^{-(\beta_0 c^2) \lambda'}, matching the pdf of \text{Normal-Gamma}(c \mu_0, \kappa_0, \alpha_0, \beta_0 c^2). This closure under scaling confirms the invariance. In contrast, the , which parameterizes the using variance \sigma^2 instead of , preserves its form under scaling but with the hyperparameter for the inverse-gamma multiplying by c^2, leading to transformations that may introduce sensitivities in improper or weakly informative settings, particularly near zero variance. The scale invariance of the normal-gamma has key implications for robustness in statistical models with unknown or varying scales, such as hierarchical Bayesian models where parameters at different levels may involve scaled observations (e.g., in or growth curve modeling). By maintaining the prior family without re-specification, it facilitates stable posterior updates and reduces sensitivity to unit choices, enhancing the prior's non-informativeness in scale-ambiguous scenarios.

Normal-inverse-gamma distribution

The serves as a for the \mu and variance \sigma^2 of a , specified as \mu \mid \sigma^2 \sim \mathcal{N}(\mu_0, \sigma^2 / \kappa) and \sigma^2 \sim \text{Inverse-Gamma}(\alpha, \beta), where \mu_0, \kappa > 0, \alpha > 0, and \beta > 0 are hyperparameters representing the , scaling, , and , respectively. The is given by p(\mu, \sigma^2 \mid \mu_0, \kappa, \alpha, \beta) \propto (\sigma^2)^{-(\alpha + 1)} \exp\left( -\frac{\beta}{\sigma^2} - \frac{\kappa (\mu - \mu_0)^2}{2 \sigma^2} \right), which incorporates $1/\sigma terms arising from the conditional . This distribution is mathematically equivalent to the normal-gamma distribution when reparameterized in terms of the precision , as the inverse-gamma prior on \sigma^2 corresponds to a gamma prior on . However, the inverse-gamma parameterization on the variance introduces distinct scaling behaviors compared to the normal-gamma's focus on , particularly in how hyperparameters influence moments and posterior updates. The normal-inverse-gamma prior is prevalent in Bayesian literature for analyses emphasizing variance components, such as hierarchical normal models and meta-analyses. In contrast, the normal-gamma is frequently favored for scenarios involving precision in conjugate updates, leveraging the gamma distribution's direct conjugacy with precision parameters. The hyperparameters align directly across the two forms, with the expected variance given by \mathbb{E}[\sigma^2] = \beta / (\alpha - 1) for \alpha > 1.

Multivariate extensions

The multivariate extension of the normal-gamma distribution is the normal-Wishart distribution, which provides a for the vector \mu \in \mathbb{R}^p and matrix \Lambda \in \mathbb{R}^{p \times p} of a p-dimensional with and . In the normal-Wishart distribution, the matrix follows a \Lambda \sim \Wishart_p(\nu, S), where \nu > p-1 denotes the and S is a p \times p positive definite scale matrix, while the is conditionally given the : \mu \mid \Lambda \sim \Normal_p(\mu_0, (\kappa \Lambda)^{-1}), with hyperparameters consisting of the prior vector \mu_0 \in \mathbb{R}^p and scalar \kappa > 0 that controls the strength of the prior belief around \mu_0. These hyperparameters allow flexible specification of prior knowledge about the location, scale, and shape of the multivariate parameters. For n independent and identically distributed observations x_1, \dots, x_n \stackrel{\text{iid}}{\sim} \MultivariateNormal_p(\mu, \Lambda^{-1}), the posterior distribution remains normal-Wishart, with updated hyperparameters including \nu_n = \nu + n for the and \kappa_n = \kappa + n for the prior strength, along with analogous updates to \mu_n and S_n that incorporate the sample mean and scatter. This conjugacy mirrors the univariate case but extends it to account for correlations across dimensions. The normal-Wishart distribution finds application in multivariate Bayesian , where it enables closed-form posterior updates for regression coefficients and error structures in models with multiple response variables. It is also employed in Bayesian to specify priors on factor means and matrices, facilitating in latent variable models with multivariate observations. Due to the higher dimensionality of the matrix, computational demands for sampling from the posterior and evaluating marginal likelihoods increase substantially relative to univariate settings.

References

  1. [1]
    [PDF] The Conjugate Prior for the Normal Distribution 1 Fixed variance (σ2 ...
    Feb 8, 2010 · Our aim is to find conjugate prior distributions for these parameters. We will investigate the hyper-parameter. (prior parameter) update ...
  2. [2]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · The use of conjugate priors allows all the results to be derived in closed form.
  3. [3]
    [PDF] Module 4: Introduction to the Normal Gamma Model - Stat@Duke
    NormalGamma distribution (continued). It turns out that this provides a conjugate prior for (µ, λ). One can show the posterior is. µ,λ|x1:n ∼ NormalGamma(M,C,A, ...
  4. [4]
    [PDF] Kullback-Leibler Divergence for the Normal-Gamma Distribution
    Nov 4, 2016 · said to follow a normal-gamma distribution (NG distribution), if their joint probability ... pdf. [4] Friston KJ, Holmes AP, Worsley KJ, Poline JP ...
  5. [5]
    Misinformation in the conjugate prior for the linear model with ...
    This occurs because the prior mean, which can be thought of as a weighted pseudo-observation, is an outlier with respect to the real observations. While this ...
  6. [6]
    [PDF] Lecture 6. Prior distributions
    Normal with unknown mean and variance: Jeffreys rule applied directly ... observations, from which we assume a Gamma(a, b) prior distribution for ω ...
  7. [7]
    Chapter 4 Inference and Decision-Making with Multiple Parameters
    We will introduce the conjugate normal-gamma family of distributions where the posterior distribution is in the same family as the prior distribution and leads ...
  8. [8]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · Figure 3: Some Normal-Gamma distributions. Produced by NGplot2. See Figure 3 for some plots. We can compute the prior marginal on µ as follows:.
  9. [9]
    Proof: Conditional distributions of the normal-gamma distribution
    Aug 5, 2020 · 1) This follows from the definition of the normal-gamma distribution: p(x,y)=p(x|y)⋅p(y)=N(x;μ,(yΛ)−1)⋅Gam(y;a,b). (6) 2) This follows from (2) ...
  10. [10]
    [PDF] Bayesian Data Analysis Third edition (with errors fixed as of 20 ...
    ... Bayesian Data Analysis. Third edition. (with errors fixed as of 20 February 2025). Andrew Gelman. Columbia University. John B. Carlin. University of Melbourne.
  11. [11]
    [PDF] Normal Gamma model
    Oct 1, 2007 · The Normal-Gamma model is a conjugate prior, defined as NG(µ, λ|µ0,κ0,α0,β0), where the prior for µ is N(µ|µ0, (κ0λ)−1) and for λ is Ga(λ|α0, ...
  12. [12]
    [PDF] A Compendium of Conjugate Priors - Applied Mathematics Consulting
    The normal process is a bivariate process that is most commonly specified in terms of its mean and variance. The two corresponding univariate processes are ...Missing: paper | Show results with:paper
  13. [13]
    [PDF] Chapter 9 The exponential family: Conjugate priors - People @EECS
    Thus, given the variance of x, we may wish to define the variance of µ as a multiple of that variance. ... E[µ|τ,n0] = E[∇A(η)|τ,n0]. (9.96). To carry out ...
  14. [14]
    [PDF] Prior distributions for variance parameters in hierarchical models
    The inverse-gamma( , ) prior distribution is an attempt at noninformativeness within the conditionally conjugate family, with set to a low value such as 1 or 0 ...
  15. [15]
    [PDF] Conjugate distributions in hierarchical Bayesian ANOVA for ... - arXiv
    is possible to consider several such normal–inverse-gamma–inverse-gamma distributions, where the single inverse-gamma distribution of the errors, σ2. , is ...
  16. [16]
    [PDF] BCMA-ES: A Bayesian approach to CMA-ES - arXiv
    Apr 2, 2019 · The Normal Inverse Gamma NIG (µ0,v,α, β) distribution is a conjugate prior of a normal distribution with unknown mean and variance. Proof ...
  17. [17]
    [PDF] Gaussian Conjugate Prior Cheat Sheet - thaines.com
    This document contains notes on how to handle the multivariate Gaussian1 in a Bayesian setting. It focuses on the conjugate prior, its Bayesian update.
  18. [18]
    3.4 Multivariate linear regression: The conjugate normal ... - Bookdown
    Introduction to Bayesian Data Modeling. 3.4 Multivariate linear regression: The conjugate normal-normal/inverse Wishart model. Let's study the multivariate ...
  19. [19]
    [PDF] Heterogeneous factor analysis models: A bayesian approach
    The normal sampling distribution for the factor scores when combined with the Wishart prior for the individual specific factor precision matrices, ~-1 ~ W ...