Normal-gamma distribution
The Normal-gamma distribution is a joint probability distribution over the mean \mu and precision \lambda = 1/\sigma^2 (or equivalently, variance \sigma^2) of a univariate normal distribution, serving as the conjugate prior in Bayesian inference for normal data with both parameters unknown.[1] It is parameterized by four hyperparameters: \mu_0 (prior mean of \mu), \kappa_0 (prior strength or effective sample size for \mu), \alpha_0 (shape parameter for the gamma prior on \lambda), and \beta_0 (rate parameter for the gamma prior on \lambda).[2] The density is given by the product p(\mu, \lambda) = \mathcal{N}(\mu \mid \mu_0, (\kappa_0 \lambda)^{-1}) \cdot \text{Gamma}(\lambda \mid \alpha_0, \beta_0), where the conditional normal distribution reflects uncertainty in the mean scaling with the inverse precision, and the marginal gamma encodes beliefs about the precision itself.[2] This distribution arises naturally in hierarchical Bayesian models for normal likelihoods, enabling closed-form posterior updates that preserve the Normal-gamma family.[1] Given n independent observations x_1, \dots, x_n \sim \mathcal{N}(\mu, \lambda^{-1}) with sample mean \bar{x} and sum of squared deviations s^2 = \sum (x_i - \bar{x})^2, the posterior is Normal-gamma with updated parameters: \mu_n = (\kappa_0 \mu_0 + n \bar{x}) / (\kappa_0 + n), \kappa_n = \kappa_0 + n, \alpha_n = \alpha_0 + n/2, and \beta_n = \beta_0 + \frac{1}{2} \left[ s^2 + \kappa_0 n (\bar{x} - \mu_0)^2 / (\kappa_0 + n)\right].[2] These updates interpret \kappa_0 as the prior's effective sample size, \alpha_0 as prior "pseudo-observations" for precision, and \beta_0 as a prior scale related to expected variance.[3] The marginal posterior for \mu is a non-standard Student's t-distribution, while for \lambda it is gamma, facilitating exact inference without simulation in many cases.[2] Key applications include Bayesian linear regression, where it extends to multivariate forms, and predictive distributions for new data, which follow a Student's t-distribution with location \mu_n, scale incorporating \beta_n / (\alpha_n \kappa_n), and degrees of freedom $2\alpha_n.[1] The choice of hyperparameters can reflect vague priors (e.g., small \kappa_0 and \alpha_0) or informative ones based on domain knowledge, such as setting \beta_0 \approx \alpha_0 \cdot \mathbb{E}[\sigma^2] for expected variance.[3] Despite slight notational variations across formulations (e.g., using rate vs. scale for the gamma), the Normal-gamma remains a cornerstone for tractable Bayesian analysis of normal models due to its conjugacy and interpretability.[2]Definition
Probability density function
The normal-gamma distribution specifies a joint probability distribution over a normal distribution's mean parameter \mu \in \mathbb{R} and precision parameter \lambda > 0 (where precision is the reciprocal of the variance). It parameterizes \mu and \lambda such that the conditional distribution \mu \mid \lambda is normal with mean hyperparameter \mu_0 and precision \kappa \lambda (equivalently, variance (\kappa \lambda)^{-1}), while the marginal distribution of \lambda is gamma with shape hyperparameter \alpha > 0 and rate hyperparameter \beta > 0.[1][3] The joint probability density function (PDF) is the product of these conditional and marginal densities: f(\mu, \lambda \mid \mu_0, \kappa, \alpha, \beta) = \mathcal{N}(\mu \mid \mu_0, (\kappa \lambda)^{-1}) \cdot \Gamma(\lambda \mid \alpha, \beta), where \mathcal{N}(\cdot \mid \mu_0, (\kappa \lambda)^{-1}) denotes the normal PDF \mathcal{N}(\mu \mid \mu_0, (\kappa \lambda)^{-1}) = \sqrt{\frac{\kappa \lambda}{2\pi}} \exp\left( -\frac{\kappa \lambda (\mu - \mu_0)^2}{2} \right) and \Gamma(\cdot \mid \alpha, \beta) denotes the gamma PDF \Gamma(\lambda \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} \exp(-\beta \lambda), \quad \lambda > 0. This product form reflects the hierarchical structure, with \lambda serving as a shared precision parameter that scales the variance of the conditional normal.[1][3][4] Substituting the component densities yields the explicit joint PDF: \begin{aligned} f(\mu, \lambda \mid \mu_0, \kappa, \alpha, \beta) &= \sqrt{\frac{\kappa \lambda}{2\pi}} \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} \exp\left( -\lambda \left( \beta + \frac{\kappa (\mu - \mu_0)^2}{2} \right) \right) \\ &= \frac{\beta^\alpha}{\Gamma(\alpha)} \left( \frac{\kappa \lambda}{2\pi} \right)^{1/2} \lambda^{\alpha - 1/2} \exp\left( -\lambda \left( \beta + \frac{\kappa (\mu - \mu_0)^2}{2} \right) \right). \end{aligned} The distribution is defined over the support \mu \in \mathbb{R} and \lambda > 0.[3][4]Parameter interpretation
The hyperparameter \mu_0 serves as the a priori best guess for the mean \mu of the underlying normal distribution, representing the central tendency of the prior belief about the location parameter.[3] The hyperparameter \kappa quantifies the strength of this prior belief in \mu_0 by acting as the number of pseudo-observations that contribute to the precision of \mu around \mu_0; larger values of \kappa imply greater confidence in the prior mean, equivalent to having observed \kappa data points centered at \mu_0.[5] The hyperparameters \alpha and \beta define the shape and rate, respectively, of the gamma prior on the precision \lambda = 1/\sigma^2, where \alpha reflects the prior strength analogous to degrees of freedom (with $2\alpha providing a variance-scale interpretation), and \beta scales the expected precision via E[\lambda] = \alpha / \beta, influencing the anticipated variability in the normal distribution.[6] Collectively, these hyperparameters allow the normal-gamma prior to be viewed as derived from a hypothetical set of pseudo-observations: specifically, \kappa observations with sample mean \mu_0 for informing the mean, combined with $2\alpha pseudo-observations whose sum of squared deviations totals $2\beta for shaping the precision prior.[7]Properties
Marginal distributions
The marginal distribution of the precision parameter \lambda in the normal-gamma distribution is a gamma distribution with shape parameter \alpha and rate parameter \beta, denoted as \lambda \sim \mathrm{Gamma}(\alpha, \beta).[2] This marginal is independent of the mean parameter \mu, as the joint density factors into the conditional normal density for \mu given \lambda and the gamma density for \lambda.[2] To derive this marginal, integrate the joint probability density function over \mu. The conditional density p(\mu \mid \lambda) is normal with mean \mu_0 and precision \kappa \lambda, which integrates to unity over \mu, leaving the marginal density of \lambda unchanged from its gamma form.[2] The marginal distribution of the mean parameter \mu is a Student's t-distribution with location parameter \mu_0, scale parameter \frac{\beta}{\kappa \alpha}, and $2\alpha degrees of freedom.[2] This arises from the scale mixture representation, where the normal-gamma joint can be viewed as a normal distribution for \mu mixed over a gamma-distributed precision \lambda, a construction known to produce a t-distribution.[2] The derivation involves integrating the joint density over \lambda, which requires evaluating the integral of a normal density times a gamma density. This product integrates to the density of a non-standardized Student's t-distribution through recognition of the gamma as an inverse-chi-squared scale mixture or direct computation using gamma function properties.[2]Conditional distributions
The conditional distribution of the mean parameter \mu given the precision parameter \lambda follows a normal distribution: \mu \mid \lambda \sim \mathcal{N}(\mu_0, (\kappa \lambda)^{-1}).[8] This form arises directly from the structure of the normal-gamma joint density, where the precision \lambda scales the prior precision factor \kappa, resulting in a variance that decreases as \lambda increases. Consequently, higher values of \lambda (indicating lower variance in the underlying normal model) lead to a tighter distribution around the prior mean \mu_0, emphasizing reduced uncertainty in \mu when the data precision is high.[9] Conversely, the conditional distribution of the precision \lambda given \mu is a gamma distribution: \lambda \mid \mu \sim \mathrm{Gamma}(\alpha + 1/2, \beta + \kappa (\mu - \mu_0)^2 / 2), where the gamma is parameterized by shape and rate.[8] The shape parameter incorporates an additional $1/2 from the dimensionality of the univariate normal component in the joint density, while the rate parameter adjusts the prior rate \beta by a term proportional to the squared deviation of \mu from \mu_0, weighted by \kappa. This adjustment reflects greater penalization (higher rate, narrower distribution) when \mu strays far from the prior mean, capturing the interdependence between the mean and precision parameters.[9] These conditionals facilitate Gibbs sampling from the joint normal-gamma distribution, as each is available in closed form and conjugate to the other, allowing iterative draws: sample \mu from its conditional given current \lambda, then sample \lambda from its conditional given the new \mu.[10] This approach is particularly useful for posterior inference in Bayesian models where the normal-gamma serves as a prior, enabling efficient exploration of the parameter space without direct integration of the joint density.[8]Moments and expectations
The expected value of the mean parameter μ in the normal-gamma distribution is E[μ] = μ₀. This follows from the symmetry of the conditional normal distribution around μ₀ and the law of total expectation.[11] The variance of μ is Var(μ) = β / (κ (α - 1)) for α > 1. To derive this, note that the marginal distribution of μ is a non-standardized Student-t distribution with 2α degrees of freedom, location μ₀, and scale parameter √(β / (α κ)); the variance of a Student-t random variable with ν degrees of freedom and scale s is s² ν / (ν - 2), yielding [β / (α κ)] ⋅ 2α / (2α - 2) = β / (κ (α - 1)). Alternatively, using the law of total variance, Var(μ) = E[Var(μ | λ)] + Var(E[μ | λ]) = E[1 / (κ λ)] + 0 = (1/κ) E[1/λ], where 1/λ follows an inverse-gamma distribution with shape α and scale β, so E[1/λ] = β / (α - 1).[11][3] The precision parameter λ follows a gamma distribution with shape α and rate β, so its expected value is E[λ] = α / β and its variance is Var(λ) = α / β², both requiring α > 1 for the variance to be finite.[10] Higher moments, such as the second moment of μ, can be computed as E[μ²] = Var(μ) + (E[μ])² = μ₀² + β / (κ (α - 1)). This uses the law of total expectation: E[μ²] = E[E[μ² | λ]] = E[μ₀² + Var(μ | λ)] = μ₀² + E[1 / (κ λ)] = μ₀² + (1/κ) ⋅ β / (α - 1). Similarly, the cross-moment E[μ λ] = E[E[μ | λ] ⋅ λ] = E[μ₀ λ] = μ₀ E[λ] = μ₀ α / β. These follow from the conditional independence structure and the known moments of the gamma distribution.[11][10] The mode of the joint distribution occurs at μ = μ₀ and λ = (α - 1/2) / β for α > 1/2. To arrive at this, consider the log-density (up to constants): log p(μ, λ) = (α - 1/2) log λ - β λ - (κ λ (μ - μ₀)²)/2. The partial derivative with respect to μ is -κ λ (μ - μ₀), which is zero at μ = μ₀. Substituting into the partial derivative with respect to λ gives (α - 1/2)/λ - β = 0, solving to λ = (α - 1/2) / β.[11]Exponential family form
The normal-gamma distribution is a member of the four-parameter exponential family, allowing it to be expressed in the canonical form p(\mu, \lambda \mid \boldsymbol{\eta}) = h(\mu, \lambda) \exp\left( \boldsymbol{\eta}^\top t(\mu, \lambda) - A(\boldsymbol{\eta}) \right), where the base measure is h(\mu, \lambda) = 1, the sufficient statistics are the vector t(\mu, \lambda) = (\mu^2 \lambda, \mu \lambda, \lambda, \log \lambda), and the natural parameters are \boldsymbol{\eta} = \left( -\frac{\kappa}{2}, \kappa \mu_0, -\beta - \frac{\kappa \mu_0^2}{2}, \alpha - \frac{1}{2} \right). This representation arises by taking the logarithm of the joint density, expanding the quadratic form (\mu - \mu_0)^2 in the conditional normal component to yield terms \mu^2 \lambda and \mu \lambda, and combining with the gamma terms for \lambda and \log \lambda. The log-partition function A(\boldsymbol{\eta}) ensures normalization and can be derived explicitly using properties of the gamma function.[12] The moments of the sufficient statistics follow from derivatives of A(\boldsymbol{\eta}) with respect to the natural parameters. Specifically, the expectation E[\lambda] = \frac{\partial A}{\partial \eta_3} = \frac{\alpha}{\beta} corresponds to the mean of the marginal gamma distribution on \lambda. For the quadratic term, E[\lambda (\mu - \mu_0)^2] = \frac{1}{\kappa}, obtained by conditioning on \lambda and using the conditional variance \mathrm{Var}(\mu \mid \lambda) = (\kappa \lambda)^{-1}, which simplifies independently of the gamma hyperparameters. These moments highlight the separation between the location-scale structure for \mu and the precision parameterization for \lambda.[13] Membership in the exponential family enables straightforward conjugate Bayesian updates, where the posterior natural parameters are the sum of the prior natural parameters and the data's sufficient statistics scaled appropriately. This structure supports efficient inference in hierarchical models with normal likelihoods, as the posterior remains normal-gamma with updated hyperparameters that linearly incorporate sample sums like \sum x_i and \sum x_i^2, facilitating scalable computation without numerical integration.[13]Bayesian usage
Conjugate prior for normal parameters
In Bayesian statistics, the normal-gamma distribution serves as a conjugate prior for the mean \mu and precision \lambda of independent and identically distributed observations x_1, \dots, x_n from a normal distribution, where each x_i \sim \mathrm{Normal}(\mu, \lambda^{-1}). The prior is parameterized as (\mu, \lambda) \sim \mathrm{Normal\text{-}Gamma}(\mu_0, \kappa, \alpha, \beta), with \mu_0 representing the prior location for the mean, \kappa scaling the prior precision around \mu_0, and \alpha, \beta as the shape and rate parameters for the marginal gamma prior on \lambda.[2] The conjugacy property stems from the form of the likelihood, which is proportional to \lambda^{n/2} \exp\left\{ -\frac{\lambda}{2} \sum_{i=1}^n (x_i - \mu)^2 \right\}, matching the kernel of the normal-gamma prior distribution and ensuring the posterior remains in the same family.[1] This setup enables closed-form updates for the posterior hyperparameters, preserving analytical tractability.[2] The normal-gamma prior was introduced in Bayesian decision theory for normal models with unknown variance, overcoming the limitations of the earlier normal-normal conjugate prior, which requires a known variance and thus cannot jointly infer mean and precision.[12] It was referenced in foundational work on applied statistical decision theory and later formalized in optimal statistical decisions.[12] By maintaining conjugacy, the normal-gamma prior supports exact Bayesian inference through simple parameter updates, avoiding the computational demands of non-conjugate alternatives that often require approximation methods like Markov chain Monte Carlo.[1]Posterior distribution derivation
The normal-gamma distribution serves as a conjugate prior for the mean \mu and precision \lambda = 1/\sigma^2 of a normal likelihood x_i \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \lambda^{-1}), i=1,\dots,n. The prior is parameterized as \mu \mid \lambda \sim \mathcal{N}(\mu_0, (\kappa \lambda)^{-1}) and \lambda \sim \text{Gamma}(\alpha, \beta), where the gamma distribution uses the shape-rate parameterization with density proportional to \lambda^{\alpha-1} e^{-\beta \lambda}.[2] The posterior distribution p(\mu, \lambda \mid \mathbf{x}) is derived by multiplying the prior density by the likelihood and recognizing the resulting form as another normal-gamma distribution. The likelihood is p(\mathbf{x} \mid \mu, \lambda) = (2\pi)^{-n/2} \lambda^{n/2} \exp\left( -\frac{\lambda}{2} \sum_{i=1}^n (x_i - \mu)^2 \right), where \mathbf{x} = (x_1, \dots, x_n) and \bar{x} = n^{-1} \sum_{i=1}^n x_i. The prior density is p(\mu, \lambda) \propto \lambda^{1/2} \exp\left( -\frac{\kappa \lambda}{2} (\mu - \mu_0)^2 \right) \cdot \lambda^{\alpha - 1} \exp(-\beta \lambda). Combining these yields p(\mu, \lambda \mid \mathbf{x}) \propto \lambda^{\alpha + n/2 - 1/2} \exp\left( -\lambda \left[ \beta + \frac{1}{2} \sum_{i=1}^n (x_i - \mu)^2 + \frac{\kappa}{2} (\mu - \mu_0)^2 \right] \right). To complete the derivation, expand the sum of squares as \sum_{i=1}^n (x_i - \mu)^2 = \sum_{i=1}^n (x_i - \bar{x})^2 + n (\mu - \bar{x})^2, denoting S = \sum_{i=1}^n (x_i - \bar{x})^2. The exponent becomes -\lambda \left[ \beta + \frac{S}{2} + \frac{n}{2} (\mu - \bar{x})^2 + \frac{\kappa}{2} (\mu - \mu_0)^2 \right]. The quadratic terms in \mu are \frac{n}{2} (\mu - \bar{x})^2 + \frac{\kappa}{2} (\mu - \mu_0)^2 = \frac{\kappa + n}{2} \left( \mu - \mu_n \right)^2 + C, where \mu_n = (\kappa \mu_0 + n \bar{x}) / (\kappa + n) and C = \frac{\kappa n}{2(\kappa + n)} (\bar{x} - \mu_0)^2 is the constant term independent of \mu. Substituting back gives p(\mu, \lambda \mid \mathbf{x}) \propto \lambda^{(\alpha + n/2) - 1} \exp\left( -\lambda \left[ \beta + \frac{S}{2} + C \right] \right) \cdot \lambda^{1/2} \exp\left( -\frac{(\kappa + n) \lambda}{2} (\mu - \mu_n)^2 \right). This matches the kernel of a normal-gamma posterior with updated hyperparameters \mu_n = (\kappa \mu_0 + n \bar{x}) / (\kappa + n), \kappa_n = \kappa + n, \alpha_n = \alpha + n/2, and \beta_n = \beta + \frac{1}{2} \left[ S + \frac{\kappa n (\bar{x} - \mu_0)^2}{\kappa + n} \right].[2] The marginal posterior predictive distribution for a new observation x^* \mid \mathbf{x} integrates out \mu and \lambda from the posterior, yielding a non-standardized Student's t-distribution: x^* \mid \mathbf{x} \sim t_{2\alpha_n} \left( \mu_n, \frac{\beta_n (1 + 1/\kappa_n)}{\alpha_n} \right). This follows from the marginal posterior for \mu being a t-distribution and the conditional x^* \mid \mu, \mathbf{x} \sim \mathcal{N}(\mu, \lambda^{-1}) with \lambda integrated out.[2]Parameter update rules
The parameter update rules for the normal-gamma distribution enable efficient Bayesian inference on the mean and precision of a normal likelihood, particularly in scenarios involving sequential data arrival. These rules leverage the conjugacy property, ensuring that the posterior remains in the normal-gamma family after incorporating new observations. For a single new observation x_n from a normal distribution with unknown mean \mu and precision \lambda, the hyperparameters are updated incrementally as follows: \kappa_n = \kappa_{n-1} + 1 \mu_n = \frac{\kappa_{n-1} \mu_{n-1} + x_n}{\kappa_n} = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{\kappa_n} \alpha_n = \alpha_{n-1} + \frac{1}{2} \beta_n = \beta_{n-1} + \frac{\kappa_{n-1} (x_n - \mu_{n-1})^2}{2 \kappa_n} These formulas adjust the prior precision multiplier \kappa, location \mu, gamma shape \alpha, and gamma scale \beta to reflect the new evidence, with the update for \beta accounting for the squared deviation weighted by the relative prior strength \kappa_{n-1}/\kappa_n.[8] In batch processing, the updates aggregate sufficient statistics from n independent and identically distributed observations: the sample size n, the sample mean \bar{x}, and the sum of squared deviations SS = \sum_{i=1}^n (x_i - \bar{x})^2. The batch formulas are then \kappa_n = \kappa_0 + n \mu_n = \frac{\kappa_0 \mu_0 + n \bar{x}}{\kappa_n} \alpha_n = \alpha_0 + \frac{n}{2} \beta_n = \beta_0 + \frac{SS}{2} + \frac{\kappa_0 n (\bar{x} - \mu_0)^2}{2 \kappa_n}, where the subscript 0 denotes initial prior values. This batch approach is mathematically equivalent to repeated application of the sequential rules for i.i.d. data, as the order of observations does not affect the final posterior due to the exchangeability of the likelihood.[12][8] The sequential update rules offer significant advantages in streaming or online data scenarios, where data arrives incrementally and computational resources must be managed efficiently. By avoiding the need to recompute sufficient statistics over the entire dataset each time, these rules facilitate real-time Bayesian updating, which is prevalent in machine learning applications such as adaptive filtering and online probabilistic modeling.[8]Sampling and computation
Random variate generation
To generate random variates from the normal-gamma distribution NG(μ, λ | μ₀, κ, α, β), where the joint density is given by the product of a conditional normal distribution for the mean μ given the precision λ and a marginal gamma distribution for λ, one efficient approach is to use the known marginal and conditional forms. Specifically, first draw λ from a gamma distribution with shape parameter α and rate parameter β, Gamma(α, β). Then, conditional on this λ, draw μ from a normal distribution with mean μ₀ and variance (κ λ)^{-1}, denoted N(μ₀, (κ λ)^{-1}). This direct method is exact and computationally straightforward, leveraging the independence structure implicit in the parameterization, and requires only standard univariate random number generators for the gamma and normal distributions. For cases where direct sampling is impractical or when integrating into more complex models, a Gibbs sampler can be employed to simulate from the joint distribution. The Gibbs sampler alternates between sampling from the full conditional distributions: μ | λ ~ N(μ₀, (κ λ)^{-1}) and λ | μ ~ Gamma(α + 1/2, β + [κ (μ - μ₀)^2]/2). The updated shape parameter α + 1/2 arises from the quadratic term in the normal conditional contributing an additional 1/2 to the exponent in the gamma density. This Markov chain Monte Carlo (MCMC) procedure converges to the target joint distribution under standard regularity conditions, though for the bivariate normal-gamma prior alone, the direct method is typically preferred due to its simplicity and lack of autocorrelation. In practice, random variate generation from the normal-gamma distribution is supported in probabilistic programming libraries used for Bayesian workflows. For instance, in Stan, users can implement the direct sampler using the built-ingamma_rng and normal_rng functions within a custom distribution block. Similarly, PyMC provides components like Gamma and Normal distributions that allow straightforward definition of the joint via the marginal-conditional approach. These implementations facilitate efficient simulation, especially when the normal-gamma serves as a conjugate prior in larger hierarchical models.