Normal-inverse-gamma distribution
The Normal-inverse-gamma distribution is a four-parameter continuous probability distribution that jointly specifies a normal distribution for the mean conditional on the variance and an inverse-gamma distribution for the variance itself, serving as the conjugate prior for the mean and variance of a univariate normal likelihood when both parameters are unknown.[1] This conjugacy ensures that the posterior distribution after observing data remains in the same family, enabling closed-form Bayesian updates without numerical approximation.[2] The distribution is commonly parameterized by hyperparameters μ₀ (prior mean of the mean), κ₀ > 0 (prior precision scaling factor, akin to a virtual sample size), α₀ > 0 (shape parameter of the inverse-gamma), and β₀ > 0 (scale or rate parameter of the inverse-gamma).[1] The joint probability density function is given byp(\mu, \sigma^2 \mid \mu_0, \kappa_0, \alpha_0, \beta_0) = \mathcal{N}(\mu \mid \mu_0, \sigma^2 / \kappa_0) \cdot \text{IG}(\sigma^2 \mid \alpha_0, \beta_0),
where \mathcal{N} denotes the normal density and IG the inverse-gamma density.[2] Alternative notations, such as using precision \lambda = 1/\sigma^2 with a gamma prior on \lambda, are equivalent through reparameterization.[1] Key properties include the marginal distribution of the mean \mu, which follows a non-standardized Student's t-distribution with location \mu_0, scale \beta_0 / (\alpha_0 \kappa_0), and degrees of freedom $2\alpha_0; this arises from integrating out the variance.[2] The posterior updates are straightforward: for n i.i.d. observations from \mathcal{N}(\mu, \sigma^2), the posterior hyperparameters become \mu_n = (\kappa_0 \mu_0 + n \bar{x}) / \kappa_n, \kappa_n = \kappa_0 + n, \alpha_n = \alpha_0 + n/2, and \beta_n = \beta_0 + (1/2) \sum_{i=1}^n (x_i - \bar{x})^2 + (\kappa_0 n / (2 \kappa_n)) (\bar{x} - \mu_0)^2, where \bar{x} is the sample mean.[1] Consequently, the posterior predictive distribution for a new observation is a Student's t-distribution with updated location \mu_n, scale \frac{\beta_n}{\alpha_n} \left(1 + \frac{1}{\kappa_n}\right), and degrees of freedom $2\alpha_n.[2] In Bayesian statistics, the normal-inverse-gamma prior is foundational for modeling uncertainty in normal data with unknown parameters, particularly in hierarchical models, linear regression, and empirical Bayes estimation, where it facilitates tractable inference and robust handling of vague priors through careful hyperparameter choice.[1] It extends naturally to multivariate settings via the normal-inverse-Wishart distribution, but remains distinct in its univariate form for scalar normal models.[2]
Fundamentals
Definition
The normal-inverse-gamma (NIG) distribution is a four-parameter family of joint continuous probability distributions defined over the mean \mu and variance \sigma^2 (or equivalently, precision \tau = 1/\sigma^2) of a univariate normal distribution, making it a key tool in Bayesian statistics for modeling uncertainty in these parameters. It arises as the product of a conditional normal distribution for the mean given the variance and a marginal inverse-gamma distribution for the variance itself, ensuring analytical tractability in posterior computations.[3] Formally, a random vector (\mu, \sigma^2) follows an NIG distribution with parameters \mu_0 (location), \lambda > 0 (precision scaling for the mean), \alpha > 0 (shape), and \beta > 0 (scale), denoted (\mu, \sigma^2) \sim \text{NIG}(\mu_0, \lambda, \alpha, \beta), if the conditional distribution is \mu \mid \sigma^2 \sim \mathcal{N}\left(\mu_0, \frac{\sigma^2}{\lambda}\right) and the marginal distribution is \sigma^2 \sim \text{IG}(\alpha, \beta), where \text{IG}(\alpha, \beta) denotes the inverse-gamma distribution with shape \alpha and scale \beta. This parameterization reflects the scaling of the variance in the conditional normal by the precision factor \lambda, which controls the strength of belief around \mu_0.[3] The NIG distribution originated in the context of conjugate prior families for Bayesian decision theory, with foundational work by Raiffa and Schlaifer (1961), who developed analytical forms for updating beliefs under normal likelihoods with unknown mean and variance.[4] It was later formalized and popularized in computational Bayesian statistics, assuming prior knowledge of the standard normal distribution \mathcal{N}(\mu, \sigma^2) and the inverse-gamma distribution, which is supported on positive reals and serves as the conjugate prior for the variance of a normal with known mean. The NIG plays a central role as the conjugate prior for normal models, enabling straightforward posterior updates when combined with data from a normal likelihood.Parameter Interpretation
The normal-inverse-gamma distribution is parameterized by four values: \mu_0, \lambda, \alpha, and \beta, which encode prior beliefs about the mean \mu and variance \sigma^2 of an underlying normal distribution in Bayesian models. The parameter \mu_0 represents the prior mean of \mu, serving as a location estimate or best guess for the central tendency based on expert knowledge or historical data.[5] It acts as the expected value around which the conditional normal distribution for \mu is centered, given \sigma^2.[5] The parameter \lambda > 0 scales the variance of the conditional normal distribution for \mu (specifically, the conditional variance is \sigma^2 / \lambda), thereby quantifying prior confidence in \mu_0. A larger \lambda implies tighter concentration around \mu_0, equivalent to greater prior precision or an effective prior sample size, reflecting stronger belief in the prior mean; conversely, small \lambda (e.g., approaching 0) yields diffuse priors with less influence from \mu_0.[5] This interpretation allows \lambda to mimic the information content of hypothetical prior observations.[5] For the inverse-gamma component on \sigma^2, \alpha > 0 is the shape parameter, which controls the effective prior sample size or degrees of freedom for the variance estimate, with larger \alpha indicating a more informative prior concentrated around a specific scale. The scale parameter \beta > 0 sets the prior scale of \sigma^2, influencing its expected magnitude; for \alpha > 1, the prior mean of \sigma^2 is \beta / (\alpha - 1), providing a direct way to specify anticipated variance levels based on domain expertise.[5] Together, \alpha and \beta shape the marginal prior for \sigma^2, balancing tail heaviness and central tendency.[5] In practice, parameter selection draws from substantive knowledge, such as setting \mu_0 to a historical average and \lambda to reflect equivalent prior data points (e.g., \lambda = 1 for modest confidence). For \alpha and \beta, values can be chosen to match desired prior moments, like targeting a specific mean and variance for \sigma^2. Noninformative priors, such as \alpha = 0 and \beta = 0, aim for minimal influence but are improper (integrating to infinity), potentially leading to improper posteriors if data are insufficient; thus, weakly informative alternatives (e.g., small positive \alpha, \beta) are recommended to ensure stability while avoiding undue regularization.[5] Sensitivity analyses to prior choices are essential for robust inference.[5] In multivariate settings, the normal-inverse-gamma extends to the normal-inverse-Wishart distribution, where the inverse-gamma on \sigma^2 generalizes to an inverse-Wishart on the covariance matrix \Sigma, accommodating correlations among dimensions that the univariate form cannot capture.[6]Characterization
Probability Density Function
The normal-inverse-gamma distribution models the joint distribution of a normal mean \mu and variance \sigma^2 > 0, where \mu \mid \sigma^2 \sim \mathcal{N}(\mu_0, \sigma^2 / \lambda) conditionally and \sigma^2 \sim \text{InvGamma}(\alpha, \beta) marginally, with parameters \mu_0 \in \mathbb{R} (location), \lambda > 0 (precision scale), \alpha > 0 (shape), and \beta > 0 (scale).[1] The joint probability density function is f(\mu, \sigma^2 \mid \mu_0, \lambda, \alpha, \beta) = \frac{\lambda^{1/2}}{\sigma \sqrt{2\pi}} \exp\left\{ -\frac{\lambda (\mu - \mu_0)^2}{2 \sigma^2} \right\} \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} (\sigma^2)^{-\alpha - 1} \exp\left\{ -\frac{\beta}{\sigma^2} \right\}, defined over the support \mu \in \mathbb{R}, \sigma^2 > 0.[1] This form follows directly from multiplying the conditional normal density (with variance \sigma^2 / \lambda) and the marginal inverse-gamma density (with shape \alpha and scale \beta).[1] To confirm normalization, integrate the joint density over the support: first, \int_{-\infty}^{\infty} f(\mu, \sigma^2 \mid \mu_0, \lambda, \alpha, \beta) \, d\mu = \frac{\beta^\alpha}{\Gamma(\alpha)} (\sigma^2)^{-\alpha - 1} \exp\left\{ -\frac{\beta}{\sigma^2} \right\}, since the inner normal integral equals 1; then, \int_0^{\infty} \frac{\beta^\alpha}{\Gamma(\alpha)} (\sigma^2)^{-\alpha - 1} \exp\left\{ -\frac{\beta}{\sigma^2} \right\} \, d\sigma^2 = 1, as this is the normalizing integral of the inverse-gamma density.[7] An alternative parameterization expresses the joint in terms of the mean \mu and precision \tau = 1 / \sigma^2 > 0, where \mu \mid \tau \sim \mathcal{N}(\mu_0, 1 / (\lambda \tau)) and \tau \sim \text{Gamma}(\alpha, \beta) (rate parameterization), yielding f(\mu, \tau \mid \mu_0, \lambda, \alpha, \beta) = \frac{(\lambda \tau)^{1/2}}{\sqrt{2\pi}} \exp\left\{ -\frac{\lambda \tau (\mu - \mu_0)^2}{2} \right\} \cdot \frac{\beta^\alpha}{\Gamma(\alpha)} \tau^{\alpha - 1} \exp\left\{ -\beta \tau \right\}. This follows from the change of variables \tau = 1 / \sigma^2, with Jacobian |\frac{d\sigma^2}{d\tau}| = 1 / \tau^2, transforming the inverse-gamma to gamma.[7] A related reparameterization for the variance uses the scale-inverse-chi-squared distribution, Inv-\chi^2(\nu, s^2), which is inverse-gamma with \alpha = \nu / 2 and \beta = \nu s^2 / 2; this form is common in modern software implementations such as Stan for its interpretability in terms of degrees of freedom \nu and scale s^2.Cumulative Distribution Function
The cumulative distribution function (CDF) of the normal-inverse-gamma distribution, which jointly models a mean parameter \mu and variance parameter \sigma^2, is defined as F(x, s) = \int_{-\infty}^{x} \int_{0}^{s} f(u, v) \, dv \, du, where f(u, v) denotes the joint probability density function of the distribution, supported on \mu \in (-\infty, \infty) and \sigma^2 \in (0, \infty). This represents the probability P(\mu \leq x, \sigma^2 \leq s). The joint PDF f(u, v) takes the form of a normal distribution for \mu conditional on \sigma^2 = v, multiplied by an inverse-gamma density for v, with hyperparameters typically denoted as location m, precision scalar k > 0, shape \nu/2 > 0, and scale s^2/2 > 0: f(u, v) = \frac{(k/ (2\pi v))^{1/2} \exp\left\{ -\frac{k}{2v} (u - m)^2 \right\} \cdot (s^2/2)^{\nu/2} v^{-(\nu/2 + 1)} \exp\left( - \frac{s^2}{2v} \right)}{\Gamma(\nu/2)}. Unlike the joint PDF, the normal-inverse-gamma CDF lacks a closed-form expression in terms of elementary functions or standard special functions, due to the complexity of integrating the product of the conditional normal and marginal inverse-gamma components over the rectangular region. Evaluation therefore requires numerical approximation. Common computational approaches include quadrature methods for deterministic integration or Monte Carlo simulation, often via Markov chain Monte Carlo (MCMC) sampling from the joint distribution to estimate the enclosed probability mass empirically. Such methods are implemented in statistical libraries, where the CDF is computed by integrating the posterior density parameters numerically. For instance, software packages provide functions to evaluate the CDF at specified points (x, s) by leveraging the updated hyperparameters from conjugate prior updates.[8] In special cases, the joint CDF simplifies by conditioning or marginalizing. Fixing \sigma^2 = s, the conditional distribution of \mu \mid \sigma^2 = s is normal with mean m and variance s/k, so the conditional CDF reduces to the standard normal CDF \Phi\left( \sqrt{k/s} (x - m) \right), scaled appropriately. Fixing \mu = x, the conditional distribution of \sigma^2 given \mu = x is inverse-gamma with shape (\nu + 1)/2 and scale (s^2 + k(x - m)^2)/2, and its CDF is F(s) = \frac{\gamma\left( \frac{\nu + 1}{2}, \frac{s^2 + k(x - m)^2}{2s} \right)}{\Gamma\left( \frac{\nu + 1}{2} \right)}, where \gamma(\cdot, \cdot) is the lower incomplete gamma function. This joint CDF plays a role in Bayesian inference for computing multidimensional tail probabilities, such as $1 - F(x, s), which inform hypothesis tests on both location and scale parameters simultaneously, as seen in confirmatory clinical trial designs incorporating historical data.Properties
Marginal and Conditional Distributions
The normal-inverse-gamma distribution, denoted as NIG(μ₀, λ, α, β), specifies a joint distribution over the mean μ and variance σ², where the conditional distribution of μ given σ² is normal, μ | σ² ∼ N(μ₀, σ² / λ), and the marginal distribution of σ² is inverse-gamma, σ² ∼ IG(α, β).[1] The marginal distribution of μ is obtained by integrating the joint density over σ². The joint density is p(μ, σ²) ∝ (σ²)^{-1/2} exp\left( -\frac{λ (μ - μ₀)^2}{2 σ²} \right) (σ²)^{-α-1} exp\left( -\frac{β}{σ²} \right), where the constant of proportionality includes the normalization for the normal and inverse-gamma components. Integrating with respect to σ² yields a kernel proportional to \left[ β + \frac{λ (μ - μ₀)^2}{2} \right]^{-(α + 1/2)}, which recognizes the density of a non-central Student-t distribution with location parameter μ₀, degrees of freedom 2α, and scale parameter \sqrt{\frac{β}{α λ}}. Thus, μ ∼ t_{2α}\left(μ₀, \sqrt{\frac{β}{α λ}}\right). This result follows from completing the square in the exponent and recognizing the integral as the normalizing constant of an inverse-gamma distribution with updated shape α + 1/2 and rate β + \frac{λ (μ - μ₀)^2}{2}.[1][9] The marginal distribution of σ² remains unchanged from its component in the joint specification, σ² ∼ IG(α, β), as the conditional μ | σ² does not affect the marginal upon integration over μ, which simply scales by the normal normalizing constant independent of σ²'s parameters.[1] The conditional distribution μ | σ² is directly given by the hierarchical structure as μ | σ² ∼ N(μ₀, σ² / λ), representing a normal distribution centered at the prior mean μ₀ with precision λ / σ².[1] The conditional distribution σ² | μ is derived from the joint by p(σ² | μ) ∝ p(μ | σ²) p(σ²), which simplifies to an inverse-gamma form: σ² | μ ∼ IG\left(α + \frac{1}{2}, β + \frac{λ (μ - μ₀)^2}{2}\right). This arises from combining the exponents in the joint density, yielding (σ²)^{-(α + 3/2)} exp\left( -\frac{β + \frac{λ (μ - μ₀)^2}{2}}{σ²} \right) up to normalization, and recognizing the standard inverse-gamma kernel. Equivalently, this can be expressed as a scaled inverse-chi-squared distribution with degrees of freedom 2α + 1 and scale parameter adjusted by the quadratic term \frac{2\left(β + \frac{λ (μ - μ₀)^2}{2}\right)}{2α + 1}. The non-standard nature stems from the half-integer shift in the shape parameter when α is integer-valued.[1]Moments and Arithmetic Operations
The normal-inverse-gamma distribution, parameterized as (\mu, \sigma^2) \sim \text{NIG}(\mu_0, \lambda, \alpha, \beta), exhibits the following key moments. The expected value of the mean parameter \mu is \mathbb{E}[\mu] = \mu_0, provided \alpha > 0. The variance of \mu is \text{Var}(\mu) = \frac{\beta}{\lambda (\alpha - 1)} for \alpha > 1. The expected value of the variance parameter \sigma^2 is \mathbb{E}[\sigma^2] = \frac{\beta}{\alpha - 1} for \alpha > 1. Higher moments of \sigma^2 follow from its marginal inverse-gamma distribution and can be expressed using the gamma function: for integer k > 0 with \alpha > k, \mathbb{E}[(\sigma^2)^k] = \frac{\beta^k \Gamma(\alpha - k)}{\Gamma(\alpha)}.[10][11] The distribution is a member of the exponential family, with natural parameters derived from \alpha, \beta, \mu_0, and \lambda, and sufficient statistics including \mu, \mu^2 / \sigma^2, -\log \sigma^2, and -1 / \sigma^2. This structure facilitates analytical computations for certain properties.[10] The information entropy of the normal-inverse-gamma distribution, which quantifies its uncertainty, is given by H = \frac{1}{2} + \log\left(\sqrt{\frac{2\pi \beta^{3/2}}{\Gamma(\alpha)}}\right) - \log\left(\frac{\lambda}{2}\right) + \alpha - \left(\alpha + \frac{3}{2}\right) \psi(\alpha), where \psi denotes the digamma function; this expression involves logarithmic and digamma terms reflecting the parameters.[12] A closed-form expression for the Kullback-Leibler divergence between two normal-inverse-gamma distributions exists, computed as the expectation under the first distribution of the log-ratio of their densities; it incorporates digamma functions and logarithmic terms on the parameters, analogous to the form for the related normal-gamma distribution.[13] Regarding arithmetic operations, the distribution of the sum of independent normal-inverse-gamma variates—corresponding to the joint prior on means and variances for summed normal random variables—is not a normal-inverse-gamma in closed form, requiring convolutions or numerical methods for exact evaluation; approximations such as moment matching are commonly employed in Bayesian analyses involving multiple components.[14] For scaling, consider a transformation to (k\mu, k^2 \sigma^2) for k > 0, which arises in rescaling normal variates with the prior. This yields another normal-inverse-gamma distribution with adjusted parameters: location k \mu_0, precision scale \lambda, shape \alpha, and scale k^2 \beta. The scaling of the inverse-gamma marginal for \sigma^2 by the factor k^2 directly supports this adjustment.[15]Inference and Estimation
Conjugacy and Posterior Updates
The normal-inverse-gamma distribution is a conjugate prior for the parameters of a normal likelihood when both the mean \mu and variance \sigma^2 are unknown. For independent and identically distributed observations x_1, \dots, x_n \iid \mathcal{N}(\mu, \sigma^2), if the prior is (\mu, \sigma^2) \sim \mathrm{NIG}(\mu_0, \kappa, \alpha, \beta), then the posterior distribution is also normal-inverse-gamma: (\mu, \sigma^2 \mid \mathbf{x}) \sim \mathrm{NIG}(\mu_n, \kappa_n, \alpha_n, \beta_n). This conjugacy ensures that posterior inference remains analytically tractable within the same parametric family.[1] The posterior parameters are updated as follows: \begin{align*} \kappa_n &= \kappa + n, \\ \mu_n &= \frac{\kappa \mu_0 + n \bar{x}}{\kappa_n}, \\ \alpha_n &= \alpha + \frac{n}{2}, \\ \beta_n &= \beta + \frac{n s^2}{2} + \frac{\kappa n (\bar{x} - \mu_0)^2}{2 \kappa_n}, \end{align*} where \bar{x} = n^{-1} \sum_{i=1}^n x_i is the sample mean and s^2 = n^{-1} \sum_{i=1}^n (x_i - \bar{x})^2 is the sample variance. These updates reflect the incorporation of data evidence: \kappa_n accumulates effective sample size, \mu_n weights the prior mean against the sample mean, \alpha_n increases shape based on observation count, and \beta_n augments the scale with data dispersion and prior-data discrepancy.[1] This conjugacy arises from Bayes' theorem, where the posterior kernel is the product of the likelihood and prior densities. The normal likelihood expands to p(\mathbf{x} \mid \mu, \sigma^2) \propto (\sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} \left[ \sum_{i=1}^n (x_i - \bar{x})^2 + n (\bar{x} - \mu)^2 \right] \right), and multiplying by the normal-inverse-gamma prior—which has a kernel involving \exp\left( -\frac{\kappa (\mu - \mu_0)^2}{2 \sigma^2} - \frac{\beta}{\sigma^2} \right) (\sigma^2)^{-(\alpha + 1)}—yields terms that complete the square in \mu and match the inverse-gamma form for \sigma^2, confirming the posterior family.[1][16] The update rules support sequential Bayesian inference, where parameters evolve incrementally with each new observation. Starting from the prior (\mu_0, \kappa, \alpha, \beta), adding the first datum updates to (\mu_1, \kappa + 1, \alpha + 1/2, \beta_1), and subsequent points apply the same formulas with n incremented by 1, enabling efficient online learning without refitting the entire dataset.[1] The choice of prior parameters \kappa and \alpha can enhance robustness to outliers in the data. A larger \kappa assigns greater weight to the prior mean \mu_0 in \mu_n, mitigating the impact of aberrant observations on the posterior mean. Meanwhile, \alpha determines the degrees of freedom $2\alpha_n in the posterior predictive distribution, a Student's t-distribution; smaller initial \alpha yields heavier tails post-update, providing greater downweighting of outliers compared to the lighter tails of a normal predictive.[1]Maximum Likelihood Estimation
The maximum likelihood estimation (MLE) for the parameters of the normal-inverse-gamma (NIG) distribution, denoted as \mu_0, \kappa, \alpha, and \beta, proceeds from a sample of n independent and identically distributed observations (\mu_i, \sigma_i^2)_{i=1}^n drawn from the joint NIG distribution. The likelihood function is given by the product of the individual joint densities: L(\mu_0, \kappa, \alpha, \beta \mid \{(\mu_i, \sigma_i^2)\}) = \prod_{i=1}^n \mathcal{N}(\mu_i \mid \mu_0, \sigma_i^2 / \kappa) \cdot \text{IG}(\sigma_i^2 \mid \alpha, \beta), where \mathcal{N}(\cdot \mid m, v) is the normal density with mean m and variance v, and \text{IG}(\cdot \mid a, b) is the inverse-gamma density with shape a and scale b. To obtain the MLE (\hat{\mu}_0, \hat{\kappa}, \hat{\alpha}, \hat{\beta}), one maximizes the log-likelihood \ell(\mu_0, \kappa, \alpha, \beta) = \sum_{i=1}^n \log \mathcal{N}(\mu_i \mid \mu_0, \sigma_i^2 / \kappa) + \sum_{i=1}^n \log \text{IG}(\sigma_i^2 \mid \alpha, \beta). This involves solving the score equations obtained by setting the partial derivatives \partial \ell / \partial \theta = 0 for each parameter \theta \in \{\mu_0, \kappa, \alpha, \beta\} to zero. However, due to the coupling between parameters—particularly through the conditional variance term \sigma_i^2 / \kappa and the inverse-gamma components—no closed-form solutions exist for the MLE. Numerical optimization techniques are required to solve these equations, such as gradient-based methods. Under standard regularity conditions—such as the parameter space being an open subset of \mathbb{R}^4 with positive \kappa, \alpha, \beta, the existence of finite moments, and identifiability—the MLE is consistent (i.e., \hat{\theta} \to_p \theta_0 as n \to \infty) and asymptotically normal (\sqrt{n} (\hat{\theta} - \theta_0) \to_d \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1}), where \mathcal{I}(\theta_0) is the Fisher information matrix). These properties follow from general MLE theory for regular parametric models. In relation to estimation in the normal model, the MLE for fixed \mu and \sigma^2 from observed data y_i \sim \mathcal{N}(\mu, \sigma^2) simplifies to the sample mean \bar{y} and biased sample variance s^2 = \sum (y_i - \bar{y})^2 / n, respectively; the NIG distribution extends this framework as a joint prior on (\mu, \sigma^2) in Bayesian settings, where frequentist MLE of its hyperparameters contrasts with simpler closed-form marginal updates.[1]Computation and Relations
Sampling Methods
Generating random variates from the normal-inverse-gamma (NIG) distribution is crucial for Monte Carlo simulations in Bayesian inference, particularly when the NIG serves as a conjugate prior for the mean and variance of a normal likelihood.[17] The most efficient and straightforward method is conditional sampling, which leverages the joint structure of the distribution. First, sample the variance \sigma^2 from an inverse-gamma distribution \text{IG}(\alpha, \beta). Then, conditional on \sigma^2, sample the mean \mu from a normal distribution \mathcal{N}(\mu_0, \sigma^2 / \lambda). This approach is computationally efficient because both the inverse-gamma and normal distributions have well-established random number generators in standard statistical software.[7][18] An alternative is the marginal-conditional approach, which begins by sampling \mu from its marginal distribution, a Student's t-distribution with 2\alpha degrees of freedom, location \mu_0, and scale \sqrt{\beta / (\alpha \lambda)}. Then, conditional on \mu, sample \sigma^2 from an adjusted inverse-gamma distribution \text{IG}(\alpha + 1/2, \beta + \lambda (\mu - \mu_0)^2 / 2). This method is useful when direct sampling from the t-distribution is preferred or when verifying properties of the marginal. For more complex scenarios, such as hierarchical models where the NIG parameters are themselves random, rejection sampling or Markov chain Monte Carlo (MCMC) methods may be employed. In particular, the conditional distributions enable Gibbs sampling, where iterates alternate between sampling \sigma^2 \mid \mu from the inverse-gamma and \mu \mid \sigma^2 from the normal, facilitating exploration of the joint posterior in Bayesian hierarchical models.[19][20] Software implementations simplify these methods. In Python, the SciPy library provides thenormal_inverse_gamma.rvs function, which generates joint samples (\mu, \sigma^2) using the conditional approach internally.[17]
In R, no built-in function exists for the joint NIG, but the conditional method can be implemented using base functions or thepythonfrom scipy.stats import [norm](/page/Norm), [invgamma](/page/Invgamma) import [numpy](/page/NumPy) as np def sample_nig(mu0, [lambda_](/page/Lambda), alpha, beta, size=1): sigma2 = invgamma.rvs(a=alpha, scale=beta, size=size) mu = [norm](/page/Norm).rvs(loc=mu0, scale=np.sqrt(sigma2 / [lambda_](/page/Lambda)), size=size) return mu, sigma2 # Example: 1000 samples mu_samples, sigma2_samples = sample_nig(mu0=0, lambda_=1, alpha=2, beta=3, size=1000)from scipy.stats import [norm](/page/Norm), [invgamma](/page/Invgamma) import [numpy](/page/NumPy) as np def sample_nig(mu0, [lambda_](/page/Lambda), alpha, beta, size=1): sigma2 = invgamma.rvs(a=alpha, scale=beta, size=size) mu = [norm](/page/Norm).rvs(loc=mu0, scale=np.sqrt(sigma2 / [lambda_](/page/Lambda)), size=size) return mu, sigma2 # Example: 1000 samples mu_samples, sigma2_samples = sample_nig(mu0=0, lambda_=1, alpha=2, beta=3, size=1000)
extraDistr package for inverse-gamma sampling.[21][22]
To validate generated samples, quantile-quantile (QQ) plots can compare the empirical quantiles of the sampled \mu against the theoretical quantiles of the marginal t-distribution, and similarly for \sigma^2 against the inverse-gamma, ensuring the sampler accurately reproduces the target distribution.rlibrary(extraDistr) sample_nig <- function(mu0, lambda, alpha, beta, n = 1) { sigma2 <- rinvgamma(n, alpha, beta) mu <- rnorm(n, mu0, sqrt(sigma2 / lambda)) return(list(mu = mu, sigma2 = sigma2)) } samples <- sample_nig(mu0 = 0, lambda = 1, alpha = 2, beta = 3, n = 1000)library(extraDistr) sample_nig <- function(mu0, lambda, alpha, beta, n = 1) { sigma2 <- rinvgamma(n, alpha, beta) mu <- rnorm(n, mu0, sqrt(sigma2 / lambda)) return(list(mu = mu, sigma2 = sigma2)) } samples <- sample_nig(mu0 = 0, lambda = 1, alpha = 2, beta = 3, n = 1000)