Inverse-chi-squared distribution
The inverse-chi-squared distribution, also known as the inverted chi-squared distribution, is a continuous probability distribution defined on the positive real line that describes the distribution of the reciprocal of a chi-squared random variable scaled by its degrees of freedom.[1] It is a special case of the inverse-gamma distribution, specifically with shape parameter \alpha = \nu/2 and rate parameter \beta = \nu \sigma^2 / 2, where \nu > 0 represents the degrees of freedom and \sigma^2 > 0 is a scale parameter.[1] The probability density function of the scaled inverse-chi-squared distribution is given by f(x \mid \nu, \sigma^2) = \frac{1}{\Gamma(\nu/2)} \left( \frac{\nu \sigma^2}{2} \right)^{\nu/2} x^{-(\nu/2 + 1)} \exp\left( -\frac{\nu \sigma^2}{2x} \right), \quad x > 0, where \Gamma denotes the gamma function.[1] This form arises naturally when considering the transformation X = 1/Y where Y follows a chi-squared distribution with \nu degrees of freedom, adjusted by the scale.[2] For \nu > 2, the mean of the distribution is \mathbb{E}[X] = \frac{\nu \sigma^2}{\nu - 2}, and the mode is \frac{\nu \sigma^2}{\nu + 2}.[1] The variance exists for \nu > 4 and is \mathrm{Var}(X) = \frac{2 \nu^2 \sigma^4}{(\nu - 2)^2 (\nu - 4)}.[1] These moments highlight the distribution's heavy-tailed nature for small \nu, making it suitable for modeling uncertainty in scale parameters.[2] In Bayesian statistics, the inverse-chi-squared distribution serves as a conjugate prior for the variance parameter \sigma^2 of a normal distribution with known mean, ensuring that the posterior distribution remains in the same family after updating with data from i.i.d. normal observations.[1] This conjugacy facilitates closed-form inference, particularly in hierarchical models like the normal-inverse-chi-squared distribution, which jointly specifies priors for both mean and variance.[3]Definition and Parameterization
Standard Inverse-chi-squared
The standard inverse-chi-squared distribution arises as the probability distribution of the reciprocal of a chi-squared random variable. Specifically, if X follows a chi-squared distribution with \nu > 0 degrees of freedom, then Y = 1/X follows the standard inverse-chi-squared distribution with parameter \nu.[4] This distribution is parameterized by a single scalar \nu, the degrees of freedom, which equals twice the shape parameter \alpha = \nu/2 in its equivalent inverse-gamma form with scale \beta = 1/2. The support is restricted to positive real numbers, y > 0.[5] The distribution was introduced in the context of sampling theory in the mid-20th century, particularly for modeling variance components in linear models with unequal variances.[6] The derivation begins with the probability density function of the chi-squared distribution for X and applies the transformation y = 1/x. This requires multiplying by the absolute value of the Jacobian determinant, |dx/dy| = 1/y^2, to yield the density of Y.[4]Scaled Inverse-chi-squared
The scaled inverse-chi-squared distribution is the distribution of \tau / X, where X follows a chi-squared distribution with \nu degrees of freedom and \tau > 0 serves as a scale factor.[7] It is defined for positive random variables and provides flexibility in modeling scaled variances compared to the unscaled case.[1] The distribution is parameterized by two positive values: the degrees of freedom \nu > 0 and the scale \tau > 0, with support restricted to y > 0.[1] In practice, \tau frequently incorporates a prior estimate for variance, scaled appropriately to reflect uncertainty in the model.[7] This parameterization is equivalent to an inverse-gamma distribution using shape \nu/2 and scale \tau/2.[1] When \tau = 1, the scaled form coincides with the standard inverse-chi-squared distribution.[1] A common specification in Bayesian inference sets \tau = \nu \sigma_0^2, where \sigma_0^2 denotes a prior scale for the variance.[7]Mathematical Properties
Probability Density Function
The probability density function (PDF) of the standard inverse-chi-squared distribution, parameterized by the degrees of freedom \nu > 0, is f(y \mid \nu) = \frac{1}{2^{\nu/2} \Gamma(\nu/2)} y^{-(\nu/2 + 1)} \exp\left(-\frac{1}{2y}\right) for y > 0, and f(y \mid \nu) = 0 otherwise. This form arises as a special case of the inverse-gamma distribution with shape parameter \alpha = \nu/2 and scale parameter \beta = 1/2. The scaled inverse-chi-squared distribution extends this by incorporating a positive scale parameter \tau > 0, yielding the PDF f(y \mid \nu, \tau) = \frac{(\tau/2)^{\nu/2}}{\Gamma(\nu/2)} y^{-(\nu/2 + 1)} \exp\left(-\frac{\tau}{2y}\right) for y > 0, and zero otherwise.[8] Here, the standard form corresponds to \tau = 1, while larger \tau shifts the distribution toward larger values of y, increasing the mean and mode proportionally. Both variants are supported only on the positive real line, reflecting their role in modeling positive quantities such as variances in Bayesian inference. The PDF exhibits unimodal behavior for \nu > 0, with the mode occurring at y = 1/(\nu + 2) for the standard form. As y \to 0^+, f(y) \to 0, dominated by the exponential decay term despite the polynomial singularity from the power law. Similarly, as y \to \infty, f(y) \to 0 via power-law decay modulated by the slowly varying exponential approaching 1. The overall shape is heavily right-skewed for small \nu (e.g., \nu < 5), featuring a sharp peak near zero and a long tail extending to large y; as \nu increases, the skewness diminishes, and the distribution approaches greater symmetry around its mode.[9] This evolution mirrors that of the parent inverse-gamma family, making the distribution suitable for priors on precision parameters where heavy tails capture uncertainty in low-information scenarios. For \nu \leq 2, the PDF's tail heaviness leads to an infinite mean, as explored further in the moments section.Moments and Central Tendency
The inverse-chi-squared distribution, in its standard form with degrees of freedom parameter \nu > 0, has a mean of \mathbb{E}[Y] = \frac{1}{\nu - 2} provided \nu > 2; otherwise, the mean is infinite.[10] For the scaled form, parameterized by an additional scale \tau > 0, the mean is \mathbb{E}[Y] = \frac{\tau}{\nu - 2} under the same condition \nu > 2.[10] The variance for the standard form is \mathrm{Var}(Y) = \frac{2}{(\nu - 2)^2 (\nu - 4)} when \nu > 4; it does not exist otherwise.[10] In the scaled case, the variance is \mathrm{Var}(Y) = \frac{2 \tau^2}{(\nu - 2)^2 (\nu - 4)} for \nu > 4.[10] These expressions arise from the distribution's representation as a special case of the inverse-gamma distribution, where the moments follow from the properties of the gamma function.[5] The mode, representing the most probable value, is obtained by maximizing the probability density function and equals \frac{1}{\nu + 2} for the standard form and \frac{\tau}{\nu + 2} for the scaled form.[10] This derivation involves setting the derivative of the log-density to zero, yielding the location of the peak in the positively skewed density. The median has no closed-form expression and must be approximated numerically, often via its relation to the inverse-gamma distribution or by solving the cumulative distribution function equation.[5] The median increases with \nu, reflecting the distribution's tendency to concentrate toward smaller values as the degrees of freedom grow, though it remains between the mode and mean due to positive skewness. Higher-order moments of order k exist only when \nu > 2k. For the standard form, the k-th raw moment is given by \mathbb{E}[Y^k] = 2^{-k} \frac{\Gamma\left(\frac{\nu}{2} - k\right)}{\Gamma\left(\frac{\nu}{2}\right)}, while for the scaled form it is \mathbb{E}[Y^k] = \left(\frac{\tau}{2}\right)^k \frac{\Gamma\left(\frac{\nu}{2} - k\right)}{\Gamma\left(\frac{\nu}{2}\right)}. These follow from the negative moments of the underlying chi-squared distribution, which is gamma-distributed.[11] The distribution exhibits positive skewness, with the mean exceeding both the mode and median; this rightward pull on the mean stems from the heavy right tail, where large values of Y occur with non-negligible probability despite the concentration around the mode for large \nu.[10]Sampling and Generation
The primary method for generating random variates from the inverse-chi-squared distribution involves a direct transformation from the chi-squared distribution, which is exact and computationally efficient. For the standard inverse-chi-squared distribution with \nu degrees of freedom, the algorithm consists of two steps: (1) draw a random variate Z from the chi-squared distribution with \nu degrees of freedom, \chi^2(\nu); (2) compute Y = 1/Z. This transformation yields a sample from the target distribution because the reciprocal of a chi-squared variate follows the inverse-chi-squared by definition.[12] For the scaled inverse-chi-squared distribution with \nu degrees of freedom and scale parameter \tau, the procedure is analogous: after generating Z \sim \chi^2(\nu), set Y = \tau / Z. This adjustment incorporates the scaling factor directly into the transformation, maintaining exactness without requiring rejection steps, as the acceptance probability is 1. The method is particularly efficient for moderate values of \nu, where chi-squared sampling is straightforward via summation of squared standard normals or gamma variates.[12] The inverse-chi-squared distribution is equivalent to a special case of the inverse-gamma distribution, facilitating alternative sampling approaches. Specifically, the standard form corresponds to an inverse-gamma with shape parameter \alpha = \nu/2 and scale parameter \beta = 1/2, while the scaled form uses \beta = \tau / 2. Random variates can thus be generated using established inverse-gamma samplers, which internally apply similar transformations from the gamma distribution.[1] In statistical software, these methods are readily implemented. For example, in R, samples from the standard inverse-chi-squared can be obtained via $1 / \mathrm{rchisq}(n, \nu), where n is the desired sample size, leveraging the built-in chi-squared generator. In Python's SciPy library, theinvgamma.rvs(a=\nu/2, scale=1/2, size=n) function provides direct sampling for the standard case, with the scale adjusted to \tau / 2 for the scaled variant. These implementations ensure high efficiency for typical applications in Bayesian inference and simulation studies.[13][14]
Relationships to Other Distributions
Connection to Chi-squared Distribution
The inverse-chi-squared distribution with \nu > 0 degrees of freedom arises directly as the reciprocal of a chi-squared random variable. Specifically, if X \sim \chi^2(\nu), then Y = 1/X follows an inverse-chi-squared distribution with \nu degrees of freedom.[10] To derive this transformation, consider the probability density function (PDF) of the chi-squared distribution: f_X(x) = \frac{1}{2^{\nu/2} \Gamma(\nu/2)} x^{\nu/2 - 1} e^{-x/2}, \quad x > 0. Under the change of variables y = 1/x, so x = 1/y and the Jacobian determinant is |dx/dy| = 1/y^2. Substituting yields the PDF of Y: f_Y(y) = f_X(1/y) \cdot \frac{1}{y^2} = \frac{1}{2^{\nu/2} \Gamma(\nu/2)} \left(\frac{1}{y}\right)^{\nu/2 - 1} e^{-1/(2y)} \cdot \frac{1}{y^2} = \frac{1}{2^{\nu/2} \Gamma(\nu/2)} y^{-\nu/2 - 1} e^{-1/(2y)}, \quad y > 0, which is the PDF of the (unscaled) inverse-chi-squared distribution.[10][15] A related quantile relationship follows from this transformation: the p-quantile of the inverse-chi-squared distribution with \nu degrees of freedom is the reciprocal of the (1-p)-quantile of the chi-squared distribution with \nu degrees of freedom. This property is used, for instance, to construct credible intervals for variance parameters by inverting chi-squared quantiles.[15] The inverse-chi-squared distribution emerged in the statistical literature of the 1930s and 1940s, particularly for inverting chi-squared-based tests and deriving confidence intervals for normal population variances. Like the chi-squared distribution, the inverse-chi-squared is supported on (0, \infty) and takes only positive values, reflecting the non-negativity of squared normals underlying the chi-squared. However, the inversion alters tail behavior: the chi-squared has exponentially decaying (light) tails, while the inverse-chi-squared exhibits power-law (heavy) tails near zero, resulting in inverted moment existence conditions—for example, the mean exists only for \nu > 2, in contrast to the chi-squared mean, which exists for all \nu > 0.[10][15]Relation to Inverse-gamma Distribution
The inverse-chi-squared distribution is a special case of the inverse-gamma distribution, which provides a broader framework for modeling positive random variables with heavy tails. This relationship allows the inverse-chi-squared to be expressed using the more general parameterization of the inverse-gamma family, facilitating computations and extensions in statistical modeling. For the standard inverse-chi-squared distribution Inv-\chi^2(\nu) with \nu > 0 degrees of freedom, it corresponds to an inverse-gamma distribution in shape-rate parameterization, InvGamma(\alpha = \nu/2, \beta = 1/2). The probability density function (PDF) of the standard inverse-chi-squared is given by f(x \mid \nu) = \frac{1}{2^{\nu/2} \Gamma(\nu/2)} x^{-\nu/2 - 1} \exp\left( -\frac{1}{2x} \right), \quad x > 0, which directly matches the inverse-gamma PDF f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{-\alpha - 1} \exp\left( -\frac{\beta}{x} \right), \quad x > 0, upon substituting \alpha = \nu/2 and \beta = 1/2. This equivalence can be verified by aligning the normalizing constants via properties of the gamma function, \Gamma(z+1) = z \Gamma(z), and confirming that the exponential terms and power laws coincide. The moments also align; for instance, the mean of Inv-\chi^2(\nu) is $1/(\nu - 2) for \nu > 2, matching the inverse-gamma mean \beta / (\alpha - 1) = (1/2) / (\nu/2 - 1) = 1/(\nu - 2). The scaled inverse-chi-squared distribution Inv-\chi^2(\nu, \tau^2), which incorporates a scale parameter \tau^2 > 0, is similarly equivalent to InvGamma(\alpha = \nu/2, \beta = \nu \tau^2 / 2). Here, the PDF adjusts to f(x \mid \nu, \tau^2) = \frac{(\nu \tau^2 / 2)^{\nu/2}}{\Gamma(\nu/2)} x^{-\nu/2 - 1} \exp\left( -\frac{\nu \tau^2}{2x} \right), matching the inverse-gamma form with the updated rate \beta = \nu \tau^2 / 2. This mapping preserves the moment structure, such as the mean \nu \tau^2 / (\nu - 2) for \nu > 2, derived from the inverse-gamma formula. The equivalence holds through identical substitution into the PDF and gamma function identities for normalization. The inverse-gamma parameterization offers greater flexibility for generalizations, as it allows arbitrary shape and rate values without the constraints implicit in the chi-squared degrees of freedom, enabling broader applications in hierarchical models and robustness to prior specifications. In contrast, the inverse-chi-squared represents a subfamily where the rate \beta is tied to the shape via \beta = \alpha / \nu for the standard case (or proportionally scaled), reflecting its origin in the reciprocal of a chi-squared variate. Notation for the inverse-gamma varies across sources, with some adopting a shape-scale form where the scale parameter is the reciprocal of the rate (leading to exp(-1/(\theta x)) with \theta > 0), though the shape-rate convention aligns directly with the above mappings and is common in Bayesian contexts.Links to Normal and Student-t Distributions
The inverse-chi-squared distribution establishes a direct connection to the normal distribution through its role in modeling the variance parameter. Specifically, if the variance \sigma^2 follows a scaled inverse-chi-squared distribution with degrees of freedom \nu and scale parameter \tau^2, then the precision \sigma^{-2} = 1 / \sigma^2 represents the reciprocal, providing a prior that scales the dispersion in normal likelihoods.[1] This parameterization aligns the distribution's tail behavior with the quadratic form of normal residuals, ensuring compatibility in likelihood-based updates.[16] A prominent link to the Student-t distribution emerges in Bayesian inference for normal data with unknown mean and variance. When an inverse-chi-squared prior is placed on the variance (or equivalently on precision), the marginal posterior for the mean, obtained by integrating out the variance, follows a Student-t distribution with degrees of freedom updated by the sample size plus prior degrees of freedom, and location and scale parameters incorporating the sample mean and prior information.[1] This result highlights how uncertainty in the normal variance propagates to heavier tails in the mean's posterior, akin to the Student-t's finite-sample adjustment over the normal.[16] Sampling properties further tie the inverse-chi-squared to normals via sums of squares. For n independent observations X_i \sim \mathcal{N}(\mu, \sigma^2) with known \mu, the sum \sum_{i=1}^n (X_i - \mu)^2 / \sigma^2 follows a chi-squared distribution with n degrees of freedom, so the quantity n / \sum_{i=1}^n [(X_i - \mu)/\sigma ]^2 follows a scaled inverse-chi-squared distribution with n degrees of freedom and scale parameter 1.[1] More generally, ratios involving squares of independent standard normals can generate variance components whose reciprocals align with inverse-chi-squared forms, particularly in hierarchical models where precision estimates arise from such quadratic ratios.[16]Applications
Bayesian Statistics as Conjugate Prior
In Bayesian statistics, the inverse-chi-squared distribution is employed as a conjugate prior for the variance parameter \sigma^2 in models where the data follow a normal distribution with an unknown mean and variance. Specifically, under a normal-inverse-chi-squared prior p(\mu, \sigma^2) = \mathcal{N}(\mu \mid \mu_0, \sigma^2 / \kappa_0) \times \text{Inv-}\chi^2(\sigma^2 \mid \nu, \tau), assuming independent observations x_1, \dots, x_n \sim \mathcal{N}(\mu, \sigma^2), the marginal posterior for \sigma^2 is also inverse-chi-squared, \sigma^2 \mid \mathbf{x} \sim \text{Inv-}\chi^2(\nu + n, \tau'), where the updated scale parameter is \tau' = (\nu \tau + \text{SS}) / (\nu + n) and SS denotes the sum of squared residuals adjusted for the uncertainty in \mu.[17] This conjugacy arises because the normal likelihood, when marginalized over \mu, interacts multiplicatively with the inverse-chi-squared prior to preserve the distributional family.[17] The posterior update rules are straightforward: the degrees of freedom increase additively by the sample size, \nu_{\text{post}} = \nu + n, reflecting the accumulation of information, while the scale update incorporates both the prior scale and the data's sum of squares, \tau_{\text{post}} = (\nu \tau + \sum_{i=1}^n (x_i - \hat{\mu})^2 + \text{adjustment for [prior](/page/Prior) mean}) / (\nu + n), where the adjustment term accounts for the discrepancy between the prior mean and the sample mean.[17] These updates enable exact inference without numerical approximation in simple settings. The conjugate form offers key advantages, including a closed-form posterior that facilitates analytical computation of credible intervals and posterior moments, and it is particularly natural for non-informative priors when \nu is large, yielding a distribution concentrated around the data-driven estimate of variance.[17] The use of the inverse-chi-squared prior gained prominence in Bayesian texts following the 1970s, especially for hierarchical models where variance components require scalable updating rules.[18] This parameterization, often in its scaled form where \tau = \nu s_0^2 with s_0^2 representing a prior scale estimate for \sigma^2, aligns well with variance interpretations in normal models.[17] As an illustrative example, consider a simple normal model with known mean \mu = 0 and observations x_1, \dots, x_n \sim \mathcal{N}(0, \sigma^2). With prior \sigma^2 \sim \text{Inv-}\chi^2(\nu, \tau), the posterior simplifies to \sigma^2 \mid \mathbf{x} \sim \text{Inv-}\chi^2(\nu + n, \tau + \sum_{i=1}^n x_i^2), directly pooling the prior pseudo-sum-of-squares \nu \tau with the observed \sum x_i^2. This setup is common in introductory Bayesian analyses of variance, allowing straightforward posterior sampling or moment calculation.[17]Variance Estimation in Regression Models
In Bayesian linear regression, the scaled inverse-chi-squared distribution serves as a conjugate prior for the error variance \sigma^2, parameterized as \sigma^2 \sim \text{scaled-Inv-}\chi^2(\nu, \nu s^2), where \nu denotes the prior degrees of freedom and s^2 is a prior scale estimate reflecting expected variability. This choice arises from the normal likelihood of the regression errors, ensuring the posterior retains the same distributional form for tractable inference.[19] Given data \mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon} with \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 I_n), the posterior for \sigma^2 incorporates the residual sum of squares (RSS) from the least-squares fit. The updated degrees of freedom become \nu_{\text{post}} = \nu + n - p, where n is the sample size and p is the number of predictors (including the intercept), accounting for the degrees of freedom lost in estimating \boldsymbol{\beta}. The posterior scale parameter is \tau_{\text{post}} = \nu s^2 + \text{RSS}, yielding \sigma^2 \mid \mathbf{y}, X \sim \text{scaled-Inv-}\chi^2(\nu_{\text{post}}, \tau_{\text{post}}). This update weights the prior scale against the data-driven RSS, with larger n or smaller RSS pulling the posterior toward lower variance values.[19] Credible intervals for \sigma^2 are derived from the quantiles of the posterior scaled inverse-chi-squared distribution, offering asymmetric bounds that reflect uncertainty beyond point estimates like the posterior mode \tau_{\text{post}} / (\nu_{\text{post}} + 2). These intervals are particularly useful for assessing the precision of predictions in regression settings, as they integrate the full posterior rather than relying on asymptotic approximations.[19] The inverse-chi-squared posterior also supports model comparison via Bayes factors for variance components, such as in comparing models with different numbers of predictors or hierarchical structures, by evaluating the marginal likelihood after integrating out \sigma^2. This approach quantifies evidence for simpler variance assumptions against more complex ones, favoring models where the posterior adequately explains the data without excessive parameterization. As an example, in simple linear regression y_i = \beta_0 + \beta_1 x_i + \epsilon_i with a noninformative prior \sigma^2 \sim \text{scaled-Inv-}\chi^2(1, 1) and n=30 observations yielding RSS = 50, the posterior is \text{scaled-Inv-}\chi^2(29, 51), from which 95% credible intervals for \sigma^2 can be computed as the 0.025 and 0.975 quantiles, typically spanning values consistent with the data's scatter.[19]Markov Chain Monte Carlo Methods
The inverse-chi-squared distribution plays a central role in Markov chain Monte Carlo (MCMC) methods, particularly Gibbs sampling, for Bayesian inference involving variance parameters in normal and hierarchical models. As a conjugate prior for the variance \sigma^2 in the normal distribution, it enables the derivation of full conditional posteriors that retain the same distributional form, facilitating direct and efficient sampling without the need for more computationally intensive techniques like Metropolis-Hastings for this component.[17][19] In the canonical setting of a normal model with unknown mean \mu and variance \sigma^2, Gibbs sampling alternates between drawing from the full conditional posterior of \mu given \sigma^2 and data, which is normal, and the full conditional of \sigma^2 given \mu and data, which follows an inverse-chi-squared distribution. Specifically, if the prior is the normal-inverse-chi-squared p(\mu, \sigma^2) = \mathcal{N}(\mu \mid \mu_0, \sigma^2 / \kappa_0) \times \text{Inv-}\chi^2(\sigma^2 \mid \nu_0, \sigma_0^2), the full conditional for \sigma^2 is \text{Inv-}\chi^2(\sigma^2 \mid \nu_n, \sigma_n^2), where the updated degrees of freedom \nu_n = \nu_0 + n and scale \sigma_n^2 incorporate the data sum of squares and prior information. This iterative process generates samples from the joint posterior, converging to the target distribution under standard MCMC conditions.[17][19] In more complex hierarchical models, such as those with multiple levels of variance components, the full conditional for a variance parameter \sigma^2 given the data and other parameters remains inverse-chi-squared, with updated parameters \nu and scale \tau that aggregate contributions from the likelihood (e.g., residual sums of squares) and hyperpriors on related parameters like means or other variances. For instance, in a two-level hierarchical normal model y_{ij} \sim \mathcal{N}(\theta_j, \sigma^2) and \theta_j \sim \mathcal{N}(\mu, \tau^2), the conditional p(\sigma^2 \mid y, \theta, \mu, \tau^2) is inverse-chi-squared with degrees of freedom increased by the number of observations and scale updated by the pooled residuals. This structure preserves conjugacy across levels, allowing straightforward Gibbs updates.[19] The direct samplability of these full conditionals enhances MCMC efficiency by avoiding rejection-based methods for variance components, reducing autocorrelation in chains and accelerating convergence, especially in high-dimensional settings. This is particularly advantageous in models where other parameters may require Metropolis-Hastings steps, as the inverse-chi-squared block samples exactly.[19][20] Such MCMC strategies are commonly applied in Bayesian analysis of variance (ANOVA) models, where group variances follow inverse-chi-squared priors, and in random-effects models for clustered data, enabling inference on between- and within-group variability through posterior samples of variance ratios or credible intervals.[19] Implementation is streamlined in probabilistic programming languages like Stan and JAGS, which natively support the inverse-chi-squared distribution for specifying priors and automatically generate efficient MCMC samplers, including Gibbs-like updates within Hamiltonian Monte Carlo frameworks in Stan.[21][22]Parameter Estimation
Method of Moments
The method of moments estimation for the parameters of the scaled inverse-chi-squared distribution, denoted Inv-χ²(ν, τ), is performed by equating the first two sample moments to the theoretical population moments. The theoretical mean is given by \mu = \frac{\nu \tau}{\nu - 2}, \quad \nu > 2, and the theoretical variance by \sigma^2 = \frac{2 \nu^2 \tau^2}{(\nu - 2)^2 (\nu - 4)}, \quad \nu > 4. [23] Let m denote the sample mean and v the sample variance from a random sample of size n. Setting m = μ yields τ = m (ν - 2)/ν. Substituting into the variance equation gives v = 2 m^2 / (ν - 4), which rearranges to the explicit solution \hat{\nu} = 4 + \frac{2 m^2}{v}. Then, the estimator for the scale parameter is \hat{\tau} = m \left( \frac{\hat{\nu} - 2}{\hat{\nu}} \right). This solution requires n > 4 to ensure the sample variance v is defined and positive. The resulting estimators are biased for small n, as the method of moments generally produces biased estimators unless adjusted.[24] The method of moments provides a straightforward computational approach by solving a simple system of equations, making it useful for quick approximations in preliminary analysis. However, it is typically less efficient than maximum likelihood estimation, yielding estimators with higher variance, particularly for distributions with heavy tails like the inverse-chi-squared.[24] For example, given sample data with mean m and variance v, the degrees-of-freedom estimator simplifies to \hat{\nu} \approx 2 \left( \frac{m^2}{v} + 2 \right) exactly, or approximately 2 \left( \frac{m^2}{v} + 1 \right) when \frac{m^2}{v} \gg 1.Maximum Likelihood Estimation
The likelihood function for an independent and identically distributed sample y_1, \dots, y_n from the scaled inverse-chi-squared distribution with degrees of freedom \nu > 0 and scale parameter \tau > 0 is given by the product of the individual probability density functions: L(\nu, \tau) = \prod_{i=1}^n f(y_i \mid \nu, \tau), where f(y \mid \nu, \tau) = \frac{ (\nu \tau / 2)^{\nu/2} }{ \Gamma(\nu/2) } y^{-(\nu/2 + 1)} \exp\left( -\frac{\nu \tau}{2 y} \right) for y > 0. The corresponding log-likelihood is \ell(\nu, \tau) = n \left[ \frac{\nu}{2} \log\left( \frac{\nu \tau}{2} \right) - \log \Gamma\left( \frac{\nu}{2} \right) \right] - \left( \frac{\nu}{2} + 1 \right) \sum_{i=1}^n \log y_i - \frac{\nu \tau}{2} \sum_{i=1}^n \frac{1}{y_i}. This expression involves sums of \log y_i and $1/y_i, along with exponential and log-gamma terms, but yields no closed-form solution for the maximum likelihood estimator (MLE) of \nu.[25] To obtain the MLEs \hat{\nu} and \hat{\tau}, numerical optimization techniques are required. Conditional on \nu, the MLE for \tau has a closed form: \hat{\tau} = n / \left( \sum_{i=1}^n 1/y_i \right), which facilitates a profile log-likelihood for \nu alone. The profile likelihood for \nu can then be maximized using methods such as Newton-Raphson iteration, where updates rely on the digamma function \psi(\nu/2) for the score and observed information. The EM algorithm provides an alternative iterative approach, particularly useful when embedding the estimation within broader models involving latent variables. Starting values for optimization are typically derived from method of moments estimators to ensure convergence.[25] Under standard regularity conditions, the MLEs \hat{\nu} and \hat{\tau} are consistent and asymptotically efficient as the sample size n \to \infty, converging in probability to the true parameters. Approximate standard errors are computed from the inverse of the negative Hessian matrix of the log-likelihood evaluated at the MLE, yielding asymptotic normality: \sqrt{n} (\hat{\theta} - \theta_0) \to \mathcal{N}(0, I(\theta_0)^{-1}), where \theta = (\nu, \tau) and I(\theta_0) is the Fisher information matrix. Bias in the estimators diminishes with larger n.[25] A key challenge in this optimization arises for small values of \nu, where the likelihood surface can be non-monotonic, potentially leading to multiple local maxima and requiring robust initial values or global optimization strategies to identify the global MLE. Such issues are more pronounced in small samples (n < 20) and underscore the importance of sensitivity checks.[25] Implementations of these procedures are available in statistical software via general-purpose optimizers; for instance, theoptim() function in R or scipy.optimize.minimize in Python can maximize the log-likelihood with user-supplied objective functions and derivatives.[25]