Relative likelihood

In statistics, relative likelihood refers to the ratio of the likelihood of a specific parameter value or hypothesis to the maximum possible likelihood within a given model, serving as a measure of evidential support or plausibility for that value relative to the best-supported alternative. This concept, formalized through the likelihood function L(\theta \mid x) = f(x \mid \theta), where f is the probability density or mass function and \theta is the parameter, enables direct comparisons without requiring prior probabilities or frequentist error rates. The relative likelihood R(\theta) = L(\theta \mid x) / L(\hat{\theta} \mid x), with $0 \leq R(\theta) \leq 1, highlights how much less plausible a parameter is compared to the maximum likelihood estimate \hat{\theta}.^[1] Developed as part of the likelihood paradigm, relative likelihood builds on R.A. Fisher's introduction of the likelihood function in the 1920s and was advanced by A.W.F. Edwards in his 1972 monograph Likelihood, which argued for its use as the foundation of inductive inference over probability-based approaches. Later proponents, including Richard Royall in Statistical Evidence: A Likelihood Paradigm (1997), emphasized its role in quantifying statistical evidence via the Law of Likelihood, which states that data support one hypothesis over another to the extent of their likelihood ratio. Key applications include model selection, where relative likelihoods assess competing models' fit; parameter inference, via likelihood intervals (e.g., sets where R(\theta) \geq 1/8 or 0.15 for approximate 95% confidence regions); and hypothesis testing, avoiding p-values by focusing on direct evidential comparisons. This approach is particularly valuable in fields like ecology, physics, and machine learning for robust uncertainty quantification without Bayesian priors.

Fundamentals

Likelihood Function

The likelihood function, denoted L(\theta \mid x), represents the joint probability density function (or probability mass function in discrete cases) of the observed data x, expressed as a function of the unknown parameter \theta. Unlike a probability distribution over \theta, it is not normalized such that its integral (or sum) over \theta equals 1; instead, it measures the plausibility of different \theta values given the fixed data x. This distinction emphasizes that the likelihood treats the data as given and varies the parameters, reversing the roles in the conditional probability f(x \mid \theta).^[2]^[3] For a sample of n independent and identically distributed observations x = (x_1, \dots, x_n), the likelihood function takes the product form

L(\theta \mid x) = \prod_{i=1}^n f(x_i \mid \theta),

where f(\cdot \mid \theta) is the probability density or mass function of each observation under parameter \theta. This formulation arises directly from the joint distribution under independence, facilitating computation and maximization.^[2]^[4] Key properties of the likelihood function include its invariance under reparameterization: if \phi = g(\theta) for a one-to-one transformation g, then the likelihood in terms of \phi is L(\phi \mid x) = L(\theta(\phi) \mid x) \cdot |J|, where J is the Jacobian determinant, preserving the relative ordering of parameter values after accounting for the transformation. Because absolute values depend on arbitrary scaling and the data's support, inference typically relies on relative likelihoods rather than absolutes, comparing L(\theta \mid x) across \theta to assess plausibility. The function plays a central role in maximum likelihood estimation (MLE), where the estimator \hat{\theta} maximizes L(\theta \mid x) (or equivalently, its logarithm for convenience), providing a method to select the most data-compatible parameter.^[3]^[5] The concept was introduced by Ronald A. Fisher in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," where he developed it as a foundational tool for frequentist inference, shifting focus from inverse probabilities to data-driven parameter assessment.^[5] A simple example is the binomial likelihood for modeling k successes in n independent trials, such as coin flips, with success probability p:

L(p \mid k) = \binom{n}{k} p^k (1-p)^{n-k}.

Here, the likelihood peaks at p = k/n, illustrating how it quantifies support for different p values based on the observed proportion.^[4]

Relative Likelihood Definition

In statistics, the relative likelihood provides a normalized measure of the plausibility of a parameter value given observed data, by comparing it to the most plausible value within the model. Formally, for a parameter \theta and data x, the relative likelihood is defined as

R(\theta \mid x) = \frac{L(\theta \mid x)}{L(\hat{\theta} \mid x)},

where L(\theta \mid x) is the likelihood function and \hat{\theta} is the maximum likelihood estimate (MLE) that maximizes L over the parameter space. This ratio satisfies $0 \leq R(\theta \mid x) \leq 1, with R(\hat{\theta} \mid x) = 1, emphasizing the relative support for \theta without dependence on absolute likelihood scales.^[1] The log-relative likelihood, l(\theta \mid x) = \log R(\theta \mid x) = \log L(\theta \mid x) - \log L(\hat{\theta} \mid x), is often preferred for numerical stability and analysis, as it transforms the multiplicative scale to an additive one. This form facilitates approximations, such as the second-order Taylor expansion around \hat{\theta}, which yields a quadratic form resembling a normal approximation for large samples. Computationally, it simplifies evaluations in optimization and inference procedures.^[1] Relative likelihood values near 1 indicate high plausibility for the parameter; for instance, the region where R(\theta \mid x) \geq 0.15 approximates a 95% support interval, corresponding asymptotically to a confidence region spanning roughly \pm 1.96 standard errors around the MLE for scalar parameters. Under standard regularity conditions, the statistic -2 \log R(\theta \mid x) follows an approximate \chi^2 distribution with degrees of freedom equal to the dimension of \theta when the true parameter is at \theta, supporting likelihood-based tests and intervals.^[6] Unlike a general likelihood ratio, which compares the likelihoods of two distinct hypotheses L(\theta_1 \mid x) / L(\theta_2 \mid x), relative likelihood is inherently tied to the model's maximum, offering a unified scale for assessing evidential support within a single framework. This distinction underscores its role in profiling parameter plausibility rather than direct hypothesis contrast.

Parameter Values

Relative Likelihood for Parameters

In statistical inference for parameters within a single model, the relative likelihood R(\theta | x) = L(\theta | x) / L(\hat{\theta} | x) serves as a direct measure of the evidential support for a specific parameter value \theta relative to the maximum likelihood estimate (MLE) \hat{\theta}, where R(\hat{\theta} | x) = 1 by definition. This ratio quantifies the degree to which the observed data x are less plausible under \theta than under \hat{\theta}, providing a scale-invariant assessment of parameter plausibility that avoids assumptions of normality or other asymptotic approximations. Unlike standard errors, which depend on the curvature of the log-likelihood at the MLE, relative likelihood offers a global view of the likelihood surface, enabling visualization of uncertainty through plots of R(\theta | x) against \theta. This approach aligns with the likelihood principle, concentrating all inferential information in the likelihood function itself.^[7]^[1] For multiparameter models, where interest lies in a subset of parameters \theta_j (parameters of interest) amid nuisance parameters \nu, the profile relative likelihood addresses the issue by maximizing the likelihood over \nu for fixed \theta_j. Formally, R(\theta_j | x) = \left[ \sup_{\nu} L(\theta_j, \nu | x) \right] / L(\hat{\theta}, \hat{\nu} | x), which concentrates the likelihood function to focus inference on \theta_j while marginalizing the impact of \nu. This profiling technique preserves the shape of the likelihood for \theta_j and is essential for practical applications, such as in generalized linear models, where nuisance parameters like dispersion must be accounted for without distorting the assessment of key effects. Adjustments to the profile likelihood, such as the Cox-Reid correction, further refine it by subtracting half the log-determinant of the observed information matrix for \nu, improving accuracy in small samples.^[8] Likelihood intervals based on relative likelihood provide an approximation to confidence intervals by delineating the range of \theta values deemed sufficiently plausible. The set \{\theta : R(\theta | x) \geq 0.15 \} (or where the log-likelihood drops by at most 1.92 units from its maximum) roughly corresponds to a 95% likelihood interval—a threshold that asymptotically aligns with the 95% quantile of a \chi^2_1 distribution under the Wilks theorem, though it differs from Wald intervals by not relying on local curvature or normality. These intervals are typically more conservative and data-dependent than frequentist confidence intervals, emphasizing evidential support over long-run coverage properties, and they perform well even in non-normal settings.^[7]^[1]^[8] Evaluating relative likelihood often involves numerical computation, such as gridding over plausible \theta values or using optimization algorithms like Newton-Raphson to locate the MLE and profile maxima. Software implementations in R or Python facilitate this via built-in maximizers, but challenges arise with multimodal likelihood surfaces, where local optima may mislead inference and require global optimization techniques or multiple starting points to ensure reliable profiling. The expectation-maximization (EM) algorithm proves useful for models with latent variables, iteratively handling incomplete data to converge on the likelihood.^[8]^[1] A concrete example occurs with n independent observations from a normal distribution N(\mu, \sigma^2) where \sigma^2 is known, yielding sample mean \bar{x}. Here, the MLE is \hat{\mu} = \bar{x}, and the relative likelihood simplifies to

R(\mu | x) = \exp\left[ -\frac{n (\mu - \bar{x})^2}{2 \sigma^2} \right],

demonstrating a symmetric parabolic decline from 1 at \mu = \bar{x}, with the rate of drop-off governed by sample size n and precision $1/\sigma^2. This form underscores the quadratic nature of the log-likelihood near the MLE, making it straightforward to compute intervals like \bar{x} \pm \sqrt{2 \sigma^2 / n} for R(\mu | x) \geq 1/e.^[9]^[1]

Likelihood Regions

Likelihood regions offer a geometric perspective on the uncertainty associated with parameter estimates by delineating sets of plausible values based on the relative likelihood function. Formally, a likelihood region is defined as the set \{\theta : R(\theta \mid x) \geq c\}, where R(\theta \mid x) is the relative likelihood and c is a threshold value that establishes a contour of plausibility, such as c = 0.15 or c = 1/8. For c \approx 0.15, the region asymptotically approximates a 95% confidence region under large-sample conditions, leveraging the quadratic behavior of the log relative likelihood \log R(\theta \mid x), as -2 \log c \approx 3.84 = \chi^2_1(0.95). In one dimension, the likelihood region manifests as an interval surrounding the MLE, bounded by points where the relative likelihood equals c. For multidimensional parameters, the region assumes more intricate shapes—potentially ellipsoidal under the quadratic approximation of the log relative likelihood \log R(\theta \mid x), or irregular otherwise—and is typically constructed through contour plotting or numerical optimization to trace the level set R(\theta \mid x) = c. These visualizations highlight joint parameter plausibility, revealing correlations and trade-offs not evident in marginal intervals.^[10]^[11] Unlike certain confidence regions derived from asymptotic normality (e.g., Wald intervals), likelihood regions maintain invariance under reparameterization, ensuring the set of plausible values transforms coherently with any valid change of variables. This property stems directly from the likelihood function's role in the definition. For models from exponential families, such as the normal or Poisson distributions, likelihood regions align exactly with confidence regions obtained via the likelihood ratio test statistic under the null hypothesis of the boundary values. Parameter values within the likelihood region are interpreted as compatible with the observed data at the specified plausibility level c, providing a direct measure of evidential support without reliance on prior distributions or long-run frequencies. These regions facilitate hypothesis testing by assessing whether a particular \theta_0 or submanifold lies inside the contour; inclusion implies the value is not strongly contradicted by the data. As an illustrative example, consider estimating the rate parameter \lambda of a Poisson distribution from data with sample mean \bar{x}. The likelihood region \{\lambda : R(\lambda \mid x) \geq 0.15\} forms an interval around \bar{x} that asymptotically approximates a 95% confidence interval. For an observed total count of 17 events (so \bar{x} = 17), this region is approximately [10.15, 26.41], which closely matches the exact Clopper-Pearson interval [10.25, 26.35] for the Poisson mean, demonstrating the practical alignment in low-dimensional cases.^[11]

Model Comparison

Relative Likelihood for Models

In the context of model comparison, the relative likelihood assesses the plausibility of different statistical models in fitting the same observed data by comparing their maximized likelihood values. For two competing models M_k and M_1, the relative likelihood is defined as R(M_k \mid x) = \frac{L_{\max}(M_k \mid x)}{L_{\max}(M_1 \mid x)}, where L_{\max}(M \mid x) is the maximum likelihood achieved by optimizing the parameters of model M given the data x. This ratio directly measures how much more (or less) likely the data are under M_k compared to M_1 at their respective best-fitting parameter values. A value of R(M_k \mid x) < 1 implies that M_k fits the data worse than M_1, while R(M_k \mid x) > 1 suggests a better fit. For nested models, where the parameter space of one model (e.g., the reduced model M_1) is a subset of the other (e.g., the full model M_k), the relative likelihood is commonly analyzed via the likelihood ratio statistic \Lambda = 2 \log \left( \frac{L_{\max}(M_k \mid x)}{L_{\max}(M_1 \mid x)} \right). Under the null hypothesis that the reduced model suffices, \Lambda asymptotically follows a chi-squared distribution with degrees of freedom equal to the difference in the number of free parameters between the models. Equivalently, -2 \log R(M_1 \mid x) \approx \chi^2(\Delta df), providing a basis for testing whether the additional complexity in M_k significantly improves the fit. This approach stems from the large-sample properties of maximum likelihood estimation.^[12] In the case of non-nested models, whose parameter spaces are incomparable, the direct relative likelihood ratio can still be computed, but interpreting it requires caution due to potential differences in model dimensions that may confound comparisons. To address this, Vuong's test extends the framework by evaluating the standardized differences in log-likelihoods across observations, testing the null hypothesis that both models are equally distant from the true data-generating process against alternatives where one is closer. This test uses the relative likelihood concept to derive a z-statistic for model selection.^[13] Raw relative likelihood ratios have notable limitations, as they do not penalize for model complexity and thus tend to favor overparameterized models that achieve higher maximized likelihoods by fitting noise in the data. For instance, when comparing a linear regression model (y = \beta_0 + \beta_1 x + \epsilon) to a quadratic extension (y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon) on the same dataset, the relative likelihood R(\text{quadratic} \mid x) = \frac{L_{\max}(\text{quadratic} \mid x)}{L_{\max}(\text{linear} \mid x)} will always exceed 1 since the linear model is nested within the quadratic. A likelihood ratio test can then assess whether this increase is statistically significant, indicating meaningful evidence for the quadratic term.^[12]

Relation to Selection Criteria

In model selection, relative likelihood serves as the basis for penalized criteria that adjust for model complexity to prevent overfitting while evaluating predictive quality. These criteria extend the raw relative likelihood R(M_k | x) by incorporating penalties proportional to the number of parameters, enabling systematic comparison among competing models. The Akaike Information Criterion (AIC), derived by Akaike in 1973, is formulated as

\text{AIC} = -2 \log L_{\max} + 2p,

where L_{\max} is the maximum likelihood of the model and p is the number of estimated parameters. The relative quality of model k to the best-fitting model is then \exp[(\text{AIC}_k - \text{AIC}_{\min})/2], which approximates the expected relative likelihood R(M_k | x) adjusted for parameter count p.^[14] This approximation arises from information-theoretic principles, estimating the model's out-of-sample predictive accuracy. As an alternative, the Bayesian Information Criterion (BIC), introduced by Schwarz in 1978, is given by

\text{BIC} = -2 \log L_{\max} + p \log n,

with n denoting the sample size. BIC imposes a harsher penalty on complexity for larger n, yielding a relative measure approximating R(M_k | x) \times n^{-(p_k - p_1)/2}, where p_1 is the parameter count of the reference model.^[15] This makes BIC particularly suitable for large datasets, favoring simpler models under asymptotic Bayesian assumptions. Other criteria build directly on relative likelihood concepts. In generalized linear models (GLMs), deviance is defined as D = -2 \log R, measuring the discrepancy between the fitted model and a saturated model that perfectly fits the data; lower deviance indicates better fit relative to the saturated likelihood.^[16] Cross-validation provides an empirical analog to relative likelihood by partitioning data and computing average predictive likelihoods across folds, offering a non-parametric assessment of model performance.^[17] Raw relative likelihood suffices for straightforward model comparisons without complexity concerns, but penalized criteria like AIC and BIC are essential for predictive applications to mitigate overfitting by trading off fit against parsimony. A practical illustration appears in ARIMA time series modeling, where AIC-based relative likelihoods guide selection among orders (p, d, q); for instance, in forecasting datasets, the model with the lowest AIC—such as an ARIMA(1,1,1) over higher-order alternatives—yields the highest relative likelihood, balancing in-sample fit with forecasting reliability.^[18]