Likelihood function
The likelihood function, often simply referred to as the likelihood, is a fundamental concept in statistical inference that measures the plausibility of different parameter values for a statistical model given a fixed set of observed data.[1] Formally, for a sample of independent observations x = (x_1, \dots, x_n) drawn from a probability distribution with density or mass function f(x_i \mid \theta), where \theta denotes the parameters, the likelihood function is defined as L(\theta \mid x) = \prod_{i=1}^n f(x_i \mid \theta), treating the data as fixed and varying \theta to assess relative support from the evidence.[2] Unlike a probability distribution over the data, the likelihood does not integrate or sum to 1 with respect to \theta, and constant factors independent of the parameters can often be omitted without altering inferences.[3] The likelihood function was introduced by British statistician Ronald A. Fisher as part of his development of modern statistical methods in the early 20th century.[4] Fisher first outlined a numerical procedure akin to maximum likelihood in 1912, but formally presented the concept in his seminal 1922 paper, "On the Mathematical Foundations of Theoretical Statistics," where he defined estimation problems in terms of maximizing the likelihood to obtain efficient and consistent parameter estimates.[5] This work distinguished the likelihood from Bayesian approaches by avoiding prior distributions on parameters, emphasizing instead the data's evidential content about \theta.[4] In practice, the likelihood function underpins maximum likelihood estimation (MLE), a cornerstone method for parameter estimation across diverse fields including biology, physics, economics, and machine learning.[2] MLE seeks the value \hat{\theta} that maximizes L(\theta \mid x), or equivalently the log-likelihood \ell(\theta \mid x) = \log L(\theta \mid x), by solving the score equation \frac{\partial \ell(\theta \mid x)}{\partial \theta} = 0; under standard regularity conditions, such estimators are consistent, asymptotically normal, and efficient, with their variability approximated by the inverse of the Fisher information I(\theta) = - \mathbb{E}\left[ \frac{\partial^2 \ell(\theta \mid x)}{\partial \theta^2} \right].[2] Beyond estimation, the likelihood enables likelihood ratio tests for comparing models, construction of confidence intervals, and even Bayesian posterior inference when combined with priors.[1] For instance, in analyzing binomial data from n trials with k successes, the likelihood L(p \mid k) \propto p^k (1-p)^{n-k} yields the MLE \hat{p} = k/n, illustrating its simplicity and utility in probabilistic modeling.[2]Definition
Discrete Case
In the discrete case, the likelihood function for a parameter vector \theta given an observed value x of a discrete random variable X is defined as L(\theta \mid x) = p(x \mid \theta), where p(\cdot \mid \theta) denotes the probability mass function of X parameterized by \theta.[3][6] Unlike the probability mass function, which treats the parameters as fixed and evaluates the probability over varying possible data values, the likelihood function regards the observed data as fixed and examines how the probability of that data changes as a function of the varying parameters.[7][8][9] For instance, in a single Bernoulli trial with success probability \theta and observed success x = 1, the likelihood simplifies to L(\theta \mid 1) = \theta; this function increases monotonically with \theta over [0, 1], indicating that larger \theta values render the observed outcome more probable under the model.[10][3] When multiple independent observations x_1, \dots, x_n are available from the same discrete distribution, the joint likelihood is the product of the individual probabilities: L(\theta \mid x_1, \dots, x_n) = \prod_{i=1}^n p(x_i \mid \theta). This multiplicative form arises because the observations are assumed independent given \theta.[3][11]Continuous Case
In the continuous case, the likelihood function for a single observation from a continuous random variable X with probability density function f(x \mid \theta) is defined as L(\theta \mid x) = f(x \mid \theta), where the observed value x is fixed and \theta is the varying parameter.[12] This setup treats the density as a function of the parameter given the data, rather than a probability over possible outcomes. Unlike a probability density function, which integrates to 1 over the data space for fixed parameters, the likelihood L(\theta \mid x) does not generally satisfy \int L(\theta \mid x) \, d\theta = 1 when integrated over the parameter space.[7][13] This distinction emphasizes that the likelihood evaluates the relative plausibility of parameters for the observed data, without normalizing as a probability distribution over \theta. A standard example arises with a single observation x from a normal distribution parameterized by mean \mu and variance \sigma^2. The probability density function is f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), so the likelihood becomes L(\mu, \sigma^2 \mid x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right). This form highlights the dependence on \mu and \sigma^2, peaking where the parameters align closely with the observed x. For n independent observations x_1, \dots, x_n from the same continuous distribution, the likelihood is the product of the individual densities, assuming independence: L(\theta \mid x_1, \dots, x_n) = \prod_{i=1}^n f(x_i \mid \theta). This joint likelihood scales with the sample size, amplifying the influence of the data on parameter inference.[14]General Formulation
In measure-theoretic probability theory, the likelihood function provides a general framework for statistical inference across arbitrary probability spaces. Consider a parametric family of probability measures \{P_\theta : \theta \in \Theta\} on a measurable sample space ( \mathcal{X}, \mathcal{A} ), where \Theta is the parameter space, typically a subset of \mathbb{R}^k. For observed data x \in \mathcal{X}, the likelihood function L(\theta \mid x) is defined as the Radon-Nikodym derivative of P_\theta with respect to a dominating \sigma-finite measure \mu on (\mathcal{X}, \mathcal{A}), provided that each P_\theta is absolutely continuous with respect to \mu: L(\theta \mid x) = \frac{dP_\theta}{d\mu}(x). This formulation unifies the discrete and continuous cases, where \mu may be a counting measure or Lebesgue measure, respectively, and extends to more complex spaces under suitable conditions.[15] The parameter space \Theta specifies the allowable values of \theta, and the likelihood is defined for each \theta \in \Theta, treating the data x as fixed while varying \theta. This setup ensures that L(\theta \mid x) quantifies how well the model parameterized by \theta explains the observed x, without regard to the prior distribution over \Theta. For asymptotic properties, such as the consistency and normality of maximum likelihood estimators, certain regularity conditions are required. These include: the true parameter lying in the interior of \Theta; the log-likelihood \ell(\theta \mid x) = \log L(\theta \mid x) being thrice continuously differentiable with respect to \theta; the ability to interchange differentiation and integration (e.g., second derivatives passing under the integral); and the Fisher information matrix I(\theta) = -\mathbb{E}\left[ \frac{\partial^2 \ell(\theta \mid x)}{\partial \theta \partial \theta^T} \right] being positive definite for all \theta \in \Theta. These assumptions ensure that the observed information converges uniformly to the expected Fisher information as the sample size increases.[16][17] Unlike a probability density function over the data, the likelihood function is not normalized as a density with respect to \theta, meaning \int_\Theta L(\theta \mid x) \, d\theta \neq 1 in general. This is because L(\theta \mid x) is proportional to the density of the data under P_\theta but serves as a measure of relative support for different \theta values given fixed x, without integrating to unity over the parameter space; normalization is neither defined nor required for frequentist inference based on likelihood ratios or maxima.[15][7]Mixed Distributions
In mixed distributions, the observed data consist of both discrete and continuous components, denoted as x = (x_d, x_c), where x_d follows a discrete probability mass function p_d(\cdot | \theta) and x_c follows a continuous probability density function f_c(\cdot | \theta). The likelihood function is constructed as the product L(\theta | x) = p_d(x_d | \theta) \cdot f_c(x_c | \theta), reflecting the independence between the components under the model parameterization \theta.[18][19] A key challenge in formulating the likelihood for mixed distributions arises from the absence of a single dominating measure that unifies the discrete and continuous parts, as discrete components lack a density with respect to Lebesgue measure while continuous parts do not admit point masses. To address this, the overall probability distribution can be expressed using a generalized density that incorporates Dirac delta functions for the discrete components, such as f(x | \theta) = p_d(x_d | \theta) \delta(x_d - \cdot) + f_c(x_c | \theta) in a unified framework, allowing the likelihood to be treated as a Radon-Nikodym derivative with respect to a hybrid measure.[20][18] An illustrative example is the Poisson process, where the number of events N(t) = n in a fixed interval [0, t] is discrete (following a Poisson distribution with rate \lambda t), and the interarrival times or arrival times S_1 < S_2 < \cdots < S_n are continuous (exponentially distributed with rate \lambda). The likelihood incorporating both is L(\lambda | n, s_1, \dots, s_n) = \lambda^n e^{-\lambda t}, which multiplies the Poisson probability mass for the count with the joint exponential density for the ordered arrival times, enabling parameter estimation that accounts for both observed events and timing.[21] This construction has important implications for statistical inference in models involving mixed data, such as survival analysis with right-censoring, where event times may be continuous but censoring indicators are discrete (observed or censored). The product-form likelihood ensures model compatibility by using the density for uncensored times and the survival function (integral of the density) for censored observations, facilitating maximum likelihood estimation while handling the hybrid nature of the data without biasing toward either component.[22][23]Key Properties
Likelihood Ratio
The likelihood ratio for two specific parameter values \theta_0 and \theta_1 given observed data x is defined as \Lambda(\theta_0, \theta_1 \mid x) = \frac{L(\theta_0 \mid x)}{L(\theta_1 \mid x)}, where L(\theta \mid x) denotes the likelihood function.[24] This ratio quantifies the relative support provided by the data for one parameter value over the other; specifically, if \Lambda > 1, the data favor \theta_0 as more plausible than \theta_1, while \Lambda < 1 indicates the opposite.[25] The likelihood ratio serves as a foundational tool in statistical inference for comparing competing parameter values or models by directly measuring how much more (or less) likely the data are under one specification compared to another. In hypothesis testing, the likelihood ratio is extended to the generalized form for comparing a null hypothesis H_0: \theta = \theta_0 against an alternative H_1: \theta \neq \theta_0, where \Lambda = \frac{L(\theta_0 \mid x)}{\sup_{\theta} L(\theta \mid x)}.[25] For instance, consider a binomial model with n trials and observed successes k, testing H_0: p = p_0 versus H_1: p \neq p_0; here, \Lambda = \frac{\binom{n}{k} p_0^k (1 - p_0)^{n-k}}{\sup_p \binom{n}{k} p^k (1 - p)^{n-k}} = \frac{p_0^k (1 - p_0)^{n-k}}{\hat{p}^k (1 - \hat{p})^{n-k}}, with \hat{p} = k/n as the maximum likelihood estimate.[26] Small values of \Lambda (typically below a threshold determined by the desired significance level) lead to rejection of H_0, indicating that the restricted model under H_0 fits the data substantially worse than the unrestricted alternative.[25] Under the null hypothesis and for large sample sizes, the test statistic -2 \log \Lambda follows an asymptotic \chi^2 distribution with degrees of freedom equal to the difference in the dimensionality of the parameter spaces under the alternative and null hypotheses, as established by Wilks' theorem.[27] This approximation enables the computation of p-values and critical regions for the likelihood ratio test in composite hypothesis settings, facilitating inference even when exact distributions are intractable.[25]Relative Likelihood
The relative likelihood of a parameter value \theta given observed data x is defined as R(\theta \mid x) = \frac{L(\theta \mid x)}{L(\hat{\theta} \mid x)}, where \hat{\theta} = \arg\max_{\theta} L(\theta \mid x) is the maximum likelihood estimate (MLE) and L(\theta \mid x) is the likelihood function. This ratio, which ranges from 0 to 1, quantifies the relative support for \theta compared to the MLE, normalizing the likelihood to its peak value and facilitating comparisons across the parameter space without regard to the absolute scale of the likelihood.[28] Likelihood regions, defined as the sets \{\theta : R(\theta \mid x) \geq c\} for $0 < c < 1, delineate regions of the parameter space where the relative likelihood exceeds a specified threshold c, offering a direct measure of evidential support from the data. Asymptotically, these regions approximate confidence regions under mild regularity conditions, with the choice of c calibrated to match coverage probabilities; for instance, in one dimension, c \approx 0.15 (corresponding to \exp(-1.92), where 1.92 is half the 95% \chi^2_1 quantile of 3.84) yields an approximate 95% confidence region. This approximation arises from the large-sample distribution of -2 \log R(\theta \mid x) \approx \chi^2_p, where p is the number of parameters of interest. Graphically, contours of constant relative likelihood in the parameter space visualize the shape and extent of evidential support, often forming elliptical or hyperbolic boundaries centered at the MLE that reflect the curvature of the likelihood surface.[28] These contours aid in exploring parameter uncertainty and model diagnostics, particularly in multiparameter settings where the relative likelihood highlights trade-offs between parameters. As an illustrative example, consider estimating the mean \mu of a normal distribution N(\mu, \sigma^2) with known variance \sigma^2 > 0, based on n independent observations x = (x_1, \dots, x_n). The relative likelihood simplifies to R(\mu \mid x) = \exp\left( -\frac{n (\mu - \bar{x})^2}{2 \sigma^2} \right), where \bar{x} = n^{-1} \sum_{i=1}^n x_i is the sample mean, achieving its maximum of 1 at \mu = \bar{x} and dropping parabolically away from this point, with the rate of decline governed by the sample size n and precision $1/\sigma^2.[29] For c = 0.15, the corresponding likelihood region is approximately \mu \in [\bar{x} - 1.96 \sigma / \sqrt{n}, \bar{x} + 1.96 \sigma / \sqrt{n}], aligning with the standard 95% confidence interval.Log-Likelihood Function
The log-likelihood function, denoted as \ell(\theta \mid x), is the natural logarithm of the likelihood function L(\theta \mid x), providing a transformed measure of how well a statistical model with parameter \theta explains the observed data x.[30] For a sample of n independent observations, it takes the explicit form \ell(\theta \mid x) = \sum_{i=1}^n \log f(x_i \mid \theta), where f(x_i \mid \theta) is the probability density or mass function of each observation.[30] A key property of the log-likelihood is that the logarithm serves as a strictly increasing monotonic transformation, ensuring that the parameter value maximizing \ell(\theta \mid x) is identical to the one maximizing L(\theta \mid x); thus, \arg\max_\theta \ell(\theta \mid x) = \arg\max_\theta L(\theta \mid x).[31] This transformation also converts the product of individual densities in the likelihood into a sum, facilitating easier numerical evaluation, optimization, and differentiation in computational procedures.[32] Graphically, the log-likelihood function often displays a unimodal shape, rising to a peak at the maximum likelihood estimate (MLE) and descending thereafter, with the peak becoming sharper as the sample size grows to concentrate evidence around the true parameter.[30] The second derivative captures this curvature: the observed information matrix is given by I(\theta) = -\frac{\partial^2 \ell}{\partial \theta^2}, evaluated at the MLE, which measures the estimate's precision and forms the basis for asymptotic variance approximations.[33] For illustration, consider n independent observations from an exponential distribution with rate parameter \lambda > 0, where the density is f(x \mid \lambda) = \lambda e^{-\lambda x} for x \geq 0. The log-likelihood simplifies to \ell(\lambda \mid x) = n \log \lambda - \lambda \sum_{i=1}^n x_i, which attains its maximum at the MLE \hat{\lambda} = n / \sum_{i=1}^n x_i, the reciprocal of the sample mean.[34]Nuisance Parameter Elimination
Profile Likelihood
In statistical models where the parameter vector is partitioned as \theta = (\psi, \lambda), with \psi denoting the parameter of interest and \lambda the nuisance parameter(s), the profile log-likelihood for \psi is defined as \ell_p(\psi \mid x) = \max_\lambda \ell(\psi, \lambda \mid x), where \ell(\cdot \mid x) is the log-likelihood function given observed data x.[35] This approach concentrates the full likelihood by optimizing out the nuisance parameters, yielding a reduced likelihood function that depends solely on \psi.[36] The profile log-likelihood is constructed by, for each fixed value of \psi, computing the conditional maximum likelihood estimator of the nuisance parameters, \hat{\lambda}(\psi) = \arg\max_\lambda \ell(\psi, \lambda \mid x), and substituting it into the original log-likelihood: \ell_p(\psi \mid x) = \ell(\psi, \hat{\lambda}(\psi) \mid x).[36] This maximization step typically involves solving score equations or using numerical optimization methods, assuming the likelihood is sufficiently smooth and the maximum exists.[35] Under regularity conditions, the profile likelihood possesses desirable asymptotic properties for inference on \psi, including equivalence to the full likelihood in large samples; specifically, the profile maximum likelihood estimator \hat{\psi}_p is asymptotically normal and efficient, matching the properties of the unrestricted maximum likelihood estimator \hat{\psi}.[36] It is particularly useful for constructing confidence intervals for \psi, where the statistic -2 \log \left[ L_p(\psi)/L_p(\hat{\psi}_p) \right] (with L_p the profile likelihood) asymptotically follows a \chi^2 distribution with one degree of freedom under the hypothesis that \psi is the true value, enabling likelihood ratio-based intervals that account for the nuisance parameters.[36] A representative example arises in simple linear regression, y_i = \beta_0 + \beta_1 x_i + \epsilon_i with \epsilon_i \sim N(0, \sigma^2), where the focus is on the slope \beta_1 = \psi and the intercept \beta_0 and variance \sigma^2 = \lambda are treated as nuisance. For fixed \beta_1, the profile likelihood maximizes the normal log-likelihood over \beta_0 (yielding the least-squares fit conditional on \beta_1) and \sigma^2 (set to the conditional residual mean squared error), resulting in \ell_p(\beta_1 \mid y) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\hat{\sigma}^2(\beta_1)) - \frac{n}{2}, where \hat{\sigma}^2(\beta_1) is the profiled variance estimate; this facilitates direct inference on \beta_1 via the resulting one-dimensional profile curve.[37]Conditional Likelihood
The conditional likelihood is a technique for focusing inference on a parameter of interest, ψ, in the presence of nuisance parameters, λ, by conditioning the full likelihood on a suitable ancillary or sufficient statistic. Formally, it is defined as L_c(\psi \mid x) = \frac{L(\psi, \lambda \mid x)}{L(\lambda \mid T(x))}, where T(x) is a statistic that is sufficient for λ but ancillary for ψ, ensuring that the resulting function depends only on ψ and facilitates unbiased estimation and testing.[36] This approach leverages the factorization theorem from sufficiency to partition the data's information, isolating the component relevant to ψ while marginalizing the influence of λ through conditioning rather than integration or maximization.[38] The method is applicable in models where a complete sufficient statistic T(x) for the nuisance parameter λ exists and is independent of ψ in distribution, a condition often satisfied in multiparameter exponential families due to their structured sufficient statistics.[39] In such cases, the conditional distribution forms a new exponential family parameterized solely by ψ, enabling exact inference without approximation. Unlike the profile likelihood, which maximizes the joint likelihood over λ to approximate inference on ψ, the conditional likelihood achieves exact elimination of the nuisance parameter when the conditioning statistic is available, avoiding potential biases from maximization.[36] Key properties include its ability to reduce bias in small-sample inference compared to profile methods, as it preserves the exact conditional distribution and avoids overestimation of precision from nuisance parameter variability.[38] In exponential families, the conditional likelihood yields asymptotically efficient estimators that are consistent and normally distributed, with relative error to the profile likelihood of order O(n^{-3/2}) under i.i.d. sampling, and it supports accurate tail probability approximations for p-values and confidence intervals.[36] These properties make it particularly valuable for precise hypothesis testing and interval estimation in structured models. A representative example arises in comparing two independent Poisson counts, X ~ Pois(ψ λ) and Y ~ Pois(λ), where ψ is the relative rate of interest and λ is the common intensity nuisance parameter. The total T = X + Y serves as the conditioning statistic; the conditional distribution is X | T = t ~ Binomial(t, ψ / (1 + ψ)), yielding a conditional likelihood L_c(ψ | T) that depends only on ψ and is free of λ, thus eliminating the nuisance and enabling exact tests for ψ = 1.[40]Marginal Likelihood
The marginal likelihood addresses the presence of nuisance parameters λ in a statistical model by integrating the full likelihood function over these parameters, yielding a likelihood that depends only on the parameter of interest ψ. Formally, for observed data x, the marginal likelihood is defined as L_m(\psi \mid x) = \int L(\psi, \lambda \mid x) \, \pi(\lambda \mid \psi) \, d\lambda, where π(λ | ψ) is a weighting density for the nuisance parameters, often chosen as a uniform or noninformative prior to reflect limited prior knowledge.[41] This approach produces a function of ψ alone, analogous to a likelihood in reduced models, and is particularly valuable for inference focused on ψ without conditioning or maximizing over λ.[41] While central to Bayesian marginalization—where it forms the evidence or normalizing constant for the posterior on ψ—the marginal likelihood can also be employed in frequentist settings by selecting appropriate weightings, such as improper uniforms, to obtain valid inference procedures with desirable properties like consistency.[41] It effectively incorporates uncertainty from λ into the assessment of ψ, leading to broader confidence regions compared to methods that fix or profile λ. However, exact computation is often challenging due to the high-dimensional integrals involved, especially in complex models.[41] To overcome these computational hurdles, approximations such as the Laplace method provide asymptotic expansions around the mode of the integrand, yielding accurate estimates for large samples.[42] For model selection purposes, the Bayesian information criterion (BIC) serves as a practical surrogate, approximating the negative log marginal likelihood as -2 log L_m(ψ | x) ≈ -2 log L(\hat{ψ}, \hat{λ} | x) + k log n, where k is the number of parameters and n the sample size, facilitating comparisons across models without full integration.[43] A classic illustration occurs in the normal linear model where data x ~ N(ψ, λ I_n), with unknown mean ψ and variance λ > 0. Integrating the likelihood over λ using an inverse-gamma or flat prior weighting results in a marginal likelihood for ψ proportional to the density of a Student's t-distribution with n-1 degrees of freedom, centered at the sample mean and scaled by the sample standard deviation; this form underpins t-based inference for ψ while accounting for variance uncertainty.Partial Likelihood
The partial likelihood is a method developed for inference in semi-parametric models where the full probability density is not fully specified, particularly useful in survival analysis to focus on the effects of covariates while treating the baseline hazard as a nuisance parameter. Introduced by Cox, it constructs a likelihood based on conditional probabilities of event occurrences given the risk set at each event time, avoiding the need to specify or estimate the entire distribution.[44] In the Cox proportional hazards model, the partial likelihood for the regression coefficients \beta given the observed data \mathbf{x} (including event times and covariates) is given by L_p([\beta](/page/Beta) \mid \mathbf{x}) = \prod_{i: \delta_i=1} \frac{\exp(\beta^\top z_i)}{\sum_{j \in R(t_i)} \exp(\beta^\top z_j)}, where the product is over the event times t_i with indicator \delta_i = 1 for observed events, z_i is the covariate vector for the individual experiencing the event, and R(t_i) denotes the risk set of individuals at risk just prior to t_i. This formulation arises from the assumption that the hazard function factors into a baseline hazard \lambda_0(t) and a relative hazard \exp(\beta^\top z), with the partial likelihood capturing only the relative contributions of covariates at each event, thereby eliminating the unspecified baseline hazard without requiring integration over nuisance parameters or explicit conditioning on ancillary statistics.[44][45] The partial likelihood is applicable in settings where interest lies primarily in the relative hazards influenced by covariates, such as in time-to-event analyses with right-censoring, as it provides a robust way to eliminate nuisance parameters like the baseline hazard directly through the conditional structure, distinct from methods involving integration or maximization over nuisances. Key properties include its asymptotic efficiency: under suitable regularity conditions, the maximum partial likelihood estimator \hat{\beta} obtained by solving the score equations from the partial log-likelihood \ell_p(\beta) = \log L_p(\beta \mid \mathbf{x}) is consistent, asymptotically normal, and fully efficient relative to the full likelihood, matching the information bound for \beta as sample size increases.[46][46] A canonical example is the Cox proportional hazards model for analyzing time-to-event data, where the partial likelihood concentrates solely on the covariate effects \beta, treating the baseline hazard as unspecified and thus avoiding parametric assumptions about the distribution of survival times. For instance, in a study of patient survival with covariates like age and treatment, the partial likelihood would product over observed failure times the probability that the failing individual has the highest relative hazard among those at risk, enabling estimation of \beta without modeling the underlying time distribution. This approach has become foundational in survival analysis due to its flexibility and theoretical guarantees.[45][44]Advanced Topics
Products of Likelihoods
In statistical inference, when data arise from independent experiments or sources, the joint likelihood function is formed by multiplying the individual likelihood functions. Specifically, for two independent datasets x and y parameterized by \theta, the joint likelihood is L(\theta \mid x, y) = L(\theta \mid x) \cdot L(\theta \mid y), leveraging the property that the joint probability density or mass function factors under independence.[47] This multiplicative principle extends to any number of independent components, providing a foundational mechanism for aggregating probabilistic evidence across separate observations or models.[47] This approach finds applications in combining evidence from multiple independent studies, such as in meta-analyses where likelihoods from disparate trials are multiplied to yield a unified inference on shared parameters.[48] It also supports modular model building, where separate likelihood components are constructed for distinct aspects of the data—potentially involving different parameters—and then combined to form a comprehensive model, facilitating scalable and interpretable analyses in complex systems.[48] However, the validity of multiplying likelihoods hinges on the independence of the data sources; violations, such as unaccounted correlations, can distort the joint inference.[48] Additionally, if parameters overlap across components without appropriate modeling, the product may lead to non-identifiability, where multiple parameter values yield equivalent likelihoods, complicating estimation.[49] A concrete example involves combining a Poisson-distributed count dataset with mean \theta and a Binomial-distributed binary dataset with success probability \theta, both sharing the parameter \theta \in (0,1). The joint likelihood is the product of the individual Poisson likelihood L(\theta \mid \mathbf{k}) = \prod_{i=1}^{n_1} \frac{\theta^{k_i} e^{-\theta}}{k_i!} and Binomial likelihood L(\theta \mid \mathbf{y}) = \prod_{j=1}^{n_2} \theta^{y_j} (1-\theta)^{m_j - y_j}, where \mathbf{k} are counts, \mathbf{y} are successes in m_j trials. Maximizing this product requires solving the score equation \frac{\sum k_i + \sum y_j}{\theta} - n_1 - \sum_{j=1}^{n_2} \frac{m_j - y_j}{1 - \theta} = 0, which generally lacks a closed-form solution but integrates information from both sources for more efficient estimation than using either alone.[50]Likelihood Equations
The likelihood equations arise from the optimization problem central to maximum likelihood estimation (MLE), where the goal is to find the parameter value \hat{\theta} that maximizes the likelihood function L(\theta \mid x) or, equivalently, its logarithm \ell(\theta \mid x). To solve this, the first-order condition requires setting the partial derivative of the log-likelihood with respect to the parameter \theta equal to zero: \frac{\partial \ell(\theta \mid x)}{\partial \theta} = 0. This equation, known as the score equation, defines the critical points of the log-likelihood surface, and under suitable regularity conditions, the solution \hat{\theta} corresponds to the maximum likelihood estimate.[51][52] For independent and identically distributed (i.i.d.) observations x_1, \dots, x_n from a density f(x_i \mid \theta), the log-likelihood is \ell(\theta \mid x) = \sum_{i=1}^n \log f(x_i \mid \theta), and the score equation simplifies to the first-order condition \sum_{i=1}^n \frac{\partial \log f(x_i \mid \theta)}{\partial \theta} = 0. This summation form leverages the additivity of the log-likelihood for i.i.d. data, making it computationally tractable for many parametric models where the individual score contributions \frac{\partial \log f(x_i \mid \theta)}{\partial \theta} can be derived explicitly.[53][10] To verify that the solution \hat{\theta} is a maximum rather than a minimum or saddle point, second-order conditions are examined using the Hessian matrix H(\theta) = \frac{\partial^2 \ell(\theta \mid x)}{\partial \theta \partial \theta^\top}, which captures the second partial derivatives. For a unique global maximum in regular models, the Hessian must be negative definite at \hat{\theta}, meaning all its eigenvalues are negative, ensuring the log-likelihood is concave in a neighborhood of the estimate. This condition confirms the stability and uniqueness of the MLE when the observed Fisher information -H(\hat{\theta}) is positive definite.[54][55] A classic example illustrates these equations for the uniform distribution U(0, \theta) with i.i.d. observations x_1, \dots, x_n, where the density is f(x_i \mid \theta) = \frac{1}{\theta} for $0 \leq x_i \leq \theta and 0 otherwise. The likelihood is L(\theta \mid x) = \theta^{-n} for \theta \geq \max_i x_i and 0 otherwise, so the log-likelihood is \ell(\theta \mid x) = -n \log \theta in that domain. Differentiating yields the score \frac{\partial \ell}{\partial \theta} = -\frac{n}{\theta}, which is negative for \theta > 0 and thus never zero; however, the maximum occurs at the boundary \theta = \max_i x_i, where \hat{\theta} = \max_i x_i maximizes \ell by minimizing \log \theta subject to the support constraint. The second derivative \frac{\partial^2 \ell}{\partial \theta^2} = \frac{n}{\theta^2} > 0 indicates convexity, but the boundary solution and domain restriction ensure \hat{\theta} is the unique maximizer.[56][6]Exponential Families
Exponential families constitute a broad class of parametric probability distributions that admit a sufficient statistic of fixed dimension, independent of sample size, as established by the Darmois–Koopman–Pitman theorem.[57] The probability density (or mass) function of a distribution in the exponential family takes the canonical formf(x \mid \theta) = h(x) \exp\left( \eta(\theta)^\top T(x) - A(\theta) \right),
where \eta(\theta) is the natural (or canonical) parameter, T(x) is the sufficient statistic, h(x) is the base measure, and A(\theta) is the cumulant-generating function serving as a normalization constant.[58] This parameterization facilitates analytical tractability in statistical inference, particularly for likelihood-based methods. For an independent and identically distributed sample x_1, \dots, x_n from an exponential family distribution, the log-likelihood function simplifies to
\ell(\theta) = \sum_{i=1}^n \left[ \eta(\theta)^\top T(x_i) - A(\theta) \right] + \sum_{i=1}^n \log h(x_i).
The second term is constant with respect to \theta, so maximization depends only on the first term. The maximum likelihood estimator \hat{\theta} is found by setting the derivative of \eta(\theta)^\top \bar{T} - A(\theta) to zero, where \bar{T} = n^{-1} \sum_{i=1}^n T(x_i) is the sample average of the sufficient statistic.[58] A key property is that \bar{T} is minimal sufficient, reducing the data to this low-dimensional summary for inference; the MLE satisfies \mathbb{E}_\theta[T(X)] = \nabla_\eta A(\theta) = \bar{T} evaluated at \hat{\theta}, equating observed and expected sufficient statistics under the model.[59] The Gamma distribution exemplifies these properties as a two-parameter exponential family with shape \alpha > 0 and rate \beta > 0. Its density is
f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} \exp(-\beta x), \quad x > 0,
which reparameterizes in natural form with \eta_1 = \alpha - 1, \eta_2 = -\beta, T(x) = (\log x, x), and appropriate h(x) and A(\theta).[58] The MLE equations are \hat{\beta} = \hat{\alpha} / \bar{x} and \log \hat{\alpha} - \psi(\hat{\alpha}) = \log \bar{x} - n^{-1} \sum_{i=1}^n \log x_i, where \psi(\cdot) is the digamma function \Gamma' / \Gamma; the shape equation typically requires numerical solution via Newton-Raphson iteration.[60]