Fact-checked by Grok 2 weeks ago

Likelihood function

The likelihood function, often simply referred to as the likelihood, is a fundamental concept in that measures the plausibility of different values for a given a fixed set of observed . Formally, for a sample of independent observations x = (x_1, \dots, x_n) drawn from a with density or mass function f(x_i \mid \theta), where \theta denotes the parameters, the likelihood function is defined as L(\theta \mid x) = \prod_{i=1}^n f(x_i \mid \theta), treating the as fixed and varying \theta to assess relative support from the evidence. Unlike a probability distribution over the , the likelihood does not integrate or sum to 1 with respect to \theta, and constant factors independent of the parameters can often be omitted without altering inferences. The likelihood function was introduced by British statistician as part of his development of modern statistical methods in the early . first outlined a numerical procedure akin to maximum likelihood in , but formally presented the concept in his seminal 1922 paper, "On the Mathematical Foundations of Theoretical Statistics," where he defined estimation problems in terms of maximizing the likelihood to obtain efficient and consistent parameter estimates. This work distinguished the likelihood from Bayesian approaches by avoiding prior distributions on parameters, emphasizing instead the data's evidential content about \theta. In practice, the likelihood function underpins , a cornerstone method for parameter estimation across diverse fields including , physics, , and . MLE seeks the value \hat{\theta} that maximizes L(\theta \mid x), or equivalently the log-likelihood \ell(\theta \mid x) = \log L(\theta \mid x), by solving the score equation \frac{\partial \ell(\theta \mid x)}{\partial \theta} = 0; under standard regularity conditions, such estimators are consistent, asymptotically normal, and efficient, with their variability approximated by the inverse of the I(\theta) = - \mathbb{E}\left[ \frac{\partial^2 \ell(\theta \mid x)}{\partial \theta^2} \right]. Beyond estimation, the likelihood enables likelihood ratio tests for comparing models, construction of confidence intervals, and even Bayesian posterior inference when combined with priors. For instance, in analyzing data from n trials with k successes, the likelihood L(p \mid k) \propto p^k (1-p)^{n-k} yields the MLE \hat{p} = k/n, illustrating its simplicity and utility in probabilistic modeling.

Definition

Discrete Case

In the discrete case, the likelihood function for a parameter vector \theta given an observed value x of a X is defined as L(\theta \mid x) = p(x \mid \theta), where p(\cdot \mid \theta) denotes the of X parameterized by \theta. Unlike the probability mass function, which treats the parameters as fixed and evaluates the probability over varying possible data values, the likelihood function regards the observed data as fixed and examines how the probability of that data changes as a function of the varying parameters. For instance, in a single with success probability \theta and observed success x = 1, the likelihood simplifies to L(\theta \mid 1) = \theta; this increases monotonically with \theta over [0, 1], indicating that larger \theta values render the observed outcome more probable under the model. When multiple observations x_1, \dots, x_n are available from the same , the joint likelihood is the product of the individual probabilities: L(\theta \mid x_1, \dots, x_n) = \prod_{i=1}^n p(x_i \mid \theta). This multiplicative form arises because the observations are assumed given \theta.

Continuous Case

In the continuous case, the likelihood function for a single observation from a continuous X with f(x \mid \theta) is defined as L(\theta \mid x) = f(x \mid \theta), where the observed value x is fixed and \theta is the varying . This setup treats the as a of the given the data, rather than a probability over possible outcomes. Unlike a , which integrates to 1 over the data space for fixed parameters, the likelihood L(\theta \mid x) does not generally satisfy \int L(\theta \mid x) \, d\theta = 1 when integrated over the parameter space. This distinction emphasizes that the likelihood evaluates the relative plausibility of parameters for the observed data, without normalizing as a over \theta. A standard example arises with a single observation x from a normal distribution parameterized by mean \mu and variance \sigma^2. The probability density function is f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), so the likelihood becomes L(\mu, \sigma^2 \mid x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right). This form highlights the dependence on \mu and \sigma^2, peaking where the parameters align closely with the observed x. For n independent observations x_1, \dots, x_n from the same continuous distribution, the likelihood is the product of the individual densities, assuming independence: L(\theta \mid x_1, \dots, x_n) = \prod_{i=1}^n f(x_i \mid \theta). This joint likelihood scales with the sample size, amplifying the influence of the data on parameter inference.

General Formulation

In measure-theoretic probability theory, the likelihood function provides a general framework for statistical inference across arbitrary probability spaces. Consider a parametric family of probability measures \{P_\theta : \theta \in \Theta\} on a measurable sample space ( \mathcal{X}, \mathcal{A} ), where \Theta is the parameter space, typically a subset of \mathbb{R}^k. For observed data x \in \mathcal{X}, the likelihood function L(\theta \mid x) is defined as the Radon-Nikodym derivative of P_\theta with respect to a dominating \sigma-finite measure \mu on (\mathcal{X}, \mathcal{A}), provided that each P_\theta is absolutely continuous with respect to \mu: L(\theta \mid x) = \frac{dP_\theta}{d\mu}(x). This formulation unifies the discrete and continuous cases, where \mu may be a counting measure or Lebesgue measure, respectively, and extends to more complex spaces under suitable conditions. The space \Theta specifies the allowable values of \theta, and the likelihood is defined for each \theta \in \Theta, treating the x as fixed while varying \theta. This setup ensures that L(\theta \mid x) quantifies how well the model parameterized by \theta explains the observed x, without regard to the prior distribution over \Theta. For asymptotic properties, such as the and of maximum likelihood estimators, certain regularity conditions are required. These include: the true lying in the interior of \Theta; the log-likelihood \ell(\theta \mid x) = \log L(\theta \mid x) being thrice continuously differentiable with respect to \theta; the ability to interchange and integration (e.g., second derivatives passing under the integral); and the matrix I(\theta) = -\mathbb{E}\left[ \frac{\partial^2 \ell(\theta \mid x)}{\partial \theta \partial \theta^T} \right] being positive definite for all \theta \in \Theta. These assumptions ensure that the converges uniformly to the expected as the sample size increases. Unlike a over the , the likelihood function is not normalized as a with respect to \theta, meaning \int_\Theta L(\theta \mid x) \, d\theta \neq 1 in general. This is because L(\theta \mid x) is proportional to the density of the under P_\theta but serves as a measure of relative support for different \theta values given fixed x, without integrating to over the parameter space; normalization is neither defined nor required for based on likelihood ratios or maxima.

Mixed Distributions

In mixed distributions, the observed data consist of both discrete and continuous components, denoted as x = (x_d, x_c), where x_d follows a probability mass function p_d(\cdot | \theta) and x_c follows a continuous f_c(\cdot | \theta). The likelihood function is constructed as the product L(\theta | x) = p_d(x_d | \theta) \cdot f_c(x_c | \theta), reflecting the between the components under the model parameterization \theta. A key challenge in formulating the likelihood for mixed distributions arises from the absence of a single dominating measure that unifies the discrete and continuous parts, as discrete components lack a density with respect to Lebesgue measure while continuous parts do not admit point masses. To address this, the overall probability distribution can be expressed using a generalized density that incorporates Dirac delta functions for the discrete components, such as f(x | \theta) = p_d(x_d | \theta) \delta(x_d - \cdot) + f_c(x_c | \theta) in a unified framework, allowing the likelihood to be treated as a Radon-Nikodym derivative with respect to a hybrid measure. An illustrative example is the Poisson process, where the number of events N(t) = n in a fixed interval [0, t] is discrete (following a with rate \lambda t), and the interarrival times or arrival times S_1 < S_2 < \cdots < S_n are continuous (exponentially distributed with rate \lambda). The likelihood incorporating both is L(\lambda | n, s_1, \dots, s_n) = \lambda^n e^{-\lambda t}, which multiplies the Poisson probability mass for the count with the joint exponential density for the ordered arrival times, enabling parameter estimation that accounts for both observed events and timing. This construction has important implications for statistical inference in models involving mixed data, such as survival analysis with right-censoring, where event times may be continuous but censoring indicators are discrete (observed or censored). The product-form likelihood ensures model compatibility by using the density for uncensored times and the survival function (integral of the density) for censored observations, facilitating maximum likelihood estimation while handling the hybrid nature of the data without biasing toward either component.

Key Properties

Likelihood Ratio

The likelihood ratio for two specific parameter values \theta_0 and \theta_1 given observed data x is defined as \Lambda(\theta_0, \theta_1 \mid x) = \frac{L(\theta_0 \mid x)}{L(\theta_1 \mid x)}, where L(\theta \mid x) denotes the likelihood function. This ratio quantifies the relative support provided by the data for one parameter value over the other; specifically, if \Lambda > 1, the data favor \theta_0 as more plausible than \theta_1, while \Lambda < 1 indicates the opposite. The likelihood ratio serves as a foundational tool in statistical inference for comparing competing parameter values or models by directly measuring how much more (or less) likely the data are under one specification compared to another. In hypothesis testing, the likelihood ratio is extended to the generalized form for comparing a null hypothesis H_0: \theta = \theta_0 against an alternative H_1: \theta \neq \theta_0, where \Lambda = \frac{L(\theta_0 \mid x)}{\sup_{\theta} L(\theta \mid x)}. For instance, consider a binomial model with n trials and observed successes k, testing H_0: p = p_0 versus H_1: p \neq p_0; here, \Lambda = \frac{\binom{n}{k} p_0^k (1 - p_0)^{n-k}}{\sup_p \binom{n}{k} p^k (1 - p)^{n-k}} = \frac{p_0^k (1 - p_0)^{n-k}}{\hat{p}^k (1 - \hat{p})^{n-k}}, with \hat{p} = k/n as the maximum likelihood estimate. Small values of \Lambda (typically below a threshold determined by the desired significance level) lead to rejection of H_0, indicating that the restricted model under H_0 fits the data substantially worse than the unrestricted alternative. Under the null hypothesis and for large sample sizes, the test statistic -2 \log \Lambda follows an asymptotic \chi^2 distribution with degrees of freedom equal to the difference in the dimensionality of the parameter spaces under the alternative and null hypotheses, as established by . This approximation enables the computation of p-values and critical regions for the in composite hypothesis settings, facilitating inference even when exact distributions are intractable.

Relative Likelihood

The relative likelihood of a parameter value \theta given observed data x is defined as R(\theta \mid x) = \frac{L(\theta \mid x)}{L(\hat{\theta} \mid x)}, where \hat{\theta} = \arg\max_{\theta} L(\theta \mid x) is the (MLE) and L(\theta \mid x) is the . This ratio, which ranges from 0 to 1, quantifies the relative support for \theta compared to the MLE, normalizing the likelihood to its peak value and facilitating comparisons across the parameter space without regard to the absolute scale of the likelihood. Likelihood regions, defined as the sets \{\theta : R(\theta \mid x) \geq c\} for $0 < c < 1, delineate regions of the parameter space where the relative likelihood exceeds a specified threshold c, offering a direct measure of evidential support from the data. Asymptotically, these regions approximate under mild regularity conditions, with the choice of c calibrated to match coverage probabilities; for instance, in one dimension, c \approx 0.15 (corresponding to \exp(-1.92), where 1.92 is half the 95% \chi^2_1 quantile of 3.84) yields an approximate 95% confidence region. This approximation arises from the large-sample distribution of -2 \log R(\theta \mid x) \approx \chi^2_p, where p is the number of parameters of interest. Graphically, contours of constant relative likelihood in the parameter space visualize the shape and extent of evidential support, often forming elliptical or hyperbolic boundaries centered at the that reflect the curvature of the likelihood surface. These contours aid in exploring parameter uncertainty and model diagnostics, particularly in multiparameter settings where the relative likelihood highlights trade-offs between parameters. As an illustrative example, consider estimating the mean \mu of a normal distribution N(\mu, \sigma^2) with known variance \sigma^2 > 0, based on n independent observations x = (x_1, \dots, x_n). The relative likelihood simplifies to R(\mu \mid x) = \exp\left( -\frac{n (\mu - \bar{x})^2}{2 \sigma^2} \right), where \bar{x} = n^{-1} \sum_{i=1}^n x_i is the sample mean, achieving its maximum of 1 at \mu = \bar{x} and dropping parabolically away from this point, with the rate of decline governed by the sample size n and precision $1/\sigma^2. For c = 0.15, the corresponding likelihood region is approximately \mu \in [\bar{x} - 1.96 \sigma / \sqrt{n}, \bar{x} + 1.96 \sigma / \sqrt{n}], aligning with the standard 95% confidence interval.

Log-Likelihood Function

The log-likelihood , denoted as \ell(\theta \mid x), is the natural logarithm of the likelihood L(\theta \mid x), providing a transformed measure of how well a with \theta explains the observed x. For a sample of n independent observations, it takes the explicit form \ell(\theta \mid x) = \sum_{i=1}^n \log f(x_i \mid \theta), where f(x_i \mid \theta) is the probability or of each observation. A key property of the log-likelihood is that the logarithm serves as a strictly increasing monotonic , ensuring that the value maximizing \ell(\theta \mid x) is identical to the one maximizing L(\theta \mid x); thus, \arg\max_\theta \ell(\theta \mid x) = \arg\max_\theta L(\theta \mid x). This also converts the product of individual densities in the likelihood into a , facilitating easier numerical evaluation, optimization, and in computational procedures. Graphically, the log-likelihood function often displays a unimodal shape, rising to a peak at the maximum likelihood estimate (MLE) and descending thereafter, with the peak becoming sharper as the sample size grows to concentrate evidence around the true parameter. The second derivative captures this curvature: the observed information matrix is given by I(\theta) = -\frac{\partial^2 \ell}{\partial \theta^2}, evaluated at the MLE, which measures the estimate's precision and forms the basis for asymptotic variance approximations. For illustration, consider n independent observations from an with rate parameter \lambda > 0, where the density is f(x \mid \lambda) = \lambda e^{-\lambda x} for x \geq 0. The log-likelihood simplifies to \ell(\lambda \mid x) = n \log \lambda - \lambda \sum_{i=1}^n x_i, which attains its maximum at the MLE \hat{\lambda} = n / \sum_{i=1}^n x_i, the of the sample mean.

Nuisance Parameter Elimination

Profile Likelihood

In statistical models where the parameter vector is partitioned as \theta = (\psi, \lambda), with \psi denoting the parameter of interest and \lambda the nuisance parameter(s), the profile log-likelihood for \psi is defined as \ell_p(\psi \mid x) = \max_\lambda \ell(\psi, \lambda \mid x), where \ell(\cdot \mid x) is the log-likelihood function given observed data x. This approach concentrates the full likelihood by optimizing out the nuisance parameters, yielding a reduced likelihood function that depends solely on \psi. The log-likelihood is constructed by, for each fixed value of \psi, computing the conditional maximum likelihood estimator of the parameters, \hat{\lambda}(\psi) = \arg\max_\lambda \ell(\psi, \lambda \mid x), and substituting it into the original log-likelihood: \ell_p(\psi \mid x) = \ell(\psi, \hat{\lambda}(\psi) \mid x). This maximization step typically involves solving score equations or using numerical optimization methods, assuming the likelihood is sufficiently and the maximum exists. Under regularity conditions, the profile likelihood possesses desirable asymptotic properties for on \psi, including equivalence to the full likelihood in large samples; specifically, the profile maximum likelihood \hat{\psi}_p is asymptotically and efficient, matching the properties of the unrestricted maximum likelihood \hat{\psi}. It is particularly useful for constructing intervals for \psi, where the -2 \log \left[ L_p(\psi)/L_p(\hat{\psi}_p) \right] (with L_p the profile likelihood) asymptotically follows a \chi^2 distribution with one degree of freedom under the that \psi is the , enabling likelihood ratio-based intervals that account for the parameters. A representative example arises in , y_i = \beta_0 + \beta_1 x_i + \epsilon_i with \epsilon_i \sim N(0, \sigma^2), where the focus is on the slope \beta_1 = \psi and the intercept \beta_0 and variance \sigma^2 = \lambda are treated as . For fixed \beta_1, the profile likelihood maximizes the normal log-likelihood over \beta_0 (yielding the least-squares fit conditional on \beta_1) and \sigma^2 (set to the conditional residual ), resulting in \ell_p(\beta_1 \mid y) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\hat{\sigma}^2(\beta_1)) - \frac{n}{2}, where \hat{\sigma}^2(\beta_1) is the profiled variance estimate; this facilitates direct on \beta_1 via the resulting one-dimensional profile curve.

Conditional Likelihood

The conditional likelihood is a for focusing on a parameter of interest, ψ, in the presence of nuisance parameters, λ, by conditioning the full likelihood on a suitable ancillary or . Formally, it is defined as L_c(\psi \mid x) = \frac{L(\psi, \lambda \mid x)}{L(\lambda \mid T(x))}, where T(x) is a that is sufficient for λ but ancillary for ψ, ensuring that the resulting function depends only on ψ and facilitates unbiased and testing. This approach leverages the factorization theorem from sufficiency to partition the data's information, isolating the component relevant to ψ while marginalizing the influence of λ through rather than or maximization. The method is applicable in models where a complete sufficient statistic T(x) for the nuisance parameter λ exists and is independent of ψ in distribution, a condition often satisfied in multiparameter exponential families due to their structured sufficient statistics. In such cases, the conditional distribution forms a new exponential family parameterized solely by ψ, enabling exact inference without approximation. Unlike the profile likelihood, which maximizes the joint likelihood over λ to approximate inference on ψ, the conditional likelihood achieves exact elimination of the nuisance parameter when the conditioning statistic is available, avoiding potential biases from maximization. Key properties include its ability to reduce in small-sample compared to methods, as it preserves the conditional and avoids overestimation of from nuisance parameter variability. In exponential families, the conditional likelihood yields asymptotically efficient estimators that are consistent and normally distributed, with relative error to the likelihood of O(n^{-3/2}) under i.i.d. sampling, and it supports accurate tail probability approximations for p-values and intervals. These properties make it particularly valuable for precise testing and in structured models. A representative example arises in comparing two independent Poisson counts, X ~ Pois(ψ λ) and Y ~ Pois(λ), where ψ is the relative rate of interest and λ is the common intensity nuisance parameter. The total T = X + Y serves as the conditioning statistic; the conditional distribution is X | T = t ~ Binomial(t, ψ / (1 + ψ)), yielding a conditional likelihood L_c(ψ | T) that depends only on ψ and is free of λ, thus eliminating the nuisance and enabling exact tests for ψ = 1.

Marginal Likelihood

The marginal likelihood addresses the presence of nuisance parameters λ in a by integrating the full over these parameters, yielding a that depends only on the parameter of interest ψ. Formally, for observed x, the is defined as L_m(\psi \mid x) = \int L(\psi, \lambda \mid x) \, \pi(\lambda \mid \psi) \, d\lambda, where π(λ | ψ) is a weighting density for the nuisance parameters, often chosen as a or noninformative to reflect limited prior knowledge. This approach produces a function of ψ alone, analogous to a in reduced models, and is particularly valuable for focused on ψ without conditioning or maximizing over λ. While central to Bayesian marginalization—where it forms the or for the posterior on ψ—the can also be employed in frequentist settings by selecting appropriate weightings, such as improper uniforms, to obtain valid procedures with desirable properties like . It effectively incorporates from λ into the assessment of ψ, leading to broader confidence regions compared to methods that fix or λ. However, exact is often challenging due to the high-dimensional integrals involved, especially in complex models. To overcome these computational hurdles, approximations such as the Laplace method provide asymptotic expansions around the mode of the integrand, yielding accurate estimates for large samples. For model selection purposes, the (BIC) serves as a practical , approximating the negative log as -2 log L_m(ψ | x) ≈ -2 log L(\hat{ψ}, \hat{λ} | x) + k log n, where k is the number of parameters and n the sample size, facilitating comparisons across models without full integration. A classic illustration occurs in the normal linear model where data x ~ N(ψ, λ I_n), with unknown mean ψ and variance λ > 0. Integrating the likelihood over λ using an inverse-gamma or flat prior weighting results in a marginal likelihood for ψ proportional to the density of a Student's t-distribution with n-1 degrees of freedom, centered at the sample mean and scaled by the sample standard deviation; this form underpins t-based inference for ψ while accounting for variance uncertainty.

Partial Likelihood

The partial likelihood is a method developed for inference in semi-parametric models where the full probability density is not fully specified, particularly useful in to focus on the effects of covariates while treating the baseline hazard as a nuisance parameter. Introduced by , it constructs a likelihood based on conditional probabilities of event occurrences given the risk set at each event time, avoiding the need to specify or estimate the entire . In the Cox proportional hazards model, the partial likelihood for the regression coefficients \beta given the observed \mathbf{x} (including event times and covariates) is given by L_p([\beta](/page/Beta) \mid \mathbf{x}) = \prod_{i: \delta_i=1} \frac{\exp(\beta^\top z_i)}{\sum_{j \in R(t_i)} \exp(\beta^\top z_j)}, where the product is over the event times t_i with indicator \delta_i = 1 for observed , z_i is the covariate vector for the individual experiencing the event, and R(t_i) denotes the risk set of individuals at risk just prior to t_i. This formulation arises from the assumption that the hazard function factors into a baseline hazard \lambda_0(t) and a relative hazard \exp(\beta^\top z), with the partial likelihood capturing only the relative contributions of covariates at each event, thereby eliminating the unspecified baseline hazard without requiring integration over nuisance parameters or explicit conditioning on ancillary statistics. The partial likelihood is applicable in settings where interest lies primarily in the relative influenced by covariates, such as in time-to-event analyses with right-censoring, as it provides a robust way to eliminate parameters like the hazard directly through the conditional structure, distinct from methods involving or maximization over nuisances. Key properties include its asymptotic efficiency: under suitable regularity conditions, the maximum partial likelihood \hat{\beta} obtained by solving the score equations from the partial -likelihood \ell_p(\beta) = \log L_p(\beta \mid \mathbf{x}) is consistent, asymptotically , and fully efficient relative to the full likelihood, matching the bound for \beta as sample size increases. A canonical example is the Cox proportional hazards model for analyzing time-to-event data, where the partial likelihood concentrates solely on the covariate effects \beta, treating the baseline hazard as unspecified and thus avoiding parametric assumptions about the distribution of survival times. For instance, in a study of patient survival with covariates like age and treatment, the partial likelihood would product over observed failure times the probability that the failing individual has the highest relative hazard among those at risk, enabling estimation of \beta without modeling the underlying time distribution. This approach has become foundational in due to its flexibility and theoretical guarantees.

Advanced Topics

Products of Likelihoods

In , when data arise from experiments or sources, the likelihood function is formed by multiplying the individual likelihood functions. Specifically, for two datasets x and y parameterized by , the likelihood is L(\theta \mid x, y) = L(\theta \mid x) \cdot L(\theta \mid y), leveraging the property that the probability density or mass function factors under . This multiplicative principle extends to any number of components, providing a foundational for aggregating probabilistic across separate observations or models. This approach finds applications in combining evidence from multiple independent studies, such as in meta-analyses where likelihoods from disparate trials are multiplied to yield a unified on shared parameters. It also supports modular model building, where separate likelihood components are constructed for distinct aspects of the data—potentially involving different parameters—and then combined to form a comprehensive model, facilitating scalable and interpretable analyses in complex systems. However, the validity of multiplying likelihoods hinges on the of the data sources; violations, such as unaccounted correlations, can distort the joint inference. Additionally, if parameters overlap across components without appropriate modeling, the product may lead to non-identifiability, where multiple values yield equivalent likelihoods, complicating . A concrete example involves combining a -distributed with \theta and a -distributed with success probability \theta, both sharing the \theta \in (0,1). The likelihood is the product of the individual likelihood L(\theta \mid \mathbf{k}) = \prod_{i=1}^{n_1} \frac{\theta^{k_i} e^{-\theta}}{k_i!} and likelihood L(\theta \mid \mathbf{y}) = \prod_{j=1}^{n_2} \theta^{y_j} (1-\theta)^{m_j - y_j}, where \mathbf{k} are , \mathbf{y} are successes in m_j trials. Maximizing this product requires solving the score equation \frac{\sum k_i + \sum y_j}{\theta} - n_1 - \sum_{j=1}^{n_2} \frac{m_j - y_j}{1 - \theta} = 0, which generally lacks a closed-form solution but integrates information from both sources for more efficient estimation than using either alone.

Likelihood Equations

The likelihood equations arise from the central to (MLE), where the goal is to find the parameter value \hat{\theta} that maximizes the likelihood function L(\theta \mid x) or, equivalently, its logarithm \ell(\theta \mid x). To solve this, the first-order condition requires setting the of the log-likelihood with respect to the \theta equal to zero: \frac{\partial \ell(\theta \mid x)}{\partial \theta} = 0. This , known as the score equation, defines the critical points of the log-likelihood surface, and under suitable regularity conditions, the solution \hat{\theta} corresponds to the maximum likelihood estimate. For independent and identically distributed (i.i.d.) observations x_1, \dots, x_n from a f(x_i \mid \theta), the log-likelihood is \ell(\theta \mid x) = \sum_{i=1}^n \log f(x_i \mid \theta), and the score simplifies to the first-order condition \sum_{i=1}^n \frac{\partial \log f(x_i \mid \theta)}{\partial \theta} = 0. This form leverages the additivity of the log-likelihood for i.i.d. , making it computationally tractable for many models where the individual score contributions \frac{\partial \log f(x_i \mid \theta)}{\partial \theta} can be derived explicitly. To verify that the solution \hat{\theta} is a maximum rather than a minimum or , second-order conditions are examined using the H(\theta) = \frac{\partial^2 \ell(\theta \mid x)}{\partial \theta \partial \theta^\top}, which captures the second partial derivatives. For a unique global maximum in regular models, the must be negative definite at \hat{\theta}, meaning all its eigenvalues are negative, ensuring the log-likelihood is concave in a neighborhood of the estimate. This condition confirms the stability and uniqueness of the MLE when the observed -H(\hat{\theta}) is positive definite. A classic example illustrates these equations for the uniform distribution U(0, \theta) with i.i.d. observations x_1, \dots, x_n, where the density is f(x_i \mid \theta) = \frac{1}{\theta} for $0 \leq x_i \leq \theta and 0 otherwise. The likelihood is L(\theta \mid x) = \theta^{-n} for \theta \geq \max_i x_i and 0 otherwise, so the log-likelihood is \ell(\theta \mid x) = -n \log \theta in that domain. Differentiating yields the score \frac{\partial \ell}{\partial \theta} = -\frac{n}{\theta}, which is negative for \theta > 0 and thus never zero; however, the maximum occurs at the boundary \theta = \max_i x_i, where \hat{\theta} = \max_i x_i maximizes \ell by minimizing \log \theta subject to the support constraint. The second derivative \frac{\partial^2 \ell}{\partial \theta^2} = \frac{n}{\theta^2} > 0 indicates convexity, but the boundary solution and domain restriction ensure \hat{\theta} is the unique maximizer.

Exponential Families

Exponential families constitute a broad class of probability distributions that admit a of fixed dimension, independent of sample size, as established by the Darmois–Koopman–Pitman theorem. The probability density (or mass) of a distribution in the takes the canonical form
f(x \mid \theta) = h(x) \exp\left( \eta(\theta)^\top T(x) - A(\theta) \right),
where \eta(\theta) is the natural (or ) parameter, T(x) is the , h(x) is the base measure, and A(\theta) is the cumulant-generating function serving as a normalization constant. This parameterization facilitates analytical tractability in , particularly for likelihood-based methods.
For an independent and identically distributed sample x_1, \dots, x_n from an exponential family distribution, the log-likelihood function simplifies to
\ell(\theta) = \sum_{i=1}^n \left[ \eta(\theta)^\top T(x_i) - A(\theta) \right] + \sum_{i=1}^n \log h(x_i).
The second term is constant with respect to \theta, so maximization depends only on the first term. The maximum likelihood estimator \hat{\theta} is found by setting the derivative of \eta(\theta)^\top \bar{T} - A(\theta) to zero, where \bar{T} = n^{-1} \sum_{i=1}^n T(x_i) is the sample average of the sufficient statistic. A key property is that \bar{T} is minimal sufficient, reducing the data to this low-dimensional summary for inference; the MLE satisfies \mathbb{E}_\theta[T(X)] = \nabla_\eta A(\theta) = \bar{T} evaluated at \hat{\theta}, equating observed and expected sufficient statistics under the model.
The Gamma distribution exemplifies these properties as a two-parameter exponential family with shape \alpha > 0 and rate \beta > 0. Its density is
f(x \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} \exp(-\beta x), \quad x > 0,
which reparameterizes in natural form with \eta_1 = \alpha - 1, \eta_2 = -\beta, T(x) = (\log x, x), and appropriate h(x) and A(\theta). The MLE equations are \hat{\beta} = \hat{\alpha} / \bar{x} and \log \hat{\alpha} - \psi(\hat{\alpha}) = \log \bar{x} - n^{-1} \sum_{i=1}^n \log x_i, where \psi(\cdot) is the digamma function \Gamma' / \Gamma; the shape equation typically requires numerical solution via Newton-Raphson iteration.

Interpretations and History

Frequentist Perspective

In the frequentist paradigm, the likelihood function provides the basis for estimating unknown fixed parameters through (MLE), where the estimator \hat{\theta} is the value that maximizes the likelihood L(\theta \mid x) for observed data x. Introduced by , this method selects the parameter value most compatible with the data under the assumed model, yielding estimators that are invariant to reparameterization and possess desirable properties such as and asymptotic efficiency. Under standard regularity conditions, such as differentiability of the log-likelihood \ell(\theta) = \log L(\theta \mid x) and finite moments, the MLE \hat{\theta} is asymptotically normally distributed: \sqrt{n} (\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, I(\theta)^{-1}), where n is the sample size and I(\theta) is the , defined as I(\theta) = \mathbb{E}\left[ -\frac{\partial^2 \ell(\theta)}{\partial \theta^2} \right]. The asymptotic variance I(\theta)^{-1} quantifies the precision of the estimator and attains the Cramér-Rao lower bound, establishing the MLE as the in large samples. This framework emphasizes the of estimators derived from the data-generating process, rather than any prior beliefs about parameters. Frequentist confidence intervals for parameters leverage the asymptotic distribution of the MLE to construct pivotal quantities, such as \hat{\theta} \pm z_{\alpha/2} \sqrt{I(\hat{\theta})^{-1}/n}, where z_{\alpha/2} is the standard normal quantile. Alternatively, likelihood ratio-based intervals use the statistic $2[\ell(\hat{\theta}) - \ell(\theta)], which asymptotically follows a \chi^2 distribution with degrees of freedom equal to the parameter dimensionality, defining regions where this statistic does not exceed a critical value. These intervals quantify uncertainty in the sampling sense, covering the true parameter with the nominal probability in repeated experiments. For hypothesis testing, the compares nested models by evaluating the ratio \Lambda = L(\theta_0 \mid x) / L(\hat{\theta} \mid x) under a restricting \theta to a , with the -2 \log \Lambda asymptotically \chi^2-distributed under the null. This procedure tests composite hypotheses and is optimal in the sense of Neyman-Pearson for simple cases, extending to large samples for and . Despite these strengths, the frequentist view of likelihood has inherent limitations: it assigns no probability to specific parameter values, as parameters are treated as fixed unknowns rather than random variables, focusing instead on the long-run frequency properties of procedures across hypothetical repetitions of . This emphasis on repeated sampling validity precludes direct probabilistic statements about parameters in a single .

Bayesian Perspective

In Bayesian inference, the likelihood function serves as the sampling model that updates the prior distribution of parameters to yield the posterior distribution. states that the posterior density π(θ|x) is proportional to the product of the likelihood L(θ|x) and the prior density π(θ): \pi(\theta \mid x) \propto L(\theta \mid x) \pi(\theta) Here, the likelihood quantifies the probability of the observed data x given parameters θ, effectively weighting the prior by the evidential support from the data. The role of the likelihood extends to model assessment, where the marginal likelihood m(x) = ∫ L(θ|x) π(θ) dθ integrates out the parameters to provide the prior predictive distribution of the data, enabling Bayesian model comparison via Bayes factors. For complex likelihoods, posterior computation typically relies on numerical approximations. Markov chain Monte Carlo (MCMC) methods generate samples from the posterior by simulating a with the target distribution as its , allowing inference through empirical averages. Variational inference, in contrast, posits a tractable approximating distribution and optimizes it to minimize the Kullback-Leibler from the true posterior, often yielding faster but biased estimates. When the likelihood belongs to an exponential family, conjugate priors simplify computations by ensuring the posterior remains in the same family as the prior, facilitating closed-form updates. A representative example is inference for a normal distribution N(μ, σ²) with unknown mean μ and variance σ², where the likelihood is L(\mu, \sigma^2 \mid x) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right). The is the , parameterized as μ | σ² ~ N(m₀, σ² / κ₀) and σ² ~ IG(α₀, β₀), which yields a posterior of the same form with updated hyperparameters: κₙ = κ₀ + n, mₙ = (κ₀ m₀ + n \bar{x}) / κₙ, αₙ = α₀ + n/2, and βₙ = β₀ + (1/2) ∑(x_i - \bar{x})² + (κ₀ n / (2 κₙ)) (\bar{x} - m₀)². This analytical tractability highlights the likelihood's compatibility with structured priors in models.

Historical Development

The concept of likelihood has roots in the early development of , particularly in the work of , who in the late 18th and early 19th centuries employed methods to infer causes from observed effects, treating the likelihood of parameters given data as proportional to the probability of data given parameters. This approach, however, conflated probability and likelihood, leading to criticisms for its reliance on uniform priors and subjective elements. The modern formulation of the likelihood function emerged with Ronald A. Fisher in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," where he explicitly distinguished likelihood from probability, defining it as a measure of the support that observed data provide for different parameter values without invoking prior distributions. Fisher introduced the method of maximum likelihood estimation as a principled way to select parameter values that maximize this support, laying the groundwork for frequentist inference. Subsequent developments in the 1930s built on Fisher's ideas, with and Egon S. Pearson incorporating likelihood ratios into hypothesis testing in their 1933 paper "On the Problem of the Most Efficient Tests of Statistical Hypotheses," which formalized the use of likelihood ratios to derive optimal tests under fixed error rates. Around the same time, Maurice S. Bartlett advanced the theory in his 1936 paper "Statistical Information and Properties of Sufficiency," exploring how sufficient statistics preserve the information in the likelihood function for . A key milestone came in 1972 with A. W. F. Edwards' book Likelihood, which articulated a distinct "likelihoodist" , emphasizing relative likelihoods as the basis for scientific independent of frequentist or Bayesian paradigms. Edwards' work synthesized historical developments and advocated for likelihood as a unified inferential tool. Post-1980s computational advances revolutionized the handling of likelihood functions in complex models, particularly through the emergence of (MCMC) methods, which enabled efficient evaluation and maximization of likelihoods in high-dimensional spaces previously intractable. Seminal contributions, such as Gelfand and Smith's 1990 introduction of for posterior densities, extended these techniques to integrate likelihood-based computations across frequentist and Bayesian frameworks. This shift marked a transition from Laplace's to a modern synthesis, where likelihood serves as a core element in both frequentist estimation and Bayesian updating.

Other Interpretations

The likelihoodist interpretation, advanced by Edwards, posits that the likelihood function provides a direct measure of evidential support for hypotheses through likelihood ratios, eschewing both probabilistic interpretations and prior distributions. In this view, the of values given the serves as the primary evidential tool, enabling without reliance on long-run frequencies or subjective priors. Edwards further advocates for "support intervals," which are constructed by considering regions where the relative likelihood exceeds a , such as 1/8 or 1/16, offering a symmetric treatment of and parameters as measures of evidential strength. Another prominent interpretation leverages the likelihood for via information criteria, exemplified by the (AIC). Introduced by Akaike, AIC is formulated as \text{AIC} = -2 \ell(\hat{\theta}) + 2p, where \ell(\hat{\theta}) is the maximized log-likelihood and p is the number of parameters, imposing a penalty for model complexity to approximate the expected Kullback-Leibler divergence between the true distribution and the fitted model. This approach uses the likelihood to balance goodness-of-fit against , facilitating objective comparisons among competing models without invoking full frequentist error rates or Bayesian posteriors. Birnbaum explored deriving confidence intervals directly from the likelihood function, arguing that evidential meaning is fully captured by the , independent of ancillary sampling details. This attempt aimed to unify under the but faced subsequent critiques for inconsistencies with established frequentist procedures and logical gaps in its foundational arguments. More recently, direct-likelihood methods have emerged in , such as Deep Direct Likelihood Knockoffs, which optimize likelihood-based objectives to generate knockoff samples for variable selection and , bypassing traditional simulation-heavy approaches. These interpretations share a focus on the likelihood as a standalone evidential , promoting between and parameters while sidestepping the probabilistic machinery of Bayesian or frequentist paradigms.

References

  1. [1]
    Stat 5421 Notes: Likelihood Inference
    Sep 29, 2025 · The likelihood for a statistical model is the probability mass function (PMF) or probability density function (PDF) thought of as a function of ...
  2. [2]
    [PDF] Likelihood Theory I & II - MIT OpenCourseWare
    is called the likelihood function for the parameter θ. The likelihood function or likelihood provides an objective means of assessing the “information” in a ...
  3. [3]
    [PDF] Likelihood Functions - Purdue Department of Statistics
    Jan 22, 2015 · The likelihood is defined as. L(θ) = f(x1,...,xn,θ) where f(x1,...,xn,θ) is joint density (or mass) function. If the data are inde- pendent ...
  4. [4]
    R. A. Fisher and the Making of Maximum Likelihood 1912 – 1922
    Abstract. In 1922 R. A. Fisher introduced the method of maximum likelihood. He first presented the numerical procedure in 1912. This.<|control11|><|separator|>
  5. [5]
    [PDF] “On the Theoretical Foundations of Mathematical Statistics”
    Feb 10, 2003 · A statistic is consistent (Fisher consis- tent) if, when calculated from the whole pop- ulation, it is equal to the parameter describing the ...
  6. [6]
    [PDF] 1 Maximum Likelihood Estimation
    Feb 22, 2016 · . Discrete Case: If f(· | θ) is a mass function (X is discrete), then. L(θ) = f(x | θ) = Pθ(X = x). L(θ) is the probability of getting the ...
  7. [7]
    [PDF] Introduction to Likelihood Statistics
    • Likelihood statistics defines probability as a frequency, not as a Bayesian state of knowledge or state of belief.
  8. [8]
    [PDF] 18.05 S22 Reading 10b: Maximum Likelihood Estimates
    Viewing the data as fixed and 𝜆 as variable, this density is the likelihood function. Our data had values. x1 = 2, x2 = 3, x3 = 1, x4 = 3, x5 = 4. So ...
  9. [9]
    [PDF] Maximum Likelihood Estimation
    Feb 26, 2020 · In the case of discrete distributions, likelihood is a synonym for the joint probability of your data. In the case of continuous distribution, ...
  10. [10]
    [PDF] Maximum Likelihood Estimation
    To use a maximum likelihood estimator, first write the log likelihood of the data given your parameters. Then chose the value of parameters that maximize the ...Missing: fixed | Show results with:fixed
  11. [11]
    [PDF] Lecture 11 Likelihood, MLE and sufficiency
    Sep 25, 2019 · We mentioned that the word “likelihood” in the likelihood function refers to the parameter, but that we cannot think of it as probability ...<|control11|><|separator|>
  12. [12]
    [PDF] Maximum Likelihood Estimation - Stat@Duke
    the parameter θ for a fixed value of the observed random variables. . So, often we will write L(θ) for L(y | θ).
  13. [13]
    [PDF] Likelihood: Philosophical foundations - MyWeb
    • Also, note that a likelihood function is not a probability distribution – for example, it does not integrate to 1. Patrick Breheny. University of Iowa.
  14. [14]
    Maximum Likelihood Estimation - Probability Course
    If Xi's are discrete, then the likelihood function is defined as L(x1,x2,⋯,xn;θ)=PX1X2⋯Xn(x1,x2,⋯,xn;θ). · If Xi's are jointly continuous, then the likelihood ...
  15. [15]
    [PDF] STA3000: Likelihood Formalities
    f(x; θ) = dPθ dν. (x); this is the likelihood function for θ. Some books describe the likelihood function as the Radon-Nikodym derivative of the probability ...
  16. [16]
    Regularity conditions - MyWeb
    Jul 21, 2025 · The following are the classical conditions to ensure that the likelihood is “regular”, meaning that at least asymptotically, the likelihood resembles that of a ...
  17. [17]
    [PDF] Fisher Information & Efficiency - Stat@Duke
    May 11, 2021 · The Information Inequality requires some regularity conditions for it to apply: The Fisher Information exists— equivalently, the log likelihood ...
  18. [18]
    Likelihood for generally coarsened observations from multi-state or ...
    May 23, 2008 · Abstract: We consider first the mixed discrete-continuous scheme of ... likelihood and taking conditional expectation for the observed likelihood.
  19. [19]
  20. [20]
    Dirac Delta Function | Generalized PDF - Probability Course
    In this section, we will use the delta function to extend the definition of the PDF to discrete and mixed random variables.
  21. [21]
  22. [22]
    [PDF] Likelihood Construction, Inference for Parametric Survival Distributions
    In this section we obtain the likelihood function for noninformatively right- censored survival data and indicate how to make an inference when a para-.
  23. [23]
    [PDF] Likelihood for Censored Data
    When we use parametric approach to the analysis of censored data, the CDF/hazard function are usually continuous and thus we use likelihood (7) or (10). When we ...
  24. [24]
    8.4.5 Likelihood Ratio Tests - Probability Course
    One way to decide between H0 and H1 is to compare the corresponding likelihood functions: l0=L(x1,x2,⋯,xn;θ0),l1=L(x1,x2,⋯,xn;θ1). More specifically, if l0 is ...
  25. [25]
    Likelihood ratio test - StatLect
    The likelihood ratio (LR) test is a test of hypothesis in which two different maximum likelihood estimates of a parameter are comparedThe null hypothesis · The likelihood ratio statistic · Asymptotic distribution of the...
  26. [26]
    The Large-Sample Distribution of the Likelihood Ratio for Testing ...
    March, 1938 The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. S. S. Wilks · DOWNLOAD PDF + SAVE TO MY LIBRARY.
  27. [27]
  28. [28]
    [PDF] Chapter 6 Likelihood Inference
    0 The set C s is referred to as a likelihood region. It contains all those values for which their likelihood is at least c A likelihood region, for some c, ...
  29. [29]
    1.5 - Maximum Likelihood Estimation | STAT 504
    Bernoulli and Binomial Likelihoods​​ The only difference between this log likelihood function and that for the Bernoulli sample is the presence of the binomial ...
  30. [30]
    [PDF] Maximum Likelihood - UGA SPIA
    monotonic transformation of the likelihood function, those are also the parameter values that maximize the function itself. Most often we take natural logs ...
  31. [31]
    [PDF] Week 6: Maximum Likelihood Estimation
    Bernoulli example. Suppose that we know that the following ten numbers were simulated using a. Bernoulli distribution: 0 0 0 1 1 1 0 1 1 1. We can denote them ...
  32. [32]
    [PDF] Maximum Likelihood Estimation
    May 14, 2001 · Now the likelihood function has a maximum at θ = 1. Example 2 Normal Sampling. Let X1,...,Xn be an iid sample with Xi˜N(μ,σ2).
  33. [33]
    [PDF] Maximum Likelihood
    Example 6.1.2 (Exponential Distribution). Suppose the common pdf is the exponential(0) density given by (3.3.2). The log of the likelihood function is given.
  34. [34]
    Maximum Likelihood, Profile Likelihood, and Penalized Likelihood
    Here we provide a primer on maximum likelihood and some important extensions which have proven useful in epidemiologic research.
  35. [35]
    [PDF] Likelihood Inference in the Presence of Nuisance Parameters
    In the presence of nuisance parameters, the likelihood function for a (one-dimensional) parameter of interest is obtained via an adjustment to the profile ...
  36. [36]
    [PDF] Profile likelihood - MyWeb
    the profile likelihood score test is also the same as the score test in the presence of nuisance parameters that we derived earlier. Patrick Breheny.
  37. [37]
    Integrated Likelihood Methods for Eliminating Nuisance Parameters
    Probably the simplest likelihood approach to eliminating nuisance parameters is to replace them with their conditional maximum likelihood esti- mates, leading ...
  38. [38]
    [PDF] Chapter 6 Conditional Likelihoods
    The other is that ( ) = is independent of . Conditional Likelihood for Exponential Family: ... ( ) is the nuisance parameter. Then, the log-likelihood is.
  39. [39]
    Conditional Likelihood Estimate for the Exponential - jstor
    conditional likelihood estimate are discussed in connection with a two parameter exponential ... structural parameter) and the other is a nuisance parameter, the ...
  40. [40]
    Integrated likelihood methods for eliminating nuisance parameters
    Elimination of nuisance parameters is a central problem in statistical inference and has been formally studied in virtually all approaches to inference.
  41. [41]
    Accurate Approximations for Posterior Moments and Marginal ...
    This article describes approximations to the posterior means and variances of positive functions of a real or vector-valued parameter, and to the marginal ...
  42. [42]
    Estimating the Dimension of a Model - Project Euclid
    Abstract. The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms ...
  43. [43]
    Partial likelihood | Biometrika - Oxford Academic
    Abstract. A definition is given of partial likelihood generalizing the ideas of conditional and marginal likelihood. Applications include life tables and i.
  44. [44]
    Regression Models and Life‐Tables - Cox - 1972
    A conditional likelihood is obtained, leading to inferences about the unknown regression coefficients. Some generalizations are outlined. References. Breslow ...Missing: partial | Show results with:partial
  45. [45]
    Cox's Regression Model for Counting Processes: A Large Sample ...
    This permits a statistical regression analysis of the intensity of a recurrent event allowing for complicated censoring patterns and time dependent covariates.
  46. [46]
    [PDF] Maximum Likelihood Estimation - Arizona Math
    For independent observations, the likelihood is the product of density functions. Because the logarithm of a product is the sum of the logarithms, finding ...
  47. [47]
    [PDF] COMBINING LIKELIHOOD AND SIGNIFICANCE FUNCTIONS
    Abstract: The need to combine likelihood information is common in analyses of complex models and in meta-analyses, where information is combined from several ...Missing: multiple | Show results with:multiple
  48. [48]
    Asymptotics in case of non-identifiability - Cross Validated
    Sep 15, 2025 · If the model is non-identifiable, there exists θ1≠θ2 such that f(x;θ1)=f(x;θ2)∀x. I am hoping the MLE then may converge in probability to ˆθnp→ ...How do we Deal with Identifiability Problems in Statistics?Does MLE require i.i.d. data? Or just independent parameters?More results from stats.stackexchange.com
  49. [49]
    MLE with datasets from populations having shared parameters
    Abstract. We consider maximum likelihood estimation with two or more datasets sampled from different populations with shared parameters.
  50. [50]
    1.2 - Maximum Likelihood Estimation | STAT 415
    Maximum likelihood estimation finds the parameter value that maximizes the probability of observing the observed data. The likelihood function is used to find ...
  51. [51]
    [PDF] Topic 15: Maximum Likelihood Estimation - Arizona Math
    Maximum likelihood estimation chooses the parameter value that makes the observed data most probable, using the likelihood function.Missing: fixed | Show results with:fixed
  52. [52]
    [PDF] Maximum Likelihood Estimation
    The likelihood and log-likelihood functions are the basis for deriving estimators for parameters, given data. While the shapes of these two functions are ...
  53. [53]
    [PDF] Maximum Likelihood Estimation (MLE)
    second order conditions. Page 19. 19. The Hessian. • The extension of the scalar second-order derivative is the. Hessian matrix of second partial derivatives:.
  54. [54]
    [PDF] Introduction to Maximum Likelihood Estimation
    Jul 26, 2012 · Second order conditions depend on the properties of the second derivative ... likelihood is measured by its second derivative matrix.
  55. [55]
    [PDF] 3. Maximum likelihood estimators and efficiency - OU Math
    In analogy to our terminology in the discrete case, we will again refer to this product of the densities as the likelihood function. Example 3.4. Consider the ...<|control11|><|separator|>
  56. [56]
    [PDF] On Distributions Admitting a Sufficient Statistic - BO Koopman
    Mar 7, 2006 · The first object of the present paper is to give a simple definition of the sufficiency of a statistic which, we feel, expresses the intuitive ...
  57. [57]
    [PDF] Chapter 8 The exponential family: Basics - People @EECS
    An important feature of the exponential family is that it one can obtain sufficient statistics ... In this section we show how to obtain maximum likelihood ...Missing: maximization | Show results with:maximization
  58. [58]
    [PDF] Maximum Likelihood in Exponential Families
    Maximum likelihood (MLE) in exponential families finds the parameter by equating partial derivatives of the log-likelihood to zero, where the log-likelihood ...
  59. [59]
    [PDF] HOMEWORK 5 SOLUTIONS 1. The geometric model. The method of ...
    Computing the gamma MLE. (a) As in Lecture 13, we denote the function f(α) = log α −. Γ0(α). Γ(α). = log α − ψ(α), where ψ(α)=Γ0(α)/Γ(α) is the digamma function ...
  60. [60]
    On the mathematical foundations of theoretical statistics - Journals
    Several reasons have contributed to the prolonged neglect into which the study of statistics, in its theoretical aspects, has fallen.
  61. [61]
    [PDF] Theory of Statistical Estimation | Semantic Scholar
    Introduction to Fisher (1922) On the Mathematical Foundations of Theoretical Statistics · Mathematics · 1992.
  62. [62]
    [PDF] Introduction to Likelihoods - CERN Indico
    Jan 23, 2013 · Use of the marginal likelihood results in a broadening of the distributions of t(x) and effectively builds in the systematic uncertainty on the ...<|control11|><|separator|>
  63. [63]
    [PDF] Bayesian Data Analysis Third edition (with errors fixed as of 20 ...
    This book is intended to have three roles and to serve three associated audiences: an introductory text on Bayesian inference starting from first principles, a ...
  64. [64]
    Bayesian Model Selection, the Marginal Likelihood, and ... - arXiv
    Feb 23, 2022 · The marginal likelihood, or Bayesian evidence, is the probability of generating observations from a prior, and it can be negatively correlated ...
  65. [65]
    [PDF] Variational Inference: A Review for Statisticians - arXiv
    May 9, 2018 · Paisley, J., Blei, D., and Jordan, M. I. (2012). Variational Bayesian inference with stochastic search. In International Conference on ...
  66. [66]
    Conjugate Priors for Exponential Families - Project Euclid
    We characterize conjugate prior measures on Θ through the property of linear posterior expectation of the mean parameter of X:E{E(X|θ)|X=x}=ax+b X : E { E ( X ...
  67. [67]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · Conjugate Bayesian analysis of Gaussian distribution uses conjugate priors, allowing closed-form results. A natural conjugate prior has the ...
  68. [68]
    Laplace's 1774 Memoir on Inverse Probability - Project Euclid
    Laplace's first major article on mathematical statistics was published in 1774. It is arguably the most influential article in this field to appear before 1800.
  69. [69]
    [PDF] fisher-1930-inverse-probability.pdf - Error Statistics Philosophy
    Feb 15, 2016 · It and its applications must have made great headway during the next 20 years, for Laplace takes for granted in a highly generalised form what ...<|control11|><|separator|>
  70. [70]
    IX. On the problem of the most efficient tests of statistical hypotheses
    The problem of testing statistical hypotheses is an old one. Its origin is usually connected with the name of Thomas Bayes.
  71. [71]
    Statistical information and properties of sufficiency - Journals
    (1992) Introduction to Bartlett (1937) Properties of Sufficiency and Statistical Tests Breakthroughs in Statistics, 10.1007/978-1-4612-0919-5_7, (109-112), .
  72. [72]
    [PDF] The Emergence of MCMC Bayesian Computation in the 1980s - arXiv
    Apr 12, 2011 · This paper examines the development in the critical period 1980–1990, when the ideas of. Markov chain simulation from the statistical physics ...
  73. [73]
    Sampling-Based Approaches to Calculating Marginal Densities
    In particular, the relevance of the approaches to calculating Bayesian posterior densities for a variety of structured models will be discussed and illustrated.
  74. [74]
    The History of Likelihood - jstor
    The word "possibility" is also a loose translation, for Laplace wrote "celle", referring to probability. Laplace was evidently being purely " Bayesian". Perhaps ...
  75. [75]
    an account of the statistical concept of likelihood and its application ...
    Sep 2, 2019 · Likelihood; an account of the statistical concept of likelihood and its application to scientific inference. by: Edwards, A. W. F. (Anthony ...
  76. [76]
    [PDF] Information Theory and an Extension of the Maximum Likelihood ...
    In this paper it is shown that the classical maximum likelihood principle can be considered to be a method of asymptotic realization of an optimum estimate with ...
  77. [77]
    [PDF] On the Foundations of Statistical Inference
    confidence intervals (and related tests) involves the possible use of randomized confidence limits (or tests), for example for a binomial parameter. The ...
  78. [78]
    On the Birnbaum Argument for the Strong Likelihood Principle
    The likelihood principle is incompatible with the main body of modern statistical theory and practice, notably the Neyman–. Pearson theory of hypothesis testing ...
  79. [79]
    [PDF] Deep direct likelihood knockoffs - NYU Courant
    DDLK is a two stage algorithm. The first stage uses maximum likelihood to estimate the distribution of the covariates from observed data.