Fact-checked by Grok 2 weeks ago

Expectation–maximization algorithm

The Expectation–maximization (EM) algorithm is an iterative computational method for obtaining maximum likelihood estimates of parameters in statistical models, especially those involving incomplete data or latent (unobserved) variables, where the likelihood function is difficult to maximize directly.^[1] Developed by Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, it alternates between two steps: the expectation (E) step, which calculates the expected value of the complete-data log-likelihood conditioned on the observed data and current parameter estimates, and the maximization (M) step, which updates the parameters by maximizing this expected log-likelihood.^[1] The process repeats until the parameters stabilize, ensuring monotonic increase in the observed-data likelihood and convergence to a stationary point under standard regularity conditions.^[2] Introduced in the 1977 paper "Maximum Likelihood from Incomplete Data via the EM Algorithm," the method generalizes earlier techniques for handling missing information and has since become a fundamental tool in statistics and machine learning due to its simplicity, broad applicability, and theoretical guarantees.^[1]^[2] In practice, the EM algorithm addresses scenarios where latent variables represent hidden structures, such as cluster assignments in data or unobserved states in sequences, by treating the missing parts as random variables distributed according to the current model.^[2] The algorithm's versatility has led to widespread adoption across fields, including parameter estimation for Gaussian mixture models (GMMs) in clustering and density estimation tasks, where it iteratively refines component means, covariances, and mixing proportions from unlabeled data.^[3] It is also central to training hidden Markov models (HMMs), enabling maximum likelihood fitting for applications like speech recognition, natural language processing, and genomic sequence analysis by estimating transition probabilities and emission parameters via the Baum-Welch algorithm, a specific EM implementation.^[3] Beyond these, EM facilitates missing data imputation in regression and survival analysis, variational inference approximations in Bayesian settings, and extensions like the generalized EM for constrained optimization problems.^[2]

Overview and Motivation

Introduction

The expectation–maximization (EM) algorithm is an iterative procedure for computing maximum likelihood estimates in statistical models where data is incomplete or involves unobserved latent variables. It operates by alternating between an expectation (E) step, which calculates the expected complete-data log-likelihood based on current parameter estimates, and a maximization (M) step, which updates the parameters by maximizing this expected log-likelihood.^[1] This approach addresses scenarios where direct maximization of the observed-data likelihood is computationally intractable due to the integration over latent variables or missing observations. By augmenting the observed data with expectations over the missing parts, EM transforms the problem into a series of simpler optimizations on the complete-data likelihood.^[1] First formalized in the 1970s, the EM algorithm provided a general framework for tackling complex estimation problems that previously required ad hoc methods.^[1] A fundamental property is that each iteration yields a non-decreasing value of the observed-data likelihood, ensuring monotonic improvement toward a stationary point.^[1]

Problem Setup

The Expectation–maximization (EM) algorithm is designed to perform maximum likelihood estimation (MLE) in statistical models where the data is incomplete, meaning some parts of the observations are not directly accessible or are latent variables. In this framework, the full dataset consists of observed variables \mathbf{x} = \{x_1, \dots, x_n\} and unobserved latent variables \mathbf{z} = \{z_1, \dots, z_n\}, with the model parameterized by \theta. The complete data likelihood, which would be available if all latent variables were observed, is given by the joint distribution p(\mathbf{x}, \mathbf{z} \mid \theta). This complete data log-likelihood, \log p(\mathbf{x}, \mathbf{z} \mid \theta), is typically straightforward to maximize with respect to \theta using standard optimization techniques.^[1] In practice, only the observed data \mathbf{x} is available, leading to the observed data likelihood p(\mathbf{x} \mid \theta), which is the marginal distribution obtained by integrating or summing the joint distribution over all possible values of the latent variables:

p(\mathbf{x} \mid \theta) = \sum_{\mathbf{z}} p(\mathbf{x}, \mathbf{z} \mid \theta).

The goal of MLE in this setting is to find the parameter value \hat{\theta} that maximizes the observed data log-likelihood \ell(\theta) = \log p(\mathbf{x} \mid \theta). This formulation arises naturally in models with hidden structures, such as mixture models where \mathbf{z} indicates component assignments or factor analysis where \mathbf{z} represents underlying factors.^[1] Direct maximization of \ell(\theta) poses significant challenges due to the intractability of the summation (or integration) over \mathbf{z}, especially when the latent space is high-dimensional or the number of possible configurations is exponentially large. Computing p(\mathbf{x} \mid \theta) explicitly often requires approximations or numerical methods that are computationally expensive, and the resulting likelihood function may be multimodal, causing gradient-based or other direct optimization approaches to converge to suboptimal local maxima rather than the global optimum. These issues are particularly pronounced in models with complex dependencies between \mathbf{x} and \mathbf{z}, motivating iterative procedures like EM to indirectly optimize the observed likelihood.^[1]

Historical Development

Origins and Key Milestones

The roots of the expectation-maximization (EM) algorithm trace back to the mid-20th century, when statisticians began developing iterative methods to handle missing data in maximum likelihood estimation. In 1958, H. O. Hartley proposed an explicit iterative procedure for computing maximum likelihood estimates from incomplete observations in a multivariate normal distribution, marking one of the earliest formulations resembling the EM framework.^[4] This approach involved imputing missing values and updating parameters iteratively, laying groundwork for handling incomplete datasets.^[5] During the 1960s, further advancements built on these ideas, particularly in multivariate analysis and mixed models. E. M. L. Beale explored iterative techniques for estimating parameters with missing values in multivariate data, emphasizing practical computational methods for such problems.^[6] Concurrently, C. R. Rao advanced estimation theory through his work on score tests and related inference principles, as detailed in his 1965 book Linear Statistical Inference and Its Applications.^[4] In 1967, J. N. K. Rao collaborated with Hartley on maximum likelihood estimation for mixed analysis of variance models, employing an iterative algorithm equivalent to EM.^[4]^[7] In the late 1960s and early 1970s, L. E. Baum and colleagues developed the Baum–Welch algorithm for parameter estimation in hidden Markov models, a special case of the iterative maximization approach later formalized as EM. These efforts highlighted the potential of iterative maximization for complex statistical models but lacked a unified general framework. The modern EM algorithm was formalized in 1977 by A. P. Dempster, N. M. Laird, and D. B. Rubin, who presented it as a broadly applicable method for maximum likelihood estimation from incomplete data, complete with a proof of convergence under certain conditions.^[8] This seminal paper unified prior scattered approaches—including those of Hartley, Beale, Rao, and Baum—and demonstrated the algorithm's generality across various incomplete data scenarios, establishing EM as a cornerstone of statistical computing.^[4] In the 1980s, the EM algorithm saw widespread adoption in computer science and machine learning, driven by its utility in unsupervised learning tasks such as density estimation and clustering. A key application emerged in fitting Gaussian mixture models, where EM iteratively estimates component parameters from observed data, as detailed in influential reviews like Redner and Walker's 1984 analysis of mixture densities.^[9] By the 1990s, practical implementations proliferated in statistical software, including routines for Gaussian mixtures in packages like S-PLUS and early versions of MATLAB's Statistics Toolbox, facilitating its integration into fields like pattern recognition and bioinformatics.^[10] This era marked EM's transition from theoretical statistics to a standard tool in computational modeling.

Major Contributors

The Expectation–maximization (EM) algorithm was formally introduced by Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin in their seminal 1977 paper, which provided a general framework for maximum likelihood estimation in the presence of missing or latent data. Dempster, a statistician at Harvard University, contributed the core iterative procedure inspired by earlier maximization techniques, while Laird, then a young researcher, focused on the algorithmic implementation for incomplete data problems. Rubin, Dempster's student, emphasized the probabilistic interpretation and convergence properties, drawing from his work on multiple imputation methods. Their collaboration generalized ad hoc methods used in specific models, such as mixture models, into a unified algorithm that has since been applied across statistics and machine learning. A key precursor to the EM algorithm was the work of Leslie E. Baum and colleagues in 1970, who developed the forward-backward algorithm for estimating parameters in hidden Markov models (HMMs). Baum, a mathematician at the Institute for Defense Analyses, along with Toby Petrie, Gideon Soules, and Norman Weiss, formulated this maximization technique for probabilistic functions of Markov chains, which directly embodies the E-step and M-step structure later formalized by Dempster et al. Their approach addressed parameter estimation in speech recognition and time-series analysis, laying the groundwork for EM's application in sequential data models. In the 1980s and 1990s, Geoffrey E. Hinton and collaborators extended EM to machine learning applications, particularly in neural networks and unsupervised learning.^[11] Hinton, a professor at the University of Toronto, popularized EM for training models like mixtures of factor analyzers, integrating it with probabilistic interpretations of neural architectures to handle latent variables in high-dimensional data.^[11] Working with Zoubin Ghahramani, Hinton's 1997 technical report detailed an EM procedure for mixtures of factor analyzers, which reduced dimensionality while capturing nonlinear structures, influencing later developments in generative models.^[11] Radford M. Neal advanced EM in the 1990s through variational and incremental variants, enhancing its scalability for complex probabilistic models. Neal, a researcher at the University of Toronto, co-authored a 1998 chapter with Hinton that justified sparse and incremental EM updates, providing theoretical support for approximations that maintain convergence guarantees in large-scale graphical models. His contributions bridged EM with Bayesian inference, enabling efficient inference in infinite mixture models and foreshadowing modern variational methods.

Core Formulation

Notation and Assumptions

The formulation of the expectation-maximization (EM) algorithm relies on a standard set of mathematical notations for incomplete-data problems in parametric statistical models. The observed data, which are the available measurements or samples, are denoted by \mathbf{X} = \{x_1, \dots, x_n\}, where each x_i is a vector in some space. The complete data consist of both the observed data and latent (or missing) variables \mathbf{Z} = \{z_1, \dots, z_n\}, forming the pair (\mathbf{X}, \mathbf{Z}). The unknown model parameters are represented by the vector \theta \in \Theta, where \Theta is the parameter space. The complete-data log-likelihood is defined as \log L_c(\theta) = \log p(\mathbf{X}, \mathbf{Z} \mid \theta), while the observed-data log-likelihood is the marginal \log L(\theta) = \log p(\mathbf{X} \mid \theta) = \log \int p(\mathbf{X}, \mathbf{Z} \mid \theta) \, d\mathbf{Z}.^[12] Several key assumptions underpin the applicability and theoretical guarantees of the EM algorithm. The observations in \mathbf{X} are assumed to be independent and identically distributed (i.i.d.) from the parametric model p(\mathbf{X} \mid \theta). The data are incomplete due to a missing-data mechanism, often assumed to be missing at random (MAR), meaning the probability of missingness depends only on the observed data and not on the unobserved values themselves; this ensures the observed likelihood correctly represents the full model. Alternatively, the model may belong to the exponential family, where the density takes the form p(\mathbf{X}, \mathbf{Z} \mid \theta) = h(\mathbf{X}, \mathbf{Z}) \exp(\eta(\theta) \cdot T(\mathbf{X}, \mathbf{Z}) - A(\theta)), facilitating closed-form computations in the maximization step via sufficient statistics. Additionally, it is assumed that the observed log-likelihood \log L(\theta) possesses a unique global maximum, though the algorithm may converge to local maxima depending on initialization.^[12] Central to the EM procedure is the Q-function, which provides a surrogate for optimizing the intractable observed log-likelihood. Defined as the conditional expectation of the complete-data log-likelihood given the observed data and current parameter estimate, it is expressed as

Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}} \left[ \log L_c(\theta) \right] = \int \log L_c(\theta) \, p(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}) \, d\mathbf{Z},

where \theta^{(t)} denotes the parameter value at iteration t. Maximizing Q(\theta \mid \theta^{(t)}) with respect to \theta yields a non-decreasing sequence of likelihood values.^[12]

E-Step and M-Step Procedure

The EM algorithm proceeds iteratively, beginning with the selection of an initial parameter estimate \theta^{(0)}, which can be obtained via methods such as moment matching or random initialization to ensure a reasonable starting point within the parameter space.^[1] At each iteration t, the E-step involves computing the expected complete-data log-likelihood under the current parameters, yielding the Q-function defined as

Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}} \left[ \log p(X, Z \mid \theta) \right].

These expectations are typically calculated using the posterior distribution p(Z \mid X, \theta^{(t)}), often analytically for conjugate models or approximately otherwise.^[1] The subsequent M-step updates the parameters by maximizing the Q-function:

\theta^{(t+1)} = \arg\max_{\theta} Q(\theta \mid \theta^{(t)}).

This maximization frequently admits closed-form expressions, such as weighted maximum likelihood for mixture models, though numerical optimization may be required in more complex cases.^[1] Iterations continue until convergence, assessed by criteria like the relative change in the observed-data log-likelihood satisfying |\ell(\theta^{(t+1)}) - \ell(\theta^{(t)})| < \epsilon for a small \epsilon > 0, or by reaching a predefined maximum number of iterations to prevent excessive computation.^[1] The general procedure can be outlined in pseudocode as follows:

Initialize θ ← θ⁰

while not converged do
    // E-step: Compute Q function
    Q(θ | θᵗ) ← E_{Z|X,θᵗ} [log p(X, Z | θ)]
    
    // M-step: Maximize Q
    θᵗ⁺¹ ← argmax_θ Q(θ | θᵗ)
    
    // Check convergence
    if |ℓ(θᵗ⁺¹) - ℓ(θᵗ)| < ε then
        break
    end if
    t ← t + 1
end while

return θᵗ⁺¹
Initialize θ ← θ⁰

while not converged do
    // E-step: Compute Q function
    Q(θ | θᵗ) ← E_{Z|X,θᵗ} [log p(X, Z | θ)]
    
    // M-step: Maximize Q
    θᵗ⁺¹ ← argmax_θ Q(θ | θᵗ)
    
    // Check convergence
    if |ℓ(θᵗ⁺¹) - ℓ(θᵗ)| < ε then
        break
    end if
    t ← t + 1
end while

return θᵗ⁺¹

This outline applies to the broad class of incomplete-data problems amenable to EM, with model-specific adaptations for the expectations and maximizations.^[1]

Parameter Interpretations

In the expectation-maximization (EM) algorithm, the parameter update \theta^{(t+1)} represents a local maximizer of an approximation to the observed-data log-likelihood \log L(\theta). Specifically, during the M-step, \theta^{(t+1)} is obtained by maximizing the auxiliary function Q(\theta \mid \theta^{(t)}), which provides a tractable surrogate for the intractable \log L(\theta) by incorporating expectations over the latent variables given the current estimate \theta^{(t)}. This maximization ensures that the observed log-likelihood is non-decreasing at each iteration, as Q(\theta^{(t+1)} \mid \theta^{(t)}) \geq Q(\theta^{(t)} \mid \theta^{(t)}), thereby approximating the ascent toward a local maximum of \log L(\theta). The posterior distribution computed in the E-step, p(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)}), serves as soft assignments for the latent variables \mathbf{Z}, assigning probabilistic weights to possible configurations rather than hard classifications. These posteriors reflect the current model's uncertainty about the hidden states or labels underlying the observed data \mathbf{X}, enabling the algorithm to handle incomplete data by averaging over plausible completions weighted by their likelihood under \theta^{(t)}. This soft assignment is crucial for models with discrete or continuous latents, such as mixtures where it corresponds to responsibilities of component memberships. The Q-function, defined as Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)})} [\log p(\mathbf{X}, \mathbf{Z} \mid \theta)], serves as the key surrogate in the iterative maximization process. As the algorithm iterates, successive maximizations of Q lead to a monotonic increase in the observed log-likelihood \log L(\theta), with convergence to a stationary point under suitable conditions. Within the standard decomposition \log L(\theta) = Q(\theta \mid \theta^{(t)}) + H(\theta \mid \theta^{(t)}), where H(\theta \mid \theta^{(t)}) = \mathbb{E}_{p(\mathbf{Z} \mid \mathbf{X}, \theta^{(t)})} [\log p(\mathbf{Z} \mid \mathbf{X}, \theta)], the entropy term related to the posterior plays a role in relating the surrogate to the true likelihood. The cross-entropy aspects highlight how EM balances expected completeness and latent uncertainty to approximate maximum likelihood estimation.^[1]

Theoretical Foundations

Monotonicity and Convergence

The expectation–maximization (EM) algorithm possesses a key monotonicity property: at each iteration, the observed-data log-likelihood satisfies \ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)}), where \theta^{(t)} denotes the parameter estimate at iteration t, with equality if and only if \theta^{(t)} is a stationary point of the likelihood function.^[12] This non-decreasing behavior of the likelihood ensures that the algorithm progressively improves the fit to the observed data without backtracking, providing a reliable ascent direction toward better parameter estimates.^[12] Under standard regularity conditions—such as the differentiability of the log-likelihood function and the compactness of the parameter space—the EM sequence converges to a local maximum of the likelihood.^[13] Specifically, all limit points of the parameter sequence are stationary points, and if the auxiliary function Q(\theta' \mid \theta) satisfies certain continuity and strict increase conditions away from local maxima, the algorithm converges to a local maximizer rather than a mere stationary point.^[13] The typical convergence rate is linear, meaning the error decreases geometrically with a factor less than 1 per iteration, though this rate can approach 1 (effectively sublinear) in challenging scenarios such as highly overlapping mixture components.^[14] In practice, the EM algorithm's performance is sensitive to the initial parameter values, which can cause it to converge to suboptimal local maxima depending on the starting point.^[15] Furthermore, convergence may become particularly slow near the boundaries of the parameter space, where the likelihood landscape is flat or the E-step approximations are less informative, often necessitating careful initialization strategies or monitoring for stagnation.^[15]

Proof of Local Maximization

The proof of local maximization in the expectation-maximization (EM) algorithm demonstrates that each iteration non-decreases the observed-data log-likelihood \ell(\theta) = \log L(\theta), where L(\theta) is the likelihood function, and that fixed points correspond to stationary points of \ell(\theta). This establishes that the algorithm converges to a local maximum (or stationary point) under mild regularity conditions, though global convergence is rare due to the typical non-convexity of \ell(\theta).^[13] Consider the decomposition of the observed-data log-likelihood using the conditional distribution of the latent variables Z given the observed data X and current parameters \theta^{(t)}:

\ell(\theta) = \mathbb{E}_{Z \sim p(Z \mid X, \theta^{(t)})} \left[ \log p(X, Z \mid \theta) \right] + \mathbb{E}_{Z \sim p(Z \mid X, \theta^{(t)})} \left[ \log \frac{p(Z \mid X, \theta^{(t)})}{p(Z \mid X, \theta)} \right].

The first term is the expectation-step auxiliary function Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{Z \mid X, \theta^{(t)}} [\log p(X, Z \mid \theta)], and the second term is the negative Kullback-Leibler (KL) divergence -\mathrm{KL}(p(Z \mid X, \theta^{(t)}) \parallel p(Z \mid X, \theta)), which is nonnegative and equals zero if and only if p(Z \mid X, \theta) = p(Z \mid X, \theta^{(t)}) for all Z. Thus,

\ell(\theta) = Q(\theta \mid \theta^{(t)}) - \mathrm{KL}(p(Z \mid X, \theta^{(t)}) \parallel p(Z \mid X, \theta)) \geq Q(\theta \mid \theta^{(t)}),

with equality holding at \theta = \theta^{(t)}. This lower bound follows directly from the nonnegativity of the KL divergence.^[13] An alternative derivation applies Jensen's inequality to the concave logarithm function. Since \log p(X \mid \theta) = \log \sum_Z p(X, Z \mid \theta), rewrite it as

\log p(X \mid \theta) = \log \sum_Z p(Z \mid X, \theta^{(t)}) \cdot \frac{p(X, Z \mid \theta)}{p(Z \mid X, \theta^{(t)})}.

By Jensen's inequality,

\log p(X \mid \theta) \geq \sum_Z p(Z \mid X, \theta^{(t)}) \log \left( \frac{p(X, Z \mid \theta)}{p(Z \mid X, \theta^{(t)})} \right) = Q(\theta \mid \theta^{(t)}) + H(\theta^{(t)}),

where H(\theta^{(t)}) = -\mathbb{E}_{Z \mid X, \theta^{(t)}} [\log p(Z \mid X, \theta^{(t)})] is the conditional entropy, independent of \theta. Again, equality holds at \theta = \theta^{(t)}.^[13] In the maximization step, \theta^{(t+1)} is chosen to maximize Q(\theta \mid \theta^{(t)}), so Q(\theta^{(t+1)} \mid \theta^{(t)}) \geq Q(\theta^{(t)} \mid \theta^{(t)}). Substituting into the bound gives

\ell(\theta^{(t+1)}) \geq Q(\theta^{(t+1)} \mid \theta^{(t)}) + H(\theta^{(t)}) \geq Q(\theta^{(t)} \mid \theta^{(t)}) + H(\theta^{(t)}) = \ell(\theta^{(t)}),

proving that the log-likelihood is non-decreasing: \ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)}). Under standard regularity conditions (e.g., differentiability of the densities and compactness of the parameter space), the sequence \{\ell(\theta^{(t)})\} is bounded above and thus converges, implying that \theta^{(t)} has limit points.^[13] For the stationarity condition, suppose the algorithm reaches a fixed point where \theta^{(t+1)} = \theta^{(t)} = \theta^*. Then \theta^* maximizes Q(\theta \mid \theta^*), so \nabla_\theta Q(\theta \mid \theta^*) \big|_{\theta = \theta^*} = 0. From the decomposition, differentiating \ell(\theta) yields \nabla_\theta \ell(\theta) = \nabla_\theta Q(\theta \mid \theta^{(t)}) - \nabla_\theta \mathrm{KL}(p(Z \mid X, \theta^{(t)}) \parallel p(Z \mid X, \theta)). At \theta = \theta^{(t)}, the KL gradient vanishes because the distributions coincide, so \nabla_\theta \ell(\theta^*) = \nabla_\theta Q(\theta^* \mid \theta^*) = 0. Thus, fixed points are stationary points of \ell(\theta).^[13] Global convergence to the maximum likelihood estimate is generally not guaranteed because \ell(\theta) is typically non-convex in \theta, leading to multiple local maxima, saddle points, or flat regions depending on the model (e.g., mixtures of Gaussians). The algorithm converges to one of these local stationary points, with the outcome sensitive to initialization.^[13]

Maximization-Maximization Perspective

The maximization-maximization (MM) perspective reframes the expectation-maximization (EM) algorithm as an iterative procedure that alternately maximizes two variables: the model parameters \theta and an auxiliary distribution q(\mathbf{z}) over the latent variables \mathbf{z}. Specifically, EM seeks to maximize the objective function

\mathcal{L}(q, \theta) = \int q(\mathbf{z}) \log \frac{p(\mathbf{x}, \mathbf{z} \mid \theta)}{q(\mathbf{z})} \, d\mathbf{z},

which provides a lower bound on the observed-data log-likelihood \log p(\mathbf{x} \mid \theta).^[16] This formulation establishes equivalence to the standard EM algorithm: the M-step maximizes \mathcal{L}(q, \theta) over \theta for a fixed q(\mathbf{z}), yielding the maximum-likelihood estimate under the complete-data likelihood averaged by q; conversely, the E-step maximizes over q(\mathbf{z}) for fixed \theta, setting q(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{x}, \theta) to tighten the bound.^[16] The MM view highlights EM's connection to coordinate ascent optimization, where each step improves the joint objective monotonically, and positions EM as a special case of broader variational optimization frameworks that relax the auxiliary distribution.^[16] Despite these insights, the approach remains susceptible to local optima, inheriting the same convergence limitations as the classical EM procedure.^[16]

Extensions and Variants

Incremental and Online EM

The incremental expectation-maximization (IEM) algorithm addresses the computational challenges of the standard batch EM procedure for large datasets by updating model parameters after processing each individual data point, using stochastic approximations to approximate the expectation step. This approach recalculates the posterior distribution for only one unobserved variable per data point, enabling efficient single-pass processing without requiring multiple full dataset iterations. Introduced by Neal and Hinton in 1998, IEM justifies its incremental updates through a variational bound on the log-likelihood, ensuring that each update monotonically increases the observed-data likelihood under mild conditions.^[17] Building on this, the online EM algorithm further adapts EM for streaming or sequentially arriving data, performing parameter updates in a continual manner as new observations become available. Developed by Sato and Ishii in 2000 for models like the normalized Gaussian network, it incorporates a discount factor (or forgetting factor) to exponentially downweight the influence of past data, allowing the algorithm to adapt to non-stationary data streams where the underlying distribution may evolve over time. When the discount factor is scheduled appropriately—such as decreasing gradually—the online EM reduces to the batch EM in the limit, but its primary advantage lies in enabling real-time learning without storing the entire dataset. This makes it particularly efficient for big data applications, such as online clustering, where computational resources and memory are limited.^[18] Regarding convergence, both incremental and online EM variants exhibit almost sure convergence to a local maximum of the likelihood under stochastic approximation conditions analogous to those in stochastic gradient descent, provided the step sizes decrease appropriately—specifically, summing to infinity while their squares sum to a finite value. These properties were rigorously established in subsequent analyses, confirming that the algorithms remain reliable for parameter estimation despite their stochastic nature. Emerging in the late 1990s, these adaptations significantly enhanced EM's scalability for massive and dynamic datasets.^[19]

α-EM and ECM Algorithms

The α-EM algorithm is a parameterized extension of the expectation-maximization (EM) procedure that constructs a surrogate function for likelihood maximization using α-logarithmic information measures derived from α-divergences.^[20] This approach generalizes the standard Q-function by replacing the natural logarithm with the α-logarithm (the inverse function of the α-exponential), where the parameter α allows for flexible tuning to accelerate convergence while maintaining the monotonic increase in the observed likelihood. Introduced by Yasuo Matsuyama in 2003, α-EM enables faster iterations in applications such as hidden Markov models by selecting an appropriate α to balance stability and speed, often outperforming the traditional EM in terms of computational efficiency for high-dimensional parameter spaces.^[20] The expectation-conditional maximization (ECM) algorithm, proposed by Xiao-Li Meng and Donald B. Rubin in 1993, modifies the standard EM by replacing the single maximization step over all parameters with a sequence of conditional maximization steps over subsets or partitions of the parameter space. In ECM, the E-step remains unchanged, computing the expected complete-data log-likelihood, but the M-step is decomposed into simpler conditional maximizations, each optimizing a subset of parameters while holding others fixed at their current values, which simplifies implementation for models where joint maximization is intractable. This conditional approach is particularly advantageous in multivariate settings, such as estimating parameters in multivariate normal distributions or factor analysis models, where it reduces the complexity of each update without requiring closed-form solutions for the full parameter vector. Both α-EM and ECM preserve the key properties of the standard EM algorithm, including monotonic non-decrease of the observed-data likelihood at each iteration and convergence to a stationary point under standard regularity conditions, ensuring reliable optimization in batch settings.^[20] In practice, these variants often exhibit faster convergence rates than vanilla EM for complex models, with α-EM providing acceleration through its tunable surrogate construction and ECM offering computational savings via decomposed updates, making them suitable for large-scale statistical inference tasks like mixture model fitting.^[20]

Links to Variational Inference

The Expectation–maximization (EM) algorithm can be viewed as a special case of mean-field variational inference, where the variational distribution over latent variables is restricted to the complete-data likelihood form under the current parameter estimates. In this interpretation, the E-step corresponds to setting the variational posterior to the exact conditional distribution given the observed data and parameters, effectively making the approximation tight for that iteration.^[21] Variational inference approximates the intractable posterior distribution p(\mathbf{z} \mid \mathbf{x}, \theta) using a variational distribution q(\mathbf{z}) from a tractable family, by maximizing the evidence lower bound (ELBO):

\mathcal{L}(q, \theta) = \mathbb{E}_{q(\mathbf{z})} \left[ \log p(\mathbf{x}, \mathbf{z} \mid \theta) \right] - \mathbb{E}_{q(\mathbf{z})} \left[ \log q(\mathbf{z}) \right].

This objective is optimized alternately with respect to q and \theta, providing a lower bound on the marginal log-likelihood \log p(\mathbf{x} \mid \theta). In EM, the E-step computes expectations under q(\mathbf{z}) = p(\mathbf{z} \mid \mathbf{x}, \theta^t), which aligns with the first term of the ELBO, while the second term (the entropy of q) remains constant during the M-step maximization over \theta. A primary difference between EM and general variational inference is the choice of q: EM fixes q to the model-specific posterior, which is exact but potentially computationally expensive or intractable for complex models, whereas variational methods allow a flexible, parameterized q (e.g., mean-field factorizations assuming independence among variables) to enable scalable approximations. This flexibility in variational inference often trades off tightness of the bound for efficiency in high-dimensional settings. In 2000, Hagai Attias introduced a variational Bayesian framework for graphical models that explicitly unifies EM with variational Bayes, extending the algorithm to incorporate priors on parameters and approximate full posteriors over both latent variables and parameters via iterative bound maximization. This formulation positions EM as the limiting case of variational Bayes when parameter uncertainty is ignored, highlighting its role as a deterministic, point-estimate approximation to Bayesian posterior inference.^[21]

Interpretations and Visualizations

Geometric Viewpoint

The log-likelihood function \log L(\theta) for incomplete-data models is typically a non-concave surface in the parameter space, complicating direct maximization due to multiple local optima and irregular contours. The expectation-maximization (EM) algorithm addresses this by iteratively constructing and optimizing surrogate functions that provide lower bounds to \log L(\theta), specifically through the Q-function defined as Q(\theta, \theta^{(t)}) = \mathbb{E}_{Z \sim p(Z|X, \theta^{(t)})} [\log p(X, Z | \theta)], where Z represents latent variables. By Jensen's inequality applied to the complete-data log-likelihood, Q(\theta, \theta^{(t)}) + H(\theta^{(t)}) forms a lower bound to \log L(\theta), with equality at \theta = \theta^{(t)}, ensuring the bound touches the log-likelihood surface at the current estimate like a supporting hyperplane.^[22] In the E-step, the posterior distribution p(Z|X, \theta^{(t)}) is computed to define Q(\theta, \theta^{(t)}), establishing a supporting hyperplane that approximates \log L(\theta) locally at \theta^{(t)} and lies entirely below it elsewhere. The M-step then maximizes Q(\theta, \theta^{(t)}) to obtain \theta^{(t+1)}, shifting to a new point on or above the previous hyperplane where the actual log-likelihood value has increased, as \log L(\theta^{(t+1)}) \geq Q(\theta^{(t+1)}, \theta^{(t)}) = \max_\theta Q(\theta, \theta^{(t)}) \geq Q(\theta^{(t)}, \theta^{(t)}) = \log L(\theta^{(t)}). This process avoids backtracking by design, as each maximization along the tangent-like surrogate guarantees monotonic ascent in the observed-data likelihood. Geometrically, this iterative procedure can be visualized in low dimensions, such as 1D where \log L(\theta) is a curved line and successive Q-functions appear as tangent lines providing piecewise linear lower bounds, or in 2D plots showing the parameter trajectory climbing the likelihood surface via tightening hyperplanes without descending. These visualizations highlight EM's role as a bound-tightening ascent method, distinct from gradient-based optimizers that follow the surface directly. Xu and Jordan's 1996 analysis further elucidates this geometry in the context of Gaussian mixture models, linking the hyperplane approximations to convergence behavior under overlapping components.^[23]

Information-Theoretic Insights

The expectation-maximization (EM) algorithm can be interpreted as iteratively minimizing the Kullback-Leibler (KL) divergence between an approximate posterior distribution over latent variables and the true posterior under the current model parameters. In each E-step, the distribution q^{(t)}(z \mid x) is set to the exact conditional posterior p(z \mid x; \theta^{(t-1)}), achieving zero KL divergence for that iteration and tightening the bound on the log-likelihood. Over successive iterations, this process refines the model's parameters \theta, progressively reducing the KL divergence between the evolving approximate posteriors and the true posterior of the fitted model, thereby improving the overall approximation to the data-generating distribution.^[16] A physics-inspired perspective views the EM auxiliary function as related to negative free energy, where the lower bound on the log-likelihood decomposes into the expected log joint probability plus the entropy of the approximate posterior: \mathcal{L}(q, \theta) = \mathbb{E}_{q(z \mid x)}[\log p(x, z \mid \theta)] + H(q(z \mid x)). This expression corresponds to the negative variational free energy F(q, \theta) = -\mathcal{L}(q, \theta), familiar from statistical mechanics, with the EM steps performing coordinate minimization of this free energy to escape local minima in the likelihood landscape more effectively than direct optimization.^[16] Such interpretations highlight EM's role in balancing energy (model fit) and entropy (uncertainty in latents) to achieve thermodynamic equilibrium in probabilistic modeling.^[24] More generally, EM emerges as a form of coordinate ascent on the evidence lower bound (ELBO), \mathcal{L}(q, \theta) = \log p(x \mid \theta) - D_{\text{KL}}(q(z \mid x) \| p(z \mid x; \theta)), where the E-step optimizes over q (fixing \theta) and the M-step over \theta (fixing q). This variational framing unifies EM with broader optimization techniques, ensuring monotonic increase in the ELBO and, consequently, the marginal likelihood, as long as the conditional posteriors are tractable.^[16] These information-theoretic lenses extend to broader implications, such as connections to rate-distortion theory in data compression, where EM for mixture models minimizes a Bregman divergence analogous to expected distortion under a rate constraint. In this view, the latent assignments act as a compressed representation of the data, with the KL term quantifying information loss, enabling EM to optimize trade-offs between fidelity and representational efficiency in clustering tasks.^[25]^[26]

Practical Applications

Mixture Models and Clustering

The expectation-maximization (EM) algorithm plays a central role in parameter estimation for finite mixture models, which represent the data-generating process as a convex combination of underlying component distributions. In a finite mixture model, the probability density function for an observation \mathbf{x} is given by

p(\mathbf{x} | \theta) = \sum_{k=1}^K \pi_k f(\mathbf{x} | \mu_k, \Sigma_k),

where K is the number of components, \pi_k are the mixing proportions satisfying \sum_{k=1}^K \pi_k = 1 and \pi_k \geq 0, and f(\mathbf{x} | \mu_k, \Sigma_k) denotes the density of the k-th component, often parameterized by a mean vector \mu_k and covariance matrix \Sigma_k for multivariate cases. This formulation allows EM to iteratively maximize the observed-data log-likelihood by treating component assignments as latent variables, making it particularly suited for unsupervised learning scenarios where data arise from heterogeneous subpopulations. Within the EM framework for mixtures, the E-step computes responsibilities z_{ik}, which represent the posterior probability that observation \mathbf{x}_i belongs to component k, effectively enabling soft clustering. These responsibilities are derived as

z_{ik} = \frac{\pi_k f(\mathbf{x}_i | \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j f(\mathbf{x}_i | \mu_j, \Sigma_j)},

quantifying the degree of membership in each cluster rather than hard assignments. In the subsequent M-step, parameters \pi_k, \mu_k, and \Sigma_k are updated to maximize the expected complete-data log-likelihood weighted by these responsibilities, leading to closed-form solutions for many common component distributions like Gaussians. This soft assignment mechanism distinguishes EM-based clustering from hard partitioning methods and facilitates handling of probabilistic overlaps in the data. Compared to k-means clustering, which assumes spherical clusters and assigns points to the nearest centroid based on Euclidean distance, EM for mixture models offers key advantages by accommodating overlapping clusters through probabilistic assignments and explicitly estimating component covariances. This enables the capture of elliptical or oriented cluster shapes, improving performance on datasets with non-uniform variance or correlated features, as demonstrated in applications to image segmentation and bioinformatics where k-means often underperforms due to its restrictive assumptions. For instance, in gene expression analysis, EM's ability to model heteroscedasticity has led to more accurate identification of co-regulated gene clusters. EM-based mixture modeling is widely applied in density estimation, where it approximates complex multimodal distributions by fitting mixtures to empirical data, providing a flexible nonparametric alternative to kernel density estimation while avoiding issues like bandwidth selection. In anomaly detection, the fitted model scores new observations based on their likelihood under the mixture, flagging outliers as those with low probability mass across all components; this approach has proven effective in fraud detection systems, where mixtures capture normal transaction patterns and isolate deviations. For detailed illustrations, such as Gaussian mixture estimation, refer to the dedicated example section.

Hidden Markov Models

The Expectation-Maximization (EM) algorithm is applied to Hidden Markov Models (HMMs) via the Baum-Welch algorithm, which iteratively estimates the model parameters to maximize the likelihood of observed sequential data. This approach addresses the challenge of inferring hidden state sequences in Markov chains where direct observations are unavailable. An HMM consists of a finite set of hidden states, typically denoted as S = \{1, 2, \dots, N\}, with parameters defined by the state transition probability matrix A = \{a_{ij}\}, where a_{ij} = P(q_{t+1} = j \mid q_t = i) for states i, j \in S and time t; the emission probability distribution B = \{b_j(o)\}, where b_j(o) = P(O_t = o \mid q_t = j) for observation o at time t; and the initial state probability vector \pi = \{\pi_i\}, where \pi_i = P(q_1 = i).^[27] These components collectively model the probabilistic generation of an observation sequence O = O_1, O_2, \dots, O_T from underlying hidden state sequences Q = q_1, q_2, \dots, q_T.^[27] In the E-step of the Baum-Welch algorithm, the forward-backward algorithm computes the expected hidden state occupancies and transitions given the current parameter estimates \lambda = (A, B, \pi) and observations O. The posterior probability of being in state i at time t, denoted \gamma_t(i) = P(q_t = i \mid O, \lambda), is obtained by combining forward probabilities \alpha_t(i) (likelihoods from the start to time t) and backward probabilities \beta_t(i) (likelihoods from time t to the end), as \gamma_t(i) = \frac{\alpha_t(i) \beta_t(i)}{P(O \mid \lambda)}.^[27] Similarly, the joint posterior over consecutive states, \xi_t(i,j) = P(q_t = i, q_{t+1} = j \mid O, \lambda), is computed as \xi_t(i,j) = \frac{\alpha_t(i) a_{ij} b_j(O_{t+1}) \beta_{t+1}(j)}{P(O \mid \lambda)}.^[27] The M-step updates the parameters by maximizing the expected complete-data log-likelihood, which reduces to re-estimating based on normalized expected counts from the E-step. Specifically, the initial probabilities are revised as \hat{\pi}_i = \gamma_1(i); the transition probabilities as

\hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)},

and for discrete emissions, the emission probabilities as

\hat{b}_j(k) = \frac{\sum_{t=1, O_t = v_k}^T \gamma_t(j)}{\sum_{t=1}^T \gamma_t(j)},

where v_k are the possible observation symbols.^[27] These updates are guaranteed to non-decrease the observed-data likelihood, converging to a local maximum under standard conditions. The Baum-Welch algorithm has been instrumental in speech recognition, enabling the modeling of acoustic signals as sequences of phonetic states to improve transcription accuracy in systems like those developed in the 1980s.^[27] In bioinformatics, it supports sequence alignment tasks by estimating parameters for profile HMMs that capture conserved motifs in protein or DNA families, facilitating tasks such as homology detection and multiple sequence alignment.^[28]

Other Statistical Domains

The expectation-maximization (EM) algorithm has been extended to factor analysis models in the presence of missing data, enabling maximum likelihood estimation by treating unobserved factors and missing observations as latent variables. In this framework, the E-step imputes the missing values and factors based on current parameter estimates, while the M-step updates the factor loadings and variances to maximize the expected complete-data log-likelihood. This approach, building on the foundational EM methodology, allows for robust inference in high-dimensional datasets where incomplete observations are common, such as in psychometrics or econometrics.^[4]^[29] Similarly, EM facilitates principal component analysis (PCA) with missing data by formulating PCA probabilistically, where principal components serve as latent variables and missing entries are imputed iteratively. The algorithm alternates between estimating the latent components conditional on observed data and updating the component loadings and noise variances, providing a computationally efficient alternative to ad-hoc imputation methods like mean substitution. This EM-based PCA has demonstrated superior recovery of underlying structure in noisy, incomplete datasets compared to traditional SVD approaches that discard missing values.^[30]^[31] In genetic linkage analysis, the Lander-Green algorithm computes the likelihood of pedigree data under a hidden Markov model framework, where unobserved inheritance patterns are handled via dynamic programming similar to the forward-backward procedure in EM. This computation is often used within EM algorithms for estimating recombination rates and other parameters in multipoint linkage mapping, enabling detection of disease loci in complex pedigrees. This method has been foundational for constructing genetic maps and remains computationally viable for pedigrees up to moderate complexity, with extensions handling larger structures via approximations. In image processing, EM supports segmentation using Markov random fields (MRFs), modeling pixel labels as hidden variables with spatial dependencies to enforce smoothness in region boundaries. The algorithm estimates mixture parameters for pixel intensities in the E-step via posterior probabilities over label configurations, and updates field potentials and class priors in the M-step to maximize the observed likelihood, often approximated with mean-field or iterated conditional modes for tractability. This HMRF-EM approach yields coherent segmentations in textured images, outperforming independent pixel models by incorporating contextual priors, and has been applied to medical imaging for tumor delineation.^[32]^[33] Post-2010 advancements have incorporated EM into single-cell RNA sequencing for cell type deconvolution, inferring proportions and expression profiles from bulk mixtures using scRNA-seq references as latent signatures. Methods like the Unified RNA-Sequencing Model (URSM) treat cell-type-specific abundances as hidden variables, applying EM to jointly estimate dropout effects, batch corrections, and mixture weights while maximizing the hierarchical likelihood. This enables accurate deconvolution in heterogeneous tissues, such as tumors, revealing immune cell infiltration patterns that inform immunotherapy responses, with validation on datasets showing improved proportion estimation over non-probabilistic baselines. Recent variants, like EMixed, extend this to multi-omics integration for finer resolution in developmental biology studies.^[34]^[35] More recently, as of 2025, EM has been integrated into deep learning frameworks for tasks like handling missing data in transformer models and privacy-preserving parameter estimation in federated learning settings.^[36]

Illustrative Examples

Gaussian Mixture Estimation

The Gaussian mixture model assumes that the observed data points \mathbf{x}_i for i = 1, \dots, N are independently drawn from a distribution that is a linear combination of K Gaussian densities, expressed as

p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} | \mu_k, \Sigma_k),

where the mixing coefficients \pi_k > 0 satisfy \sum_{k=1}^K \pi_k = 1, \mu_k is the mean vector of the k-th component, and \Sigma_k is the corresponding covariance matrix. This model is particularly useful for capturing multimodal data distributions in density estimation tasks. The EM algorithm estimates the parameters \Theta = \{\pi_k, \mu_k, \Sigma_k\}_{k=1}^K by treating the component labels as latent variables and iteratively maximizing a lower bound on the observed-data log-likelihood.^[3] The algorithm begins with an initialization of the parameters, often obtained by running k-means clustering to set initial means \mu_k^{(0)} and responsibilities, with \pi_k^{(0)} = 1/K and \Sigma_k^{(0)} set to the sample covariance or a diagonal approximation to ensure positive definiteness. Iterations then alternate between the E-step and M-step until the relative change in parameters or log-likelihood falls below a tolerance \epsilon, such as $10^{-6}. Convergence is monitored by tracking the increase in log-likelihood, which is guaranteed to be non-decreasing under the EM framework. In the E-step at iteration t, the responsibilities \gamma_{ik}^{(t)}, which represent the expected posterior probability that data point \mathbf{x}_i was generated by component k given the current parameters, are computed as

\gamma_{ik}^{(t)} = \frac{\pi_k^{(t)} \mathcal{N}(\mathbf{x}_i | \mu_k^{(t)}, \Sigma_k^{(t)})}{\sum_{j=1}^K \pi_j^{(t)} \mathcal{N}(\mathbf{x}_i | \mu_j^{(t)}, \Sigma_j^{(t)})}.

These soft assignments allow the algorithm to handle overlapping components without hard partitioning the data.^[3] In the M-step, the parameters are updated to maximize the expected complete-data log-likelihood using the responsibilities from the E-step as weights:

\pi_k^{(t+1)} = \frac{1}{N} \sum_{i=1}^N \gamma_{ik}^{(t)},

\mu_k^{(t+1)} = \frac{\sum_{i=1}^N \gamma_{ik}^{(t)} \mathbf{x}_i}{\sum_{i=1}^N \gamma_{ik}^{(t)}},

\Sigma_k^{(t+1)} = \frac{\sum_{i=1}^N \gamma_{ik}^{(t)} (\mathbf{x}_i - \mu_k^{(t+1)})(\mathbf{x}_i - \mu_k^{(t+1)})^T}{\sum_{i=1}^N \gamma_{ik}^{(t)}}.

The update for \pi_k normalizes the total responsibility for each component, the mean is a weighted average of the data points, and the covariance is a weighted sample covariance centered at the new mean. These closed-form expressions make the M-step computationally efficient, with overall complexity O(N K d^2) per iteration for d-dimensional data.^[3] To illustrate the process, consider a concrete 1D example with K=2 components and N=5 data points drawn from overlapping Gaussians: x = {1.0, 1.5, 4.0, 4.5, 5.0}. Assume k-means initialization yields \pi_1^{(0)} = \pi_2^{(0)} = 0.5, \mu_1^{(0)} = 1.25, \mu_2^{(0)} = 4.5, but with initial \sigma_1^{(0)} = \sigma_2^{(0)} = 1.0 (variance 1.0) to allow some overlap for demonstration. The Gaussian density is \mathcal{N}(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right). In the first E-step (t=0), the responsibilities are calculated for each x_i. For x=1.0, the pdf to μ1 is higher, yielding γ_{i1}^{(0)} ≈ 1.0, γ_{i2}^{(0)} ≈ 0. For x=1.5: γ_{11}^{(0)} ≈ 0.99, γ_{12}^{(0)} ≈ 0.01; for x=4.0: γ_{21}^{(0)} ≈ 0.006, γ_{22}^{(0)} ≈ 0.994; for x=4.5 and x=5.0 similarly close to 0 and 1. The summed responsibilities are N_1 ≈ 2.0, N_2 ≈ 3.0. In the first M-step, the updates are \pi_1^{(1)} = 0.4, \pi_2^{(1)} = 0.6; \mu_1^{(1)} ≈ 1.24; \mu_2^{(1)} ≈ 4.50; \sigma_1^{2(1)} ≈ 0.06; \sigma_2^{2(1)} ≈ 0.25. The log-likelihood increases monotonically. Subsequent iterations refine the parameters further. After a few iterations, parameters converge close to the true generating parameters (μ1=1, σ1=0.5, μ2=4.5, σ2=0.5, π1=0.4). This example demonstrates how EM refines initial estimates through weighted averaging, gradually improving the fit to the multimodal data, with soft assignments capturing overlap when components are not well-separated. The approach generalizes beyond Gaussians by replacing the component densities with kernel densities, such as using kernel functions in place of \mathcal{N} to model non-parametric mixtures while retaining the EM structure for parameter updates. This extension allows flexible density estimation for arbitrary shapes, though it increases computational cost due to kernel evaluations.

Censored Data Regression

The Expectation–maximization (EM) algorithm facilitates maximum likelihood estimation in censored data regression by treating censored observations as missing latent variables in a linear model framework. Consider the setup where the latent response follows y_i = \mathbf{x}_i^T [\beta](/page/Beta) + \epsilon_i with \epsilon_i \sim N(0, \sigma^2), but the observed response is censored, such as y_i^* = \max(y_i, c) for left-censoring at threshold c (commonly c = 0). This censoring arises in applications like economic data where negative values are unobserved or recorded as a lower bound, leading to a likelihood that mixes normal densities for uncensored points and normal cumulative probabilities for censored ones, which lacks a closed-form maximizer. The EM approach reformulates the problem as incomplete data estimation, iteratively imputing the missing latent values to maximize an expected complete-data likelihood under normality.^[37]^[1] In the E-step, given current parameters \theta^t = (\beta^t, (\sigma^t)^2), the algorithm computes the conditional expectations of the latent y_i for censored observations based on the truncated normal distribution. For a left-censored point where y_i^* = c (implying y_i \leq c), the imputation is the mean of the left-truncated normal:

E[y_i \mid y_i \leq c, \mathbf{x}_i, \theta^t] = \mathbf{x}_i^T [\beta^t](/page/Beta) + \sigma^t \cdot \frac{ -[\phi](/page/Phi)(z_i^t) }{ \Phi(z_i^t) },

where z_i^t = (c - \mathbf{x}_i^T \beta^t)/\sigma^t, \phi is the standard normal pdf, and \Phi is the cdf. For uncensored points, E[y_i \mid \cdot] = y_i^*. The conditional variance \Var(y_i \mid \cdot, \theta^t) for censored points is (\sigma^t)^2 \left[ 1 - z_i^t \frac{\phi(z_i^t)}{\Phi(z_i^t)} - \left( \frac{\phi(z_i^t)}{\Phi(z_i^t)} \right)^2 \right] and zero for uncensored points. These imputations and variances form the sufficient statistics for the complete-data normal model. The second moment follows as E[y_i^2 \mid \cdot] = [E[y_i \mid \cdot]]^2 + \Var(y_i \mid \cdot, \theta^t).^[38]^[1] The M-step maximizes the expected complete-data log-likelihood Q(\theta \mid \theta^t) = -\frac{n}{2} \log(2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum_i E[(y_i - \mathbf{x}_i^T \beta)^2 \mid \mathbf{y}^*, \mathbf{X}, \theta^t], yielding closed-form updates. The regression coefficient is updated using ordinary least squares on the imputed data: \beta^{t+1} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \tilde{\mathbf{y}}^t, with \tilde{y}_i^t = E[y_i \mid \cdot]. The variance is then (\sigma^{t+1})^2 = \frac{1}{n} \sum_i \left[ (\tilde{y}_i^t - \mathbf{x}_i^T \beta^{t+1})^2 + \Var(y_i \mid \cdot, \theta^t) \right]. These updates for the Tobit model ensure monotonic increase in the observed likelihood.^[38]^[1] This EM application to the Tobit model offers advantages over ad-hoc methods, such as substituting the censoring threshold for censored values or regressing only on uncensored data, by properly incorporating the truncated likelihood contributions and avoiding bias in parameter estimates and predictions. The iterative nature converges reliably under mild conditions, providing consistent maximum likelihood estimates even with substantial censoring, as demonstrated in econometric analyses of limited dependent variables.^[38]

Alternative Approaches

Gradient-Based Optimization

Gradient-based optimization offers an alternative to the EM algorithm for maximizing the likelihood function in statistical models, particularly those involving latent variables. In direct gradient ascent, the parameters \theta are updated iteratively by following the gradient of the log-likelihood \ell(\theta) = \log L(\theta):

\theta^{(t+1)} = \theta^{(t)} + \eta \nabla_\theta \ell(\theta^{(t)})

where \eta > 0 is the learning rate. This approach computes the gradient directly with respect to the observed data likelihood, often requiring marginalization over latent variables, which can be approximated via sampling or other methods.^[39] A key advantage of gradient-based methods over EM is their generality; they apply to a broad class of models without the need for a structured complete-data formulation, including those not amenable to EM's E- and M-steps. Stochastic gradient descent (SGD), a variant using mini-batch approximations, enables efficient handling of large-scale datasets, making it suitable for modern machine learning applications where full likelihood evaluations are prohibitive.^[40]^[41] Despite these strengths, gradient methods lack the monotonicity guarantee of EM, where each iteration non-decreases the likelihood; instead, they may oscillate or diverge if the learning rate is poorly chosen, leading to overshooting local optima. Computing exact gradients can also be demanding, especially in models with intractable integrals, and often necessitates automatic differentiation tools. In contrast to EM's reliance on potentially costly E-steps, gradient approaches may falter in non-convex landscapes without additional safeguards like momentum or adaptive rates.^[39]^[42] Gradient-based optimization is preferable in convex likelihood problems, where convergence to the global maximum is assured, or when EM's E-step computations outweigh gradient evaluations, such as in very high-dimensional latent spaces. For instance, in scenarios where the complete-data log-likelihood is unavailable but the observed gradient is tractable, direct ascent avoids EM's iterative bounding. EM's convergence properties, including monotonicity, provide reliability in latent variable settings, but gradients offer flexibility for broader optimization challenges.^[41]^[42]

Bayesian Sampling Methods

Bayesian sampling methods, particularly Markov chain Monte Carlo (MCMC) techniques such as Gibbs sampling, provide an alternative framework to the expectation-maximization (EM) algorithm for inference in latent variable models by enabling full posterior estimation rather than point maximum likelihood estimates.^[43] Unlike EM, which iteratively maximizes the observed data likelihood through expectation and maximization steps, Bayesian sampling incorporates prior distributions on parameters and latent variables to sample from the joint posterior distribution p(\theta, z | y) \propto p(y | \theta, z) p(z | \theta) p(\theta), where \theta denotes model parameters, z the latent variables, and y the observed data.^[44] This approach addresses limitations of EM, such as sensitivity to initialization and convergence to local optima, by exploring the entire parameter space and quantifying uncertainty through posterior samples.^[43] A foundational MCMC method for these models is Gibbs sampling, which iteratively draws from conditional posteriors of parameters and latents given the current state of others, facilitating data augmentation to make intractable integrals tractable.^[45] In finite mixture models, for instance, Diebolt and Robert (1994) demonstrated how Gibbs sampling approximates the posterior for component parameters and allocations, treating mixture weights and means as random variables drawn from Dirichlet and normal priors, respectively; this yields Bayes estimates that avoid the label-switching issues sometimes arising in EM-based clustering.^[44] Similarly, Escobar and West (1995) extended this to Dirichlet process mixtures for nonparametric density estimation, using Gibbs sampling to infer an unknown number of components by integrating over a prior on the concentration parameter, offering greater flexibility than fixed-component EM fits.^[45] For hidden Markov models (HMMs), MCMC alternatives to EM enable Bayesian parameter estimation and model selection by sampling transition probabilities, emission distributions, and state sequences from their joint posterior.^[43] Rydén (2008) compared EM and MCMC implementations, finding that while EM is faster for simple parameter estimation, MCMC excels in scenarios with model uncertainty, such as determining the number of states, by providing marginal posteriors that penalize overparameterization through priors like those on transition matrices; computational costs for MCMC were higher but manageable with efficient samplers, achieving comparable accuracy with better uncertainty assessment.^[43] These methods also handle extensions like time-varying parameters or hierarchical structures more naturally than standard EM. Key advantages of Bayesian sampling over EM include the incorporation of prior expert knowledge to regularize estimates in low-data regimes and the provision of credible intervals for parameters, which EM lacks as it yields only point estimates.^[45] However, MCMC requires careful diagnostics for chain convergence, such as trace plots and Gelman-Rubin statistics, and can be computationally demanding for high-dimensional models, though accelerations like blocked Gibbs or Hamiltonian Monte Carlo mitigate this.^[43] Overall, these sampling approaches have become widely adopted in statistical software like Stan and JAGS for robust inference in domains where EM's frequentist assumptions fall short.^[44]

References

[1]
Maximum Likelihood from Incomplete Data Via the EM Algorithm
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality.Missing: original | Show results with:original
[2]
[PDF] The EM Algorithm
The EM algorithm has many applications throughout statistics. It is often used for example, in machine learning and data mining applications, and in Bayesian.
[3]
[PDF] A Gentle Tutorial of the EM Algorithm and its Application to ...
Abstract. We describe the maximum-likelihood parameter estimation problem and how the Expectation-. Maximization (EM) algorithm can be used for its solution ...
[4]
[PDF] Maximum Likelihood from Incomplete Data via the EM Algorithm
Apr 6, 2007 · By using a representation similar to that used by Dempster, Laird and. Rubin in the genetics example of Section 1, Haberman showed how ...
[5]
(PDF) The Expectation Maximization Algorithm - ResearchGate
This note represents my attempt at explaining the EM algorithm (Hartley, 1958; Dempster et al., 1977; McLachlan and Krishnan, 1997).<|control11|><|separator|>
[6]
Missing Values in Multivariate Analysis - jstor
One natural approach to this problem is to omit all incomplete observations. This is unsatisfactory if many variables are known for an incomplete ...
[7]
Maximum-Likelihood Estimation for the Mixed Analysis of Variance ...
Printed in Great Britain. Maximum-likelihood estimation for the mixed analysis of variance model. BY H. 0. HARTLEY AND J. N. K. RAO. Texas A and H University.Missing: EM | Show results with:EM
[8]
Maximum Likelihood from Incomplete Data via the EM Algorithm - jstor
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality.
[9]
Mixture Densities, Maximum Likelihood and the EM Algorithm
We discuss the formulation and theoretical and practical properties of the EM algorithm for mixture densities, focussing in particular on mixtures of densities ...
[10]
[PDF] Mixture Models and EM - Columbia CS
We first of all use the Gaussian mixture distribution to motivate the EM algorithm in a fairly informal way, and then we give a more careful treatment based on ...Missing: 1980s software
[11]
[PDF] The EM Algorithm for Mixtures of Factor Analyzers - Computer Science
The EM Algorithm for Mixtures of Factor Analyzers. Zoubin Ghahramani. Geoffrey E. Hinton. Department of Computer Science. University of Toronto. 6 King's ...
[12]
Maximum Likelihood from Incomplete Data Via the EM Algorithm
Maximum Likelihood from Incomplete Data Via the EM Algorithm Free. A. P. Dempster,.
[13]
On the Convergence Properties of the EM Algorithm - Project Euclid
Two convergence aspects of the EM algorithm are studied: (i) does the EM algorithm find a local maximum or a stationary value of the (incomplete-data) ...
[14]
Statistical convergence of the EM algorithm on Gaussian mixture ...
We study the convergence behavior of the Expectation Maximization (EM) algorithm ... Further, we show that the convergence rate is linear and characterize the ...<|control11|><|separator|>
[15]
[PDF] Statistical guarantees for the EM algorithm: From population to ...
The EM algorithm is a widely used tool in maximum-likelihood estima- tion in incomplete data problems. Existing theoretical work has focused on.
[16]
[PDF] a view of the em algorithm that justifies incremental, sparse, and ...
Abstract. The EM algorithm performs maximum likelihood estimation for data in which some variables are unobserved. We present a function that.Missing: 1990s | Show results with:1990s
[17]
A View of the Em Algorithm that Justifies Incremental, Sparse, and ...
The EM algorithm performs maximum likelihood estimation for data in which some variables are unobserved. We present a function that resembles negative free ...Missing: Sato 1994
[18]
On-line EM Algorithm for the Normalized Gaussian Network
Feb 1, 2000 · We show that the on-line EM algorithm is equivalent to the batch EM algorithm if a specific scheduling of the discount factor is employed. In ...Missing: Masato | Show results with:Masato
[19]
[PDF] Online EM Algorithm for Latent Data Models - arXiv
Mar 1, 2017 · A specific instance of the proposed online EM algorithm has been derived by Sato and Ishii (2000) for maximum likelihood estimation in the ...Missing: Masato | Show results with:Masato
[20]
https://doi.org/10.1109/TIT.2002.808105
[21]
[PDF] A Variational Bayesian Framework for Graphical Models - NIPS papers
Abstract. This paper presents a novel practical framework for Bayesian model averaging and model selection in probabilistic graphical models.
[22]
[PDF] Pattern Recognition and Machine Learning - Microsoft
A companion volume (Bishop and Nabney,. 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab ...
[23]
[PDF] The Expectation-Maximization and Alternating Minimization Algorithms
Sep 11, 2002 · The Expectation-Maximization (EM) algorithm is a hill-climbing approach to finding a local maximum of a likelihood function [7, 8].
[24]
[PDF] An Information Theoretic Analysis of Maximum Likelihood Mixture ...
In rate distortion theory, the loss in “information” is quantified by the expected Bregman distortion be- tween Z and ˆZ. Intuitively, if the expected ...
[25]
[2201.02447] Bregman divergence based em algorithm and its ...
Jan 7, 2022 · However, in the rate distortion theory, it is needed to minimize the mutual information under the constant constraint for the cost function. Our ...
[26]
[PDF] A tutorial on hidden Markov models and selected applications in ...
In this paper we attempt to care- fully and methodically review the theoretical aspects of this type of statistical modeling and show how they have been applied ...
[27]
[PDF] Hidden Markov Models in Computational Biology
Hidden Markov Models in Computational Biology. Applications to Protein Modeling. Anders Krogh'f, Michael Brown', I. Saira Mian'. Kiminen Sjolander' and David ...
[28]
[PDF] Maximum likelihood estimation of factor models on data sets with ...
In this paper we propose a methodology to estimate a dynamic factor model on data sets with an arbitrary pattern of missing data. We modify the Expectation ...
[29]
[PDF] EM Algorithms for PCA and SPCA
I present an expectation-maximization (EM) algorithm for principal component ... [3] Zoubin Ghahramani and Geoffrey Hinton. The EM algorithm for ...
[30]
Principal Component Analysis with Noisy and/or Missing Data - arXiv
Aug 20, 2012 · This paper presents a method for PCA on noisy datasets with missing values, using noise-weighted Expectation Maximization (EM) PCA. Missing ...
[31]
Counting algorithms for linkage: correction to Morton and Collins
Here, we show that these assertions are both incorrect: the Lander-Green algorithm is an EM algorithm, while the Morton-Collins algorithm is not. We note ...<|separator|>
[32]
Using an instrumental variable to test for unmeasured confounding
We develop a test of whether there is unmeasured confounding when an instrumental variable (IV) is available.
[33]
HMRF-EM-image: Implementation of the Hidden Markov Random ...
Jul 15, 2012 · Abstract:In this project, we study the hidden Markov random field (HMRF) model and its expectation-maximization (EM) algorithm.
[34]
On the convergence of EM-like algorithms for image segmentation ...
In this paper, we review three EM-like variants for Markov random field segmentation and compare their convergence properties both at the theoretical and ...
[35]
A UNIFIED STATISTICAL FRAMEWORK FOR SINGLE CELL AND ...
In this paper, we propose to jointly analyze single cell and bulk RNA-seq data using the Unified RNA-Sequencing Model (URSM), which simultaneously corrects for ...
[36]
EMixed: Probabilistic Multi-Omics Cellular Deconvolution of Bulk ...
Oct 17, 2025 · Cellular deconvolution is a key approach to deciphering the complex cellular makeup of tissues by inferring the composition of cell types ...
[37]
Estimation of Relationships for Limited Dependent Variables - jstor
ESTIMATION OF RELATIONSHIPS FOR LIMITED. DEPENDENT VARIABLES'. BY JAMES TOBIN. "What do you mean, less than nothing?" replied Wilbur. "I don't think there is.
[38]
Tobit models: A survey - ScienceDirect.com
Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society, B 39 (1977), pp. 1-38. (with discussion). Google Scholar.Missing: original | Show results with:original
[39]
[PDF] Optimization with EM and Expectation-Conjugate-Gradient
Our goal in this paper is to contrast the EM algorithm with a direct gradient-based optimization approach. As a concrete alternative, we present an Expectation- ...
[40]
https://proceedings.mlr.press/v70/zhu17a/zhu17a.pdf
[41]
[PDF] Comparison of the EM Algorithm and Alternatives - Uni Ulm
Abstract. The Expectation-Maximization (EM) algorithm is widely used also in industry for parameter estimation within a Maximum Likelihood (ML).
[42]
[PDF] Statistical guarantees for the EM algorithm - arXiv
Aug 9, 2014 · In such an algorithm, instead of performing an exact maximization, we simply choose a parameter value that does not decrease the likelihood.
[43]
EM versus Markov chain Monte Carlo for Estimation of Hidden ...
Inference in HMMs is traditionally often carried out using the EM algorithm, but examples of Bayesian estimation, in gen- eral implemented through Markov chain ...
[44]
Estimation of Finite Mixture Distributions Through Bayesian Sampling
We present approximation methods which evaluate the posterior distribution and Bayes estimators by Gibbs sampling, relying on the missing data structure of the ...
[45]
Bayesian Density Estimation and Inference Using Mixtures
This paper describes Bayesian inference for density estimation using mixtures of Dirichlet processes, including normal mixtures, and addresses local vs global ...