Hidden Markov model

A hidden Markov model (HMM) is a probabilistic statistical model that represents systems evolving over time as a Markov process with unobservable (hidden) states, where the only available data consists of sequences of observable outputs probabilistically dependent on those states.^[1] These models capture the joint probability distribution over sequences of hidden states and observations through three core components: a finite set of hidden states with transition probabilities forming a Markov chain, emission probabilities defining the likelihood of each observation given a state, and an initial probability distribution over starting states.^[2] Formally, an HMM is denoted by the tuple \lambda = (A, B, \pi), where A is the state transition matrix, B is the observation emission matrix, and \pi is the initial state vector, enabling compact representation of complex sequential dependencies.^[3] The foundations of HMMs were established in the late 1960s by Leonard E. Baum and colleagues at the Institute for Defense Analyses in Princeton, New Jersey, with key early works including Baum and Petrie's 1966 paper on statistical inference for probabilistic functions and Baum and Eagon's 1967 contribution on maximum likelihood estimation for Markov chains with hidden states.^[1] Central algorithms for HMMs include the Viterbi algorithm for finding the most likely sequence of hidden states (dynamic programming for maximum a posteriori decoding, introduced by Andrew Viterbi in 1967 and popularized by Forney in 1973), the forward-backward algorithm for computing posterior probabilities of states, and the Baum-Welch algorithm for parameter estimation via expectation-maximization (developed by Baum et al. in the 1970s).^[4] These methods address the core challenges of HMMs: evaluation of observation likelihoods, decoding hidden state sequences, and learning model parameters from data.^[5] HMMs gained prominence in the 1980s and 1990s through applications in speech recognition, where Lawrence Rabiner's 1989 tutorial highlighted their use in modeling acoustic signals as emissions from phonetic state sequences, enabling systems like those at Bell Labs to achieve state-of-the-art performance.^[4] In bioinformatics, HMMs model biological sequences such as DNA or proteins, with Sean Eddy's 1998 profile HMM framework facilitating gene prediction and multiple sequence alignment in tools like HMMER.^[6] Additional domains include natural language processing for part-of-speech tagging, finance for modeling regime-switching in time series, and computer vision for gesture recognition, underscoring their versatility in handling noisy, sequential data with latent structure.^[1]

Introduction

Definition

A Hidden Markov model (HMM) is a statistical model that represents a Markov chain where the states are unobserved (hidden), and the observable outputs, known as emissions, are generated conditionally from the current hidden state according to an emission distribution. This framework extends the standard Markov chain by incorporating a layer of indirection through hidden states, allowing the model to capture sequential data where direct state observations are unavailable but indirect evidence is provided via emissions. The model relies on two key assumptions: the first-order Markov property, which states that the probability of transitioning to the next state depends solely on the current state and not on prior states; and the conditional independence of observations, meaning each observation depends only on the current hidden state and is independent of previous observations and states given that state. Formally, an HMM is defined by a finite set of hidden states S = \{1, 2, \dots, N\}, an observation sequence O = (o_1, o_2, \dots, o_T) where each o_t is drawn from a discrete alphabet of M symbols, a transition probability matrix A = (a_{ij})_{i,j=1}^N with a_{ij} = P(s_t = j \mid s_{t-1} = i), an emission probability distribution B = (b_j(k))_{j=1,k=1}^{N,M} with b_j(k) = P(o_t = v_k \mid s_t = j) for observation symbols v_k, and an initial state probability distribution \pi = (\pi_i)_{i=1}^N with \pi_i = P(s_1 = i). The complete model is denoted by the parameter set \lambda = (A, B, \pi). Given a state sequence S = (s_1, s_2, \dots, s_T), the probability of the observation sequence under the model is

P(O \mid S, \lambda) = \prod_{t=1}^T b_{s_t}(o_t).

This emission probability reflects the conditional independence assumption, as each term depends only on the observation at time t and the state s_t.

Terminology

In the context of hidden Markov models (HMMs), the hidden states refer to the underlying, unobserved sequence of states that evolve according to a Markov process, while the observations are the visible outputs generated from these states, which are directly measurable but do not reveal the state directly.^[7] The transition probabilities, denoted as a_{ij} = P(s_{t+1} = j \mid s_t = i), represent the probability of moving from state i at time t to state j at time t+1, forming the Markov chain structure among the hidden states.^[7] Similarly, the emission probabilities, given by b_i(o_t) = P(o_t \mid s_t = i), describe the probability of observing symbol o_t when the system is in state i at time t.^[7] The initial state probabilities, expressed as \pi_i = P(s_1 = i), specify the probability distribution over the possible starting states at time t=1.^[7] HMMs are commonly formulated for discrete observation spaces, where emissions are finite symbols drawn from a discrete alphabet, but extensions to continuous observations use probability density functions, often modeled as mixtures of Gaussians for each state to capture the distribution of continuous outputs.^[7] The complete set of model parameters is typically denoted by \lambda = (A, B, \pi), where A is the transition probability matrix, B the emission probability matrix (or densities for continuous cases), and \pi the initial distribution.^[7] In algorithmic contexts, the forward variables \alpha_t(i) quantify the joint probability of the partial observation sequence up to time t and being in state i at time t, while the Viterbi variables \delta_t(i) capture the maximum probability of the partial observation sequence and state path ending in state i at time t.^[7] A common point of confusion arises in distinguishing HMMs from fully visible Markov models, where the state sequence is directly observed alongside the outputs, allowing straightforward maximum likelihood estimation without inference over hidden variables; in contrast, HMMs require specialized algorithms to infer the latent states from observations alone.^[7]

Examples

Urn and ball drawing

The urn and ball drawing example provides an intuitive illustration of a hidden Markov model (HMM), where the hidden states are represented by distinct urns, and the observations correspond to the colors of balls drawn from those urns. In this setup, there are multiple urns, each containing a fixed proportion of differently colored balls, such as red and white. The underlying process begins with the selection of an initial urn according to an initial state probability distribution. At each discrete time step, a ball is drawn randomly from the current urn (revealing its color as the observation), replaced, and then the next urn is chosen based on transition probabilities that depend only on the current urn. The sequence of urn choices forms a Markov chain, but since the urns themselves remain hidden, only the sequence of ball colors is directly observed, exemplifying the "hidden" aspect of the model.^[7] A representative instance of this model employs two urns to demonstrate the key components. Urn A (state 1) is filled with 90% red balls and 10% white balls, while Urn B (state 2) contains 30% red balls and 70% white balls; these proportions define the emission probabilities, where the probability of drawing a red ball from Urn A is b_A(\text{red}) = 0.9 and b_A(\text{white}) = 0.1, and similarly b_B(\text{red}) = 0.3, b_B(\text{white}) = 0.7. The transition probabilities specify the likelihood of moving between urns, such as an 80% chance of remaining in the current urn and a 20% chance of switching to the other (i.e., a_{AA} = 0.8, a_{AB} = 0.2, a_{BB} = 0.8, a_{BA} = 0.2). The initial state probabilities are often set equally, with \pi_A = 0.5 and \pi_B = 0.5. These parameters capture the stochastic nature of both state transitions and observation emissions in an HMM.^[7] Given an observed sequence of ball colors, such as red-white-red, one can consider possible underlying sequences of urn selections that could have generated it, including paths like A-B-A or B-A-B. For the path A-B-A, the joint probability of this hidden state sequence and the observations is calculated as the product of the initial probability, transition probabilities, and emission probabilities:

P(\text{A-B-A, red-white-red}) = \pi_A \cdot a_{AB} \cdot b_A(\text{red}) \cdot a_{BA} \cdot b_B(\text{white}) \cdot b_A(\text{red})

Substituting the parameters yields $0.5 \cdot 0.2 \cdot 0.9 \cdot 0.2 \cdot 0.7 \cdot 0.9 = 0.01134. This factorization highlights the Markov property in the states and the conditional independence of observations given the states: each ball color depends only on the urn from which it is drawn, independent of prior colors or urns.^[7]

Weather and activity prediction

A classic pedagogical example of a hidden Markov model involves modeling daily weather as hidden states that influence observable human activities, illustrating the model's ability to capture temporal dependencies in sequential data. The hidden states represent weather conditions: "sunny," "cloudy," or "rainy," which evolve over days according to a Markov process. The observations are the activities performed by an individual, such as walking, shopping, or driving, which depend probabilistically on the current weather but are not direct measurements of it. This setup demonstrates how unobserved factors (weather) can explain patterns in visible behaviors (activities) across time.^[8] The transition probabilities model the persistence or change in weather, reflecting realistic meteorological patterns where conditions tend to continue but can shift. For sunny, the probabilities are 0.5 to stay sunny, 0.3 to cloudy, and 0.2 to rainy; for cloudy, 0.4 to sunny, 0.3 to stay cloudy, and 0.3 to rainy; for rainy, 0.1 to sunny, 0.5 to cloudy, and 0.4 to stay rainy. These values capture the temporal dynamics in this model. The emission probabilities link states to observations, showing how activities are more likely under certain weather: walking is highly probable on sunny days (0.88) but less likely on cloudy or rainy days (0.10 each), while shopping (0.10 sunny, 0.60 cloudy/rainy) and driving (0.02 sunny, 0.30 cloudy/rainy) are more common on non-sunny days, as people seek indoor alternatives or use vehicles to avoid discomfort. Initial state probabilities are set as 0.7 for sunny, 0.2 for cloudy, and 0.1 for rainy, providing a starting distribution for the sequence.^[8] Consider a three-day observation sequence: walk, shop, drive. This sequence suggests underlying weather changes, as walking aligns with sunny conditions, while shopping and driving indicate cloudy or rainy. To compute the joint probability of observations and a specific hidden state chain—say, sunny on day 1, cloudy on day 2, rainy on day 3—the model factors it as the product of the initial probability, emissions, and transitions: π(sunny) × b_walk(sunny) × a_sunny_cloudy × b_shop(cloudy) × a_cloudy_rainy × b_drive(rainy). Plugging in values yields 0.7 × 0.88 × 0.3 × 0.6 × 0.3 × 0.3 ≈ 0.00998, representing one possible explanation for the observed activities. Summing over all possible state chains gives the total probability of the observation sequence, highlighting multiple hidden paths that could generate it.^[8] This example reveals how a hidden Markov model infers latent temporal structures: the correlation between activities, such as a walk followed by shopping and driving, is explained by the unobserved weather transitions, without needing direct weather data. By modeling emissions (activity given weather) and transitions (weather evolution), the HMM captures dependencies that would otherwise appear unexplained in the observation sequence alone.

Model Components

States and observations

In a Hidden Markov model (HMM), the hidden states form a finite, unobservable set of discrete categories or conditions that represent the underlying system dynamics at each time step. These states evolve sequentially according to the Markov property, where the state at any given time depends exclusively on the immediately preceding state, thereby constituting an underlying Markov chain that governs the generation of observable data without being directly measurable. The observations comprise the directly measurable sequence of data points produced by the hidden process, where each observation arises stochastically from the corresponding hidden state at that time. Observations may be discrete, drawn from a finite alphabet of symbols, or continuous, consisting of real-valued measurements typically modeled with probability density functions such as multivariate Gaussians to accommodate variability within each state. Central to the HMM framework is the conditional independence between observations given the hidden states: each observation depends solely on the current state and is independent of all other states and observations in the sequence, establishing no direct linkage between states and observations beyond this generative mechanism. This structure enables the model to infer latent processes from surface-level data patterns. For instance, in a weather model, hidden states might denote unobservable atmospheric conditions, while observations capture visible behaviors like outdoor activities. HMMs incorporate the stationarity assumption for the underlying Markov chain, meaning the behavioral rules for state evolution remain constant over time, which supports modeling of extended sequences without temporal drift. Ergodicity is also typically presumed, requiring the state space to be fully connected and non-periodic, ensuring that sufficiently long observation sequences will sample all possible states proportionally to their equilibrium frequencies.

Transition and emission probabilities

The transition probabilities in a Hidden Markov model (HMM) capture the dynamics of the underlying Markov chain, specifying the probability of moving from one hidden state to another while enforcing the Markov property that the future state depends only on the current state. These probabilities are organized into a transition matrix A = [a_{ij}]_{i,j=1}^N, where N is the number of hidden states and a_{ij} = P(S_{t+1} = j \mid S_t = i).^[9] The matrix A is row-stochastic, meaning that for each row i, the entries sum to unity: \sum_{j=1}^N a_{ij} = 1.^[9] This structure ensures that the transitions form a valid probability distribution over the next states, with all a_{ij} \geq 0.^[10] Emission probabilities link the hidden states to the observable sequence, defining the likelihood of generating a particular observation from a given state. For discrete observation alphabets with M symbols, these are represented by the emission matrix B = [b_j(k)]_{j=1}^N, k=1^M, where b_j(k) = P(O_t = v_k \mid S_t = j) and v_k denotes the k-th observation symbol.^[9] Like the transition matrix, B is row-stochastic: \sum_{k=1}^M b_j(k) = 1 for each state j, with all entries non-negative.^[9] For continuous observations, emission probabilities are instead modeled using probability density functions f_j(o_t), commonly multivariate Gaussian distributions parameterized by state-specific means and covariance matrices, ensuring \int f_j(o_t) \, do_t = 1.^[1] The initial state probabilities provide the starting distribution over hidden states, given by the vector \pi = [\pi_i]_{i=1}^N, where \pi_i = P(S_1 = i) and \sum_{i=1}^N \pi_i = 1, with all \pi_i \geq 0.^[9] Together with A and B, these form the model parameters \lambda = (A, B, \pi). The joint probability of an observation sequence O = o_1, \dots, o_T and a corresponding state sequence S = s_1, \dots, s_T under the model is then

P(O, S \mid \lambda) = \pi_{s_1} \, b_{s_1}(o_1) \prod_{t=2}^T a_{s_{t-1} s_t} \, b_{s_t}(o_t),

which factorizes according to the HMM's generative process.^[9] All components—transition, emission, and initial probabilities—must satisfy non-negativity (\geq 0) and normalization constraints to qualify as proper probability distributions.^[10]

Inference

Observed sequence likelihood

In hidden Markov models (HMMs), the observed sequence likelihood, denoted P(O \mid \lambda), represents the probability of a given sequence of observations O = o_1, o_2, \dots, o_T under the model parameters \lambda, which include the initial state probabilities \pi, transition probabilities A, and emission probabilities B. This likelihood is essential for model evaluation and inference tasks, as it quantifies how well the model explains the observed data. Formally, it is obtained by marginalizing over all possible hidden state sequences S = s_1, s_2, \dots, s_T:

P(O \mid \lambda) = \sum_{S} P(O, S \mid \lambda),

where the joint probability P(O, S \mid \lambda) = \pi_{s_1} b_{s_1}(o_1) \prod_{t=2}^T a_{s_{t-1} s_t} b_{s_t}(o_t). This summation accounts for the exponential number of possible state paths, rendering direct computation intractable for long sequences, with a brute-force complexity of O(N^T), where N is the number of states and T is the observation length.^[9] To address this, the forward algorithm employs dynamic programming to compute the likelihood efficiently in O(N^2 T) time. Introduced as part of the foundational inference framework for HMMs, it defines the forward variables \alpha_t(i) as the probability of the partial observation sequence up to time t and the model being in state i at time t:

\alpha_t(i) = P(o_1, o_2, \dots, o_t, s_t = i \mid \lambda).

The algorithm proceeds in three steps. First, initialization sets the values at t=1:

\alpha_1(i) = \pi_i b_i(o_1), \quad 1 \leq i \leq N,

reflecting the joint probability of starting in state i and emitting the first observation. Second, the recursion propagates these values forward:

\alpha_{t+1}(j) = \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(o_{t+1}), \quad 1 \leq t < T, \quad 1 \leq j \leq N,

summing the probabilities of reaching state j at time t+1 from all previous states i, weighted by the transition and emission probabilities. Finally, termination yields the total likelihood by summing over the final states:

P(O \mid \lambda) = \sum_{i=1}^N \alpha_T(i).

This approach avoids enumerating all paths by incrementally building the probabilities, making HMM inference scalable for applications like speech recognition and sequence modeling. The forward algorithm's efficiency and accuracy have made it a cornerstone of HMM theory since its development.^[9]

Latent state probabilities

In hidden Markov models, the latent state probabilities refer to the posterior distribution over the hidden states given an observed sequence O = o_1, o_2, \dots, o_T, denoted P(S \mid O, \lambda), where S = s_1, s_2, \dots, s_T is the state sequence and \lambda represents the model parameters including transition probabilities A = \{a_{ij}\} and emission probabilities B = \{b_j(o)\}. By Bayes' theorem, this joint posterior is P(S \mid O, \lambda) = \frac{P(O \mid S, \lambda) P(S \mid \lambda)}{P(O \mid \lambda)}, with P(O \mid S, \lambda) factoring as the product of emission probabilities and P(S \mid \lambda) as the product of initial and transition probabilities; however, direct computation is intractable due to the exponential number of possible state sequences, so marginal posteriors P(s_t \mid O, \lambda) are computed efficiently via dynamic programming decompositions. Filtering provides the posterior probability of the current state given observations up to the present time, P(s_t \mid o_1, \dots, o_t, \lambda), which is useful for online prediction or real-time inference. This is calculated as \gamma_t(i) = P(s_t = i \mid o_1, \dots, o_t, \lambda) = \frac{\alpha_t(i)}{\sum_{j=1}^N \alpha_t(j)}, where \alpha_t(i) = P(o_1, \dots, o_t, s_t = i \mid \lambda) are the forward variables obtained from the forward algorithm, and N is the number of states. Smoothing extends this to compute the posterior for any state s_t given the entire observation sequence, P(s_t \mid o_1, \dots, o_T, \lambda), enabling more accurate offline inference by incorporating future observations. The smoothed probability is \gamma_t(i) = P(s_t = i \mid O, \lambda) = \frac{\alpha_t(i) \beta_t(i)}{P(O \mid \lambda)}, where \beta_t(i) are the backward variables and P(O \mid \lambda) = \sum_{i=1}^N \alpha_T(i). The backward algorithm computes the backward variables recursively as \beta_t(i) = P(o_{t+1}, \dots, o_T \mid s_t = i, \lambda), initialized with \beta_T(i) = 1 for all states i, and for t = T-1, \dots, 1,

\beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(j).

This recursion leverages the Markov property to propagate likelihoods backward through the state trellis, with time complexity O(T N^2).

Most likely state sequence

In hidden Markov models (HMMs), the task of finding the most likely sequence of hidden states S = s_1, s_2, \dots, s_T given an observed sequence O = o_1, o_2, \dots, o_T and model parameters \lambda = (A, B, \pi) involves computing \arg\max_S P(S \mid O, \lambda).^[9] This is equivalent to \arg\max_S P(O, S \mid \lambda), since P(O \mid \lambda) is constant for a fixed observation sequence, approximating the posterior by selecting the single path that maximizes the joint probability rather than summing over all possible paths.^[9] This approach, known as the Viterbi approximation, is particularly useful when the number of states is large, making exhaustive enumeration intractable.^[11] The Viterbi algorithm employs dynamic programming to efficiently solve this optimization problem in O(T N^2) time, where N is the number of states and T is the sequence length.^[9] It defines \delta_t(i) as the probability of the most likely path ending in state i at time t, having produced the partial observation sequence o_1, \dots, o_t. The initialization is given by

\delta_1(i) = \pi_i b_i(o_1), \quad 1 \leq i \leq N,

where \pi_i is the initial state probability and b_i(o_1) is the emission probability of observation o_1 from state i.^[9] The recursion proceeds as

\delta_{t+1}(j) = \max_{1 \leq i \leq N} \left[ \delta_t(i) a_{ij} \right] b_j(o_{t+1}), \quad 1 \leq t \leq T-1, \quad 1 \leq j \leq N,

with a_{ij} denoting the transition probability from state i to j, and auxiliary pointers \psi_t(j) = \arg\max_i \left[ \delta_t(i) a_{ij} \right] stored to enable backtracking.^[9] At time T, the maximum probability is P^* = \max_{1 \leq i \leq N} \delta_T(i), and the final state is s_T^* = \arg\max_i \delta_T(i). The optimal state sequence is then recovered by backtracking: s_t^* = \psi_{t+1}(s_{t+1}^*) for t = T-1, \dots, 1.^[9] This algorithm originates from Viterbi's work on decoding convolutional codes, where it was shown to be asymptotically optimal for certain rates, and was later adapted to HMMs for sequence decoding tasks.^[11] In practice, to avoid underflow in numerical computations, the probabilities are often computed in log space, replacing products with sums of logarithms.^[9] A prominent application is in bioinformatics for sequence alignment using profile HMMs, where the Viterbi algorithm identifies the optimal path aligning a query sequence to a model representing a protein family, especially when the exact marginal likelihood summation over all alignments is computationally prohibitive.^[12] For instance, in tools like HMMER, Viterbi decoding produces the highest-scoring alignment by maximizing the joint probability of the state path and emissions, facilitating remote homolog detection.^[12]

Statistical estimation

Statistical estimation in hidden Markov models (HMMs) involves evaluating the uncertainty and significance of inferences about latent states and model fit, distinct from parameter learning. This includes deriving measures of variability for posterior state probabilities and testing hypotheses about model structures using likelihood-based statistics. Such methods are essential for assessing the reliability of decoded state sequences or probabilities in applications like sequence analysis, where observations may be noisy or limited.^[7] The posterior distribution of the latent state s_t at time t given the observation sequence O = \{o_1, \dots, o_T\}, denoted P(s_t | O), is computed using the forward-backward algorithm. For discrete states labeled by integers i = 1, \dots, N, the posterior variance \operatorname{Var}(s_t | O) quantifies uncertainty as \sum_{i=1}^N \gamma_t(i) i^2 - \left( \sum_{i=1}^N \gamma_t(i) i \right)^2, where \gamma_t(i) = P(s_t = i | O). This variance is low when the posterior concentrates on a single state (high confidence) and increases with spread across states, providing a time-specific measure of inference reliability. In practice, for non-ordinal states, alternative uncertainty metrics like entropy H(s_t | O) = -\sum_i \gamma_t(i) \log \gamma_t(i) are used, but variance applies directly when states have natural ordering, such as in regime-switching models.^[13] Likelihood ratio tests enable model comparison in HMMs by assessing whether a more complex model significantly improves fit over a nested simpler one. The test statistic is $2 \log (L_1 / L_2), where L_1 and L_2 are the maximized likelihoods under the full and reduced models, respectively, and asymptotically follows a \chi^2 distribution with degrees of freedom equal to the difference in parameter counts under regularity conditions. This is particularly useful for testing constraints like equal transition probabilities across states or the number of hidden states, aiding selection among candidate HMMs without overfitting. For non-standard cases, such as parameters on boundaries, adjusted distributions may apply to maintain validity.^[14] Bootstrap methods address the variability of HMM inferences by resampling the observation sequence to generate empirical distributions of statistics like state posteriors or likelihoods. In a parametric bootstrap, one fits the HMM to data to obtain parameters \hat{\theta}, simulates B replicate sequences from the fitted model, refits or recomputes inferences on each, and uses the sample distribution (e.g., percentiles) for confidence intervals on quantities like \gamma_t(i) or decoded paths. This non-asymptotic approach is robust to small sample sizes or irregular likelihoods where analytical methods fail, though computationally intensive; efficient variants use importance sampling to reduce replications needed. For instance, bootstrap intervals for posterior probabilities help quantify uncertainty in state assignments for short sequences.^[15] Under ergodicity and identifiability assumptions, HMM maximum likelihood estimators exhibit asymptotic normality for large observation length T: \sqrt{T} (\hat{\theta} - \theta_0) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\theta_0)^{-1}), where \mathcal{I}(\theta_0) is the Fisher information matrix, enabling Wald tests and confidence intervals for parameters. This convergence supports p-value computations for hypotheses on state-specific means or transitions, such as testing if a transition probability exceeds a threshold. For latent state inferences, similar asymptotics hold for smoothed posteriors, with variability scaling as $1/T, though misspecification requires robust variants. These results underpin standard errors in software implementations for long sequences.^[16]

Learning

Parameter estimation methods

Parameter estimation in hidden Markov models (HMMs) aims to determine the model parameters \lambda = (A, B, \pi) from observed sequences O = o_1, o_2, \dots, o_T, where the latent states are unobserved, making it an unsupervised learning problem. The objective is to maximize the log-likelihood \log P(O \mid \lambda), which corresponds to finding the parameters that best explain the observed data under the HMM assumptions. The seminal method for this is the Baum-Welch algorithm, introduced as an iterative maximization technique for probabilistic functions of Markov chains. This algorithm is a specific instance of the expectation-maximization (EM) framework applied to HMMs, where the expectation step computes expected sufficient statistics using posterior probabilities, and the maximization step updates the parameters to increase the likelihood.^[7] The Baum-Welch algorithm leverages the forward-backward inference procedure to perform the expectation step efficiently. It computes two key quantities: the posterior probability of being in state i at time t, denoted \gamma_t(i) = P(s_t = i \mid O, \lambda), and the joint posterior probability of transitioning from state i at time t to state j at time t+1, denoted \xi_t(i,j) = P(s_t = i, s_{t+1} = j \mid O, \lambda). These are derived from the forward probabilities \alpha_t(i) = P(o_1, \dots, o_t, s_t = i \mid \lambda) and backward probabilities \beta_t(i) = P(o_{t+1}, \dots, o_T \mid s_t = i, \lambda), specifically \gamma_t(i) = \frac{\alpha_t(i) \beta_t(i)}{P(O \mid \lambda)} and \xi_t(i,j) = \frac{\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}{P(O \mid \lambda)}.^[7] In the maximization step, the parameters are re-estimated using these expected counts to form maximum-likelihood estimates under the complete-data likelihood. The transition probabilities are updated as

\hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)},

which represents the expected number of transitions from i to j divided by the expected number of times in state i. For discrete observation emissions, the probabilities are updated as

\hat{b}_j(k) = \frac{\sum_{t=1}^T \gamma_t(j) \cdot \mathbf{1}(o_t = k)}{\sum_{t=1}^T \gamma_t(j)},

where \mathbf{1}(o_t = k) is 1 if the observation at t matches symbol k and 0 otherwise; the initial state distribution is \hat{\pi}_i = \gamma_1(i). These closed-form updates ensure that the likelihood does not decrease after each iteration.^[7] The algorithm proceeds iteratively until the change in log-likelihood falls below a threshold, at which point it converges to a local maximum of P(O \mid \lambda). Since the likelihood function is non-convex and multimodal, convergence depends on the initial parameter values, often leading to suboptimal solutions; thus, multiple runs with random initializations are recommended, selecting the model with the highest final likelihood.^[7] This approach has been foundational for HMM training, particularly in applications requiring robust unsupervised estimation.^[7]

Supervised vs. unsupervised learning

In supervised learning for hidden Markov models (HMMs), both the latent state sequences and the corresponding observation sequences are available during training, allowing for direct maximum likelihood estimation of the model parameters through simple counting methods. The transition probabilities are computed as the frequency of observed transitions between states, given by \hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \mathbb{I}(q_t = i, q_{t+1} = j)}{\sum_{t=1}^{T-1} \mathbb{I}(q_t = i)}, where q_t denotes the state at time t and \mathbb{I} is the indicator function; similarly, emission probabilities are estimated as the relative frequency of each observation given the state.^[17]^[18] This approach is computationally efficient and yields precise estimates when state labels are accurate, as seen in applications like part-of-speech tagging with manually annotated corpora.^[1] In contrast, unsupervised learning applies when only observation sequences are available, as the states remain hidden, necessitating iterative inference to estimate parameters; the Baum-Welch algorithm, an expectation-maximization procedure, is commonly used to iteratively refine estimates by computing expected state occupancies and transitions via forward-backward probabilities. This method is more challenging due to the indirect nature of state inference, often leading to convergence at local maxima rather than the global optimum, and requiring higher computational resources for convergence.^[1] Hybrid approaches, such as semi-supervised learning, incorporate a mix of fully labeled, partially labeled, and unlabeled sequences to leverage available supervision while scaling to larger datasets, improving parameter robustness in domains like biological sequence analysis.^[19] The key trade-offs between these paradigms lie in data requirements, efficiency, and performance: supervised learning excels in accuracy and speed when labeled state data is plentiful but is limited by the cost and scarcity of such annotations, whereas unsupervised learning enables model training on abundant unlabeled observations, promoting scalability at the expense of potential suboptimal solutions and increased iteration times.^[20] Model evaluation in both settings typically employs cross-validation on held-out labeled data to assess generalization or computes the likelihood of unseen observation sequences under the trained model, with supervised approaches benefiting from direct accuracy metrics like state prediction error and unsupervised ones relying more on perplexity or predictive log-likelihood to gauge fit.^[1]

Applications

Speech and signal processing

Hidden Markov models (HMMs) have played a pivotal role in speech recognition since the 1980s, modeling the temporal dynamics of spoken utterances where hidden states correspond to phonemes or subword units, and observations consist of acoustic feature vectors derived from the audio signal. A common choice for these observations is Mel-frequency cepstral coefficients (MFCCs), which capture the short-term power spectrum of speech in a perceptually motivated way, mimicking human auditory processing. This representation proved effective in early experiments comparing parametric features for monosyllabic word recognition in continuous speech.^[9]^[21] To address coarticulation effects—where the pronunciation of a phoneme is influenced by neighboring sounds—triphone models extend basic phoneme-based HMMs by conditioning acoustic models on the left and right phonetic contexts, such as modeling the transition from one phoneme to another across a central phoneme. This context-dependent approach significantly improves recognition accuracy by reducing the variability in emission probabilities for similar phonetic environments, as demonstrated in large-vocabulary systems. For handling continuous-valued observations in real speech, hybrid HMM-Gaussian mixture model (HMM-GMM) frameworks model emission densities as weighted sums of multivariate Gaussians, allowing flexible approximation of the complex distributions of acoustic features like MFCCs. These mixtures enable robust parameter estimation via expectation-maximization, adapting to speaker and environmental variations.^[9]^[9] Early implementations, such as the Sphinx system developed at Carnegie Mellon University, leveraged discrete HMMs with triphone modeling to achieve speaker-independent recognition of continuous speech with vocabularies exceeding 1,000 words, attaining word error rates around 20-30% on benchmark tasks in the late 1980s. HMM-based architectures, including HMM-GMM hybrids, became foundational in automatic speech recognition (ASR) pipelines, powering commercial systems and influencing standards in telephony and dictation software through the 1990s and 2000s.^[22]^[9] Beyond speech, HMMs find applications in broader signal processing tasks, particularly for anomaly detection in time-series data where normal signal patterns are modeled as standard state sequences, and deviations indicate faults or irregularities. In fault diagnosis for mechanical systems, such as monitoring vibration signals from machinery, HMMs distinguish healthy operation from defective states by analyzing transitions in feature-extracted signals, enabling early detection with high sensitivity in industrial settings. This approach has been applied to tasks like wind turbine state detection, achieving high performance (AUC > 0.99) in controlled experiments by leveraging Viterbi decoding to infer anomalous paths.^[23]

Bioinformatics and sequence analysis

Hidden Markov models (HMMs) have become a cornerstone in bioinformatics for modeling biological sequences, where the hidden states often represent structural or functional elements, and observations correspond to nucleotide or amino acid symbols. In sequence analysis, HMMs excel at capturing dependencies in discrete symbolic data, such as DNA or protein sequences, enabling predictions of latent features like gene boundaries or structural motifs. This application leverages the probabilistic framework of HMMs to handle the inherent variability and noise in biological data, outperforming simpler statistical models in accuracy for tasks involving sequential patterns.^[24] In gene finding, HMMs model genomic DNA sequences by defining states that correspond to genomic regions such as exons, introns, and intergenic spacers, with emissions representing the observed DNA bases (A, C, G, T). A seminal implementation is GENSCAN, which uses a generalized HMM to predict complete gene structures, including splice sites and exon-intron boundaries, by computing the most likely state path through the sequence. GENSCAN achieves high sensitivity and specificity on human and vertebrate genomes, correctly identifying about 79% of exons at the nucleotide level in benchmark tests. The Viterbi algorithm is typically employed to decode the most probable gene structure.^[25] For protein secondary structure prediction, HMMs treat amino acid residues as emissions and define states for common structural elements like alpha-helices, beta-sheets, and coils (or random coils). An early and influential approach uses a standard HMM to estimate transition probabilities between states based on known protein structures and emission probabilities derived from amino acid propensities in each state. This method, applied to single sequences, yields prediction accuracies around 60-65% for three-state models (helix, sheet, coil), improving upon rule-based heuristics by accounting for sequential correlations. Such models facilitate de novo predictions and refinement of homology-based inferences.^[26] Profile HMMs extend standard HMMs for detecting conserved motifs in protein families, incorporating match, insert, and delete states to represent alignments without gaps in the core model. These models are built from multiple sequence alignments, capturing position-specific conservation and variability for sensitive database searches. A foundational work introduced profile HMMs for protein modeling, demonstrating superior performance in aligning and searching distantly related sequences compared to position weight matrices. Profile HMMs are particularly effective for motif detection in large databases, with applications in identifying functional domains. The HMMER software suite implements profile HMMs for practical biosequence analysis, enabling rapid searching of protein databases for homologs and generating alignments. Developed as an open-source tool, HMMER uses optimized algorithms like the Forward-Backward procedure for scoring and has become integral to resources like Pfam for curating protein families. In benchmarks, HMMER3 achieves speeds comparable to BLAST while maintaining higher sensitivity for remote homologs, processing large datasets efficiently on standard hardware.

History

Origins and early developments

The foundations of hidden Markov models (HMMs) trace back to the early 20th century with the development of Markov chains by Russian mathematician Andrey Markov, who introduced the concept in 1906 to model sequences of dependent events, such as letters in texts, demonstrating that the law of large numbers holds without full independence.^[27] This framework provided the stochastic process basis for later extensions, where observable outcomes depend on underlying states. The "hidden" aspect emerged in the 1960s amid efforts to model noisy communication channels and pattern recognition problems, particularly in speech processing, at the Institute for Defense Analyses (IDA). A seminal contribution came from Leonard E. Baum and Ted Petrie, who in 1966 formalized statistical inference methods for probabilistic functions of finite-state Markov chains, laying the theoretical groundwork for estimating parameters in systems where states are not directly observable.^[28] Their work, building on earlier IDA research, addressed challenges in decoding hidden sequences from observed emissions, initially motivated by automatic speech recognition and error-correcting codes in noisy environments.^[9] In the early 1970s, these ideas gained practical traction in speech recognition through extensions by James K. Baker at Carnegie Mellon University and Fred Jelinek at IBM, who adapted HMMs for continuous speech modeling, enabling probabilistic decoding of phonetic sequences despite acoustic variability.^[9] Baker's implementations, influenced by IDA tutorials from Jack Ferguson in the late 1960s, integrated HMMs into systems like the HARPY recognizer, while Jelinek's IBM group developed large-vocabulary dictation tools using stochastic training on Markov sources.^[29] These advancements marked the transition from theoretical formulations to applied tools for sequential data analysis.

Key contributors and milestones

In the late 1970s and 1980s, the application of Hidden Markov Models (HMMs) to speech recognition gained momentum through the efforts of key researchers at IBM, led by Frederick Jelinek, whose group pioneered the shift from rule-based systems to statistical approaches, including HMMs, achieving significant improvements in continuous speech recognition accuracy.^[30] This transition was supported by substantial funding from the Defense Advanced Research Projects Agency (DARPA), which sponsored initiatives like the Strategic Computing Program in the 1980s, fostering the development and evaluation of HMM-based systems for large-vocabulary speech tasks.^[31] A pivotal milestone came in 1989 with Lawrence Rabiner's influential tutorial, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," which synthesized the theoretical foundations and practical implementations of HMMs, drawing on earlier work by Jack Ferguson, and made the framework accessible to a broad audience of engineers and researchers. During this period, the Baum-Welch algorithm, originally formulated in the early 1970s, saw widespread implementation in speech processing software, enabling efficient parameter estimation for HMMs trained on acoustic data and marking a practical advancement in the field's scalability.^[32] In the 1990s, HMMs extended beyond speech into bioinformatics, with Anders Krogh and colleagues introducing profile HMMs in 1994 for modeling protein families and sequence alignment, which improved sensitivity in database searches and multiple sequence analysis compared to earlier probabilistic methods.^[33] This adoption highlighted HMMs' versatility in handling sequential data with hidden states. By the 2000s, HMMs integrated with emerging machine learning techniques, such as Gaussian mixture models and neural network hybrids, enhancing applications in areas like natural language processing, though the core HMM framework—relying on Markov assumptions and Viterbi-style inference—remained fundamentally unchanged.^[1]

Extensions

Continuous and general state spaces

In hidden Markov models (HMMs) with continuous state spaces, the hidden states are elements of a continuous domain, such as \mathbb{R}^d, rather than a finite discrete set. The transition probabilities are replaced by transition densities f(x_t \mid x_{t-1}), which describe the probability distribution of the next state given the previous one, often modeled using parametric forms like multivariate Gaussians for tractability. Similarly, emission densities g(y_t \mid x_t) govern the observation process, allowing the model to handle real-valued observations common in applications like signal processing. Inference algorithms for continuous-state HMMs adapt the classical forward-backward procedure by replacing discrete summations with integrals over the state space. For example, the forward variable \alpha_t(x_t) at time t is computed as \alpha_t(x_t) = g(y_t \mid x_t) \int \alpha_{t-1}(x_{t-1}) f(x_t \mid x_{t-1}) \, dx_{t-1}, which is analytically tractable under linear Gaussian assumptions where states evolve via x_t = A x_{t-1} + w_t and observations via y_t = C x_t + v_t, with w_t and v_t Gaussian noises. This special case reduces to the Kalman filter for filtering and the Kalman smoother for full smoothing, unifying discrete HMM methods with linear dynamical systems. In nonlinear or non-Gaussian settings, these integrals are approximated using methods like particle filters, which maintain a weighted set of state samples to represent the posterior distribution recursively. Extensions to general state spaces encompass countable or uncountably infinite states, addressing scenarios where the number of states is unknown or potentially unlimited. A prominent approach is the infinite HMM, which employs a Dirichlet process prior on the transition distributions to allow an infinite number of states while sharing atoms across transitions via a hierarchical structure.^[34] Specifically, for each state i, the transition row \beta_i is drawn from a Dirichlet process \text{DP}(\alpha_0 G_0), where G_0 is a base measure (e.g., on emission parameters) and \alpha_0 controls concentration, enabling posterior inference to infer a data-driven number of states through stick-breaking representations or Chinese restaurant process sampling.^[34] Algorithms for these models approximate marginal likelihoods and posteriors using variational methods or MCMC, with particle filters providing sequential approximations for online settings.

Bayesian and discriminative variants

Bayesian variants of hidden Markov models treat the model parameters as random variables, incorporating prior distributions to facilitate full probabilistic inference over parameters and latent states, thereby providing measures of uncertainty in addition to point estimates. The transition matrix A and initial state distribution \pi typically receive Dirichlet priors, which are conjugate to the multinomial distributions governing state transitions and initial states, ensuring that the priors reflect the simplex constraints on probabilities. For emission parameters B, priors are selected to match the emission distributions, such as Dirichlet for discrete categorical emissions or normal-inverse-gamma for continuous Gaussian emissions. This setup allows for posterior distributions that integrate out parameters, avoiding overfitting in data-scarce regimes.^[35] Inference in Bayesian HMMs is computationally challenging due to the intractability of the full posterior, often addressed through Markov chain Monte Carlo (MCMC) methods like Gibbs sampling, which iteratively samples latent states from their conditional distributions given observations and parameters, then updates parameters from their conditionals given states. Alternatively, variational Bayesian inference approximates the true posterior with a simpler factorized distribution, optimizing a lower bound on the marginal likelihood to enable scalable computation, particularly useful for large datasets or real-time applications. These approaches have been foundational in extending HMMs to nonparametric settings, such as hierarchical Dirichlet process priors for unknown state spaces.^[36] Discriminative variants, in contrast, directly model the conditional distribution P(S|O) of latent states S given observations O, bypassing the joint modeling of P(O, S) in generative HMMs to improve accuracy in supervised tasks with labeled data. A key example is the conditional random field (CRF), which posits an exponential family form for the conditional, incorporating arbitrary features of observations and states without assuming conditional independence between observations given states, thus mitigating issues like label bias in maximum entropy Markov models. Parameter estimation in CRFs uses maximum likelihood on the conditional, often via gradient-based optimization, leading to superior performance over HMMs in sequence labeling. The core difference lies in the objective: generative models like standard HMMs optimize the joint likelihood, which can underperform if the observation model is poorly specified, whereas discriminative approaches prioritize the posterior over states, yielding tighter decision boundaries for classification. Bayesian HMMs are particularly advantageous in small-data settings, where priors encode domain knowledge to quantify parameter uncertainty, as seen in applications like sparse time-series analysis in finance. Discriminative models such as CRFs shine in natural language processing, outperforming HMMs in part-of-speech tagging by 1-2% in accuracy on benchmark datasets like the Penn Treebank, due to their ability to leverage rich observation features.^[37]

Advanced Topics

Measure-theoretic foundations

The measure-theoretic foundations of hidden Markov models (HMMs) extend the classical discrete-state framework to general measurable spaces, enabling the treatment of continuous or abstract state and observation spaces. The state space is denoted by (S, \Sigma_S), where S is the set of possible hidden states and \Sigma_S is a \sigma-algebra on S, rendering it a measurable space. Similarly, the observation space is (O, \Sigma_O), with O as the set of observable outcomes and \Sigma_O its \sigma-algebra. These structures allow the definition of probability measures over paths in the product space, capturing the underlying stochastic dynamics without assuming finite or countable discreteness. The dynamics of an HMM are specified via stochastic kernels that govern transitions and emissions. The state transition kernel K: S \times \Sigma_S \to [0,1] is such that, for each fixed x \in S, K(x, \cdot) is a probability measure on (S, \Sigma_S), and for each A \in \Sigma_S, the map x \mapsto K(x, A) is \Sigma_S-measurable. The emission kernel G: S \times \Sigma_O \to [0,1] similarly defines the conditional distribution of observations given states, satisfying analogous measurability and probability conditions. Given an initial probability measure \mu on (S, \Sigma_S), the joint law of the HMM process (X_t, Y_t)_{t \geq 1} is constructed on a suitable probability space (\Omega, \mathcal{F}, P), where \{X_t\} forms a Markov chain with transitions induced by K and law starting from \mu, and Y_t is conditionally independent across time given \{X_t\} with law G(X_t, \cdot). This setup ensures the overall measure P is well-defined and mutually absolutely continuous with respect to product measures under standard regularity assumptions.^[5] Filtered processes in HMMs arise from conditioning the hidden state distribution on the observed filtration. Define the observation filtration \mathcal{F}_t = \sigma(Y_1, \dots, Y_t), a sub-\sigma-algebra of \mathcal{F}. The filtering distribution at time t is the regular conditional probability \pi_t(A) = P(X_t \in A \mid \mathcal{F}_t) for A \in \Sigma_S, which exists as a probability measure on (S, \Sigma_S) by the disintegration theorem, given the measurability of the kernels. Expectations under \pi_t yield filtered estimates, such as \pi_t(f) = \int_S f(x) \pi_t(dx) = E[f(X_t) \mid \mathcal{F}_t] for bounded measurable f: S \to \mathbb{R}. The evolution of \pi_t is governed by recursive updates involving the prediction step via K and the correction via G, formulated in terms of conditional expectations within the expanding filtration.^[5] Under stationarity and ergodicity assumptions, long-run behavior of HMMs is characterized by ergodic theorems. Assume the Markov chain \{X_t\} is positive Harris recurrent and aperiodic with respect to some reference measure, implying the existence of a unique invariant probability measure \bar{\mu} on (S, \Sigma_S). By Birkhoff's ergodic theorem applied to the stationary ergodic process \{X_t\}, the time average \frac{1}{n} \sum_{t=1}^n f(X_t) \to \int_S f d\bar{\mu} almost surely for integrable f, establishing the long-run state distribution. For the filtered process, if the emission kernel G is sufficiently informative (e.g., ensuring absolute continuity and identifiability), the filter \pi_t converges in total variation to a stationary filtering distribution \bar{\pi}, reflecting the ergodic occupation measure conditioned on infinite observations. These results underpin asymptotic consistency in parameter estimation and prediction. The abstract likelihood in HMMs is expressed using Radon-Nikodym derivatives to handle general measures. For a parameterized model with parameter \lambda governing K^\lambda, G^\lambda, and initial \mu^\lambda, the likelihood is the Radon-Nikodym derivative \frac{dP^\lambda}{dP^0}(Y_{1:n}), where P^0 is a dominating reference measure (often the law of independent observations under a baseline kernel). Under absolute continuity P^\lambda \ll P^0, this derivative admits an explicit form as an expectation over the hidden paths:

\frac{dP^\lambda}{dP^0}(Y_{1:n}) = E^0 \left[ \prod_{t=1}^n \frac{dG^\lambda(X_t, \cdot)}{dG^0(X_t, \cdot)}(Y_t) \cdot \frac{d\mu^\lambda}{d\mu^0}(X_1) \prod_{t=1}^{n-1} \frac{dK^\lambda(X_t, \cdot)}{dK^0(X_t, \cdot)}(X_{t+1}) \;\middle|\; Y_{1:n} \right],

where the densities are with respect to dominating measures on the respective spaces. This formulation facilitates maximum likelihood inference via filtering recursions, ensuring well-posedness in non-discrete settings.

Input-output and hybrid models

Input-output hidden Markov models (IOHMMs) extend the standard hidden Markov model framework by incorporating exogenous inputs that influence the state transitions and emission probabilities, enabling the modeling of dynamic systems where observations depend on both hidden states and external controls or covariates. In an IOHMM, the transition probabilities are conditioned on inputs x_t, such that P(s_{t+1} | s_t, x_t), and emissions are given by P(o_t | s_t, x_t), allowing the model to capture how inputs like actions or environmental factors affect the system's evolution. This architecture was introduced to handle supervised sequence processing tasks, such as predicting outputs from input sequences, and supports learning via an expectation-maximization algorithm adapted for the input dependencies.^[38] Hybrid models combine HMMs with other architectures to address limitations in emission modeling or temporal dependencies. For instance, hybrid deep neural network-HMM (DNN-HMM) systems replace traditional Gaussian mixture emissions with DNNs that predict state posteriors from acoustic features, improving representation of complex observation distributions in tasks like speech recognition. In these hybrids, the DNN serves as a nonlinear classifier for frame-level states, while the HMM layer enforces temporal consistency through transitions, achieving significant error rate reductions on benchmarks like Switchboard compared to Gaussian HMMs. Recurrent HMM variants integrate recurrent neural networks (RNNs) to introduce longer-range memory, where hidden states evolve via RNN dynamics before feeding into HMM emissions, enhancing modeling of non-Markovian sequences in areas like activity recognition. Autoregressive HMMs further extend the framework by making emissions dependent on previous observations within each state, formalized as o_t \sim \mathcal{N}(A_{s_t} o_{t-1} + b_{s_t}, \Sigma_{s_t}), where A_{s_t} and b_{s_t} are state-specific autoregressive parameters. This allows states to influence future observations through serial correlations, capturing dynamics like speech signals where current samples depend on priors, and inference proceeds via forward-backward algorithms with autoregressive likelihoods. Such models are particularly suited for time series with within-state dependencies.^[39] These variants find applications in control systems and reinforcement learning, where inputs represent actions that alter state transitions in partially observable environments, akin to partially observable Markov decision processes (POMDPs). In reinforcement learning, IOHMMs model belief states over hidden dynamics, enabling policy optimization under uncertainty, as demonstrated in algorithms that learn value functions for POMDPs using HMM filtering.