HMM
A hidden Markov model (HMM) is a statistical model that describes a system as a Markov process with unobserved (hidden) states, where the observable outputs are generated probabilistically from those states, allowing inference about the hidden state sequence from the observations.[1] The model is defined by a set of hidden states, a transition probability matrix governing state changes, an emission probability distribution for observations from each state, and an initial state probability vector.[1] HMMs assume the Markov property, meaning the next state depends only on the current state, not prior history, and observations are conditionally independent given the state.[2] Developed in the late 1960s by Leonard E. Baum and colleagues at the Institute for Defense Analyses, HMMs built on earlier Markov chain theory to address probabilistic inference in sequences with latent variables.[1] Key algorithms for HMMs include the Viterbi algorithm for finding the most likely state sequence (decoding), the forward algorithm for computing the probability of an observation sequence, and the Baum-Welch algorithm (an expectation-maximization method) for parameter estimation from data.[1] These tools enable unsupervised learning of model parameters, making HMMs versatile for sequential data analysis.[2] HMMs gained prominence in the 1970s and 1980s through applications in speech recognition, where they model acoustic signals as emissions from hidden phonetic states, powering systems like those at Bell Labs.[3] In bioinformatics, HMMs are used for gene prediction by modeling nucleotide sequences with hidden states representing exons, introns, or intergenic regions, as in tools like GENSCAN.[4] Other notable applications include part-of-speech tagging in natural language processing, gesture recognition in computer vision, and anomaly detection in time series data.[1] Despite their foundational role, HMMs have been extended or supplemented by models like conditional random fields for handling non-independent observations.[3]Background Concepts
Markov Chains
A Markov chain is a stochastic process satisfying the Markov property, which posits that the probability distribution for the next state depends only on the current state and not on the sequence of prior states. Formally, for a sequence of random variables X_1, X_2, \dots, X_t, \dots, the property is expressed as P(X_t = j \mid X_{t-1} = i_{t-1}, \dots, X_1 = i_1) = P(X_t = j \mid X_{t-1} = i_{t-1}) for all t \geq 2, states i_1, \dots, i_{t-1}, j. This concept was introduced by Andrey Markov in his 1906 paper examining dependencies in sequences like letters in Pushkin's Eugene Onegin.[5][6] In the context of discrete-time Markov chains (DTMCs), the state space S is typically finite or countably infinite, denoted as S = \{1, 2, \dots, N\} for a finite case with N states. The dynamics are governed by a transition probability matrix A = (a_{ij}), where a_{ij} = P(X_t = j \mid X_{t-1} = i) represents the probability of transitioning from state i to state j. Each entry satisfies a_{ij} \geq 0, and the rows sum to 1: \sum_{j=1}^N a_{ij} = 1 for all i, ensuring the matrix is stochastic. The process begins with an initial state distribution \pi = (\pi_1, \pi_2, \dots, \pi_N), where \pi_i = P(X_1 = i) and \sum_{i=1}^N \pi_i = 1. The joint probability of a specific state sequence x_1, x_2, \dots, x_T is then given by P(X_1 = x_1, \dots, X_T = x_T) = \pi_{x_1} \prod_{t=2}^T a_{x_{t-1} x_t}. This formulation captures the chain's memoryless progression from one state to the next.[6][7] A key long-term property of Markov chains is the stationary distribution, a probability vector \pi that remains invariant under transitions, satisfying the balance equation \pi = \pi A along with the normalization \sum_i \pi_i = 1. For ergodic chains—those that are irreducible (any state reachable from any other) and aperiodic (no cyclic structure forcing periodic returns)—this stationary distribution is unique and represents the limiting distribution as t \to \infty, independent of the initial state: \lim_{t \to \infty} P(X_t = i) = \pi_i for all i. Ergodicity ensures the chain explores the state space thoroughly over time, averaging to the stationary behavior.[6] A simple illustrative example is a two-state weather model with states sunny (1) and rainy (2). The transition matrix might be A = \begin{pmatrix} 0.9 & 0.1 \\ 0.2 & 0.8 \end{pmatrix}, indicating a 90% chance of sunny following sunny and an 80% chance of rainy following rainy. Assuming an initial distribution \pi = (0.5, 0.5), the probability of a sequence like sunny-rainy-sunny is $0.5 \times 0.1 \times 0.2 = 0.01. Solving \pi = \pi A yields the stationary distribution \pi \approx (0.667, 0.333), suggesting long-run proportions of about two-thirds sunny days and one-third rainy. This model highlights how transition probabilities dictate steady-state tendencies without external influences.[6]Observation Models
In probabilistic models involving sequential data, an observation model defines the conditional probability distribution that generates observable outputs from underlying states, providing a mechanism to link latent processes to measurable events.[3] This framework is foundational in Markovian systems, where states evolve over time, and observations serve as proxies for those states in scenarios where direct state visibility is assumed.[3] For discrete observations, the emission probability is specified as b_j(k) = P(O_t = k \mid X_t = j), where O_t represents the observation at time t, X_t is the state at time t, j indexes one of N possible states, and k indexes one of M possible observation symbols.[3] These probabilities form an N \times M emission matrix B, with each row summing to 1 to ensure a valid probability distribution over observations for a given state.[3] This discrete setup is common in applications like symbol sequences, where states deterministically or stochastically map to a finite set of outputs.[3] In the case of continuous observations, the emission probabilities are often modeled using multivariate Gaussian distributions, with b_j(O_t) = \mathcal{N}(O_t; \mu_j, \Sigma_j), where \mu_j is the mean vector and \Sigma_j is the covariance matrix associated with state j.[3] This parametric form assumes that observations are drawn independently from a normal distribution centered around the state's characteristic parameters, accommodating real-valued data such as sensor measurements.[3] More complex variants, like Gaussian mixture models, extend this by weighting multiple Gaussians per state, but the single Gaussian case establishes the core continuous emission concept.[3] A representative example arises in a visible Markov chain modeling weather patterns, where states correspond to conditions like sunny, rainy, or cloudy, and observations are direct state indicators in the noiseless case but extended to noisy measurements, such as temperature readings emitted from each state with added Gaussian noise.[3] For instance, a sunny state might emit higher average temperatures around 25°C with low variance, while a rainy state emits cooler readings around 15°C with similar variability, allowing the model to generate realistic sequences of temperature data from known state transitions.[3] Key properties of observation models include the conditional independence of emissions given the current state, expressed as P(O_t \mid X_t, X_{t-1}, \dots, O_{t-1}) = P(O_t \mid X_t), ensuring that past states and observations do not influence the current emission beyond the present state.[3] This Markovian assumption simplifies computation and aligns with the sequential nature of the underlying state process.[3] The joint probability distribution over states and observations in such a visible model is given by: P(X_1, \dots, X_T, O_1, \dots, O_T) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t), where \pi_{X_1} is the initial state probability, a_{ij} = P(X_t = j \mid X_{t-1} = i) are the state transition probabilities, and the product captures the chained dependencies.[3] This formulation integrates the observation model with the state dynamics, enabling the generation of complete observation sequences from initial conditions.[3]Model Formalism
State Space and Transitions
In a Hidden Markov Model (HMM), the hidden state space consists of a finite set of latent states S = \{1, 2, \dots, N\}, where N is the number of states and these states represent underlying processes that generate the observed data but are not directly measurable.[1] The states evolve over discrete time steps according to the first-order Markov property, which assumes that the probability of the next state depends solely on the current state: P(X_t = j \mid X_{t-1} = i, X_{t-2}, \dots, X_1) = P(X_t = j \mid X_{t-1} = i) = a_{ij}.[8] This property simplifies the modeling of sequential dependencies by ignoring longer historical context.[9] The transition probabilities \{a_{ij}\} for $1 \leq i, j \leq N form the state transition matrix A, a stochastic matrix where each row sums to 1 (\sum_{j=1}^N a_{ij} = 1 for all i), ensuring valid probability distributions over possible next states.[1] HMMs typically assume homogeneous transitions, meaning the matrix A is time-invariant across the sequence, although extensions to time-varying transitions exist in more advanced variants.[8] The states are assumed to be finite and discrete, facilitating tractable computation, though continuous-state generalizations have been developed for specific applications.[9] The process begins with an initial state distribution \pi = \{\pi_i\}, where \pi_i = P(X_1 = i) for $1 \leq i \leq N and \sum_{i=1}^N \pi_i = 1, capturing the starting probabilities of the hidden chain.[1] Together, these components define the hidden state dynamics of the HMM, denoted compactly as the model \lambda = (A, B, \pi), where B specifies the observation probabilities linking states to data (detailed separately).[9] Graphically, the state space and transitions are often depicted as a directed graph, with nodes representing the N states and directed edges labeled with the corresponding a_{ij} weights, illustrating the probabilistic flow between states.[1] For instance, in part-of-speech tagging, a simple three-state HMM might use states for noun (N), verb (V), and adjective (A), with the transition matrix A encoding typical linguistic sequences—such as a high probability of noun-to-verb transitions in subject-verb structures and lower probabilities for less common shifts like verb-to-noun.[10] A representative transition matrix could appear as follows, reflecting such patterns:| From \ To | N | V | A |
|---|---|---|---|
| N | 0.70 | 0.20 | 0.10 |
| V | 0.10 | 0.80 | 0.10 |
| A | 0.30 | 0.20 | 0.50 |
Hidden States and Observations
In Hidden Markov models, the observation sequence is generated by the underlying hidden states via a probabilistic emission process. The sequence of observations is denoted as O = O_1, O_2, \dots, O_T, where each O_t at time step t = 1, 2, \dots, T is produced conditionally on the hidden state X_t.[3] Observations in HMMs can be either discrete symbols from a finite alphabet V = \{ v_1, v_2, \dots, v_M \}, with M possible symbols, or continuous vectors in a multidimensional space.[3] The emission probabilities capture this process and are represented by the matrix B = \{ b_j(k) \}, where b_j(k) = P(O_t = v_k | X_t = j) denotes the probability of observing symbol v_k when the hidden state is j, for j = 1, 2, \dots, N states and k = 1, 2, \dots, M.[3] This formulation assumes conditional independence of observations given the current hidden state: P(O_t | X_t, O_{1:t-1}, X_{1:t}) = P(O_t | X_t).[3] The full HMM is defined by the parameter set \lambda = (A, B, \pi), combining the emission probabilities B with the state transition matrix A and initial state distribution \pi.[3] For continuous observations, emission densities are commonly modeled using multivariate Gaussian distributions, where b_j(\mathbf{O}_t) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}_j|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{O}_t - \boldsymbol{\mu}_j)^T \boldsymbol{\Sigma}_j^{-1} (\mathbf{O}_t - \boldsymbol{\mu}_j) \right), with \boldsymbol{\mu}_j as the mean vector, \boldsymbol{\Sigma}_j as the covariance matrix for state j, and d as the observation dimension.[3] A illustrative example is the biased coin toss HMM, where hidden states correspond to different coins with distinct biases, and observations are the outcomes of tosses (heads or tails). In this setup, one state might represent a fair coin with b_j(H) = b_j(T) = 0.5, while another represents a biased coin with b_j(H) = 0.6 and b_j(T) = 0.4; the sequence of observed heads and tails thus reflects the underlying, unseen sequence of coin selections and their biases.[3] The joint probability of a hidden state sequence X = X_1, X_2, \dots, X_T and the observation sequence O under model \lambda is given by P(X, O | \lambda) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t), which factors the initial probability, emissions, and transitions multiplicatively.[3]Computational Methods
Likelihood Computation (Forward Algorithm)
In hidden Markov models (HMMs), computing the likelihood of an observation sequence O = O_1, O_2, \dots, O_T given the model parameters \lambda = (A, B, \pi) is a fundamental evaluation task, as it quantifies how well the model explains the data. This likelihood is defined as P(O \mid \lambda) = \sum_X P(O, X \mid \lambda), where the sum is over all possible hidden state sequences X = X_1, X_2, \dots, X_T of length T, and P(O, X \mid \lambda) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t). Direct summation over all N^T possible state paths is computationally intractable for moderate T and number of states N > 1.[3] The forward algorithm addresses this by using dynamic programming to compute the likelihood efficiently. It defines the forward variable \alpha_t(j) = P(O_1, O_2, \dots, O_t, X_t = j \mid \lambda) as the probability of the partial observation sequence up to time t and being in state j at time t, given \lambda. Initialization sets \alpha_1(j) = \pi_j b_j(O_1) for j = 1, \dots, N. The recursion proceeds as \alpha_{t+1}(j) = \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(O_{t+1}) for t = 1, \dots, T-1 and j = 1, \dots, N. Termination yields the likelihood as P(O \mid \lambda) = \sum_{j=1}^N \alpha_T(j). This approach has time complexity O(N^2 T), a polynomial improvement over the exponential cost of naive enumeration, making it feasible for practical sequence lengths.[3] The forward variables also enable filtering posterior probability computations, such as the probability of being in state j at time t given the observations up to t, via \gamma_t(j) = \alpha_t(j) / \sum_{k=1}^N \alpha_t(k). Full smoothing posteriors, given the entire observation sequence, require the forward-backward procedure incorporating backward variables. To illustrate, consider a simple HMM for a coin-tossing experiment with two hidden states: fair coin (state 1, P(H) = 0.6) and biased coin (state 2, P(H) = 0.9), initial probabilities \pi_1 = 0.6, \pi_2 = 0.4, and transition matrix A = \begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}.[3] For observation sequence O = H, T, H (T=3), the emission probabilities are b_1(H) = 0.6, b_1(T) = 0.4, b_2(H) = 0.9, b_2(T) = 0.1. The forward variables are computed as follows:- At t=1 (O_1 = H): \alpha_1(1) = 0.6 \times 0.6 = 0.36, \alpha_1(2) = 0.4 \times 0.9 = 0.36.
- At t=2 (O_2 = T): \alpha_2(1) = [0.36 \times 0.7 + 0.36 \times 0.3] \times 0.4 = 0.144, \alpha_2(2) = [0.36 \times 0.3 + 0.36 \times 0.7] \times 0.1 = 0.036.
- At t=3 (O_3 = H): \alpha_3(1) = [0.144 \times 0.7 + 0.036 \times 0.3] \times 0.6 \approx 0.06696, \alpha_3(2) = [0.144 \times 0.3 + 0.036 \times 0.7] \times 0.9 \approx 0.06156.
State Sequence Decoding (Viterbi Algorithm)
The state sequence decoding problem in hidden Markov models seeks the most likely sequence of hidden states X = x_1, x_2, \dots, x_T given an observation sequence O = o_1, o_2, \dots, o_T and model parameters \lambda = (A, B, \pi). This corresponds to \arg\max_X P(X \mid O, \lambda), which is equivalent to \arg\max_X P(O, X \mid \lambda) via maximum a posteriori (MAP) estimation, as P(O \mid \lambda) is constant for a fixed model.[11] The Viterbi algorithm solves this using dynamic programming to find the maximum-probability path through the state trellis, avoiding exhaustive enumeration of all N^T possible sequences. Originally developed by Andrew Viterbi for decoding convolutional codes in 1967, it was adapted for HMMs in the context of speech recognition and sequence modeling.[12][11] The algorithm defines \delta_t(j) as the probability of the most likely path ending in state j at time t, accounting for the first t observations: \delta_t(j) = \max_{x_1, \dots, x_{t-1}} P(x_1, \dots, x_{t-1}, x_t = j, o_1, \dots, o_t \mid \lambda) Initialization sets \delta_1(j) = \pi_j b_j(o_1) for each state j = 1, \dots, N.[11] The recursion proceeds for t = 1 to T-1: \delta_{t+1}(j) = \max_i \left[ \delta_t(i) a_{ij} \right] b_j(o_{t+1}) with backpointers \psi_{t+1}(j) = \arg\max_i \left[ \delta_t(i) a_{ij} \right] to track the optimal preceding state.[11] Termination computes the highest probability P^* = \max_j \delta_T(j) and initial state x_T^* = \arg\max_j \delta_T(j); the full sequence is then backtracked as x_t^* = \psi_{t+1}(x_{t+1}^*) for t = T-1 down to $1.[11] The computational complexity is O(N^2 T), linear in the observation length T but quadratic in the number of states N, making it efficient for moderate-sized state spaces.[11] For example, consider an HMM with two hidden states (rainy, sunny) modeling daily weather, where observations indicate rainy or sunny conditions (e.g., via direct reports). Given the observation sequence rainy, sunny, rainy over three days and assuming transition probabilities a_{rr} = 0.7, a_{rs} = 0.3, a_{sr} = 0.4, a_{ss} = 0.6, emission probabilities b_r(\text{rainy}) = 0.9, b_r(\text{sunny}) = 0.1, b_s(\text{rainy}) = 0.2, b_s(\text{sunny}) = 0.8, and uniform initials \pi_r = \pi_s = 0.5, the Viterbi algorithm computes \delta values and \psi backpointers iteratively. At t=1 (rainy), \delta_1(r) = 0.45, \delta_1(s) = 0.1, \psi_1(j) = undefined. At t=2 (sunny), \delta_2(r) = 0.0315, \delta_2(s) = 0.108, \psi_2(r) = r, \psi_2(s) = r. At t=3 (rainy), \delta_3(r) \approx 0.0389, \delta_3(s) \approx 0.0130, \psi_3(r) = s, \psi_3(s) = s. Termination selects x_3^* = r (P^* \approx 0.0389), backtracks to x_2^* = s, x_1^* = r, recovering the sequence rainy-sunny-rainy as most likely.[11] The Viterbi algorithm assumes a clear peak in the posterior distribution and excels with high-signal data where emissions strongly distinguish states; in low-signal or highly ambiguous cases, it selects one path but ignores the multiplicity of near-optimal alternatives.[11]Parameter Learning (Baum-Welch Algorithm)
The parameter learning problem in hidden Markov models (HMMs) seeks to estimate the model parameters \lambda = (A, B, \pi)—comprising the state transition matrix A, the observation emission probabilities B, and the initial state distribution \pi—so as to maximize the likelihood P(O \mid \lambda) of an observed sequence O = O_1, O_2, \dots, O_T in the absence of state labels.[3] This unsupervised estimation is challenging due to the latent nature of the states, but it enables HMMs to be trained directly from data.[3] The Baum-Welch algorithm addresses this by implementing the expectation-maximization (EM) framework, originally developed for probabilistic functions of Markov chains. In each iteration, the E-step computes expected counts over hidden state occupancies and transitions using the forward-backward procedure, while the M-step re-estimates the parameters to maximize the expected complete-data log-likelihood given those expectations.[3] The forward variables \alpha_t(j) from the forward algorithm provide P(O_1, \dots, O_t, X_t = j \mid \lambda), which are combined with backward variables in the E-step.[3] The backward variables are defined as \beta_t(j) = P(O_{t+1}, \dots, O_T \mid X_t = j, \lambda), initialized with \beta_T(j) = 1 for all states j, and computed recursively backward via \beta_t(i) = \sum_j a_{ij} b_j(O_{t+1}) \beta_{t+1}(j) for t = T-1, \dots, 1.[3] These enable the computation of posterior probabilities: the gamma \gamma_t(j) = P(X_t = j \mid O, \lambda) = \frac{\alpha_t(j) \beta_t(j)}{P(O \mid \lambda)}, representing the expected occupancy of state j at time t, and the xi \xi_t(i,j) = P(X_t = i, X_{t+1} = j \mid O, \lambda) = \frac{\alpha_t(i) a_{ij} b_j(O_{t+1}) \beta_{t+1}(j)}{P(O \mid \lambda)}, capturing expected transitions from i to j at t.[3] In the M-step, parameters are updated using these expectations:- Initial distribution: \pi_i = \gamma_1(i),
- Transitions: a_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)},
- Emissions (for discrete observations v_k): b_j(k) = \frac{\sum_{t: O_t = v_k} \gamma_t(j)}{\sum_{t=1}^T \gamma_t(j)}.[3]