Fact-checked by Grok 2 weeks ago

HMM

A hidden Markov model (HMM) is a that describes a system as a Markov process with unobserved () states, where the observable outputs are generated probabilistically from those states, allowing about the state sequence from the observations. The model is defined by a set of states, a transition probability matrix governing state changes, an emission for observations from each state, and an initial state probability vector. HMMs assume the , meaning the next state depends only on the current state, not prior history, and observations are conditionally independent given the state. Developed in the late by E. Baum and colleagues at for Analyses, HMMs built on earlier theory to address probabilistic inference in sequences with latent variables. Key algorithms for HMMs include the for finding the most likely state sequence (decoding), the forward algorithm for computing the probability of an observation sequence, and the Baum-Welch algorithm (an expectation-maximization method) for parameter estimation from data. These tools enable of model parameters, making HMMs versatile for sequential data analysis. HMMs gained prominence in the 1970s and 1980s through applications in , where they model acoustic signals as emissions from hidden phonetic states, powering systems like those at . In bioinformatics, HMMs are used for by modeling nucleotide sequences with hidden states representing exons, introns, or intergenic regions, as in tools like GENSCAN. Other notable applications include in , gesture recognition in , and in time series data. Despite their foundational role, HMMs have been extended or supplemented by models like conditional random fields for handling non-independent observations.

Background Concepts

Markov Chains

A is a satisfying the , which posits that the for the next state depends only on the current state and not on the sequence of prior states. Formally, for a sequence of random variables X_1, X_2, \dots, X_t, \dots, the property is expressed as P(X_t = j \mid X_{t-1} = i_{t-1}, \dots, X_1 = i_1) = P(X_t = j \mid X_{t-1} = i_{t-1}) for all t \geq 2, states i_1, \dots, i_{t-1}, j. This concept was introduced by in his 1906 paper examining dependencies in sequences like letters in Pushkin's . In the context of discrete-time Markov chains (DTMCs), the state space S is typically finite or countably infinite, denoted as S = \{1, 2, \dots, N\} for a finite case with N states. The dynamics are governed by a transition probability matrix A = (a_{ij}), where a_{ij} = P(X_t = j \mid X_{t-1} = i) represents the probability of transitioning from state i to state j. Each entry satisfies a_{ij} \geq 0, and the rows sum to 1: \sum_{j=1}^N a_{ij} = 1 for all i, ensuring the matrix is . The process begins with an initial state distribution \pi = (\pi_1, \pi_2, \dots, \pi_N), where \pi_i = P(X_1 = i) and \sum_{i=1}^N \pi_i = 1. The joint probability of a specific state sequence x_1, x_2, \dots, x_T is then given by P(X_1 = x_1, \dots, X_T = x_T) = \pi_{x_1} \prod_{t=2}^T a_{x_{t-1} x_t}. This formulation captures the chain's memoryless progression from one state to the next. A key long-term property of Markov chains is the , a \pi that remains invariant under transitions, satisfying the balance equation \pi = \pi A along with the \sum_i \pi_i = 1. For chains—those that are irreducible (any state reachable from any other) and aperiodic (no cyclic structure forcing periodic returns)—this is unique and represents the limiting distribution as t \to \infty, independent of the initial state: \lim_{t \to \infty} P(X_t = i) = \pi_i for all i. ensures the chain explores the state space thoroughly over time, averaging to the behavior. A simple illustrative example is a two-state weather model with states sunny (1) and rainy (2). The might be A = \begin{pmatrix} 0.9 & 0.1 \\ 0.2 & 0.8 \end{pmatrix}, indicating a 90% chance of sunny following sunny and an 80% chance of rainy following rainy. Assuming an initial distribution \pi = (0.5, 0.5), the probability of a sequence like sunny-rainy-sunny is $0.5 \times 0.1 \times 0.2 = 0.01. Solving \pi = \pi A yields the \pi \approx (0.667, 0.333), suggesting long-run proportions of about two-thirds sunny days and one-third rainy. This model highlights how transition probabilities dictate steady-state tendencies without external influences.

Observation Models

In probabilistic models involving sequential data, an observation model defines the that generates observable outputs from underlying states, providing a to link latent processes to measurable events. This framework is foundational in Markovian systems, where states evolve over time, and observations serve as proxies for those states in scenarios where direct state visibility is assumed. For discrete observations, the emission probability is specified as b_j(k) = P(O_t = k \mid X_t = j), where O_t represents the observation at time t, X_t is the state at time t, j indexes one of N possible states, and k indexes one of M possible observation symbols. These probabilities form an N \times M emission matrix B, with each row summing to 1 to ensure a valid probability distribution over observations for a given state. This discrete setup is common in applications like symbol sequences, where states deterministically or stochastically map to a finite set of outputs. In the case of continuous observations, the emission probabilities are often modeled using multivariate Gaussian distributions, with b_j(O_t) = \mathcal{N}(O_t; \mu_j, \Sigma_j), where \mu_j is the mean vector and \Sigma_j is the covariance matrix associated with state j. This parametric form assumes that observations are drawn independently from a normal distribution centered around the state's characteristic parameters, accommodating real-valued data such as sensor measurements. More complex variants, like Gaussian mixture models, extend this by weighting multiple Gaussians per state, but the single Gaussian case establishes the core continuous emission concept. A representative example arises in a visible Markov chain modeling weather patterns, where states correspond to conditions like sunny, rainy, or cloudy, and observations are direct indicators in the noiseless case but extended to noisy measurements, such as temperature readings emitted from each with added . For instance, a sunny might emit higher average s around 25°C with low variance, while a rainy emits cooler readings around 15°C with similar variability, allowing the model to generate realistic sequences of from known state transitions. Key properties of observation models include the conditional independence of emissions given the current state, expressed as P(O_t \mid X_t, X_{t-1}, \dots, O_{t-1}) = P(O_t \mid X_t), ensuring that past states and observations do not influence the current emission beyond the present state. This Markovian assumption simplifies computation and aligns with the sequential nature of the underlying state process. The joint probability distribution over states and observations in such a visible model is given by: P(X_1, \dots, X_T, O_1, \dots, O_T) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t), where \pi_{X_1} is the initial state probability, a_{ij} = P(X_t = j \mid X_{t-1} = i) are the state transition probabilities, and the product captures the chained dependencies. This formulation integrates the observation model with the state dynamics, enabling the generation of complete observation sequences from initial conditions.

Model Formalism

State Space and Transitions

In a (HMM), the hidden state space consists of a of latent states S = \{1, 2, \dots, N\}, where N is the number of states and these states represent underlying processes that generate the observed data but are not directly measurable. The states evolve over discrete time steps according to the first-order , which assumes that the probability of the next state depends solely on the current state: P(X_t = j \mid X_{t-1} = i, X_{t-2}, \dots, X_1) = P(X_t = j \mid X_{t-1} = i) = a_{ij}. This property simplifies the modeling of sequential dependencies by ignoring longer historical context. The transition probabilities \{a_{ij}\} for $1 \leq i, j \leq N form the state transition matrix A, a stochastic matrix where each row sums to 1 (\sum_{j=1}^N a_{ij} = 1 for all i), ensuring valid probability distributions over possible next states. HMMs typically assume homogeneous transitions, meaning the matrix A is time-invariant across the sequence, although extensions to time-varying transitions exist in more advanced variants. The states are assumed to be finite and discrete, facilitating tractable computation, though continuous-state generalizations have been developed for specific applications. The process begins with an initial state distribution \pi = \{\pi_i\}, where \pi_i = P(X_1 = i) for $1 \leq i \leq N and \sum_{i=1}^N \pi_i = 1, capturing the starting probabilities of the hidden chain. Together, these components define the hidden state dynamics of the HMM, denoted compactly as the model \lambda = (A, B, \pi), where B specifies the observation probabilities linking states to data (detailed separately). Graphically, the state space and transitions are often depicted as a , with nodes representing the N states and directed edges labeled with the corresponding a_{ij} weights, illustrating the probabilistic flow between states. For instance, in , a simple three-state HMM might use states for (N), (V), and (A), with the transition matrix A encoding typical linguistic sequences—such as a high probability of noun-to-verb transitions in subject-verb structures and lower probabilities for less common shifts like verb-to-noun. A representative could appear as follows, reflecting such patterns:
From \ ToNVA
N0.700.200.10
V0.100.800.10
A0.300.200.50
This structure prioritizes self-transitions and common syntactic progressions, as estimated from tagged corpora in HMM-based taggers.

Hidden States and Observations

In Hidden Markov models, the observation sequence is generated by the underlying hidden states via a probabilistic emission process. The sequence of observations is denoted as O = O_1, O_2, \dots, O_T, where each O_t at time step t = 1, 2, \dots, T is produced conditionally on the hidden state X_t. Observations in HMMs can be either discrete symbols from a finite V = \{ v_1, v_2, \dots, v_M \}, with M possible symbols, or continuous vectors in a multidimensional space. The emission probabilities capture this process and are represented by the matrix B = \{ b_j(k) \}, where b_j(k) = P(O_t = v_k | X_t = j) denotes the probability of observing symbol v_k when the hidden state is j, for j = 1, 2, \dots, N states and k = 1, 2, \dots, M. This formulation assumes of observations given the current hidden state: P(O_t | X_t, O_{1:t-1}, X_{1:t}) = P(O_t | X_t). The full HMM is defined by the parameter set \lambda = (A, B, \pi), combining the emission probabilities B with the A and initial state distribution \pi. For continuous observations, emission densities are commonly modeled using multivariate Gaussian distributions, where b_j(\mathbf{O}_t) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}_j|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{O}_t - \boldsymbol{\mu}_j)^T \boldsymbol{\Sigma}_j^{-1} (\mathbf{O}_t - \boldsymbol{\mu}_j) \right), with \boldsymbol{\mu}_j as the mean vector, \boldsymbol{\Sigma}_j as the for state j, and d as the observation dimension. A illustrative example is the biased coin toss HMM, where hidden states correspond to different s with distinct biases, and observations are the outcomes of tosses (heads or tails). In this setup, one state might represent a with b_j(H) = b_j(T) = 0.5, while another represents a biased with b_j(H) = 0.6 and b_j(T) = 0.4; the of observed heads and tails thus reflects the underlying, unseen of coin selections and their biases. The joint probability of a hidden state sequence X = X_1, X_2, \dots, X_T and the observation sequence O under model \lambda is given by P(X, O | \lambda) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t), which factors the initial probability, emissions, and transitions multiplicatively.

Computational Methods

Likelihood Computation (Forward Algorithm)

In hidden Markov models (HMMs), computing the likelihood of an observation sequence O = O_1, O_2, \dots, O_T given the model parameters \lambda = (A, B, \pi) is a fundamental evaluation task, as it quantifies how well the model explains the data. This likelihood is defined as P(O \mid \lambda) = \sum_X P(O, X \mid \lambda), where the sum is over all possible hidden state sequences X = X_1, X_2, \dots, X_T of length T, and P(O, X \mid \lambda) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t). Direct summation over all N^T possible state paths is computationally intractable for moderate T and number of states N > 1. The forward algorithm addresses this by using dynamic programming to compute the likelihood efficiently. It defines the forward variable \alpha_t(j) = P(O_1, O_2, \dots, O_t, X_t = j \mid \lambda) as the probability of the partial observation sequence up to time t and being in state j at time t, given \lambda. Initialization sets \alpha_1(j) = \pi_j b_j(O_1) for j = 1, \dots, N. The recursion proceeds as \alpha_{t+1}(j) = \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(O_{t+1}) for t = 1, \dots, T-1 and j = 1, \dots, N. Termination yields the likelihood as P(O \mid \lambda) = \sum_{j=1}^N \alpha_T(j). This approach has O(N^2 T), a improvement over the cost of naive , making it feasible for practical sequence lengths. The forward variables also enable filtering computations, such as the probability of being in state j at time t given the observations up to t, via \gamma_t(j) = \alpha_t(j) / \sum_{k=1}^N \alpha_t(k). Full posteriors, given the entire observation sequence, require the forward-backward procedure incorporating backward variables. To illustrate, consider a simple HMM for a coin-tossing experiment with two hidden states: fair coin (state 1, P(H) = 0.6) and biased coin (state 2, P(H) = 0.9), initial probabilities \pi_1 = 0.6, \pi_2 = 0.4, and transition matrix A = \begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}. For observation sequence O = H, T, H (T=3), the emission probabilities are b_1(H) = 0.6, b_1(T) = 0.4, b_2(H) = 0.9, b_2(T) = 0.1. The forward variables are computed as follows:
  • At t=1 (O_1 = H): \alpha_1(1) = 0.6 \times 0.6 = 0.36, \alpha_1(2) = 0.4 \times 0.9 = 0.36.
  • At t=2 (O_2 = T): \alpha_2(1) = [0.36 \times 0.7 + 0.36 \times 0.3] \times 0.4 = 0.144, \alpha_2(2) = [0.36 \times 0.3 + 0.36 \times 0.7] \times 0.1 = 0.036.
  • At t=3 (O_3 = H): \alpha_3(1) = [0.144 \times 0.7 + 0.036 \times 0.3] \times 0.6 \approx 0.06696, \alpha_3(2) = [0.144 \times 0.3 + 0.036 \times 0.7] \times 0.9 \approx 0.06156.
The likelihood is P(O \mid \lambda) = 0.06696 + 0.06156 \approx 0.12852. This step-by-step calculation demonstrates how the algorithm marginalizes over paths while avoiding redundant computations.

State Sequence Decoding ()

The state sequence decoding problem in hidden Markov models seeks the most likely sequence of hidden states X = x_1, x_2, \dots, x_T given an observation sequence O = o_1, o_2, \dots, o_T and model parameters \lambda = (A, B, \pi). This corresponds to \arg\max_X P(X \mid O, \lambda), which is equivalent to \arg\max_X P(O, X \mid \lambda) via maximum () estimation, as P(O \mid \lambda) is constant for a fixed model. The solves this using dynamic programming to find the maximum-probability path through the state trellis, avoiding exhaustive enumeration of all N^T possible sequences. Originally developed by for decoding convolutional codes in 1967, it was adapted for HMMs in the context of and sequence modeling. The algorithm defines \delta_t(j) as the probability of the most likely path ending in state j at time t, accounting for the first t observations: \delta_t(j) = \max_{x_1, \dots, x_{t-1}} P(x_1, \dots, x_{t-1}, x_t = j, o_1, \dots, o_t \mid \lambda) Initialization sets \delta_1(j) = \pi_j b_j(o_1) for each state j = 1, \dots, N. The recursion proceeds for t = 1 to T-1: \delta_{t+1}(j) = \max_i \left[ \delta_t(i) a_{ij} \right] b_j(o_{t+1}) with backpointers \psi_{t+1}(j) = \arg\max_i \left[ \delta_t(i) a_{ij} \right] to track the optimal preceding state. Termination computes the highest probability P^* = \max_j \delta_T(j) and initial state x_T^* = \arg\max_j \delta_T(j); the full sequence is then backtracked as x_t^* = \psi_{t+1}(x_{t+1}^*) for t = T-1 down to $1. The computational complexity is O(N^2 T), linear in the observation length T but quadratic in the number of states N, making it efficient for moderate-sized state spaces. For example, consider an HMM with two hidden states (rainy, sunny) modeling daily weather, where observations indicate rainy or sunny conditions (e.g., via direct reports). Given the observation sequence rainy, sunny, rainy over three days and assuming transition probabilities a_{rr} = 0.7, a_{rs} = 0.3, a_{sr} = 0.4, a_{ss} = 0.6, emission probabilities b_r(\text{rainy}) = 0.9, b_r(\text{sunny}) = 0.1, b_s(\text{rainy}) = 0.2, b_s(\text{sunny}) = 0.8, and uniform initials \pi_r = \pi_s = 0.5, the Viterbi algorithm computes \delta values and \psi backpointers iteratively. At t=1 (rainy), \delta_1(r) = 0.45, \delta_1(s) = 0.1, \psi_1(j) = undefined. At t=2 (sunny), \delta_2(r) = 0.0315, \delta_2(s) = 0.108, \psi_2(r) = r, \psi_2(s) = r. At t=3 (rainy), \delta_3(r) \approx 0.0389, \delta_3(s) \approx 0.0130, \psi_3(r) = s, \psi_3(s) = s. Termination selects x_3^* = r (P^* \approx 0.0389), backtracks to x_2^* = s, x_1^* = r, recovering the sequence rainy-sunny-rainy as most likely. The Viterbi algorithm assumes a clear peak in the posterior distribution and excels with high-signal data where emissions strongly distinguish states; in low-signal or highly ambiguous cases, it selects one path but ignores the multiplicity of near-optimal alternatives.

Parameter Learning (Baum-Welch Algorithm)

The parameter learning problem in hidden Markov models (HMMs) seeks to estimate the model parameters \lambda = (A, B, \pi)—comprising the state transition matrix A, the observation emission probabilities B, and the initial state distribution \pi—so as to maximize the likelihood P(O \mid \lambda) of an observed sequence O = O_1, O_2, \dots, O_T in the absence of state labels. This unsupervised estimation is challenging due to the latent nature of the states, but it enables HMMs to be trained directly from data. The Baum-Welch algorithm addresses this by implementing the expectation-maximization (EM) framework, originally developed for probabilistic functions of Markov chains. In each iteration, the E-step computes expected counts over hidden state occupancies and transitions using the forward-backward procedure, while the M-step re-estimates the parameters to maximize the expected complete-data log-likelihood given those expectations. The forward variables \alpha_t(j) from the forward algorithm provide P(O_1, \dots, O_t, X_t = j \mid \lambda), which are combined with backward variables in the E-step. The backward variables are defined as \beta_t(j) = P(O_{t+1}, \dots, O_T \mid X_t = j, \lambda), initialized with \beta_T(j) = 1 for all states j, and computed recursively backward via \beta_t(i) = \sum_j a_{ij} b_j(O_{t+1}) \beta_{t+1}(j) for t = T-1, \dots, 1. These enable the computation of posterior probabilities: the gamma \gamma_t(j) = P(X_t = j \mid O, \lambda) = \frac{\alpha_t(j) \beta_t(j)}{P(O \mid \lambda)}, representing the expected occupancy of state j at time t, and the xi \xi_t(i,j) = P(X_t = i, X_{t+1} = j \mid O, \lambda) = \frac{\alpha_t(i) a_{ij} b_j(O_{t+1}) \beta_{t+1}(j)}{P(O \mid \lambda)}, capturing expected transitions from i to j at t. In the M-step, parameters are updated using these expectations:
  • Initial distribution: \pi_i = \gamma_1(i),
  • Transitions: a_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)},
  • Emissions (for discrete observations v_k): b_j(k) = \frac{\sum_{t: O_t = v_k} \gamma_t(j)}{\sum_{t=1}^T \gamma_t(j)}.
These closed-form updates guarantee a non-decreasing log-likelihood at each . The algorithm converges to a local maximum of the log-likelihood, though the optimum depends on initialization; random starts are common, with multiple runs recommended to select the best result based on final likelihood. As an illustrative example, consider sequences from coin tosses where hidden states correspond to biased coins (fair or loaded), and observations are heads () or tails (T). Starting with initial parameters (e.g., uniform transitions and emissions), Baum-Welch iterations refine estimates—such as increasing the heads probability for the loaded state—yielding progressively higher sequence likelihoods, often converging in 10–20 steps for short sequences like H T T T.

Practical Applications

Speech and Pattern Recognition

Hidden Markov models (HMMs) have been extensively applied in speech recognition, where they model the temporal dynamics of acoustic signals by representing phonemes or subword units as hidden states. Observations are typically acoustic features such as Mel-frequency cepstral coefficients (MFCCs), modeled using Gaussian mixture distributions to capture the variability in speech production. In early systems like the CMU Sphinx recognizer, HMMs were integrated with language models to decode continuous speech, enabling speaker-independent recognition of large vocabularies. In part-of-speech (POS) tagging, HMMs treat grammatical tags (e.g., , ) as hidden states, with words as observations and transition probabilities reflecting syntactic dependencies. The is commonly used to find the most likely tag sequence for a given , achieving high accuracy on standard corpora by leveraging statistical patterns in text. This approach dominated stochastic tagging methods in the 1990s, providing a robust framework for tasks. HMMs also facilitate and by modeling sequential trajectories as state paths, with emissions derived from feature vectors like stroke coordinates or velocity profiles. For , states represent character segments, allowing recognition of cursive scripts through dynamic programming alignment. In , multidimensional HMMs capture spatial and temporal variations in hand movements, enabling real-time interfaces for human-computer interaction. From the through the , HMMs dominated automatic speech recognition (ASR) systems, forming the backbone of commercial and research implementations due to their effectiveness in handling sequential data. More recently, HMMs have been hybridized with deep neural networks (DNNs), where DNNs replace Gaussian emissions for better acoustic modeling, leading to substantial reductions in error rates while retaining HMM structure for decoding. A representative example is isolated , in which HMM states model subword units like phonemes, with parameters estimated using the Baum-Welch algorithm on labeled audio datasets featuring MFCC extractions. Such systems have demonstrated (WER) improvements of up to 30% through refinements like triphone modeling, establishing HMMs as a benchmark for sequential tasks. HMMs are also used in anomaly detection for time series data, where normal behavior is modeled as a sequence of hidden states, and deviations in observations indicate anomalies. This application is common in network intrusion detection and financial fraud monitoring, using the forward algorithm to compute likelihoods and flag low-probability sequences.

Bioinformatics and Genetics

Hidden Markov models (HMMs) play a pivotal role in bioinformatics by modeling the probabilistic structure of biological sequences, where hidden states represent underlying features like genomic regions or structural elements, and observations correspond to nucleotides or amino acids. This framework excels in tasks involving sequential dependencies, such as gene annotation and evolutionary analysis, by leveraging algorithms like the forward algorithm for likelihood computation. In genetics, HMMs facilitate the interpretation of complex data from DNA and protein sequences, enabling predictions that inform functional genomics and evolutionary biology. In finding, HMMs distinguish coding from non-coding regions by defining states for , introns, and intergenic spacers, with probabilities capturing composition biases in each. The GENSCAN tool exemplifies this approach, using a generalized HMM to predict complete eukaryotic gene structures, including splice sites and exon boundaries, by integrating length distributions and sequence motifs into state transitions. This method has achieved high accuracy in identifying genes in and other genomes, outperforming earlier rule-based predictors. For multiple sequence alignment, profile HMMs extend standard models to represent conserved motifs and variable regions, with states denoting match, insert, or delete columns and transitions penalizing gaps. The HMMER software implements these profiles to align related protein or DNA sequences, particularly for detecting distant homologs where pairwise methods falter; the Viterbi algorithm identifies the optimal alignment path. Profile HMMs have revolutionized homology searches by providing position-specific scoring superior to simple profiles. Protein secondary structure prediction employs HMMs with states for alpha-helices, beta-sheets, and random coils, where emissions are based on propensities and physicochemical properties. Observations from single or multiple alignments inform transitions that model lengths and turns, yielding predictions with accuracies around 65% in mid-1990s implementations. This state-based modeling captures local structural propensities more effectively than window-based methods. Discrete HMMs in phylogenetic modeling treat evolutionary rate variations across sequence sites as hidden states, with emissions reflecting substitution probabilities under site-specific rates. This allows reconstruction of ancestral sequences and phylogenies by accounting for heterogeneity without assuming uniform evolution, as demonstrated in models that infer rate categories from aligned molecular data. Such approaches improve likelihood estimates for tree inference compared to constant-rate models. A illustrative example is CpG island detection, where states represent CpG-rich () and non-island regions; emissions favor high CG dinucleotide probabilities in islands versus GC-biased but CG-suppressed ones elsewhere, with the forward algorithm calculating the of a originating from an island. This binary-state HMM identifies promoter-associated regions critical for regulation. The impact of HMMs in bioinformatics includes enabling sensitive tools like as alternatives to , accelerating large-scale annotation with greater remote homology detection. Recent advancements integrate HMMs with , as in the Tiberius predictor, which combines neural networks for feature extraction with HMM decoding to boost eukaryotic finding accuracy beyond traditional methods.

Developments and Variants

Historical Context

The foundations of hidden Markov models (HMMs) trace back to the early with the development of Markov chains by , who in 1906 introduced the concept of processes where future states depend only on the current state, laying the groundwork for modeling sequential dependencies. In the late 1960s, Leonard E. Baum and colleagues at the Institute for Defense Analyses (IDA) formalized HMM theory through a series of papers. This included the 1970 adaptation of the —originally developed by in 1967 for error-correcting codes in communication systems—to find optimal state sequences in hidden processes. This was followed by the 1970 Baum-Welch algorithm, co-authored with Lloyd R. Welch and others, which provided a method for maximum likelihood parameter estimation in HMMs using an expectation-maximization approach. The 1970s marked key milestones in practical application, particularly in speech recognition through DARPA-funded projects like the Speech Understanding Research (SUR) program (1971–1976), where HMMs were integrated into systems at institutions such as and to handle continuous speech with vocabularies up to 1,000 words. By the late 1980s, HMMs gained traction in bioinformatics as DNA sequencing technologies proliferated, with early applications in sequence modeling emerging around 1989 (e.g., Churchill's work on heterogeneous DNA sequences), and and in the early to mid-1990s, exemplified by models like EcoParse for analyzing bacterial genes and evolutionary patterns. Lawrence R. Rabiner played a pivotal role in popularizing HMMs through his influential 1989 tutorial in the Proceedings of the IEEE, which synthesized the theory and applications, particularly in speech, and became a standard reference for researchers. Implementations advanced through figures like those at , including Levinson's work on statistical speech modeling in the and . While pure HMMs dominated sequence modeling until the early 2010s, their standalone use declined with the rise of techniques, though they remain foundational in hybrid systems for tasks like acoustic modeling.

Extensions and Advanced Models

Hidden Markov models (HMMs) have been extended in various ways to address limitations of the basic discrete-state framework, such as handling continuous or infinite state spaces, incorporating hierarchical structures, and accommodating covariates or non-exponential state durations. These generalizations enable modeling of more complex dependencies in sequential data while maintaining the core probabilistic foundations of HMMs. One key extension involves continuous state spaces, where the hidden states evolve according to processes or are modeled using Gaussian processes to capture smooth, infinite-dimensional dynamics. For instance, switching models allow the hidden process to switch between regimes governed by stochastic differential equations, providing flexibility for applications like financial or neural dynamics where abrupt changes occur within continuous trajectories. Similarly, HMMs augmented with Gaussian processes for state transitions enable nonparametric modeling of latent functions, improving inference in scenarios with limited discrete assumptions. Hierarchical hidden Markov models (HHMMs) introduce nested structures, where super-states contain sub-HMMs, facilitating multi-scale representations of sequences with shared substructures. This recursive formulation, analyzed in seminal work, allows efficient inference via extensions of the forward-backward , making it suitable for modeling layered processes like speech prosody or . For example, in , HHMMs can represent complex motions where top-level states denote overall actions (e.g., ) and sub-states capture phases (e.g., raise and lower), enabling robust decoding of temporal hierarchies. HHMMs have proven effective in video analysis by capturing both global and local patterns without exponential complexity in state space. Profile HMMs extend the standard model to handle biological sequences with insertions and deletions by incorporating explicit insert and delete states alongside match states, forming a linear architecture that aligns variable-length inputs to a consensus profile. This variant, widely used in tools like HMMER for identifying distant protein homologs, improves sensitivity over basic HMMs by modeling position-specific conservation and gaps explicitly, as demonstrated in early applications to multiple sequence alignments. Profile HMMs thus address alignment ambiguities in evolutionary data without altering the core emission and transition framework. Bayesian HMMs incorporate distributions on transition matrices A and emission parameters B, typically using Dirichlet priors for their conjugacy with multinomial likelihoods, and employ (MCMC) methods like for posterior inference to quantify uncertainty in parameters and s. This approach contrasts with by enabling regularization and handling small datasets, with applications in robust sequence modeling where parameter variability is critical. Bayesian formulations also support nonparametric extensions, such as infinite HMMs via Dirichlet processes, for discovering unknown state counts. Discriminative variants shift from generative parameter learning (e.g., Baum-Welch) to criteria like maximum (MMI) training, which optimizes the model to maximize the between correct state sequences and observations, directly improving discrimination between competing hypotheses. Introduced for , MMI adjusts transitions and emissions to minimize recognition errors, often outperforming expectation-maximization by focusing on decision boundaries rather than data likelihood. These methods are particularly effective in supervised settings with labeled sequences. Input-output HMMs (IOHMMs) generalize the model by conditioning transitions or emissions on exogenous covariates, allowing external inputs to influence and making the framework suitable for controlled or contextual sequences. In this architecture, covariates enter via logistic or softmax functions on transition probabilities, enabling applications like time-series forecasting with predictors. IOHMMs thus bridge HMMs with regression-like models for covariate-dependent regimes. Variable-duration HMMs, or hidden semi-Markov models (HSMMs), relax the geometric holding time assumption by assigning explicit duration distributions to states, often using gamma or non-parametric forms to model realistic sojourn times in applications like speech where lengths vary. extends the forward algorithm to account for durations, improving accuracy over standard HMMs in tasks with explicit timing dependencies. These models maintain tractability while capturing non-Markovian aspects of state persistence. Recent developments post-2020 have integrated HMMs with architectures for semi-supervised learning, leveraging self-attention mechanisms to enhance latent in low-label regimes. For example, hidden Markov transformers combine recurrent hidden dynamics with decoders for tasks like simultaneous , where partial observations guide predictions, achieving better by fusing sequential modeling with global context. Such hybrids exploit transformers' parallelization for scalable semi-supervised training on sequential data.

References

  1. [1]
    [PDF] Hidden Markov Models
    The parameters of an HMM are the A transition probability matrix and the B observation likelihood matrix. Both can be trained with the Baum-Welch or forward- ...
  2. [2]
    [PDF] Introduction to Hidden Markov Models - Harvard University
    A. Definition. A hidden Markov model is a tool for representing prob- ability distributions over sequences of observations [1].
  3. [3]
    [PDF] A Tutorial on Hidden Markov Models and Selected Applications in ...
    In Section V we discuss the issues that arise in imple- menting HMMs including the topics of scaling, initial parameter estimates, model size, model form, ...
  4. [4]
    Recent Applications of Hidden Markov Models in Computational ...
    This paper examines recent developments and applications of Hidden Markov Models (HMMs) to various problems in computational biology, including multiple ...
  5. [5]
    Markov and the Birth of Chain Dependence Theory - jstor
    [Part 2 of this article. English translation]. Markov, A.A. (1906). Extension of the law of large numbers to dependent quantities [in Russian]. Izv.
  6. [6]
    Markov Chains - Cambridge University Press & Assessment
    Book description​​ Markov chains are central to the understanding of random processes. This is not only because they pervade the applications of random processes ...
  7. [7]
    [PDF] Chapter 8: Markov Chains
    Note: The distribution of Xt is Xt ∼ πT Pt. The distribution of Xt+1 is Xt+1 ∼ πT Pt+1. Taking one step in the Markov chain corresponds to multiplying by P on ...Missing: source | Show results with:source
  8. [8]
    [PDF] What HMMs Can Do - Columbia Electrical Engineering
    Rather than wet our toes with HMM general properties and analogies, we dive right in by providing a formal definition. Definition 3.1. Hidden Markov Model A ...
  9. [9]
    None
    Summary of each segment:
  10. [10]
    None
    ### Summary of Transition Matrix for POS Tagging in HMM
  11. [11]
    [PDF] A tutorial on hidden Markov models and selected applications in ...
    'The idea of characterizing the theoretical aspects of hidden. Markov modeling in terms of solving three fundamental problems is due to Jack Ferguson of IDA ( ...
  12. [12]
    [PDF] Error Bounds for Convolutional Codes and an Asymptotically ...
    69-72, February. 1967. Error Bounds for Convolutional Codes and an Asymptotically Optimum. Decoding Algorithm. ANDREW J. VITERBI,.
  13. [13]
    [PDF] RECENT PROGRESS IN THE SPHINX SPEECH RECOGNITION ...
    SPHINX is a large-vocabulary, speaker-independent, continuous speech recognition system based on discrete hidden Markov models (HMMs) with LPC-derived.
  14. [14]
    Hidden Markov Models in Handwriting Recognition - Semantic Scholar
    Four different applications of HMM's in various contexts are described and four different approaches to transpose the HMM technology to off-line handwriting ...
  15. [15]
    [PDF] Hidden Markov Model for Gesture Recognition
    This report presents a method for developing a gesture interface using the multi-dimensional hidden Markov model (HMM). HMM is a doubly stochastic model and is ...
  16. [16]
    A Historical Perspective of Speech Recognition
    Jan 1, 2014 · Here, we provide our collective historical perspective on the advances in the field of speech recognition.
  17. [17]
  18. [18]
    [PDF] Improved HMM Models for High Performance Speech Recognition
    The addition of cross-word triphone models reduced the word error rate by 30% as will be seen in the tables of results below. 5 System Recognition Results. The ...
  19. [19]
    The hidden Markov model and its applications in bioinformatics ...
    The hidden Markov model (HMM), a statistical model widely utilized in machine learning, has proven effective in addressing various problems in bioinformatics.
  20. [20]
    [PDF] HMMER User's Guide - Eddy Lab
    Feb 15, 2013 · HMMER can be used to replace BLASTP and PSI-BLAST for searching protein databases with single query sequences. HMMER includes two programs for ...
  21. [21]
    Tiberius: end-to-end deep learning with an HMM for gene prediction
    For more than 25 years, learning-based eukaryotic gene predictors were driven by hidden Markov models (HMMs), which were directly inputted a DNA sequence.
  22. [22]
    First-Hand:The Hidden Markov Model
    Jan 12, 2015 · The earliest work on the theory of probabilistic functions of a Markov chain was published in a series of classic papers by Leonard E. Baum and ...
  23. [23]
  24. [24]
    [PDF] The Hierarchical Hidden Markov Model: Analysis and Applications
    Hidden Markov models (HMMs) have become the method of choice for modeling stochastic processes and sequences in applications such as speech and handwrit- ing ...Missing: Oliver | Show results with:Oliver
  25. [25]
  26. [26]
    Continuously variable duration hidden Markov models for automatic ...
    The solution proposed here is to replace the probability distributions of duration with continuous probability density functions to form a continuously ...