HMM

A hidden Markov model (HMM) is a statistical model that describes a system as a Markov process with unobserved (hidden) states, where the observable outputs are generated probabilistically from those states, allowing inference about the hidden state sequence from the observations.^[1] The model is defined by a set of hidden states, a transition probability matrix governing state changes, an emission probability distribution for observations from each state, and an initial state probability vector.^[1] HMMs assume the Markov property, meaning the next state depends only on the current state, not prior history, and observations are conditionally independent given the state.^[2] Developed in the late 1960s by Leonard E. Baum and colleagues at the Institute for Defense Analyses, HMMs built on earlier Markov chain theory to address probabilistic inference in sequences with latent variables.^[1] Key algorithms for HMMs include the Viterbi algorithm for finding the most likely state sequence (decoding), the forward algorithm for computing the probability of an observation sequence, and the Baum-Welch algorithm (an expectation-maximization method) for parameter estimation from data.^[1] These tools enable unsupervised learning of model parameters, making HMMs versatile for sequential data analysis.^[2] HMMs gained prominence in the 1970s and 1980s through applications in speech recognition, where they model acoustic signals as emissions from hidden phonetic states, powering systems like those at Bell Labs.^[3] In bioinformatics, HMMs are used for gene prediction by modeling nucleotide sequences with hidden states representing exons, introns, or intergenic regions, as in tools like GENSCAN.^[4] Other notable applications include part-of-speech tagging in natural language processing, gesture recognition in computer vision, and anomaly detection in time series data.^[1] Despite their foundational role, HMMs have been extended or supplemented by models like conditional random fields for handling non-independent observations.^[3]

Background Concepts

Markov Chains

A Markov chain is a stochastic process satisfying the Markov property, which posits that the probability distribution for the next state depends only on the current state and not on the sequence of prior states. Formally, for a sequence of random variables X_1, X_2, \dots, X_t, \dots, the property is expressed as

P(X_t = j \mid X_{t-1} = i_{t-1}, \dots, X_1 = i_1) = P(X_t = j \mid X_{t-1} = i_{t-1})

for all t \geq 2, states i_1, \dots, i_{t-1}, j. This concept was introduced by Andrey Markov in his 1906 paper examining dependencies in sequences like letters in Pushkin's Eugene Onegin.^[5]^[6] In the context of discrete-time Markov chains (DTMCs), the state space S is typically finite or countably infinite, denoted as S = \{1, 2, \dots, N\} for a finite case with N states. The dynamics are governed by a transition probability matrix A = (a_{ij}), where a_{ij} = P(X_t = j \mid X_{t-1} = i) represents the probability of transitioning from state i to state j. Each entry satisfies a_{ij} \geq 0, and the rows sum to 1: \sum_{j=1}^N a_{ij} = 1 for all i, ensuring the matrix is stochastic. The process begins with an initial state distribution \pi = (\pi_1, \pi_2, \dots, \pi_N), where \pi_i = P(X_1 = i) and \sum_{i=1}^N \pi_i = 1. The joint probability of a specific state sequence x_1, x_2, \dots, x_T is then given by

P(X_1 = x_1, \dots, X_T = x_T) = \pi_{x_1} \prod_{t=2}^T a_{x_{t-1} x_t}.

This formulation captures the chain's memoryless progression from one state to the next.^[6]^[7] A key long-term property of Markov chains is the stationary distribution, a probability vector \pi that remains invariant under transitions, satisfying the balance equation \pi = \pi A along with the normalization \sum_i \pi_i = 1. For ergodic chains—those that are irreducible (any state reachable from any other) and aperiodic (no cyclic structure forcing periodic returns)—this stationary distribution is unique and represents the limiting distribution as t \to \infty, independent of the initial state: \lim_{t \to \infty} P(X_t = i) = \pi_i for all i. Ergodicity ensures the chain explores the state space thoroughly over time, averaging to the stationary behavior.^[6] A simple illustrative example is a two-state weather model with states sunny (1) and rainy (2). The transition matrix might be

A = \begin{pmatrix} 0.9 & 0.1 \\ 0.2 & 0.8 \end{pmatrix},

indicating a 90% chance of sunny following sunny and an 80% chance of rainy following rainy. Assuming an initial distribution \pi = (0.5, 0.5), the probability of a sequence like sunny-rainy-sunny is $0.5 \times 0.1 \times 0.2 = 0.01. Solving \pi = \pi A yields the stationary distribution \pi \approx (0.667, 0.333), suggesting long-run proportions of about two-thirds sunny days and one-third rainy. This model highlights how transition probabilities dictate steady-state tendencies without external influences.^[6]

Observation Models

In probabilistic models involving sequential data, an observation model defines the conditional probability distribution that generates observable outputs from underlying states, providing a mechanism to link latent processes to measurable events.^[3] This framework is foundational in Markovian systems, where states evolve over time, and observations serve as proxies for those states in scenarios where direct state visibility is assumed.^[3] For discrete observations, the emission probability is specified as b_j(k) = P(O_t = k \mid X_t = j), where O_t represents the observation at time t, X_t is the state at time t, j indexes one of N possible states, and k indexes one of M possible observation symbols.^[3] These probabilities form an N \times M emission matrix B, with each row summing to 1 to ensure a valid probability distribution over observations for a given state.^[3] This discrete setup is common in applications like symbol sequences, where states deterministically or stochastically map to a finite set of outputs.^[3] In the case of continuous observations, the emission probabilities are often modeled using multivariate Gaussian distributions, with b_j(O_t) = \mathcal{N}(O_t; \mu_j, \Sigma_j), where \mu_j is the mean vector and \Sigma_j is the covariance matrix associated with state j.^[3] This parametric form assumes that observations are drawn independently from a normal distribution centered around the state's characteristic parameters, accommodating real-valued data such as sensor measurements.^[3] More complex variants, like Gaussian mixture models, extend this by weighting multiple Gaussians per state, but the single Gaussian case establishes the core continuous emission concept.^[3] A representative example arises in a visible Markov chain modeling weather patterns, where states correspond to conditions like sunny, rainy, or cloudy, and observations are direct state indicators in the noiseless case but extended to noisy measurements, such as temperature readings emitted from each state with added Gaussian noise.^[3] For instance, a sunny state might emit higher average temperatures around 25°C with low variance, while a rainy state emits cooler readings around 15°C with similar variability, allowing the model to generate realistic sequences of temperature data from known state transitions.^[3] Key properties of observation models include the conditional independence of emissions given the current state, expressed as P(O_t \mid X_t, X_{t-1}, \dots, O_{t-1}) = P(O_t \mid X_t), ensuring that past states and observations do not influence the current emission beyond the present state.^[3] This Markovian assumption simplifies computation and aligns with the sequential nature of the underlying state process.^[3] The joint probability distribution over states and observations in such a visible model is given by:

P(X_1, \dots, X_T, O_1, \dots, O_T) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t),

where \pi_{X_1} is the initial state probability, a_{ij} = P(X_t = j \mid X_{t-1} = i) are the state transition probabilities, and the product captures the chained dependencies.^[3] This formulation integrates the observation model with the state dynamics, enabling the generation of complete observation sequences from initial conditions.^[3]

Model Formalism

State Space and Transitions

In a Hidden Markov Model (HMM), the hidden state space consists of a finite set of latent states S = \{1, 2, \dots, N\}, where N is the number of states and these states represent underlying processes that generate the observed data but are not directly measurable.^[1] The states evolve over discrete time steps according to the first-order Markov property, which assumes that the probability of the next state depends solely on the current state: P(X_t = j \mid X_{t-1} = i, X_{t-2}, \dots, X_1) = P(X_t = j \mid X_{t-1} = i) = a_{ij}.^[8] This property simplifies the modeling of sequential dependencies by ignoring longer historical context.^[9] The transition probabilities \{a_{ij}\} for $1 \leq i, j \leq N form the state transition matrix A, a stochastic matrix where each row sums to 1 (\sum_{j=1}^N a_{ij} = 1 for all i), ensuring valid probability distributions over possible next states.^[1] HMMs typically assume homogeneous transitions, meaning the matrix A is time-invariant across the sequence, although extensions to time-varying transitions exist in more advanced variants.^[8] The states are assumed to be finite and discrete, facilitating tractable computation, though continuous-state generalizations have been developed for specific applications.^[9] The process begins with an initial state distribution \pi = \{\pi_i\}, where \pi_i = P(X_1 = i) for $1 \leq i \leq N and \sum_{i=1}^N \pi_i = 1, capturing the starting probabilities of the hidden chain.^[1] Together, these components define the hidden state dynamics of the HMM, denoted compactly as the model \lambda = (A, B, \pi), where B specifies the observation probabilities linking states to data (detailed separately).^[9] Graphically, the state space and transitions are often depicted as a directed graph, with nodes representing the N states and directed edges labeled with the corresponding a_{ij} weights, illustrating the probabilistic flow between states.^[1] For instance, in part-of-speech tagging, a simple three-state HMM might use states for noun (N), verb (V), and adjective (A), with the transition matrix A encoding typical linguistic sequences—such as a high probability of noun-to-verb transitions in subject-verb structures and lower probabilities for less common shifts like verb-to-noun.^[10] A representative transition matrix could appear as follows, reflecting such patterns:

From \ To	N	V	A
N	0.70	0.20	0.10
V	0.10	0.80	0.10
A	0.30	0.20	0.50

This structure prioritizes self-transitions and common syntactic progressions, as estimated from tagged corpora in HMM-based taggers.^[10]

Hidden States and Observations

In Hidden Markov models, the observation sequence is generated by the underlying hidden states via a probabilistic emission process. The sequence of observations is denoted as O = O_1, O_2, \dots, O_T, where each O_t at time step t = 1, 2, \dots, T is produced conditionally on the hidden state X_t.^[3] Observations in HMMs can be either discrete symbols from a finite alphabet V = \{ v_1, v_2, \dots, v_M \}, with M possible symbols, or continuous vectors in a multidimensional space.^[3] The emission probabilities capture this process and are represented by the matrix B = \{ b_j(k) \}, where b_j(k) = P(O_t = v_k | X_t = j) denotes the probability of observing symbol v_k when the hidden state is j, for j = 1, 2, \dots, N states and k = 1, 2, \dots, M.^[3] This formulation assumes conditional independence of observations given the current hidden state: P(O_t | X_t, O_{1:t-1}, X_{1:t}) = P(O_t | X_t).^[3] The full HMM is defined by the parameter set \lambda = (A, B, \pi), combining the emission probabilities B with the state transition matrix A and initial state distribution \pi.^[3] For continuous observations, emission densities are commonly modeled using multivariate Gaussian distributions, where

b_j(\mathbf{O}_t) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}_j|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{O}_t - \boldsymbol{\mu}_j)^T \boldsymbol{\Sigma}_j^{-1} (\mathbf{O}_t - \boldsymbol{\mu}_j) \right),

with \boldsymbol{\mu}_j as the mean vector, \boldsymbol{\Sigma}_j as the covariance matrix for state j, and d as the observation dimension.^[3] A illustrative example is the biased coin toss HMM, where hidden states correspond to different coins with distinct biases, and observations are the outcomes of tosses (heads or tails). In this setup, one state might represent a fair coin with b_j(H) = b_j(T) = 0.5, while another represents a biased coin with b_j(H) = 0.6 and b_j(T) = 0.4; the sequence of observed heads and tails thus reflects the underlying, unseen sequence of coin selections and their biases.^[3] The joint probability of a hidden state sequence X = X_1, X_2, \dots, X_T and the observation sequence O under model \lambda is given by

P(X, O | \lambda) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t),

which factors the initial probability, emissions, and transitions multiplicatively.^[3]

Computational Methods

Likelihood Computation (Forward Algorithm)

In hidden Markov models (HMMs), computing the likelihood of an observation sequence O = O_1, O_2, \dots, O_T given the model parameters \lambda = (A, B, \pi) is a fundamental evaluation task, as it quantifies how well the model explains the data. This likelihood is defined as P(O \mid \lambda) = \sum_X P(O, X \mid \lambda), where the sum is over all possible hidden state sequences X = X_1, X_2, \dots, X_T of length T, and P(O, X \mid \lambda) = \pi_{X_1} b_{X_1}(O_1) \prod_{t=2}^T a_{X_{t-1} X_t} b_{X_t}(O_t). Direct summation over all N^T possible state paths is computationally intractable for moderate T and number of states N > 1.^[3] The forward algorithm addresses this by using dynamic programming to compute the likelihood efficiently. It defines the forward variable \alpha_t(j) = P(O_1, O_2, \dots, O_t, X_t = j \mid \lambda) as the probability of the partial observation sequence up to time t and being in state j at time t, given \lambda. Initialization sets \alpha_1(j) = \pi_j b_j(O_1) for j = 1, \dots, N. The recursion proceeds as \alpha_{t+1}(j) = \left[ \sum_{i=1}^N \alpha_t(i) a_{ij} \right] b_j(O_{t+1}) for t = 1, \dots, T-1 and j = 1, \dots, N. Termination yields the likelihood as P(O \mid \lambda) = \sum_{j=1}^N \alpha_T(j). This approach has time complexity O(N^2 T), a polynomial improvement over the exponential cost of naive enumeration, making it feasible for practical sequence lengths.^[3] The forward variables also enable filtering posterior probability computations, such as the probability of being in state j at time t given the observations up to t, via \gamma_t(j) = \alpha_t(j) / \sum_{k=1}^N \alpha_t(k). Full smoothing posteriors, given the entire observation sequence, require the forward-backward procedure incorporating backward variables. To illustrate, consider a simple HMM for a coin-tossing experiment with two hidden states: fair coin (state 1, P(H) = 0.6) and biased coin (state 2, P(H) = 0.9), initial probabilities \pi_1 = 0.6, \pi_2 = 0.4, and transition matrix A = \begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}.^[3] For observation sequence O = H, T, H (T=3), the emission probabilities are b_1(H) = 0.6, b_1(T) = 0.4, b_2(H) = 0.9, b_2(T) = 0.1. The forward variables are computed as follows:

At t=1 (O_1 = H): \alpha_1(1) = 0.6 \times 0.6 = 0.36, \alpha_1(2) = 0.4 \times 0.9 = 0.36.
At t=2 (O_2 = T): \alpha_2(1) = [0.36 \times 0.7 + 0.36 \times 0.3] \times 0.4 = 0.144, \alpha_2(2) = [0.36 \times 0.3 + 0.36 \times 0.7] \times 0.1 = 0.036.
At t=3 (O_3 = H): \alpha_3(1) = [0.144 \times 0.7 + 0.036 \times 0.3] \times 0.6 \approx 0.06696, \alpha_3(2) = [0.144 \times 0.3 + 0.036 \times 0.7] \times 0.9 \approx 0.06156.

The likelihood is P(O \mid \lambda) = 0.06696 + 0.06156 \approx 0.12852.^[3] This step-by-step calculation demonstrates how the algorithm marginalizes over paths while avoiding redundant computations.

State Sequence Decoding (Viterbi Algorithm)

The state sequence decoding problem in hidden Markov models seeks the most likely sequence of hidden states X = x_1, x_2, \dots, x_T given an observation sequence O = o_1, o_2, \dots, o_T and model parameters \lambda = (A, B, \pi). This corresponds to \arg\max_X P(X \mid O, \lambda), which is equivalent to \arg\max_X P(O, X \mid \lambda) via maximum a posteriori (MAP) estimation, as P(O \mid \lambda) is constant for a fixed model.^[11] The Viterbi algorithm solves this using dynamic programming to find the maximum-probability path through the state trellis, avoiding exhaustive enumeration of all N^T possible sequences. Originally developed by Andrew Viterbi for decoding convolutional codes in 1967, it was adapted for HMMs in the context of speech recognition and sequence modeling.^[12]^[11] The algorithm defines \delta_t(j) as the probability of the most likely path ending in state j at time t, accounting for the first t observations:

\delta_t(j) = \max_{x_1, \dots, x_{t-1}} P(x_1, \dots, x_{t-1}, x_t = j, o_1, \dots, o_t \mid \lambda)

Initialization sets \delta_1(j) = \pi_j b_j(o_1) for each state j = 1, \dots, N.^[11] The recursion proceeds for t = 1 to T-1:

\delta_{t+1}(j) = \max_i \left[ \delta_t(i) a_{ij} \right] b_j(o_{t+1})

with backpointers \psi_{t+1}(j) = \arg\max_i \left[ \delta_t(i) a_{ij} \right] to track the optimal preceding state.^[11] Termination computes the highest probability P^* = \max_j \delta_T(j) and initial state x_T^* = \arg\max_j \delta_T(j); the full sequence is then backtracked as x_t^* = \psi_{t+1}(x_{t+1}^*) for t = T-1 down to $1.^[11] The computational complexity is O(N^2 T), linear in the observation length T but quadratic in the number of states N, making it efficient for moderate-sized state spaces.^[11] For example, consider an HMM with two hidden states (rainy, sunny) modeling daily weather, where observations indicate rainy or sunny conditions (e.g., via direct reports). Given the observation sequence rainy, sunny, rainy over three days and assuming transition probabilities a_{rr} = 0.7, a_{rs} = 0.3, a_{sr} = 0.4, a_{ss} = 0.6, emission probabilities b_r(\text{rainy}) = 0.9, b_r(\text{sunny}) = 0.1, b_s(\text{rainy}) = 0.2, b_s(\text{sunny}) = 0.8, and uniform initials \pi_r = \pi_s = 0.5, the Viterbi algorithm computes \delta values and \psi backpointers iteratively. At t=1 (rainy), \delta_1(r) = 0.45, \delta_1(s) = 0.1, \psi_1(j) = undefined. At t=2 (sunny), \delta_2(r) = 0.0315, \delta_2(s) = 0.108, \psi_2(r) = r, \psi_2(s) = r. At t=3 (rainy), \delta_3(r) \approx 0.0389, \delta_3(s) \approx 0.0130, \psi_3(r) = s, \psi_3(s) = s. Termination selects x_3^* = r (P^* \approx 0.0389), backtracks to x_2^* = s, x_1^* = r, recovering the sequence rainy-sunny-rainy as most likely.^[11] The Viterbi algorithm assumes a clear peak in the posterior distribution and excels with high-signal data where emissions strongly distinguish states; in low-signal or highly ambiguous cases, it selects one path but ignores the multiplicity of near-optimal alternatives.^[11]

Parameter Learning (Baum-Welch Algorithm)

The parameter learning problem in hidden Markov models (HMMs) seeks to estimate the model parameters \lambda = (A, B, \pi)—comprising the state transition matrix A, the observation emission probabilities B, and the initial state distribution \pi—so as to maximize the likelihood P(O \mid \lambda) of an observed sequence O = O_1, O_2, \dots, O_T in the absence of state labels.^[3] This unsupervised estimation is challenging due to the latent nature of the states, but it enables HMMs to be trained directly from data.^[3] The Baum-Welch algorithm addresses this by implementing the expectation-maximization (EM) framework, originally developed for probabilistic functions of Markov chains. In each iteration, the E-step computes expected counts over hidden state occupancies and transitions using the forward-backward procedure, while the M-step re-estimates the parameters to maximize the expected complete-data log-likelihood given those expectations.^[3] The forward variables \alpha_t(j) from the forward algorithm provide P(O_1, \dots, O_t, X_t = j \mid \lambda), which are combined with backward variables in the E-step.^[3] The backward variables are defined as

\beta_t(j) = P(O_{t+1}, \dots, O_T \mid X_t = j, \lambda),

initialized with \beta_T(j) = 1 for all states j, and computed recursively backward via

\beta_t(i) = \sum_j a_{ij} b_j(O_{t+1}) \beta_{t+1}(j)

for t = T-1, \dots, 1.^[3] These enable the computation of posterior probabilities: the gamma \gamma_t(j) = P(X_t = j \mid O, \lambda) = \frac{\alpha_t(j) \beta_t(j)}{P(O \mid \lambda)}, representing the expected occupancy of state j at time t, and the xi \xi_t(i,j) = P(X_t = i, X_{t+1} = j \mid O, \lambda) = \frac{\alpha_t(i) a_{ij} b_j(O_{t+1}) \beta_{t+1}(j)}{P(O \mid \lambda)}, capturing expected transitions from i to j at t.^[3] In the M-step, parameters are updated using these expectations:

Initial distribution: \pi_i = \gamma_1(i),
Transitions: a_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)},
Emissions (for discrete observations v_k): b_j(k) = \frac{\sum_{t: O_t = v_k} \gamma_t(j)}{\sum_{t=1}^T \gamma_t(j)}.^[3]

These closed-form updates guarantee a non-decreasing log-likelihood at each iteration.^[3] The algorithm converges to a local maximum of the log-likelihood, though the optimum depends on initialization; random starts are common, with multiple runs recommended to select the best result based on final likelihood.^[3] As an illustrative example, consider sequences from coin tosses where hidden states correspond to biased coins (fair or loaded), and observations are heads (H) or tails (T). Starting with initial parameters (e.g., uniform transitions and emissions), Baum-Welch iterations refine estimates—such as increasing the heads probability for the loaded state—yielding progressively higher sequence likelihoods, often converging in 10–20 steps for short sequences like H T T H H T.^[3]

Practical Applications

Speech and Pattern Recognition

Hidden Markov models (HMMs) have been extensively applied in speech recognition, where they model the temporal dynamics of acoustic signals by representing phonemes or subword units as hidden states. Observations are typically acoustic features such as Mel-frequency cepstral coefficients (MFCCs), modeled using Gaussian mixture distributions to capture the variability in speech production.^[3] In early systems like the CMU Sphinx recognizer, HMMs were integrated with language models to decode continuous speech, enabling speaker-independent recognition of large vocabularies.^[13] In part-of-speech (POS) tagging, HMMs treat grammatical tags (e.g., noun, verb) as hidden states, with words as observations and transition probabilities reflecting syntactic dependencies. The Viterbi algorithm is commonly used to find the most likely tag sequence for a given sentence, achieving high accuracy on standard corpora by leveraging statistical patterns in text. This approach dominated stochastic tagging methods in the 1990s, providing a robust framework for natural language processing tasks. HMMs also facilitate handwriting and gesture recognition by modeling sequential trajectories as state paths, with emissions derived from feature vectors like stroke coordinates or velocity profiles. For handwriting, states represent character segments, allowing recognition of cursive scripts through dynamic programming alignment.^[14] In gesture recognition, multidimensional HMMs capture spatial and temporal variations in hand movements, enabling real-time interfaces for human-computer interaction.^[15] From the 1980s through the 2000s, HMMs dominated automatic speech recognition (ASR) systems, forming the backbone of commercial and research implementations due to their effectiveness in handling sequential data.^[16] More recently, HMMs have been hybridized with deep neural networks (DNNs), where DNNs replace Gaussian emissions for better acoustic modeling, leading to substantial reductions in error rates while retaining HMM structure for decoding. A representative example is isolated word recognition, in which HMM states model subword units like phonemes, with parameters estimated using the Baum-Welch algorithm on labeled audio datasets featuring MFCC extractions.^[17] Such systems have demonstrated word error rate (WER) improvements of up to 30% through refinements like triphone modeling, establishing HMMs as a benchmark for sequential pattern recognition tasks.^[18] HMMs are also used in anomaly detection for time series data, where normal behavior is modeled as a sequence of hidden states, and deviations in observations indicate anomalies. This application is common in network intrusion detection and financial fraud monitoring, using the forward algorithm to compute likelihoods and flag low-probability sequences.^[1]

Bioinformatics and Genetics

Hidden Markov models (HMMs) play a pivotal role in bioinformatics by modeling the probabilistic structure of biological sequences, where hidden states represent underlying features like genomic regions or structural elements, and observations correspond to nucleotides or amino acids. This framework excels in tasks involving sequential dependencies, such as gene annotation and evolutionary analysis, by leveraging algorithms like the forward algorithm for likelihood computation. In genetics, HMMs facilitate the interpretation of complex data from DNA and protein sequences, enabling predictions that inform functional genomics and evolutionary biology.^[19] In gene finding, HMMs distinguish coding from non-coding regions by defining states for exons, introns, and intergenic spacers, with emission probabilities capturing nucleotide composition biases in each. The GENSCAN tool exemplifies this approach, using a generalized HMM to predict complete eukaryotic gene structures, including splice sites and exon boundaries, by integrating length distributions and sequence motifs into state transitions. This method has achieved high accuracy in identifying genes in human and other vertebrate genomes, outperforming earlier rule-based predictors. For multiple sequence alignment, profile HMMs extend standard models to represent conserved motifs and variable regions, with states denoting match, insert, or delete columns and transitions penalizing gaps. The HMMER software implements these profiles to align related protein or DNA sequences, particularly for detecting distant homologs where pairwise methods falter; the Viterbi algorithm identifies the optimal alignment path. Profile HMMs have revolutionized homology searches by providing position-specific scoring superior to simple profiles.^[20] Protein secondary structure prediction employs HMMs with states for alpha-helices, beta-sheets, and random coils, where emissions are based on amino acid propensities and physicochemical properties. Observations from single or multiple sequence alignments inform transitions that model segment lengths and turns, yielding predictions with accuracies around 65% in mid-1990s implementations.^[21] This state-based modeling captures local structural propensities more effectively than window-based methods. Discrete HMMs in phylogenetic modeling treat evolutionary rate variations across sequence sites as hidden states, with emissions reflecting substitution probabilities under site-specific rates. This allows reconstruction of ancestral sequences and phylogenies by accounting for heterogeneity without assuming uniform evolution, as demonstrated in models that infer rate categories from aligned molecular data. Such approaches improve likelihood estimates for tree inference compared to constant-rate models. A illustrative example is CpG island detection, where states represent CpG-rich (island) and non-island regions; emissions favor high CG dinucleotide probabilities in islands versus GC-biased but CG-suppressed ones elsewhere, with the forward algorithm calculating the posterior probability of a sequence originating from an island. This binary-state HMM identifies promoter-associated regions critical for gene regulation. The impact of HMMs in bioinformatics includes enabling sensitive tools like HMMER as alternatives to BLAST, accelerating large-scale proteome annotation with greater remote homology detection. Recent advancements integrate HMMs with deep learning, as in the Tiberius predictor, which combines neural networks for feature extraction with HMM decoding to boost eukaryotic gene finding accuracy beyond traditional methods.^[22]

Developments and Variants

Historical Context

The foundations of hidden Markov models (HMMs) trace back to the early 20th century with the development of Markov chains by Russian mathematician Andrey Markov, who in 1906 introduced the concept of stochastic processes where future states depend only on the current state, laying the groundwork for modeling sequential dependencies. In the late 1960s, Leonard E. Baum and colleagues at the Institute for Defense Analyses (IDA) formalized HMM theory through a series of papers. This included the 1970 adaptation of the Viterbi algorithm—originally developed by Andrew Viterbi in 1967 for error-correcting codes in communication systems—to find optimal state sequences in hidden processes.^[23] This was followed by the 1970 Baum-Welch algorithm, co-authored with Lloyd R. Welch and others, which provided a method for maximum likelihood parameter estimation in HMMs using an expectation-maximization approach. The 1970s marked key milestones in practical application, particularly in speech recognition through DARPA-funded projects like the Speech Understanding Research (SUR) program (1971–1976), where HMMs were integrated into systems at institutions such as Carnegie Mellon University and IBM to handle continuous speech with vocabularies up to 1,000 words. By the late 1980s, HMMs gained traction in bioinformatics as DNA sequencing technologies proliferated, with early applications in sequence modeling emerging around 1989 (e.g., Churchill's work on heterogeneous DNA sequences),^[24] and gene prediction and sequence alignment in the early to mid-1990s, exemplified by models like EcoParse for analyzing bacterial genes and evolutionary patterns. Lawrence R. Rabiner played a pivotal role in popularizing HMMs through his influential 1989 tutorial in the Proceedings of the IEEE, which synthesized the theory and applications, particularly in speech, and became a standard reference for researchers.^[25] Implementations advanced through figures like those at Bell Labs, including Levinson's work on statistical speech modeling in the 1970s and 1980s. While pure HMMs dominated sequence modeling until the early 2010s, their standalone use declined with the rise of deep learning techniques, though they remain foundational in hybrid systems for tasks like acoustic modeling.^[16]

Extensions and Advanced Models

Hidden Markov models (HMMs) have been extended in various ways to address limitations of the basic discrete-state framework, such as handling continuous or infinite state spaces, incorporating hierarchical structures, and accommodating covariates or non-exponential state durations. These generalizations enable modeling of more complex dependencies in sequential data while maintaining the core probabilistic foundations of HMMs.^[26] One key extension involves continuous state spaces, where the hidden states evolve according to diffusion processes or are modeled using Gaussian processes to capture smooth, infinite-dimensional dynamics. For instance, switching diffusion models allow the hidden process to switch between regimes governed by stochastic differential equations, providing flexibility for applications like financial time series or neural dynamics where abrupt changes occur within continuous trajectories. Similarly, HMMs augmented with Gaussian processes for state transitions enable nonparametric modeling of latent functions, improving inference in scenarios with limited discrete assumptions. Hierarchical hidden Markov models (HHMMs) introduce nested structures, where super-states contain sub-HMMs, facilitating multi-scale representations of sequences with shared substructures. This recursive formulation, analyzed in seminal work, allows efficient inference via extensions of the forward-backward algorithm, making it suitable for modeling layered processes like speech prosody or activity recognition. For example, in gesture recognition, HHMMs can represent complex motions where top-level states denote overall actions (e.g., waving) and sub-states capture phases (e.g., arm raise and lower), enabling robust decoding of temporal hierarchies. HHMMs have proven effective in video analysis by capturing both global and local patterns without exponential complexity in state space.^[27] Profile HMMs extend the standard model to handle biological sequences with insertions and deletions by incorporating explicit insert and delete states alongside match states, forming a linear architecture that aligns variable-length inputs to a consensus profile. This variant, widely used in tools like HMMER for identifying distant protein homologs, improves sensitivity over basic HMMs by modeling position-specific conservation and gaps explicitly, as demonstrated in early applications to multiple sequence alignments. Profile HMMs thus address alignment ambiguities in evolutionary data without altering the core emission and transition framework. Bayesian HMMs incorporate prior distributions on transition matrices A and emission parameters B, typically using Dirichlet priors for their conjugacy with multinomial likelihoods, and employ Markov chain Monte Carlo (MCMC) methods like Gibbs sampling for posterior inference to quantify uncertainty in parameters and states. This approach contrasts with maximum likelihood estimation by enabling regularization and handling small datasets, with applications in robust sequence modeling where parameter variability is critical. Bayesian formulations also support nonparametric extensions, such as infinite HMMs via Dirichlet processes, for discovering unknown state counts. Discriminative variants shift from generative parameter learning (e.g., Baum-Welch) to criteria like maximum mutual information (MMI) training, which optimizes the model to maximize the mutual information between correct state sequences and observations, directly improving discrimination between competing hypotheses. Introduced for speech recognition, MMI adjusts transitions and emissions to minimize recognition errors, often outperforming expectation-maximization by focusing on decision boundaries rather than data likelihood. These methods are particularly effective in supervised settings with labeled sequences. Input-output HMMs (IOHMMs) generalize the model by conditioning transitions or emissions on exogenous covariates, allowing external inputs to influence hidden dynamics and making the framework suitable for controlled or contextual sequences. In this architecture, covariates enter via logistic or softmax functions on transition probabilities, enabling applications like time-series forecasting with predictors. IOHMMs thus bridge HMMs with regression-like models for covariate-dependent regimes. Variable-duration HMMs, or hidden semi-Markov models (HSMMs), relax the geometric holding time assumption by assigning explicit duration distributions to states, often using gamma or non-parametric forms to model realistic sojourn times in applications like speech where phoneme lengths vary. Inference extends the forward algorithm to account for durations, improving accuracy over standard HMMs in tasks with explicit timing dependencies. These models maintain tractability while capturing non-Markovian aspects of state persistence.^[28] Recent developments post-2020 have integrated HMMs with transformer architectures for semi-supervised learning, leveraging self-attention mechanisms to enhance latent state inference in low-label regimes. For example, hidden Markov transformers combine recurrent hidden dynamics with transformer decoders for tasks like simultaneous translation, where partial observations guide predictions, achieving better generalization by fusing sequential modeling with global context. Such hybrids exploit transformers' parallelization for scalable semi-supervised training on sequential data.

HMM