Markov model
A Markov model is a stochastic model that describes a system transitioning between a finite or countable set of states, where the probability distribution for the next state depends solely on the current state and not on the prior history of the system, a property known as the Markov property or memorylessness.[1] This framework, foundational to probability theory, enables the modeling of random processes in discrete or continuous time, often represented through transition probabilities organized in matrices for discrete cases.[1] The concept originated with Russian mathematician Andrey Andreyevich Markov (1856–1922), who developed it as an extension of the law of large numbers to dependent random variables, first publishing on the topic in 1906 while analyzing sequences of dependent events, such as letter transitions in Alexander Pushkin's novel Eugene Onegin.[2] Markov's work built on earlier probability research by Pafnuty Chebyshev and addressed limitations in classical independence assumptions, proving results like the central limit theorem for such chains under general conditions.[2] Subsequent advancements, including continuous-time processes by Andrey Kolmogorov in 1931, expanded the theory to broader stochastic processes.[3] Key components of a Markov model include states (possible conditions of the system), transition probabilities (likelihoods of moving between states), and initial state distributions.[4] In discrete-time models, known as Markov chains, these are often visualized as state diagrams or matrices, with properties like irreducibility (reachability between all states) and aperiodicity determining long-term behavior, such as convergence to a stationary distribution.[1] Extensions like hidden Markov models (HMMs) account for unobserved states, inferring them from observable outputs via algorithms such as the Viterbi or forward-backward methods, which have proven computationally efficient for sequential data analysis.[5] Markov models find applications across diverse fields due to their ability to capture sequential dependencies. In finance, they model asset price movements and predict market states for risk assessment.[6] Public health uses them to simulate disease progression and evaluate intervention strategies, such as in epidemiology for tracking infection spread.[6] In computer science and machine learning, HMMs power speech recognition, natural language processing, and bioinformatics tasks like gene sequence prediction.[5] Engineering applications include reliability analysis of fault-tolerant systems, where states represent operational modes and transitions model failure rates.[4] Other notable uses span queueing theory, genetics,[5] and operations research, such as optimizing bike-sharing or delivery networks.[6]Fundamentals
Definition and Markov Property
A Markov model is a stochastic model that describes a sequence of possible events, in which the probability of each event depends only on the state attained in the previous event.[7] This framework captures systems evolving over time where future behavior is determined solely by the current configuration, making it a fundamental tool in probability theory for modeling dependencies in sequences.[8] The core distinguishing feature of a Markov model is the Markov property, also known as the memoryless property, which formalizes this dependence structure. Mathematically, for a stochastic process \{X_n\}_{n \geq 0}, the property states that P(X_{n+1} = j \mid X_n = i, X_{n-1} = i_{n-1}, \dots, X_0 = i_0) = P(X_{n+1} = j \mid X_n = i) for all n \geq 0 and states i, j, i_{n-1}, \dots, i_0.[9] This implies that the process has no memory beyond the immediate past, allowing predictions of future states to rely exclusively on the present one.[10] In the discrete-state, discrete-time case, transition probabilities are compactly represented by a transition matrix P, where each entry p_{ij} = P(X_{n+1} = j \mid X_n = i) gives the probability of moving from state i to state j.[11] A key concept is the stationary distribution \pi, a probability vector satisfying the balance equation \pi P = \pi, \quad \sum_{i} \pi_i = 1, which describes a long-run equilibrium where the state distribution remains unchanged under transitions.[12] For illustration, consider a basic weather model with states "sunny" and "rainy": if today is sunny, the probability of sunny tomorrow is 0.9 and rainy 0.1; if today is rainy, the probability of sunny tomorrow is 0.5 and rainy 0.5. Here, tomorrow's weather probability depends only on today's state, exemplifying the Markov property without regard to prior days.[13] Markov models are classified based on the nature of states and time: discrete-state discrete-time models use countable states and integer time steps; discrete-state continuous-time models maintain countable states but allow transitions at arbitrary real times; continuous-state variants extend to uncountable state spaces, such as positions in diffusion processes.[14] These variants share the Markov property but differ in their analytical tools and applications.[15]Historical Development
The origins of Markov models can be traced to the work of Russian mathematician Andrey Andreyevich Markov (1856–1922), who sought to extend classical probability theory beyond the assumption of statistical independence in sequences of events. In his 1906 paper, titled Rasprostranenie zakona bol’shikh chisel na velichiny, zavisiaschchie drug ot druga (Extension of the law of large numbers to quantities depending on each other), Markov introduced the concept of dependent random variables by analyzing the sequence of approximately 20,000 letters in Alexander Pushkin's verse novel Eugene Onegin. He categorized letters as vowels or consonants and calculated transition probabilities, such as the probability of a vowel following a consonant (approximately 0.663), to demonstrate that successive elements in a sequence could exhibit limited dependence while still satisfying the law of large numbers. This work explicitly rejected the rigid independence assumption dominant in the probability school of Pafnuty Chebyshev, Markov's mentor, and laid the groundwork for modeling sequences where the future state depends only on the current one.[16] Markov further developed these ideas in his 1913 memoir, Primer statisticheskogo issledovaniia nad tekstom "Evgeniia Onegina" po povodu odnoy zadachi iz teorii veroiatnostei (An example of a statistical investigation of the text of Eugene Onegin in connection with a problem in probability theory), which extended the chain model to general sequences of dependent phenomena beyond binary states. This publication, part of a symposium Markov organized to commemorate the 200th anniversary of Bernoulli's law of large numbers, formalized higher-order dependencies and applied the approach to broader statistical regularities. In the 1930s, Soviet mathematician Andrey Kolmogorov advanced the framework significantly, establishing the general theory of Markov processes in works such as his 1931 monograph Analiticheskie metody v teorii veroiatnostei (Analytical methods in probability theory) and a 1936 paper on stochastic processes with independent increments, which provided rigorous foundations for both discrete and continuous cases. Early applications of these models emerged in genetics during the same decade, where J.B.S. Haldane employed stochastic chain-like processes to model gene frequency fluctuations and mutation-selection equilibria in populations.[16][17] Post-World War II developments accelerated the field's maturation, particularly with the introduction of continuous-time Markov chains by William Feller in his influential 1950 textbook An Introduction to Probability Theory and Its Applications, Volume I. Feller's treatment integrated chains into broader probability applications, emphasizing birth-death processes and limit theorems essential for queueing and reliability analysis. By the 1950s, the term "Markov chains" had become standard—first coined by Sergei Bernstein in 1926 but widely adopted during this period—fueling growth in operations research, computing, and simulation methods as computational tools emerged.[16]Markov Chains
Discrete-Time Markov Chains
A discrete-time Markov chain (DTMC) is a stochastic process \{X_n : n = 0, 1, 2, \dots \} with a discrete state space S, which may be finite or countably infinite, where the Markov property holds: the probability distribution of X_{n+1} depends only on X_n and not on prior states.[18] Formally, for states i, j \in S and any history X_0 = i_0, \dots, X_n = i_n, the transition probabilities satisfy P(X_{n+1} = j \mid X_0 = i_0, \dots, X_n = i_n) = P(X_{n+1} = j \mid X_n = i_n) = p_{i_n j}, where p_{ij} \geq 0 and \sum_{j \in S} p_{ij} = 1 for all i \in S.[19] These transition probabilities form the rows of the transition matrix P = (p_{ij}), a right-stochastic matrix.[20] Key concepts in DTMCs include the classification of states based on return probabilities and periodicity. A state i \in S is absorbing if p_{ii} = 1, meaning the chain remains there indefinitely once entered.[18] States are transient if the probability of eventual return to i starting from i is less than 1, and recurrent if this return probability is 1; recurrent states are further divided into null recurrent (infinite expected return time) and positive recurrent (finite expected return time).[21] In an irreducible chain (where every state is reachable from every other), all states share the same classification: either all transient or all recurrent.[18] Periodicity measures the greatest common divisor of return times to a state i, with i aperiodic if this gcd is 1; a chain is ergodic if it is irreducible, positive recurrent, and aperiodic.[22] The evolution of DTMCs is analyzed using the Chapman-Kolmogorov equations, which describe multi-step transitions. For m, n \geq 1, the n+m-step transition probabilities satisfy p_{ij}^{(n+m)} = \sum_{k \in S} p_{ik}^{(n)} p_{kj}^{(m)}, where p_{ij}^{(n)} is the (i,j)-entry of P^n.[23] In matrix form, this yields P^{n+m} = P^n P^m, enabling computation of long-term behavior via matrix powers.[23] A stationary distribution \pi = (\pi_i)_{i \in S} for a DTMC satisfies \pi P = \pi with \sum_{i \in S} \pi_i = 1 and \pi_i \geq 0, representing a probability vector invariant under transitions.[24] For finite-state irreducible chains, this equation is equivalent to solving the linear system \pi (P - I) = 0 subject to normalization, where I is the identity matrix; uniqueness holds for ergodic chains.[18] In positive recurrent irreducible chains, even if infinite, a unique stationary distribution exists with \pi_i > 0 for all i.[24] The gambler's ruin problem illustrates DTMC concepts: consider two players with initial fortunes i and N-i (total N), each betting $1per round with win probabilityp \neq 1/2; the state is the first player's fortune, with absorbing states at 0 ([ruin](/page/The_Ruin)) and N(win).[25] The chain is irreducible on{1, \dots, N-1}but transient, with absorption probabilities computable via stationary-like equations, yielding ruin probability\frac{ \left( \frac{1-p}{p} \right)^i - \left( \frac{1-p}{p} \right)^N }{ 1 - \left( \frac{1-p}{p} \right)^N }ifp \neq 1/2$.[25] PageRank, used in Google's original search engine, models web page importance via a DTMC on pages as states, where transitions follow hyperlink structure (random surfer model), adjusted for damping to ensure aperiodicity and irreducibility.[26] The stationary distribution gives page ranks, computed by solving \pi P = \pi iteratively, prioritizing pages with high in-link quality.[26] For irreducible aperiodic finite-state DTMCs, the distribution converges to the unique stationary \pi: starting from \mu_0, \|\mu_n - \pi\|_{TV} \to 0 as n \to \infty, where \|\cdot\|_{TV} is total variation distance. The mixing time \tau(\epsilon) is the minimal n such that \|\mu_n - \pi\|_{TV} \leq \epsilon for all initial \mu_0, bounded by \tau(\epsilon) \leq \frac{\log(1/(\epsilon \min_i \pi_i))}{1 - \lambda_2} for $0 < \epsilon < 1, with \lambda_2 the second-largest eigenvalue of P; this quantifies convergence speed, crucial for applications like sampling.[27] In contrast, continuous-time Markov chains involve exponential holding times rather than fixed steps.[28]Continuous-Time Markov Chains
A continuous-time Markov chain is a stochastic process \{X(t): t \geq 0\} with a countable state space S that satisfies the Markov property: for all s, t \geq 0 and i, j \in S, the conditional probability P(X(s+t) = j \mid X(s) = i, \{X(u): 0 \leq u < s\}) = P(X(s+t) = j \mid X(s) = i) = p_{ij}(t), where p_{ij}(t) denotes the transition probability function.[29] The process remains in each state i for an exponentially distributed holding time with rate \lambda_i = -q_{ii} > 0, after which it jumps to another state j \neq i with probability q_{ij}/\lambda_i.[30] The dynamics of a continuous-time Markov chain are governed by the infinitesimal generator matrix Q = (q_{ij}), where q_{ij} \geq 0 for i \neq j represents the transition rate from state i to j, and the diagonal entries satisfy q_{ii} = -\sum_{j \neq i} q_{ij} so that each row sums to zero.[29] This matrix Q is derived as the derivative of the transition probability matrix at time zero, Q = P'(0).[30] The transition probabilities evolve according to the Kolmogorov backward equations, \frac{d}{dt} p_{ij}(t) = \sum_{k \in S} q_{ik} p_{kj}(t), \quad t > 0, with initial condition p_{ij}(0) = \delta_{ij}, or equivalently the forward equations, \frac{d}{dt} p_{ij}(t) = \sum_{k \in S} p_{ik}(t) q_{kj}, \quad t > 0. [29] These systems of ordinary differential equations describe the forward flow of probability distributions and the backward propagation from initial states.[30] For an irreducible continuous-time Markov chain, a unique stationary distribution \pi = (\pi_i)_{i \in S} exists if the chain is positive recurrent, satisfying the balance equations \pi Q = 0 with \sum_{i \in S} \pi_i = 1, and \lim_{t \to \infty} p_{ij}(t) = \pi_j for all i, j \in S.[30] In contrast to discrete-time Markov chains, which evolve via matrix powers, continuous-time chains involve exponential holding times and rate-based jumps, leading to solutions often expressed via matrix exponentials P(t) = e^{tQ}.[29] Birth-death processes form an important class of continuous-time Markov chains on the non-negative integers, where transitions occur only to neighboring states: from state n to n+1 at birth rate \lambda_n and to n-1 at death rate \mu_n (with \mu_0 = 0)./16%3A_Markov_Processes/16.21%3A_Continuous-Time_Birth-Death_Chains) A canonical example is the M/M/1 queue in queueing theory, modeling customer arrivals as a Poisson process with rate \lambda (births) and exponential service times with rate \mu (deaths), where the state represents the queue length; the stationary distribution is geometric with parameter \rho = \lambda / \mu < 1, yielding mean queue length \rho / (1 - \rho).[31] Another application arises in radioactive decay chains, such as the thorium series, where states correspond to unstable isotopes transitioning to daughter nuclides at decay rates \lambda_i, forming a linear birth-death process with no births (\lambda_n = 0) and deaths \mu_n > 0, solving to exponential decay probabilities via the Kolmogorov equations.[32] Key properties include the potential for explosion in chains with infinite state spaces, where the explosion time \zeta = \inf\{t > 0: N(t) = \infty\} (with N(t) the number of jumps by time t) is finite with positive probability, as in pure birth processes with superlinear rates like the Yule process (\lambda_n = n \beta); regular chains require P(\zeta = \infty) = 1.[29] For simulation and transient analysis, uniformization embeds the continuous-time chain into a discrete-time Markov chain by adding fictitious self-transitions at a uniform rate \alpha \geq \max_i (-q_{ii}), yielding transition matrix M = I + Q/\alpha and Poisson-distributed jump times, enabling efficient Monte Carlo methods.[33]Sequential and Inferred Models
Hidden Markov Models
A hidden Markov model (HMM) is a probabilistic model that extends the Markov chain framework by assuming an underlying stochastic process with unobserved (hidden) states that evolve according to the Markov property, while the observable outputs, or emissions, are generated from these states via a conditional probability distribution. Unlike a standard Markov chain where states are directly observed, in an HMM the sequence of hidden states Q = q_1, q_2, \dots, q_T forms a Markov chain, and the observation sequence O = o_1, o_2, \dots, o_T is produced stochastically from each state, capturing dependencies in sequential data where direct state information is unavailable. This model was first formalized by Baum and Petrie in their work on statistical inference for probabilistic functions of Markov processes.[34][35] The HMM is defined by three key components: the state transition probability matrix A = \{a_{ij}\}, where a_{ij} = P(q_{t+1} = S_j \mid q_t = S_i) specifies the probability of transitioning from state S_i to S_j; the emission probability distribution B = \{b_j(k)\}, where b_j(k) = P(o_t = v_k \mid q_t = S_j) gives the probability of observing symbol v_k in state S_j (often represented as an emission matrix for discrete observations); and the initial state probability distribution \pi = \{\pi_i\}, where \pi_i = P(q_1 = S_i). The complete model is denoted by the parameter set \lambda = (A, B, \pi). These components enable the HMM to represent systems with latent dynamics, such as sequences where the true state sequence is inferred indirectly from observations.[35] To work with HMMs, three fundamental computational problems are addressed. The evaluation problem computes the likelihood of an observation sequence given the model, P(O \mid \lambda), using the forward-backward algorithm, which efficiently calculates this probability in O(N^2 T) time via dynamic programming, where N is the number of states and T is the observation length. The decoding problem finds the most likely hidden state sequence Q^* = \arg\max_Q P(Q \mid O, \lambda), solved by the Viterbi algorithm, a dynamic programming approach that maximizes the joint probability P(Q, O \mid \lambda) by maintaining the highest probability path ending in each state at each time step: \begin{align*} \delta_1(i) &= \pi_i b_i(o_1), \\ \delta_{t+1}(j) &= \max_i [\delta_t(i) a_{ij}] b_j(o_{t+1}), \\ \psi_{t+1}(j) &= \arg\max_i [\delta_t(i) a_{ij}], \end{align*} with backtracking to recover the state sequence, achieving the same time complexity as the forward algorithm. These methods, detailed in foundational treatments of HMMs, allow for inference over hidden states.[35] Parameter estimation in HMMs is performed via the Baum-Welch algorithm, an expectation-maximization (EM) procedure that iteratively reestimates \lambda to maximize P(O \mid \lambda) when states are unobserved. Starting from an initial guess, the E-step computes expected state occupancies and transitions using forward-backward variables, while the M-step updates A, B, and \pi as normalized frequencies, such as \hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)}, where \xi_t(i,j) = P(q_t = S_i, q_{t+1} = S_j \mid O, \lambda) and \gamma_t(i) = P(q_t = S_i \mid O, \lambda); this guarantees non-decreasing likelihood and converges to a local maximum. The algorithm was developed by Baum, Petrie, Soules, and Weiss as a maximization technique for Markov chain probabilities.[35] HMMs have been widely applied in sequence modeling tasks. In speech recognition, hidden states represent phonetic units like phonemes, with emissions modeling acoustic features such as spectral vectors, enabling systems to decode spoken words from audio observations; early implementations achieved recognition accuracies exceeding 95% on isolated digits using left-to-right HMM topologies. In natural language processing, HMMs support part-of-speech (POS) tagging, where states correspond to grammatical categories (e.g., noun, verb) and emissions to words, allowing inference of tag sequences from sentences; robust taggers based on this approach report tagging accuracies around 96% on standard corpora like the Penn Treebank.[35][36] Model performance is evaluated using metrics tailored to the application. Perplexity measures predictive uncertainty for sequence models, computed as $2^{-\frac{1}{T} \log_2 P(O \mid \lambda)}, with lower values indicating better fit for tasks like language modeling. In communication or error-prone sequences, bit error rate quantifies decoding accuracy as the fraction of incorrectly inferred bits or symbols, often below 1% in optimized HMM-based systems. These metrics provide quantitative assessment of inference quality without ground-truth states.[35]Markov Decision Processes
A Markov decision process (MDP) is a mathematical framework for modeling sequential decision-making under uncertainty, where an agent interacts with an environment by selecting actions that influence probabilistic state transitions and associated rewards. Formally, an MDP is defined as a tuple (S, A, P, R, \gamma), where S is the state space, A is the action space, P(s' \mid s, a) denotes the transition probabilities to next states s' given current state s and action a, R(s, a) represents the reward function, and \gamma \in [0, 1) is the discount factor that weights future rewards. This structure builds on the Markov property, ensuring that future states depend only on the current state and action, not prior history. The theory originated in the 1950s with foundational work by Lloyd Shapley on stochastic games and Richard Bellman on dynamic programming approaches to controlled Markov chains.[37][38][39] In an MDP, a policy \pi specifies the agent's behavior, mapping states to actions; it can be deterministic (\pi(s) = a) or stochastic (\pi(a \mid s)). The goal is to find an optimal policy \pi^* that maximizes the expected discounted cumulative reward, often evaluated through value functions. The state-value function V^\pi(s) estimates the expected return starting from state s under policy \pi, while the action-value function Q^\pi(s, a) does so after taking action a in s. These functions satisfy the Bellman equations, which decompose the value into immediate reward plus discounted expected future value: V^\pi(s) = \sum_a \pi(a \mid s) \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^\pi(s') \right] For the optimal value function V^*(s), the Bellman optimality equation is: V^*(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s') \right] This equation implies that the optimal value operator is a contraction mapping in the space of bounded functions, guaranteeing a unique fixed point solvable via iterative methods. A similar form holds for Q^*(s, a).[39][40] Solution methods for MDPs include value iteration and policy iteration. Value iteration, introduced by Bellman, initializes a value function and iteratively applies the Bellman optimality operator until convergence to V^*, from which \pi^* is derived by greedy selection. Policy iteration, developed by Ronald Howard, alternates between policy evaluation (solving the Bellman equation for a fixed \pi to compute V^\pi) and policy improvement (updating \pi to greedily select actions maximizing the Bellman update). Both algorithms converge in finite steps for finite MDPs under standard assumptions. For infinite-horizon discounted MDPs, these methods yield \gamma-discounted optimal policies; finite-horizon variants truncate at a horizon T with time-dependent values, while average-reward criteria (undiscounted, long-run average) use modified Bellman equations focusing on differential rewards.[38][41][40] MDPs find application in problems like inventory control, where states represent stock levels, actions are ordering quantities, transitions model demand uncertainty, and rewards capture holding and shortage costs—solvable via policy iteration to minimize long-term costs. In robot navigation, states are positions, actions are movements, transitions account for noisy sensors or dynamics, and rewards guide goal-reaching; value iteration optimizes paths in simulated environments. These examples illustrate MDPs' utility in balancing exploration of uncertain outcomes with exploitation of rewarding actions.[40]Decision and Observation Models
Partially Observable Markov Decision Processes
A partially observable Markov decision process (POMDP) extends the Markov decision process framework to environments where the agent's state observations are imperfect or noisy. Formally, a POMDP is defined by the tuple (S, A, T, R, \Omega, O, \gamma), where S is the state space, A is the action space, T: S \times A \times S \to [0,1] is the transition probability function, R: S \times A \to \mathbb{R} is the reward function, \Omega is the observation space, O: S \times A \times \Omega \to [0,1] is the observation probability function (typically O(s', a, o) for the next state s' \in S), and \gamma \in [0,1) is the discount factor.[42] This structure models sequential decision-making under uncertainty in state knowledge, where the agent receives an observation o \in \Omega after each action rather than the true state.[42] Since the true state s \in S is hidden, the agent maintains a belief state b(s), which represents the posterior probability distribution P(s \mid a_t, o_{1:t}, a_{1:t-1}) over states given the action-observation history. This belief serves as a sufficient statistic for decision-making, transforming the POMDP into an equivalent fully observable belief MDP where beliefs are the states.[43] Belief updates can be computed using techniques analogous to those in hidden Markov models, such as the forward algorithm, to incorporate new observations and actions.[44] Planning in POMDPs involves solving the belief MDP via value iteration, where the value function V(b) is optimized over the continuous belief space. For finite-horizon problems, the optimal value function is piecewise linear and convex in the belief state, allowing representation as a finite set of linear segments (hyperplanes) that can be backed up exactly during iteration. This property, established in early dynamic programming approaches, enables efficient computation despite the high dimensionality, though exact solutions scale poorly with problem size.[43] Key solution approaches include point-based value iteration (PBVI), which approximates the value function by performing backups only at a discrete set of strategically selected belief points, yielding anytime policies with bounded error.[45] Another method is AND-OR tree search, which builds an observation tree from the current belief and evaluates action-observation branches to select policies online, often combined with Monte Carlo sampling for scalability in large spaces.[46] Recent advances as of 2025 include integration with deep learning for long-horizon planning, such as BetaZero, which uses learned approximations for belief-state planning in complex environments.[47] POMDPs are applied in robot localization, where a mobile robot uses sensor observations (e.g., laser scans) to maintain a belief over possible positions and plan paths under odometry uncertainty.[48] In medical treatment, they model sequential therapy decisions, such as adjusting dosages for chronic conditions like hypertension, balancing efficacy against side effects given noisy patient responses.[49] The problem of finding an optimal policy for a finite-horizon POMDP is PSPACE-complete, reflecting its inherent computational intractability even for moderate-sized models.[50] Approximations like QMDP address this by assuming immediate full observability after actions, computing a hybrid value function from underlying MDP Q-values to derive fast, near-optimal policies.Hierarchical Markov Models
Hierarchical Markov models extend traditional Markov models by incorporating multi-level structures that capture nested dependencies in sequential data or decision processes. In these models, higher-level abstract states govern transitions and emissions at lower levels, allowing for the representation of complex hierarchies where, for example, a super-state in a reinforcement learning (RL) setting initiates and oversees a subordinate Markov chain until a termination condition is met.[51] This hierarchical organization enables modeling of phenomena with varying timescales or compositional structures, such as skills composed of primitive actions in RL or layered linguistic features in sequences.[52] Key types include hierarchical hidden Markov models (HHMMs) and hierarchical Markov decision processes (HMDPs). In HHMMs, the hierarchy consists of super-states, each of which emits sequences from a lower-level hidden Markov model (HMM) rather than individual observations; transitions occur between super-states only upon completion of the subordinate chain, forming a tree-like structure that recursively models multi-scale sequences.[53] HMDPs, often used in RL, build on this by integrating decision-making, where higher-level policies select options—temporally extended actions—that execute lower-level Markov policies, potentially over partially observable environments like POMDPs at the base level.[51] Inference in hierarchical models, particularly HHMMs, typically involves extensions of standard HMM algorithms adapted for the tree structure. Parameter learning employs variational methods, such as expectation-maximization (EM) variants that propagate messages bottom-up and top-down across levels, or sampling techniques like Markov chain Monte Carlo (MCMC) and Gibbs sampling to handle the latent hierarchy and estimate transition probabilities, emission distributions, and durations.[53][54] These approaches achieve efficient inference, often in linear time relative to sequence length, by exploiting the hierarchical decomposition.[55] A primary benefit of hierarchical Markov models is temporal abstraction, which allows reuse of learned sub-sequences across contexts and reduces the state space explosion inherent in flat models by factoring complex behaviors into modular components.[56] This abstraction mitigates the curse of dimensionality in long-horizon tasks, enabling faster learning and planning by operating at multiple timescales simultaneously.[57] Recent developments as of 2025 include Hierarchical Semi-Markov Models that incorporate duration-aware dynamics for better modeling of variable-length sequences in applications like activity recognition.[58] In reinforcement learning, the options framework exemplifies hierarchical modeling for skill acquisition, where an agent learns temporally abstract actions (options) that terminate based on learned policies, facilitating hierarchical control in environments like navigation or game-playing.[51] For speech processing, HHMMs have been applied to model prosody at multiple levels, such as phonetic, syllabic, and intonational tiers, capturing dependencies in spontaneous Mandarin speech by hierarchically predicting pitch contours and rhythm from linguistic contexts.[59][60] Extensions to non-parametric settings include infinite hierarchical hidden Markov models (IHHMMs), which use priors like the hierarchical Dirichlet process to allow an unbounded number of states and levels, automatically inferring the hierarchy's depth and structure from data without fixed dimensionality assumptions.[61] This enables flexible modeling of diverse sequence lengths and complexities in applications like genomics or natural language.[62]Graphical and Spatial Models
Markov Random Fields
A Markov random field (MRF), also known as an undirected graphical model, consists of an undirected graph G = (V, E), where V is the set of vertices and E the set of edges, together with a set of random variables \{X_v : v \in V\} defined on a common probability space. The key characterizing feature is the local Markov property, which states that each variable X_v is conditionally independent of all other variables given its neighbors:P(X_v \mid X_{V \setminus \{v\}}) = P(X_v \mid X_{\mathcal{N}(v)}),
where \mathcal{N}(v) = \{ u \in V : (u, v) \in E \} denotes the neighbors of v. This property encodes multivariate dependencies through the graph structure, making MRFs suitable for modeling spatial or relational data without assuming a temporal order.[63] The Hammersley-Clifford theorem establishes the equivalence between MRFs and Gibbs random fields for strictly positive probability distributions. Specifically, a distribution satisfies the local Markov property with respect to G if and only if it admits a Gibbs factorization:
P(X) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c),
where \mathcal{C} is the set of cliques in G, each \phi_c is a non-negative potential function over the clique c, and Z is the normalizing partition function. This factorization allows parameterization via clique potentials, facilitating the representation of complex interactions. The theorem, originally from unpublished lecture notes, has been foundational in linking Markov properties to exponential family forms in multivariate analysis.[64] Common parameterizations of MRFs include the Ising model for binary variables, where potentials encourage alignment between neighboring states, such as \phi_{uv}(x_u, x_v) = \exp(\beta x_u x_v) for x_u, x_v \in \{-1, +1\} and interaction strength \beta > 0. This model, originating from statistical physics to describe ferromagnetism, serves as the prototypical MRF and has been widely adopted for discrete spatial modeling. The Potts model generalizes the Ising framework to K-state variables, with potentials \phi_{uv}(x_u, x_v) = \exp(\beta \mathbb{I}(x_u = x_v)), where \mathbb{I} is the indicator function, allowing for multi-label dependencies while preserving the pairwise structure.[65] Inference in MRFs, such as computing marginals or the maximum a posteriori configuration, is generally NP-hard due to the partition function Z. Approximate methods include Gibbs sampling, a Markov chain Monte Carlo technique that iteratively samples from full conditional distributions to generate configurations from the joint, as introduced for Bayesian image restoration. Belief propagation, exact on tree-structured graphs, approximates inference on loopy graphs via iterative message passing between nodes, converging to fixed points that yield marginal approximations; the loopy variant, while not guaranteed to converge, performs well empirically in practice.[66][67] MRFs find application in image denoising, where spatial potentials model pixel correlations to smooth noise while preserving edges, as demonstrated in early Bayesian restoration frameworks using Ising-like models on lattice graphs. In social network analysis, MRFs, often termed Markov random graphs, capture dependencies in tie formations, such as homophily, by defining potentials over edge cliques to infer network structures from observed data. Unlike Bayesian networks, which use directed acyclic graphs to encode asymmetric conditional dependencies via factorization of the joint into conditional probabilities, MRFs employ undirected graphs to represent symmetric, non-causal associations without implying ordering.[66][68]