Graphical model

A graphical model, also known as a probabilistic graphical model, is a statistical model that represents the joint probability distribution over a set of random variables using a graph, where nodes correspond to the variables and edges encode conditional dependencies or independencies among them.^[1] This framework combines principles from probability theory and graph theory to compactly capture complex multivariate distributions, enabling efficient representation, inference, and learning in high-dimensional settings.^[2] Graphical models are broadly classified into two primary types: directed models and undirected models. Directed graphical models, exemplified by Bayesian networks, employ directed acyclic graphs (DAGs) to factorize the joint distribution as a product of conditional probabilities, where each edge indicates a direct causal or influential relationship from parent to child nodes.^[1] In contrast, undirected graphical models, such as Markov random fields (or Markov networks), use undirected graphs to express the joint distribution via potential functions over cliques of variables, emphasizing mutual dependencies without implying directionality.^[2] A related representation, the factor graph, provides a bipartite graph structure that explicitly separates variables from the factors defining interactions, facilitating unified algorithms for both directed and undirected cases.^[1] These models have roots in early 20th-century work, such as Andrey Markov's chains (1906) and Ernst Ising's model (1925), but gained prominence in the late 1980s through Judea Pearl's formalization of Bayesian networks for probabilistic reasoning in artificial intelligence.^[3] Today, graphical models underpin applications in diverse fields, including machine learning for pattern recognition, bioinformatics for genetic analysis, computer vision for image segmentation, and natural language processing for sequence modeling, due to their ability to handle uncertainty and scale to large datasets via algorithms like belief propagation and variational inference.^[1]^[2]

Introduction

Definition and Motivation

Graphical models, also known as probabilistic graphical models, are a class of probabilistic models that use a graph-based structure to represent the conditional dependencies among a set of random variables. In these models, nodes correspond to random variables, and edges between nodes encode the direct dependencies (or conditional independencies) between them, providing a visual and compact encoding of multivariate probability distributions.^[2]^[4] The primary motivation for graphical models arises from the challenges of modeling high-dimensional joint probability distributions, where the number of parameters grows exponentially with the number of variables under full independence assumptions. By exploiting conditional independence structures inherent in many real-world systems, graphical models allow for a modular factorization of the joint distribution into lower-dimensional components, such as local potentials or conditional probabilities, which significantly reduces the representational complexity. This approach not only facilitates intuitive visualization of dependencies but also bridges key areas of statistics, artificial intelligence, and computer science by unifying principles from probability theory and graph theory to enable scalable reasoning under uncertainty.^[2]^[4]^[5] Key benefits include the provision of visual intuition for complex relationships, support for efficient computation through decomposition (e.g., the joint probability as a product of factors over cliques or conditional distributions), and the ability to handle large-scale problems by leveraging graphical separation criteria to identify conditional independencies. These properties make graphical models particularly valuable for applications requiring tractable inference and learning in domains with intricate interdependencies, such as natural language processing or bioinformatics.^[2]^[4] A basic example illustrates this: consider two random variables, X (e.g., weather condition) and Y (e.g., attendance at an event), where Y depends on X. In a graphical model, this is represented by a single edge connecting the nodes for X and Y, indicating that the joint distribution P(X, Y) factors as P(X) \cdot P(Y \mid X) in a directed graph or via a potential function in an undirected graph, capturing their dependence without assuming full parameterization.^[2]

Historical Development

The roots of graphical models trace back to early 20th-century efforts in statistics and physics to represent complex dependencies visually and probabilistically. In the 1920s, geneticist Sewall Wright developed path analysis as a method to diagram causal relationships among variables in biological systems, using directed paths to quantify correlations and influences in multivariate data.^[6] This approach laid foundational ideas for directed graphical structures by decomposing observed correlations into direct and indirect effects. Paralleling this, concepts of Markov fields emerged in statistical physics during the 1950s, building on earlier lattice models like the Ising model (1925) to describe interacting particle systems under local dependencies, with measure-theoretic foundations formalized by mathematicians such as Kolmogorov and Doob.^[7] These early influences from path diagrams and physical Markov processes provided the conceptual groundwork for later probabilistic graphical representations. The 1980s marked a pivotal era for graphical models, driven by advances in artificial intelligence and computer vision. Judea Pearl introduced Bayesian networks in the mid-1980s, formalizing directed acyclic graphs to encode conditional independencies and enable efficient probabilistic inference in expert systems.^[8] Concurrently, Markov random fields gained prominence in computer vision through works like Geman and Geman's 1984 application of Gibbs distributions and stochastic relaxation for Bayesian image restoration, adapting physical models to handle spatial dependencies in image processing tasks.^[9] A key unification came in 1988 with Lauritzen and Spiegelhalter's framework, which integrated directed and undirected models via moralization and demonstrated local computation algorithms for belief propagation across graphical structures.^[10] Influential figures included Pearl, whose work extended to causal inference; Hammersley and Clifford, who in 1971 established the theorem equating positive Markov fields to Gibbs measures; and communities at conferences like the Uncertainty in Artificial Intelligence (UAI), inaugural in 1985, which fostered collaborative advancements in probabilistic reasoning.^[11]^[12] Post-2000, graphical models integrated deeply with machine learning, evolving from classical probabilistic tools to hybrid systems enhanced by neural architectures. Seminal texts like Koller and Friedman's 2009 treatise solidified their role in scalable inference for large datasets, bridging statistics, AI, and applications in genomics and natural language processing. By the 2010s, deep graphical models emerged, combining variational inference with neural networks for tasks like generative modeling, as seen in deep Boltzmann machines. In recent years, developments as of 2025 have focused on neural enhancements, such as Neural Graphical Models, which embed expressive neural parameterizations into traditional structures for improved representation of complex dependencies while preserving interpretability.^[13]^[14] Ongoing advancements include hybrid approaches integrating probabilistic graphical models with graph neural networks and deep generative models, highlighted at events like the 12th International Conference on Probabilistic Graphical Models (PGM 2024).^[15]

Mathematical Foundations

Graphical Structures

Graphical models employ graph-theoretic structures to encode relationships among random variables, providing a visual and formal representation of dependencies without delving into probabilistic semantics. A graph G = (V, E) consists of a set of vertices V, each corresponding to a random variable (which may be discrete or continuous), and a set of edges E that indicate the presence of dependencies between these variables. This notation draws from standard graph theory, adapted to model multivariate distributions.^[16] In directed graphical models, the structure is a directed acyclic graph (DAG), where edges are oriented arrows representing conditional influences or causal directions between variables. Acyclicity, the absence of directed cycles in the graph, is a fundamental property that permits a topological ordering of vertices, ensuring that influences propagate in a well-defined manner without feedback loops. Directed edges from a parent vertex to a child vertex signify that the child's value depends on the parent's, facilitating the modeling of asymmetric relationships. Undirected graphical models, in contrast, use undirected graphs to capture symmetric dependencies, where edges are unoriented lines connecting vertices to denote mutual associations without implying directionality. Key structural elements include cliques, which are fully connected subgraphs of vertices, and separators, which are subsets of vertices that, when removed, disconnect the graph into independent components.^[16] These properties enable the identification of modular structures within the dependency network. To bridge directed and undirected representations, the moralization process transforms a DAG into an undirected graph by first adding undirected edges between all pairs of parents sharing a common child (moralizing the graph) and then removing edge directions. This procedure preserves certain independence properties while simplifying the structure for analysis in undirected frameworks.

Probabilistic Interpretations

Graphical models provide a framework for encoding probability distributions over multiple random variables through graph structures, where the edges represent dependencies and the absence of edges implies conditional independencies. Central to this encoding is the concept of conditional independence, which states that two sets of random variables X and Z are conditionally independent given a third set Y, denoted X \perp Z \mid Y, if and only if the conditional probability satisfies P(X \mid Y, Z) = P(X \mid Y).^[17] This property allows the joint distribution to be factored into simpler components, reducing computational complexity while preserving the essential probabilistic relationships. In graphical models, conditional independencies are read off the graph structure via specific separation criteria. For undirected graphical models, also known as Markov random fields, the separation criterion holds: two sets of nodes X and Z are conditionally independent given Y if Y separates X and Z in the graph, meaning every path between nodes in X and Z is blocked by nodes in Y.^[17] For directed graphical models, or Bayesian networks, the d-separation criterion applies: X and Z are conditionally independent given Y if every directed path from X to Z is d-separated by Y, where a path is blocked if it contains a chain A \to B \to C or fork A \leftarrow B \to C with B in Y, or a collider A \to B \leftarrow C with neither B nor its descendants in Y.^[17] These criteria ensure that the graph faithfully represents the independence structure of the underlying distribution. The factorization theorems link the graph structure directly to the joint probability distribution. In directed graphical models, the joint distribution P(X_1, \dots, X_n) over nodes X_1, \dots, X_n factorizes according to the directed acyclic graph (DAG) as

P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \mathrm{Pa}(X_i)),

where \mathrm{Pa}(X_i) denotes the parents of X_i in the DAG; this follows from the chain rule of probability and the conditional independencies implied by the graph.^[17] In undirected graphical models, the joint distribution P(X) factorizes over the maximal cliques C of the graph as

P(X) \propto \prod_{c \in \mathcal{C}} \phi_c(X_c),

where \mathcal{C} is the set of cliques, X_c are the variables in clique c, and \phi_c are positive potential functions; this factorization captures the local interactions defined by the edges.^[18] The Hammersley-Clifford theorem establishes the equivalence between undirected graphical models and Gibbs distributions for distributions with positive densities. Specifically, a probability distribution P(X) over a finite set of variables satisfies the global Markov properties with respect to an undirected graph if and only if it can be expressed as a Gibbs distribution

P(X) = \frac{1}{Z} \exp\left( \sum_{c \in \mathcal{C}} \psi_c(X_c) \right),

where Z is the normalizing constant, \mathcal{C} are the cliques, and \psi_c are potential functions; this holds under the assumption of positive potentials, linking the graphical separation to exponential family forms.^[18] The Markov properties in graphical models derive from local to global independencies through iterative application of the separation criteria. Local Markov properties state that each variable is independent of its non-descendants (in directed graphs) or non-adjacent variables (in undirected graphs) given its Markov blanket. The global Markov property then follows: for disjoint sets A, B, S, A \perp B \mid S if S separates A and B in the graph. This derivation proceeds by showing that the local independencies imply pairwise separations, which extend to set separations via the transitivity of conditional independence and the graph's blocking rules, ensuring all implied independencies are captured without additional assumptions.^[17]

Types of Graphical Models

Undirected Graphical Models

Undirected graphical models, also known as Markov random fields or Markov networks, provide a framework for representing multivariate probability distributions using an undirected graph, where nodes represent random variables and edges denote conditional dependencies between them. Unlike directed models, these graphs impose no inherent directionality or acyclicity, making them suitable for capturing symmetric associations such as correlations without implying causation. The joint probability distribution P(X) over the variables X = (X_1, \dots, X_n) factors into a product of potential functions \phi_c defined over the maximal cliques c of the graph, normalized by a partition function Z:
P(X) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \phi_c(X_c),
where \mathcal{C} is the set of maximal cliques and Z = \sum_X \prod_{c \in \mathcal{C}} \phi_c(X_c). This clique-based factorization, as established by the Hammersley-Clifford theorem for strictly positive distributions, ensures that the model encodes the global Markov properties of conditional independence via graph separation.^[19]^[20] A key property of undirected graphical models is the local Markov property, where the Markov blanket of a node consists solely of its immediate neighbors in the graph; conditioning on these neighbors renders the node conditionally independent of all other variables. This neighborhood-based separation simplifies the representation of local interactions and aligns with the pairwise or higher-order potentials used in the factorization. In applications, these models are particularly effective in spatial statistics for modeling lattice-structured data with regional dependencies, such as in geostatistics or environmental modeling, and in image processing for tasks like segmentation and denoising, where pixel interactions are naturally symmetric and local.^[21]^[22]^[23] A canonical example is the Ising model, originally developed in statistical physics to describe ferromagnetic interactions among spins on a lattice. In this model, binary variables s_i \in \{-1, +1\} interact pairwise via coupling strengths J_{ij}, with the energy function given by
H(\mathbf{s}) = -\sum_{i < j} J_{ij} s_i s_j,
and the joint distribution proportional to \exp(-H(\mathbf{s}) / T), where T is temperature; absent edges correspond to zero couplings, enforcing conditional independence. This pairwise structure exemplifies how undirected models handle undirected associations, contrasting with directed models' requirement for acyclic graphs to represent conditional probabilities via parent-child relations.^[21]^[24]

Directed Graphical Models

Directed graphical models, commonly referred to as Bayesian networks, represent multivariate probability distributions using directed acyclic graphs (DAGs), where each node denotes a random variable and each directed edge signifies a direct conditional dependency between variables.^[17] This structure encodes the joint distribution as a product of local conditional probabilities: each variable's distribution is conditioned solely on its immediate predecessors, or parents, in the graph.^[17] Formally, for variables X_1, \dots, X_n, the joint probability factors as

P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \mathrm{Pa}(X_i)),

where \mathrm{Pa}(X_i) denotes the set of parents of X_i.^[17] This factorization exploits the graph's acyclicity to capture hierarchical or causal relationships efficiently, avoiding the exponential complexity of specifying full joint distributions directly.^[17] A defining feature of Bayesian networks is the local Markov property, which states that each variable is conditionally independent of its non-descendants given its parents.^[17] This property implies a form of modularity: the influence of distant variables flows only through immediate dependencies, enabling compact representations of complex systems.^[17] Additionally, the directed edges lend themselves to causal interpretations, where arrows indicate potential causal directions.^[25] Interventions—such as setting a variable to a specific value—can be modeled using the do-operator, which truncates incoming edges to the intervened node, distinguishing actions from mere observations.^[25] The do-calculus, a set of three inference rules, allows identification of causal effects from observational data by manipulating probabilities under graphical criteria like back-door and front-door adjustments.^[25] Conditional independencies implied by the model are determined via d-separation, a graphical test for blocking paths between variables.^[17] A classic example is the Asia chest clinic network, which models diagnostic reasoning for lung-related conditions.^[26] Nodes represent binary variables such as "visit to Asia," "smoking," "tuberculosis," "lung cancer," "bronchitis," "either tuberculosis or cancer," "chest X-ray," "dyspnea," and "positive X-ray."^[26] Directed edges connect causes to effects, for instance, "visit to Asia" to "tuberculosis," "smoking" to both "lung cancer" and "bronchitis," and symptoms like "dyspnea" depending on underlying diseases.^[26] This network facilitates probabilistic queries, such as updating disease probabilities given observed symptoms, demonstrating how directed structures support evidential and causal reasoning in medical diagnostics.^[26] To handle continuous variables, Bayesian networks extend to linear Gaussian models, where each conditional distribution is a Gaussian with mean linear in the parents' values and fixed variance.^[27] Specifically, for a continuous child X_i with continuous parents \mathrm{Pa}(X_i),

X_i \mid \mathrm{Pa}(X_i) \sim \mathcal{N}\left( \mu_i + \sum_{j \in \mathrm{Pa}(X_i)} \lambda_{ij} (X_j - \mu_j), \tau_i^2 \right),

with parameters \mu_i, \lambda_{ij}, \tau_i^2 estimated from data.^[27] This assumption preserves conjugacy, allowing exact inference via multivariate Gaussian completions, and is widely used in econometric and genetic modeling where variables like stock prices or gene expressions follow linear dependencies.^[27]

Advanced and Hybrid Variants

Cyclic directed models extend traditional directed acyclic graphs by incorporating feedback loops to represent temporal dependencies and recurrent processes, particularly in dynamic systems like time series data. These models, known as dynamic Bayesian networks (DBNs), allow cycles across time slices while maintaining acyclicity within each slice, enabling the modeling of evolving states such as Markov chains or hidden Markov models. Introduced as a framework for reasoning about persistence and causation, DBNs unroll the network over time steps to approximate inference, transforming the cyclic structure into a tree-like Bayesian network for exact or approximate algorithms like the forward-backward pass. This unfolding approach facilitates efficient computation for sequential data, such as speech recognition or financial forecasting, by treating future states as dependent on past observations without explicit loops in the inference graph.^[28] Hybrid models combine elements of directed and undirected graphical structures to handle more complex independence relations, including those induced by latent variables. Chain graphs, which integrate directed edges for causal influences and undirected edges for conditional associations within Markov blankets, provide a unified representation for semi-Markovian processes where full acyclicity is not assumed. These models support causal interpretations by associating distributions with equilibrium states of dynamic systems, allowing for mixed dependencies in observational data from fields like epidemiology. Ancestral graphs further extend this by incorporating bidirected edges to denote latent confounders, enabling representation of equivalence classes of causal diagrams under marginalization and conditioning, thus aiding identification in the presence of unobserved variables.^[29]^[30] Other variants generalize the representational power of graphical models for specific factorizations and data structures. Factor graphs represent joint distributions as products of local factors over variables, using a bipartite structure that separates nodes for variables and factors to simplify message-passing algorithms like sum-product for inference. This formalism unifies directed and undirected models, facilitating applications in coding theory and signal processing where arbitrary factorizations are needed. Relational graphical models, such as probabilistic relational models, extend these to relational domains with entities and links, modeling uncertainty over populations using schemas with slots for attributes and relations, akin to object-oriented Bayesian networks. These allow scalable learning of dependencies in databases or social networks by parameterizing distributions over types rather than fixed variables.^[31]^[32] Recent developments post-2020 have explored hybrid variants integrating transformer architectures with graphical models to achieve scalable inference and learning in large-scale AI systems. These hybrids leverage transformer's attention mechanisms to approximate complex probabilistic dependencies in graph-structured data, enhancing efficiency for dynamic and relational settings beyond traditional exact methods. For instance, viewing transformers as graph-to-graph models enables probabilistic extensions for sequence and graph prediction tasks, combining the factorization benefits of graphical models with transformer's global context capture.^[33] More recent examples include MSGformer (as of 2025), a hybrid architecture combining multi-scale graph neural networks with transformer encoders for enhanced modeling in electronics and signal processing tasks.^[34]

Inference in Graphical Models

Exact Inference Algorithms

Exact inference in graphical models refers to algorithms that compute precise marginal or conditional probabilities by exploiting the conditional independence structure encoded in the graph, ensuring exact results under the model's factorization properties. These methods are particularly effective for models with sparse or tree-like structures, where the graph's topology allows for efficient elimination of variables without excessive computational cost. The core idea leverages the joint probability distribution's decomposition into local factors, enabling dynamic programming techniques to avoid enumerating all possible configurations. One fundamental approach is variable elimination, a general exact inference algorithm that systematically sums out variables to compute marginals or conditionals. In this method, variables are eliminated one by one according to an ordering, with each elimination step multiplying relevant factors and then marginalizing over the variable to produce a new factor over the remaining variables. The resulting intermediate factors capture the interactions among non-eliminated variables, and the process continues until the query variables are isolated. This algorithm, formalized as bucket elimination, unifies various reasoning tasks and operates directly on the moralized graph for directed models or the original graph for undirected ones. For computing marginals, the sum-product variant of variable elimination passes messages that represent summed products of factors, generalizing belief propagation to arbitrary graphs via factor graphs. In directed graphical models, such as Bayesian networks, the marginal probability of a variable X_i is computed exactly as

P(X_i) = \sum_{X \setminus X_i} \prod_{j} P(X_j \mid \mathrm{Pa}(X_j)),

where the sum is over all variables except X_i, and the product is over all conditional probabilities given parents, exploiting the directed acyclic structure for factorization. This equation directly follows from the joint distribution's chain rule decomposition and can be evaluated efficiently using variable elimination if the graph has low induced width. For tree-structured undirected graphical models, belief propagation provides an exact message-passing algorithm where nodes exchange messages along the tree edges to compute marginals. Introduced by Pearl, it involves forward and backward passes: each message from a node to its neighbor summarizes the beliefs from the subtree, computed as a product of incoming messages and local potentials, followed by summation over the sending node's variable. This yields exact marginals at leaves and propagates them throughout the tree in linear time relative to the number of edges. To extend exact inference to general loopy undirected models, the junction tree algorithm first triangulates the graph to form a chordal graph, then constructs a junction tree where nodes are cliques (maximal complete subgraphs) and edges represent separators. Messages are then passed between cliques using a sum-product rule analogous to belief propagation, ensuring consistency across the tree. Developed by Lauritzen and Spiegelhalter, this method guarantees exact results by maintaining potentials on cliques that encode the full joint distribution. Despite their precision, exact inference algorithms suffer from high computational complexity in dense graphs, as the size of intermediate factors grows exponentially with the treewidth—the minimum width over all possible elimination orderings, defined as one less than the size of the largest clique in a triangulated graph. Exact methods are thus tractable only for models with bounded treewidth, such as trees (treewidth 1) or near-trees, beyond which approximations become necessary for scalability.

Approximate Inference Methods

Approximate inference methods are essential for graphical models where exact inference becomes computationally intractable due to high treewidth or large model size, providing scalable approximations to posterior distributions or marginals. These techniques trade off precision for efficiency, enabling applications in complex domains like computer vision and natural language processing. Unlike exact methods such as variable elimination, which are limited to low-complexity graphs, approximate approaches handle general structures, including loopy undirected models and deep directed networks.^[35] Sampling-based methods, particularly Markov Chain Monte Carlo (MCMC), generate samples from the target posterior distribution to estimate expectations or marginals. Gibbs sampling, a foundational MCMC algorithm, iteratively samples each variable conditioned on the current values of all others, ensuring the chain converges to the stationary distribution under mild conditions. Introduced in the context of image restoration using Gibbs distributions, this method is particularly suited for undirected graphical models, where conditional independencies simplify sampling steps. For dynamic graphical models, such as those in time-series analysis, sequential Monte Carlo (SMC) extends sampling by propagating particles through a sequence of distributions, incorporating importance sampling and resampling to maintain approximation quality. SMC frameworks tailored to graphical models allow flexible ordering of variables for inference, improving efficiency over standard MCMC in high-dimensional settings. Variational inference approximates the intractable posterior with a simpler distribution from a chosen family, minimizing the Kullback-Leibler (KL) divergence to the true posterior, which is equivalent to maximizing a lower bound on the log-evidence known as the evidence lower bound (ELBO). In the mean-field approximation, the variational family assumes full independence among variables, leading to factorized distributions that facilitate coordinate ascent updates; this approach is computationally efficient but often underestimates posterior correlations. Structured variational methods enhance accuracy by incorporating dependencies, such as through cluster graphs or exponential family forms, allowing better fits at the cost of increased complexity. These techniques, rooted in optimization over exponential families, provide deterministic approximations suitable for large-scale models.^[36]^[35] Loopy belief propagation (LBP) extends the exact sum-product algorithm to graphs with cycles by performing iterative message passing between nodes, computing approximate marginals as fixed points of the update equations. Despite the absence of convergence guarantees in general loopy graphs, LBP often yields accurate results in practice, particularly for weakly connected cycles, and minimizes the Bethe free energy approximation to the posterior. This method is widely used in applications like error-correcting codes and stereo vision, where message computations leverage local structure for scalability.^[37] Recent advances in the 2020s integrate neural networks for amortized inference, training models to approximate posteriors across related graphical structures for rapid, one-pass computation. For instance, normalizing flows conditioned on graphical model structures enable invertible transformations for density estimation, amortizing the cost of inference in continuous models.^[38] Graph neural networks further facilitate message passing in amortized settings, generalizing beyond fixed architectures to handle variable graph sizes and improving upon traditional LBP in loopy scenarios.^[39] More recent developments include prior-data fitted networks (PFNs), which use transformer-based models for fast amortized Bayesian inference on graph-structured data, achieving strong performance in low-data regimes as of 2025.^[40] These hybrid approaches leverage deep learning's representational power to achieve state-of-the-art approximations in probabilistic programming and generative modeling tasks.

Learning Graphical Models

Parameter Learning

Parameter learning in graphical models involves estimating the numerical values of the model's parameters, such as conditional probability tables (CPTs) in directed models or potential functions in undirected models, given a fixed graph structure and observed data. This process aims to fit the model to the data while respecting the independencies encoded by the graph. For discrete variables, parameters are typically probabilities that sum to one; for continuous variables, they may include means and covariances in Gaussian models.^[41] Maximum likelihood estimation (MLE) is a fundamental approach for parameter learning, seeking parameters that maximize the likelihood of the observed data under the model. With complete data—where all variables are observed in every sample—MLE often admits closed-form solutions due to the graphical structure's decomposition of the joint likelihood. In Bayesian networks, for a node X_i with discrete states and parents \mathrm{Pa}(X_i), the MLE for the CPT entry is the normalized count of configurations in the data:

P(X_i = x \mid \mathrm{Pa}(X_i) = \mathrm{pa}) = \frac{N(x, \mathrm{pa})}{N(\mathrm{pa})},

where N(x, \mathrm{pa}) is the number of samples with X_i = x and parents \mathrm{Pa}(X_i) = \mathrm{pa}, and N(\mathrm{pa}) is the count for the parent configuration alone.^[41]^[42] In undirected graphical models, such as Markov random fields, MLE requires optimizing the log-likelihood, which includes a intractable partition function, often addressed via approximations like pseudolikelihood for tractability.^[41] When data is incomplete, with missing or hidden variables, direct MLE becomes infeasible, and the expectation-maximization (EM) algorithm provides an iterative solution. Introduced by Dempster, Laird, and Rubin, EM alternates between an E-step, which computes the expected values of sufficient statistics (or log-likelihood) over the missing data given the observed data and current parameters—typically using inference methods like belief propagation—and an M-step, which maximizes this expected complete-data likelihood to update parameters.^[43]^[41] In graphical models, the E-step leverages the model's structure for efficient expectation computation, while the M-step reuses closed-form updates from the complete-data case, such as count-based CPT revisions in Bayesian networks:

\hat{\theta}_{i,rst}^{(k+1)} = \frac{\sum_n P(Y_i=r, Y_j=s, Y_k=t \mid x^{(n)}, \theta^{(k)})}{\sum_n P(Y_j=s, Y_k=t \mid x^{(n)}, \theta^{(k)})},

where Y denotes latent variables and superscript k indexes iterations; convergence is to a local maximum of the observed-data likelihood.^[41] For undirected models, EM may incorporate iterative scaling to adjust clique potentials in the M-step.^[41] Bayesian approaches to parameter learning incorporate prior distributions over parameters to yield posterior estimates, enhancing robustness against overfitting, especially with sparse data. Dirichlet distributions serve as conjugate priors for multinomial parameters in discrete graphical models, allowing closed-form posterior updates that add pseudocounts to data counts, effectively smoothing estimates and integrating prior knowledge.^[44]^[41] For example, in Bayesian networks, a Dirichlet prior with hyperparameters \alpha yields the posterior mean P(X_i = x \mid \mathrm{Pa}(X_i) = \mathrm{pa}) = \frac{N(x, \mathrm{pa}) + \alpha_x}{N(\mathrm{pa}) + \sum \alpha}, promoting stability in small-sample regimes.^[42] In undirected models, similar priors can regularize potential parameters, though full posterior computation often requires approximations like variational inference.^[41]

Structure Learning

Structure learning in graphical models involves inferring the topology of the graph—specifically the presence, absence, and direction of edges—from observational data, which is a key step in model selection before parameter estimation. This process addresses the challenge of discovering dependencies among variables while avoiding overfitting, often under assumptions such as the absence of latent confounders for directed acyclic graphs (DAGs). Algorithms for structure learning generally fall into two main categories: score-based and constraint-based methods, each balancing computational feasibility with statistical rigor.^[45] Score-based methods evaluate candidate graph structures using a scoring function that measures goodness-of-fit to the data penalized by model complexity, selecting the structure that maximizes the score. Common scoring functions include the Bayesian Information Criterion (BIC), which approximates the Bayes factor for model comparison by penalizing the number of parameters with a term proportional to the logarithm of the sample size, and the Akaike Information Criterion (AIC), which uses a fixed penalty based on twice the number of parameters to estimate relative prediction error.^[46] To search the vast space of possible structures efficiently, these methods often employ greedy hill-climbing algorithms, which start from an initial graph (e.g., empty or complete) and iteratively add, delete, or reverse edges to improve the score until a local optimum is reached.^[47] This approach, while heuristic, has been shown to identify optimal structures under certain decomposable scores like BIC.^[47] Constraint-based methods, in contrast, infer structure by testing for conditional independencies between variables, leveraging the graphical criterion of d-separation to prune edges and orient them. The Peter-Clark (PC) algorithm is a seminal constraint-based approach that begins with a complete undirected graph and progressively removes edges based on statistical tests of conditional independence, such as the chi-squared (\chi^2) test for discrete variables, conditioning on increasingly larger sets of variables to identify direct dependencies.^[48] These tests assume that the data-generating process satisfies the Markov condition, where independencies are encoded by graph separation.^[48] In the context of causal discovery, structure learning extends to identifying DAGs that represent causal relationships, with the PC algorithm suitable for settings without latent confounders, producing a Markov equivalence class of DAGs. For scenarios involving latent variables or selection bias, the Fast Causal Inference (FCI) algorithm extends PC by incorporating additional rules to handle unmeasured confounders, outputting a partial ancestral graph (PAG) that marks possible, definite, and uncertain causal directions.^[48]^[49] Both algorithms rely on the faithfulness assumption, which posits that all conditional independencies in the data are faithfully represented by the graph's d-separations, ensuring that the true structure can be recovered asymptotically from sufficient data.^[48] Despite their effectiveness, structure learning methods face significant challenges, including high computational complexity, as finding the optimal graph structure is NP-hard even for discrete Bayesian networks under standard scores.^[50] The faithfulness assumption, while enabling identifiability, can fail in finite samples or under model misspecification, leading to spurious edges or missed causal paths, and requires careful choice of independence tests to control for multiple testing errors.^[48]^[45]

Applications

In Machine Learning and Statistics

Graphical models play a central role in machine learning for modeling sequential data through directed structures like Hidden Markov Models (HMMs), which represent hidden states and observable emissions as a chain of nodes to capture temporal dependencies in processes such as speech recognition and natural language processing. HMMs enable probabilistic reasoning over sequences by leveraging the conditional independence encoded in the graph, allowing efficient inference via algorithms like the Viterbi and forward-backward methods to estimate state paths and likelihoods.^[51] In statistics, HMMs facilitate uncertainty quantification in time-series analysis, where the graphical structure decomposes joint probabilities into tractable factors for parameter estimation from observed data.^[52] In machine learning, Gaussian processes (GPs) can be aggregated using Gaussian graphical models (GGMs) in ensemble methods for distributed GPs, enabling scalable nonparametric regression and classification.^[53] This connection allows GPs to model complex dependencies in continuous data domains, such as spatial interpolation or reinforcement learning, by treating kernel-induced correlations as dependencies in graphical structures.^[54] For causal inference in statistics, directed acyclic graphs (DAGs) underpin do-calculus, a set of graphical rules introduced by Pearl for identifying interventional effects from observational data without experiments, by manipulating the graph to account for confounding and enabling counterfactual queries.^[55] Ensemble methods in statistics further integrate multiple graphical models, such as combining Gaussian graphical models (GGMs) to form consensus networks that aggregate partial correlations for robust structure recovery in high-dimensional settings.^[53] Applications in anomaly detection leverage undirected graphical models like GGMs to identify outliers in network data by estimating sparse precision matrices that highlight deviations from normal conditional dependencies, as seen in multi-channel time-series monitoring where graphical Bayesian changepoint detection flags irregularities.^[56] In recommender systems, collaborative filtering employs graphical models to represent user-item interactions as bipartite graphs, with latent factor models formulated as probabilistic graphical structures that infer preferences through neighborhood-based inference, improving prediction accuracy over matrix factorization alone.^[57] Modern integrations with deep learning approximate graphical model posteriors using variational autoencoders (VAEs), which cast latent variable inference as amortized variational methods on directed graphs, as demonstrated in post-2020 works on graph VAEs for link prediction in molecular networks.^[58] These hybrids enhance scalability for large-scale probabilistic reasoning, with VAEs providing tractable lower bounds on marginal likelihoods derived from the graphical factorization.

In Scientific and Engineering Domains

In bioinformatics, graphical models such as Bayesian networks have been instrumental in reconstructing gene regulatory networks by representing directed probabilistic dependencies between genes based on expression data.^[59] These models enable the inference of regulatory interactions, such as transcription factor bindings, by incorporating prior knowledge of biological pathways and handling noisy high-throughput data from microarrays or RNA sequencing.^[60] For protein structure prediction, probabilistic graphical models, including hidden Markov models and conditional random fields, model the sequential and spatial constraints of amino acid chains to predict secondary and tertiary structures, improving accuracy over traditional physics-based simulations.^[61] In engineering applications, dynamic Bayesian networks facilitate fault diagnosis in electrical circuits by modeling temporal evolutions of system states and sensor observations, allowing localization of transient faults through probabilistic propagation.^[62] For reliability analysis in complex networks, such as power grids or communication systems, probabilistic graphical models quantify failure dependencies and propagation risks, enabling predictive maintenance by estimating conditional probabilities of component failures under stress conditions.^[63] In physics and computer vision, Markov random fields serve as undirected graphical models for image denoising, enforcing local smoothness priors to restore noisy pixel intensities while preserving edges, as pioneered in early Bayesian restoration techniques.^[64] Spatial graphical models, often Gaussian Markov random fields, are applied in epidemiology to capture contagion patterns across geographic regions, modeling disease spread as conditional dependencies in lattice-structured graphs derived from incidence data.^[65] Recent advancements in the 2020s have used Bayesian networks for structure learning in climate model evaluation, aiding in uncertainty quantification for projections of extreme events.^[66] In robotics, graph-based probabilistic methods enhance sensor fusion by integrating multimodal data from LiDAR, cameras, and IMUs, improving localization and obstacle detection in dynamic environments through efficient inference of fused states.^[67]