Bayesian network
A Bayesian network, also known as a belief network or Bayes net, is a probabilistic graphical model that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG).[1] In this structure, nodes correspond to random variables—ranging from discrete to continuous—and directed edges signify direct probabilistic influences or conditional dependencies between variables, while the absence of edges encodes conditional independencies among non-adjacent nodes.[2] The full joint probability distribution over the variables is compactly encoded as the product of each variable's conditional probability given its parent variables in the graph, enabling efficient representation and manipulation of complex probability distributions that would otherwise require exponential storage.[1] The concept of Bayesian networks was formalized and popularized by Judea Pearl in the 1980s, building on earlier work in probabilistic reasoning and graphical models dating back to the 1920s with Sewall Wright's path analysis in genetics.[3] Pearl's seminal contributions, including his 1988 book Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, introduced belief propagation algorithms for exact inference in these networks, allowing for the updating of probabilities based on new evidence.[3] This framework leverages Bayes' theorem to perform reasoning under uncertainty, making it a cornerstone of artificial intelligence and machine learning for handling incomplete or noisy data.[4] Bayesian networks support two primary tasks: inference, which involves computing posterior probabilities or marginals given observed evidence, often via algorithms like variable elimination or junction tree methods for exact results, or approximate methods like Markov chain Monte Carlo for intractable cases; and learning, which estimates network structure and parameters from data using approaches such as score-based search or constraint-based discovery.[1] These capabilities make Bayesian networks particularly suited for domains requiring causal modeling and decision-making under uncertainty.[5] Notable applications span medicine, where they aid in diagnostic systems by integrating symptoms, test results, and disease probabilities, for example in the Pathfinder system for pathology diagnosis;[6] biology, for inferring gene regulatory networks from expression data; and risk assessment, such as in environmental modeling or financial forecasting. In engineering and operations research, they facilitate fault diagnosis in complex systems and reliability analysis.[7] Recent advancements integrate Bayesian networks with deep learning for scalable inference in large-scale datasets, enhancing their utility in modern AI applications like natural language processing and autonomous systems.[8]Introduction
Basic Definition
A Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies through a directed acyclic graph (DAG), where each node corresponds to a random variable and each directed edge indicates a direct probabilistic influence from one variable to another.[9] The primary purpose of a Bayesian network is to compactly encode the joint probability distribution over the variables by specifying a factorization into local conditional probability distributions, one for each variable given its direct predecessors in the graph, thereby enabling efficient representation and reasoning under uncertainty.[9] Bayesian networks support updating beliefs about unobserved variables through the incorporation of evidence on observed ones, leveraging Bayes' theorem to revise probabilities coherently across the model. The term was coined by Judea Pearl in 1985 to highlight its foundations in Bayesian conditioning and its applicability to evidential and causal reasoning in artificial intelligence and statistics.[10]Illustrative Example
A common illustrative example of a Bayesian network is a model for diagnosing lung cancer based on patient symptoms and risk factors. The network consists of four binary variables: Smoking (indicating whether the patient smokes), Cancer (indicating presence of lung cancer), Cough (indicating the presence of a cough), and XRay (indicating a positive X-ray result). Each variable can take values yes or no. The directed edges capture causal relationships: an arrow from Smoking to Cancer reflects that smoking increases the risk of cancer, while arrows from Cancer to Cough and from Cancer to XRay reflect that cancer causes these symptoms and test outcomes. This structure forms a directed acyclic graph (DAG), depicted textually as:The DAG ensures no cycles and encodes the probabilistic dependencies among the variables. To construct this network, begin by defining the nodes for the relevant variables and their states. Next, draw directed edges based on domain knowledge of causal or associative influences, such as medical evidence linking smoking to cancer and cancer to observable symptoms. Finally, specify a conditional probability table (CPT) for each node, quantifying the probabilities given its parents. For root nodes like Smoking with no parents, the CPT is simply the prior probability distribution. For other nodes, the CPT lists probabilities conditioned on all combinations of parent states. Since the variables are binary, most CPTs are 2x1 or 2x2 tables. Example CPTs, drawn from standard medical diagnosis illustrations, are as follows (probabilities sum to 1 for each conditioning case): CPT for Smoking (prior):Smoking → Cancer → Cough ↘ XRaySmoking → Cancer → Cough ↘ XRay
| Smoking | Probability |
|---|---|
| yes | 0.3 |
| no | 0.7 |
| Cancer \ Smoking | yes | no |
|---|---|---|
| yes | 0.10 | 0.01 |
| no | 0.90 | 0.99 |
| Cough \ Cancer | yes | no |
|---|---|---|
| yes | 0.80 | 0.30 |
| no | 0.20 | 0.70 |
| XRay \ Cancer | yes | no |
|---|---|---|
| yes | 0.90 | 0.20 |
| no | 0.10 | 0.80 |
Graphical and Structural Foundations
Directed Acyclic Graph Representation
A Bayesian network is represented by a directed acyclic graph (DAG), in which each node corresponds to a random variable that may be either discrete or continuous. Directed edges in the graph connect pairs of nodes, indicating a direct conditional dependency: an edge from node X to node Y signifies that Y directly depends on X, conditional on the other direct influencers of Y. This structure compactly encodes the qualitative relationships among the variables, capturing how the state of one variable influences others without specifying the full joint distribution.[4] The acyclicity of the graph is a fundamental requirement, ensuring no directed cycles exist that could lead to inconsistencies in dependency resolution during probability propagation. Without cycles, the DAG admits a topological ordering of its nodes, a linear arrangement where every directed edge points from an earlier node to a later one, facilitating ordered processing of dependencies from "root" causes to effects. This property follows from graph theory: a directed graph is acyclic if and only if it possesses at least one topological ordering, which can be constructed via algorithms like depth-first search that detect and prevent cycles. Within the DAG, key structural relationships define the dependency hierarchy. The parents of a node are all nodes with outgoing edges directly to it, representing the immediate conditioning variables. Conversely, the children of a node are those with incoming edges from it, denoting variables directly affected by the node. Ancestors encompass all nodes reachable via directed paths leading to the node in question, forming the extended set of influencers through transitive dependencies. These relations highlight the hierarchical nature of the model, where influences propagate unidirectionally along paths. Qualitatively, the directed edges symbolize direct influences or associative links between variables, often lending themselves to interpretations of mechanistic or conditional effects in practical domains.[4] However, the graph structure encodes dependencies without inherently establishing causation, which requires additional assumptions or data.Conditional Independencies and Markov Properties
In Bayesian networks, the directed acyclic graph (DAG) structure encodes conditional independencies among the random variables, enabling compact representation of joint probability distributions by capturing only the essential probabilistic dependencies. These independencies arise from the topological properties of the graph, where the presence or absence of directed edges signifies direct influences, and the overall arrangement implies broader separation conditions that translate to probabilistic independence statements. This encoding is formalized through Markov properties, which bridge the graphical representation and the underlying probability distribution. The local Markov property provides a foundational building block for these independencies, stating that each variable in the network is conditionally independent of its non-descendants given its parents.[11] Formally, for a variable X_i with parents \mathrm{Pa}(X_i), X_i \perp \mathrm{ND}(X_i) \mid \mathrm{Pa}(X_i), where \mathrm{ND}(X_i) denotes the set of non-descendants of X_i. This property ensures that once the immediate causes (parents) are known, the variable does not depend on other preceding variables in the graph, reflecting the causal or influential flow along directed paths. It implies conditional independencies, such as a variable being screened off from its non-descendants by its parents, and extends to more general cases through the global Markov property. Building on the local property, the global Markov property extends these independencies to arbitrary subsets of variables, asserting that if a set S separates two disjoint sets A and B in the graph—meaning no active path connects A and B given S—then A \perp B \mid S holds in the distribution.[12] This separation is determined graphically via criteria like d-separation, which identifies blocked paths in the DAG. The global property thus captures the full spectrum of conditional independencies implied by the network structure, allowing inference about distant variables conditioned on intervening ones. A Bayesian network's DAG serves as an I-map (independence map) for the joint distribution, meaning the graph faithfully represents a subset of the true conditional independencies in the probability space, though it may not capture all (i.e., it is not necessarily a perfect map).[12] This mapping ensures that every graphical independence corresponds to a genuine probabilistic one, providing a minimal yet sufficient structure for reasoning under uncertainty, as established in foundational work on plausible inference.Formal Definitions and Properties
Factorization of Joint Probability Distribution
In a Bayesian network, the joint probability distribution over a set of random variables X_1, X_2, \dots, X_n represented by a directed acyclic graph (DAG) factors into a product of conditional probabilities, each depending only on the variable's direct predecessors, or parents, in the graph.[13] This factorization is given by P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \mathrm{Pa}(X_i)), where \mathrm{Pa}(X_i) denotes the set of parents of X_i in the DAG.[13] This representation derives from the chain rule of probability, which expands the joint distribution as P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid X_1, \dots, X_{i-1}), but restricts the conditioning set for each X_i to only its parents \mathrm{Pa}(X_i) due to the conditional independencies encoded in the graph structure.[13] The local Markov property justifies this restriction by implying that each variable is conditionally independent of its non-descendants given its parents.[13] For discrete variables, each conditional probability P(X_i \mid \mathrm{Pa}(X_i)) is specified via a conditional probability table (CPT), which enumerates the probabilities for all possible values of X_i given every combination of parent values.[9] For continuous variables, these conditionals are instead represented by probability density functions that describe the distribution of X_i given its parents.[14] The resulting factorization provides a minimal I-map of the joint distribution, meaning the graph's independencies form a subset of those in the true distribution, with no redundant edges; removing any edge would violate at least one independence.[15] This minimal structure ensures an efficient yet complete encoding of the probabilistic dependencies implied by the network.[15]Local Markov Property
The local Markov property is a fundamental independence assumption in Bayesian networks that captures the directed dependencies encoded by the directed acyclic graph (DAG). Formally, given a Bayesian network consisting of random variables X = \{X_v : v \in V\} and a DAG G = (V, A), the property states that for every node v \in V, X_v is conditionally independent of its non-descendants given its parents:X_v \perp\!\!\!\perp \left( V \setminus (\{v\} \cup \mathrm{De}(v)) \right) \mid \mathrm{Pa}(v),
where \mathrm{De}(v) denotes the descendants of v (excluding v), and \mathrm{Pa}(v) are the direct parents of v.[16] This means that once the immediate causes (parents) of X_v are known, X_v provides no additional information about variables that do not descend from it.[17] The proof of the local Markov property follows from the joint probability factorization of the Bayesian network. The joint distribution factorizes as
P(X) = \prod_{v \in V} P(X_v \mid \mathrm{Pa}(v)).
Let ND denote the non-descendants of v, so ND = V \setminus (\{v\} \cup \mathrm{De}(v)), and consider the set S = ND ∪ {v}. The marginal distribution P(S) is obtained by summing over the descendants De(v), and since the sum over De(v) integrates to 1, P(S) = ∏{u \in S} P(X_u \mid \mathrm{Pa}(u)). For u ∈ ND, Pa(u) ⊂ ND because v is not an ancestor of any node in ND (otherwise, that node would be a descendant of v). Moreover, none of the CPDs for nodes in ND depend on X_v, as there are no edges from v to ND. Thus, P(S) = \left[ \prod{u \in \mathrm{ND}} P(X_u \mid \mathrm{Pa}(u)) \right] \times P(X_v \mid \mathrm{Pa}(v)), where the first product does not involve X_v. Therefore, P(X_v \mid \mathrm{ND}) = P(X_v \mid \mathrm{Pa}(v)), establishing the independence.[16][17] This property has key implications for the minimality of the network structure. It ensures that the DAG is a minimal I-map (independence map) for the distribution, meaning every edge represents a direct dependence that cannot be removed without violating the encoded conditional independencies. If an edge were redundant—such as connecting a non-parent non-descendant directly to v—the local Markov property would be contravened, as it would imply an unshielded dependence not captured by the parents alone. Consequently, the graph avoids superfluous arcs, promoting a parsimonious representation of the probabilistic dependencies.[16] In comparison to undirected graphical models like Markov random fields, the local Markov property in Bayesian networks emphasizes directional causality through parent conditioning, whereas in undirected models, a variable is conditionally independent of non-adjacent variables given its immediate neighbors, without distinguishing directionality or ancestry. This directed formulation allows Bayesian networks to model asymmetric influences and temporal or causal orders more explicitly than their undirected counterparts.[18]
d-Separation Criterion
The d-separation criterion, introduced by Judea Pearl, is a graphical test for determining conditional independencies in a Bayesian network represented as a directed acyclic graph (DAG). It specifies conditions under which two disjoint sets of nodes, X and Y, are conditionally independent given a third disjoint set Z, denoted as X \perp Y \mid Z. This criterion relies on analyzing undirected paths in the DAG to identify whether evidence in Z blocks all information flow between X and Y. Formally, sets X and Y are d-separated by Z, written X \perp_d Y \mid Z, if every undirected path from a node in X to a node in Y is blocked by Z. A path is blocked by Z if, for at least one triple of consecutive nodes i - k - j along the path, one of the following holds:[19]- Serial (head-to-tail) connection: i \to k \to j, and k \in Z.
- Diverging (tail-to-tail) connection: i \leftarrow k \to j, and k \in Z.
- Converging (head-to-head) connection: i \to k \leftarrow j, and neither k nor any descendant of k is in Z.