Variable elimination

Variable elimination is an exact inference algorithm employed in probabilistic graphical models, including Bayesian networks and Markov random fields, to compute marginal probabilities, conditional distributions, or optimal assignments by systematically eliminating variables from the joint distribution through summation or maximization operations in a specified elimination order.^[1] Introduced as part of the bucket elimination framework, it generalizes dynamic programming principles to unify various probabilistic inference tasks, such as belief updating, finding the most probable explanation (MPE), maximum a posteriori (MAP) estimation, and maximum expected utility (MEU) computation.^[2] The algorithm operates by representing the joint probability as a product of factors—typically conditional probability tables or potentials—and iteratively processing non-query variables: for each variable to eliminate, it multiplies all relevant factors involving that variable, then marginalizes (sums out) the variable to produce a new factor over its neighboring variables, which is then placed into subsequent processing steps.^[3] This process exploits the conditional independencies encoded in the graphical model's structure to avoid exhaustive enumeration of the full joint distribution, reducing computational complexity from exponential in the number of variables to exponential in the induced width (or treewidth) of the moralized graph under the chosen elimination ordering.^[2] Optimal orderings, often found via heuristics like minimum degree or minimum fill-in, are crucial for minimizing the size of intermediate factors and thus controlling time and space requirements, which can otherwise grow prohibitively large for dense or loopy graphs.^[4] While highly efficient for tree-structured or low-treewidth models, variable elimination's exact nature limits its scalability to very large or complex networks, prompting extensions like conditioning on evidence to prune irrelevant subnetworks or approximations such as loopy belief propagation for intractable cases.^[2]

Introduction

Overview and Definition

Variable elimination (VE) is an exact inference algorithm employed in probabilistic graphical models to compute marginal probabilities or maximum a posteriori (MAP) assignments by successively eliminating variables from a joint probability distribution.^[5] This method systematically reduces the complexity of the joint distribution by focusing on subsets of variables, enabling efficient computation of queries such as the probability of evidence or the most likely configuration of hidden variables. At a high level, VE represents the joint distribution as a product of factors—compact representations of partial distributions—and iteratively multiplies the factors involving a chosen variable before summing out (marginalizing over) that variable, thereby eliminating it from consideration.^[5] The process continues until only the variables of interest remain, yielding the desired marginal or assignment. The choice of elimination order influences computational cost, as suboptimal orders can lead to larger intermediate factors. VE provides exact results, distinguishing it from approximate methods, and offers significant efficiency gains over brute-force enumeration, particularly for models with tree-structured graphs or low treewidth, where complexity is bounded by the induced width of the graph.^[5] This makes it suitable for networks where the graph's structure limits the size of intermediate computations. The algorithm unifies inference across different graphical model types, including directed Bayesian networks and undirected Markov random fields, by treating both through a common factor-based framework.^[5] Within probabilistic graphical models, VE serves as a general elimination-based alternative to message-passing algorithms like belief propagation, applicable to both acyclic and loopy structures without requiring explicit message routing.

Historical Development

Variable elimination emerged in the mid-1990s as a key algorithm for exact probabilistic inference in graphical models, building on earlier advancements in constraint satisfaction problems and Bayesian network reasoning. Researchers Rina Dechter and Judea Pearl contributed foundational ideas through their work on tree clustering and decomposition methods for constraint networks in the late 1980s, which laid the groundwork for systematic variable elimination in probabilistic settings. Independently, Dechter formalized the approach in her seminal 1996 paper introducing bucket elimination as a unifying framework for various inference tasks, including belief updating and most probable explanation finding in Bayesian networks. Concurrently, Nevin Lianwen Zhang and David Poole described a similar variable elimination procedure in their 1996 work on exploiting causal independence, emphasizing its efficiency for marginalizing variables in directed graphical models.^[6] The algorithm's initial applications focused on Bayesian networks, heavily influenced by Judea Pearl's 1988 development of belief propagation, which provided an exact inference method for tree-structured graphs and highlighted the need for generalizable elimination techniques in loopy networks. Variable elimination applies to undirected graphical models such as Markov random fields, where it facilitates marginal inference by summing out variables while managing clique sizes through appropriate elimination orders. Further advancements in the 2000s addressed hybrid models combining discrete and continuous variables; for instance, methods integrating mixture of truncated exponentials and variable elimination enabled exact inference in hybrid Bayesian networks with both types of variables. Key milestones include its integration into early probabilistic software tools in the 1990s and 2000s, paving the way for broader adoption in inference engines. In the 2010s, variable elimination influenced modern open-source libraries such as libDAI (implementing exact inference methods like junction tree) and pgmpy, a Python package supporting variable elimination for Bayesian network inference.^[7]^[8] These implementations underscored its versatility across directed and undirected models. The algorithm's impact lies in bridging exact and approximate inference paradigms, as evidenced by its prominent role in Kevin Murphy's 2012 textbook, which popularized variable elimination as a foundational technique in machine learning curricula and applications.

Background Concepts

Probabilistic Graphical Models

Probabilistic graphical models (PGMs) provide a compact and intuitive representation of multivariate probability distributions by combining graph theory with probability theory, where nodes represent random variables and edges encode conditional dependencies or independencies among them.^[9] This framework exploits the conditional independence structure inherent in many real-world systems to factorize complex joint distributions into simpler local components, enabling efficient storage, manipulation, and inference over high-dimensional spaces.^[9] PGMs encompass both directed and undirected graph structures, facilitating applications in fields such as machine learning, statistics, and artificial intelligence.^[9] Bayesian networks, also known as belief networks, are directed acyclic graphs (DAGs) in which each node corresponds to a random variable, and directed edges represent conditional dependencies between variables.^[10] The joint probability distribution over the variables X = \{X_1, \dots, X_n\} factorizes according to the graph structure as

P(X) = \prod_{i=1}^n P(X_i \mid \mathrm{Parents}(X_i)),

where \mathrm{Parents}(X_i) denotes the set of variables with direct edges pointing to X_i.^[10] This factorization leverages the Markov condition, which states that each variable is conditionally independent of its non-descendants given its parents, allowing the model to capture asymmetric causal or influential relationships efficiently.^[10] Markov random fields (MRFs), or undirected graphical models, employ undirected graphs where nodes represent random variables and edges indicate mutual dependencies without directional asymmetry.^[11] The joint distribution factorizes over the maximal cliques of the graph via the Hammersley-Clifford theorem, which equates positive MRFs with Gibbs distributions: under the positivity condition (all probabilities positive), a distribution is Markovian if and only if it admits a factorization

P(X) = \frac{1}{Z} \prod_C \psi_C(X_C),

where C ranges over maximal cliques, \psi_C are potential functions, and Z is the normalization constant.^[11] The global Markov property in MRFs asserts conditional independence between variables separated by a conditioning set in the graph, focusing on symmetric interactions within cliques.^[11] Key differences between Bayesian networks and MRFs lie in their graph orientations and independence semantics: BNs use directed edges to encode directional asymmetries and ancestral relationships, while MRFs rely on undirected edges and graph separations to model undirected conditional independencies.^[9] Both structures support variable elimination for exact inference tasks, such as computing marginal probabilities P(Y \mid e) or maximum a posteriori (MAP) estimates \arg\max P(X \mid e), where Y \subseteq X are query variables and e denotes observed evidence.^[9] Standard notation in PGMs includes the set of variables X = \{X_1, \dots, X_n\}, with discrete or continuous domains, and inference queries conditioned on evidence e.^[9] The computational complexity of inference in PGMs is governed by the treewidth of the graph, defined as the minimum, over all elimination orderings, of one less than the size of the largest clique induced during elimination.^[12] Low treewidth implies that the graph can be decomposed into a tree-like structure with small separators, enabling polynomial-time inference algorithms like variable elimination; for instance, graphs with treewidth bounded by k yield complexity exponential in k but linear in the number of nodes.^[12] This measure quantifies the "tree-likeness" of the graph and is crucial for assessing the tractability of exact methods in dense dependency structures.^[12]

Factors in Graphical Models

In probabilistic graphical models, a factor \phi is defined as a function mapping assignments to a subset of variables, known as its scope, to non-negative real numbers, typically representing unnormalized probabilities or potentials that capture local interactions among the variables.^[13] These factors serve as the core data structures for inference algorithms like variable elimination, allowing the joint distribution to be decomposed into compact local components rather than storing the full exponential-sized joint explicitly.^[14] For discrete variables, factors are commonly represented as multidimensional arrays or tables, where each entry corresponds to the function value for a specific assignment of values within the scope; for instance, a factor \phi_{A,B}(a,b) over variables A and B might be tabulated with rows for values of A and columns for B.^[13] In contrast, for continuous variables, factors take functional forms, such as Gaussian densities, to represent densities over continuous scopes efficiently.^[13] Factors in graphical models are derived directly from the model's structure: in Bayesian networks, each conditional probability distribution P(X_i \mid \mathrm{Pa}(X_i)) constitutes a factor over X_i and its parents \mathrm{Pa}(X_i); in Markov random fields, factors correspond to potential functions defined over maximal cliques in the undirected graph.^[13] A key property is their multiplicativity, whereby the joint distribution (or unnormalized measure) is the product of all factors, enabling the full model to be reconstructed from these local pieces.^[13] Additionally, summation over a variable in a factor's scope corresponds to marginalization or integration, effectively eliminating that variable from consideration.^[14] This factor-based representation is particularly advantageous in variable elimination, as it permits dynamic combination and reduction of factors during inference, avoiding the need to materialize the full joint distribution, which grows exponentially with the number of variables n.^[14] By operating on these smaller scopes, the approach exploits conditional independencies inherent in the graphical model to maintain computational tractability.^[13]

Core Operations

Factor Multiplication

In probabilistic graphical models, factor multiplication is a fundamental operation used to combine information from multiple factors during inference algorithms such as variable elimination. Given two factors \phi_1 with scope S_1 (the set of variables it depends on) and \phi_2 with scope S_2, their product \phi = \phi_1 \times \phi_2 has scope S = S_1 \cup S_2. For any assignment s to the variables in S, the value is computed as \phi(s) = \phi_1(s_{S_1}) \times \phi_2(s_{S_2}), where s_{S_1} and s_{S_2} denote the projections of s onto the respective scopes; if a variable is absent from a scope, its value is irrelevant to that factor's evaluation. This operation effectively creates a new factor that encodes the joint compatibility over the union of variables, preserving the probabilistic semantics of the original factors.^[15] To perform the multiplication algorithmically, the representations of \phi_1 and \phi_2 (typically tables or arrays indexed by assignments to their scopes) are aligned over any shared variables in S_1 \cap S_2. For each possible joint assignment to S_1 \cup S_2, the corresponding values from \phi_1 and \phi_2 are retrieved and multiplied pointwise to populate the new factor \phi. The time complexity of this step is O(|\dom(S_1 \cup S_2)|), where \dom(V) is the size of the Cartesian product of the domains of variables in V, assuming discrete finite domains; space requirements are similarly bounded by the output factor's size. If the domains of shared variables differ between \phi_1 and \phi_2 (e.g., due to prior restrictions or evidence instantiation), the factors must be extended by assigning dummy values—such as 0 for incompatible assignments in unnormalized potentials or 1 for neutral extension in multiplicative contexts—or restricted to the intersection of compatible assignments to ensure well-defined products.^[15]^[16] A simple example illustrates the process: consider factors \phi_A(a) over variable A with domain \{a_1, a_2\} and \phi_B(b) over B with domain \{b_1, b_2\}, where scopes are disjoint. Their product is \phi_{A,B}(a,b) = \phi_A(a) \times \phi_B(b), yielding a table with four entries, each the product of the independent marginal values. For overlapping scopes, such as \phi_{A,B}(a,b) and \phi_{B,C}(b,c) with B shared and domains \{b_1, b_2\} for both, the product \phi_{A,B,C}(a,b,c) has eight entries (assuming binary domains for A and C), where for each matching b, the values \phi_{A,B}(a,b) \times \phi_{B,C}(b,c) are computed, effectively integrating information across the chain. In variable elimination, this multiplication is crucial for building intermediate factors that represent partial joint distributions over the neighbors of a variable to be eliminated, enabling efficient propagation of dependencies without expanding to the full joint over all variables.^[15] Numerical issues arise in practice due to the potential for underflow (when multiplying many probabilities less than 1) or overflow (with unnormalized potentials exceeding machine precision). To mitigate this, factor multiplication is often implemented in log-space: values are stored and manipulated as logarithms, converting the product to a sum \log \phi(s) = \log \phi_1(s_{S_1}) + \log \phi_2(s_{S_2}), which avoids extreme exponents while preserving relative magnitudes; normalization can follow using log-sum-exp tricks for stability. This approach is particularly beneficial in high-dimensional models where intermediate factors involve numerous multiplications.^[17]

Variable Summation

In variable elimination for probabilistic graphical models, the variable summation operation, also known as marginalization, removes a variable from a factor by summing over its entire domain, thereby producing a new factor with a reduced scope. This step is fundamental to reducing the joint distribution to marginals of interest. For a factor \phi defined over a scope S that includes variable X, the resulting factor \phi' after summing out X has scope S \setminus \{X\} and is given by

\phi'(s') = \sum_{x \in \dom(X)} \phi(s', x),

where s' denotes an assignment to the variables in S \setminus \{X\}, and \dom(X) is the domain of X.^[14] Algorithmically, computing this summation involves iterating over all possible assignments to S \setminus \{X\} and, for each such assignment, accumulating the values of \phi across all x \in \dom(X). The time required is proportional to the size of the original factor, which is O(|\dom(S \setminus \{X\})| \cdot |\dom(X)|), as each entry in the factor table must be considered. This operation often follows factor multiplication, which combines all factors involving X into a single factor before marginalization. In the context of discrete variables, as is common in many Bayesian networks and Markov random fields, the summation directly aggregates probabilities or potentials. For instance, given a factor \phi_{A,B}(a,b) over variables A and B, summing out B yields \phi_A(a) = \sum_{b \in \dom(B)} \phi_{A,B}(a,b), a unary factor over A alone.^[14] In graphical models with continuous variables, the discrete summation is replaced by integration to marginalize over the continuous domain: \phi'(s') = \int_{-\infty}^{\infty} \phi(s', x) \, dx, assuming a one-dimensional continuous variable X; multivariate cases extend this analogously. This adaptation enables exact inference in hybrid discrete-continuous models, though it may require closed-form solutions or numerical methods for tractability. Each elimination step in the variable elimination algorithm relies on this summation to progressively remove variables, generating intermediate factors that capture the marginals over the surviving variables until only the query variables remain.^[18] When evidence is incorporated, factors are first restricted by setting observed variables to their evidence values, which prunes the domains and adjusts the summation accordingly—for example, if a variable Y in the factor is observed as y = e, the factor is multiplied by a Dirac delta or indicator function at e, effectively fixing Y before any summation over other variables. This ensures that the marginals condition on the evidence without altering the core summation mechanics.^[14]

The Algorithm

Step-by-Step Procedure

The variable elimination algorithm computes marginal probabilities or maximum a posteriori (MAP) estimates in probabilistic graphical models by systematically eliminating variables from a set of factors representing the joint distribution. The input consists of an initial set of factors \{\phi_1, \dots, \phi_k\} derived from the graphical model (such as conditional probability tables in Bayesian networks or potential functions in Markov random fields), the query variables Y, and evidence variables e with observed values. An elimination order for the non-query variables is also required, which specifies the sequence in which variables are removed; selection of this order is assumed to be provided here.^[15] The first step processes the evidence by restricting each factor \phi_i to the observed values in e. This involves evaluating \phi_i at the fixed values for the observed variables in its scope and then removing those variables from the scope, yielding a new set of restricted factors. This step ensures that only unobserved variables remain in subsequent computations, effectively conditioning the model on the evidence.^[15]^[19] Next, the algorithm iterates over the non-query variables in the chosen elimination order. For each variable X_i to eliminate, it collects all factors in the current set whose scopes include X_i. These factors are multiplied together to form an intermediate factor \psi = \prod \phi_j, where the product is over the relevant factors; the scope of \psi includes X_i and all other variables appearing in the multiplied factors. Then, X_i is summed out by computing \phi' (Z) = \sum_{x_i} \psi (x_i, Z), where Z denotes the remaining variables in the scope of \psi. This new factor \phi' is added back to the set of factors, replacing the originals used in the multiplication, while factors not involving X_i remain unchanged. The process repeats for the next variable until all non-query variables are eliminated. For MAP queries, the summation over X_i is replaced by an argmax operation to track the maximizing assignment.^[15]^[19] After eliminating all non-query variables, the remaining factors have scopes contained within the query variables Y. These factors are multiplied together to obtain the unnormalized marginal \tilde{P}(Y \mid e) = \prod \phi_m, where the product is over the final factors. For probability queries, normalization is applied by dividing by \sum_Y \tilde{P}(Y \mid e) to yield P(Y \mid e). The output is this final factor over Y for marginal inference or the tracked assignment for MAP.^[15] The procedure can be outlined in pseudocode as follows:

function VariableElimination(factors φ, query Y, evidence e, order σ):
    // Step 1: Restrict to evidence
    for each φ_i in φ:
        φ_i ← [Restrict](/page/Restrict)(φ_i, e)  // Evaluate at e and remove observed vars
    
    // Step 2-3: Eliminate variables in order σ (non-query vars)
    remaining_vars ← all vars except Y
    while remaining_vars ≠ ∅:
        X_i ← next var in σ from remaining_vars
        relevant_φ ← {φ_j in φ | X_i in scope(φ_j)}
        ψ ← Product(relevant_φ)  // Multiply: ψ(x_i, Z) = ∏ φ_j
        φ' ← SumOut(ψ, X_i)      // φ'(Z) = ∑_{x_i} ψ(x_i, Z)
        φ ← (φ \ relevant_φ) ∪ {φ'}  // Update factor set
        remaining_vars ← remaining_vars \ {X_i}
    
    // Step 4: Compute final marginal
    final_φ ← Product(φ)  // All remaining factors over Y
    P(Y|e) ← Normalize(final_φ)  // Divide by sum over Y
    return P(Y|e)
function VariableElimination(factors φ, query Y, evidence e, order σ):
    // Step 1: Restrict to evidence
    for each φ_i in φ:
        φ_i ← [Restrict](/page/Restrict)(φ_i, e)  // Evaluate at e and remove observed vars
    
    // Step 2-3: Eliminate variables in order σ (non-query vars)
    remaining_vars ← all vars except Y
    while remaining_vars ≠ ∅:
        X_i ← next var in σ from remaining_vars
        relevant_φ ← {φ_j in φ | X_i in scope(φ_j)}
        ψ ← Product(relevant_φ)  // Multiply: ψ(x_i, Z) = ∏ φ_j
        φ' ← SumOut(ψ, X_i)      // φ'(Z) = ∑_{x_i} ψ(x_i, Z)
        φ ← (φ \ relevant_φ) ∪ {φ'}  // Update factor set
        remaining_vars ← remaining_vars \ {X_i}
    
    // Step 4: Compute final marginal
    final_φ ← Product(φ)  // All remaining factors over Y
    P(Y|e) ← Normalize(final_φ)  // Divide by sum over Y
    return P(Y|e)

This pseudocode integrates the core operations of factor multiplication and variable summation as building blocks for efficient inference.^[15]^[19]

Illustrative Example

To illustrate the variable elimination algorithm, consider a simple Bayesian network with three binary variables: A (root node), B (dependent on A), and C (dependent on both A and B). The network structure forms a V-shape, textually depicted as:

  A
 / \
B   C
  A
 / \
B   C

The query is the marginal probability P(C = true). The conditional probability tables (CPTs) define the initial factors: φ₁(A) = P(A), φ₂(A, B) = P(B | A), and φ₃(A, B, C) = P(C | A, B). All variables take values in {true, false}.^[15] The factor φ₁(A) is given by:

A	φ₁(A)
true	0.5
false	0.5

The factor φ₂(A, B) is:

A	B	φ₂(A, B)
true	true	0.9
true	false	0.1
false	true	0.1
false	false	0.9

The factor φ₃(A, B, C) for C = true (with symmetric probabilities for C = false summing to 1 for each (A, B) instantiation) is:

A	B	C	φ₃(A, B, C)
true	true	true	0.8
true	false	true	0.3
false	true	true	0.4
false	false	true	0.2

Following the variable elimination procedure with elimination order B then A, first multiply φ₂(A, B) and φ₃(A, B, C = true) to obtain the intermediate factor ψ(A, B, C = true) = φ₂(A, B) × φ₃(A, B, C = true). Then sum out B to produce φ(A, C = true) = ∑_B ψ(A, B, C = true).^[15] The summation over B yields: For A = true, C = true:
φ(A = true, C = true) = [φ₂(true, true) × φ₃(true, true, true)] + [φ₂(true, false) × φ₃(true, false, true)]
= (0.9 × 0.8) + (0.1 × 0.3) = 0.72 + 0.03 = 0.75 For A = false, C = true:
φ(A = false, C = true) = (0.1 × 0.4) + (0.9 × 0.2) = 0.04 + 0.18 = 0.22 Thus, the factor φ(A, C = true) is:

A	C	φ(A, C)
true	true	0.75
false	true	0.22

Next, multiply this with φ₁(A) to get ψ(A, C = true) = φ₁(A) × φ(A, C = true):

A	C	ψ(A, C)
true	true	0.5 × 0.75 = 0.375
false	true	0.5 × 0.22 = 0.11

Finally, sum out A: φ(C = true) = ∑_A ψ(A, C = true) = 0.375 + 0.11 = 0.485. This is the unnormalized marginal probability, but since the factors derive from normalized CPTs, it equals the true P(C = true) = 0.485. A similar process for C = false yields P(C = false) = 0.515, confirming normalization.^[15]

Applications in Inference

Exact Inference in Bayesian Networks

Variable elimination (VE) applied to Bayesian networks (BNs) begins by representing the joint probability distribution as a product of conditional probability tables (CPTs) associated with each node, where each CPT serves as an initial factor encoding the conditional dependencies.^[5] To perform elimination, the directed acyclic graph (DAG) of the BN is first moralized by adding undirected edges between all pairs of parents of each common child and removing edge directions, transforming it into an undirected moral graph that captures the relevant conditional independencies for inference.^[5] The elimination order is then selected on this moral graph to systematically remove variables, ensuring that the process respects the network's structure while avoiding cycles in the intermediate computations.^[5] For marginal inference queries such as computing the posterior probability P(Y|e), where Y is the query variable and e represents observed evidence on other nodes, VE processes the factors by summing out all non-query and non-evidence variables in the chosen order, ultimately yielding a factor over Y that is normalized to obtain the marginal.^[5] When evidence is present on multiple nodes, the initial factors corresponding to those evidence variables are restricted upfront by setting them to their observed values, reducing the domain size and simplifying subsequent multiplications and summations.^[5] This adaptation ensures efficient incorporation of evidence without altering the core elimination steps. To address maximum a posteriori (MAP) queries, which seek the most probable assignment to a subset of variables given evidence, VE is modified into a max-product variant by replacing summation operations during elimination with maximization over the eliminated variables, propagating the highest probability assignments through the factors.^[5] This extension maintains the same structural framework but shifts the objective from probabilistic marginalization to optimization, making it suitable for finding optimal configurations in diagnostic or decision-making tasks within BNs.^[5] A specialized variant known as bucket elimination organizes the factors into "buckets" indexed by the variable to be eliminated last among its arguments, processing each bucket sequentially from the end of the elimination order to the beginning, which facilitates efficient storage and reuse of intermediate results particularly in directed structures like BNs. This approach leverages the ordering to minimize factor sizes during processing, enhancing practicality for networks with moderate connectivity. Although BNs are acyclic, the moralization process can introduce cycles, which VE handles by implicitly triangulating the graph during elimination—adding fill-in edges to form cliques—and the overall time and space complexity is exponential in the induced width of the moral graph under the chosen ordering, a measure of the largest clique size formed during the process.^[5] The treewidth of the moral graph, defined as the minimum induced width over all possible orderings, provides a graph-theoretic bound on this complexity, indicating that exact VE is feasible only for BNs with small or structured treewidth, such as those approximating tree-like dependencies.^[5]

Exact Inference in Markov Random Fields

In Markov random fields (MRFs), variable elimination enables exact inference by factorizing the joint distribution over an undirected graph into potentials defined on maximal cliques C. The joint probability is given by P(X) = \frac{1}{Z} \prod_{C} \psi_C(X_C), where \psi_C are non-negative potential functions over the variables in clique C, and Z = \sum_X \prod_{C} \psi_C(X_C) is the normalizing partition function.^[16] The algorithm proceeds similarly to its application in Bayesian networks but operates directly on the undirected structure, without needing moralization to connect parents. An elimination order is chosen for the variables, and for each variable X_i to eliminate, all potentials involving X_i are multiplied together, then X_i is marginalized out (typically by summation) to produce a new potential over the remaining neighboring variables. This multiplication and elimination step induces fill-in edges between neighbors of X_i, effectively completing the graph to a chordal form where every cycle of length greater than three has a chord; the resulting factors represent the updated factorization over the triangulated graph.^[2]^[20] To compute a query such as the marginal P(Y \mid e), evidence variables in e are fixed by multiplying in delta potentials (indicators that enforce the observed values), and all non-query variables are eliminated in sequence; the final factor over Y is then normalized by the partition function Z, which is computed via a separate full-elimination pass over all variables. Unlike in Bayesian networks, where conditional probability tables are normalized, MRF potentials \psi_C are often unnormalized, and the lack of directed edges means independence is encoded via separation in the undirected graph rather than d-separation.^[16]^[20] A representative example arises in image denoising with a grid-structured MRF, where each pixel is a variable X_{i,j}, data potentials \psi_d(X_{i,j}, Y_{i,j}) encode compatibility with the noisy observation Y_{i,j}, and pairwise smoothness potentials \psi_s(X_{i,j}, X_{i,j+1}) and \psi_s(X_{i,j}, X_{i+1,j}) encourage neighboring pixels to take similar values. Applying variable elimination row-by-row—first eliminating all variables in row 1, then row 2, and so on—multiplies horizontal and vertical potentials to form intermediate factors over entire rows, yielding the posterior marginals for the denoised image; however, this creates factors of size exponential in the grid width due to induced cliques.^[16] Exact inference via variable elimination is feasible in MRFs that are tree-structured (treewidth 1, allowing linear-time computation equivalent to belief propagation) or have bounded low treewidth, where the complexity is polynomial in the number of variables but exponential in the treewidth. For loopy MRFs with cycles and high treewidth, such as dense grids, exact VE becomes intractable, leading to approximations like loopy belief propagation or truncated elimination schemes, though the focus here remains on exact methods for low-treewidth cases.^[16]^[2]

Complexity and Optimization

Computational Complexity

The computational complexity of variable elimination (VE) is fundamentally tied to the structure of the graphical model and the chosen elimination order. During each elimination step for a variable X_i, the algorithm multiplies relevant factors and sums over the domain of X_i, incurring a cost of O(d^{w+1}), where d is the maximum domain size of the variables and w is the induced width of the graph with respect to the elimination order (the maximum size of the scope of intermediate factors minus one). Since there are n variables to eliminate, the total time complexity is O(n \cdot d^{w+1}).^[21] The space complexity mirrors this, as the algorithm must store intermediate factors, requiring up to O(d^{w+1}) space for the largest ones generated during the process. The induced width w is closely related to the treewidth tw of the graph, which is the minimum induced width over all possible elimination orders; thus, VE's cost can be bounded by O(n \cdot d^{tw+1}). For graphs with low treewidth, such as trees where tw=1, VE achieves O(n \cdot d^{2}) time complexity, which is linear in n for fixed d, making it efficient for sparse structures.^[21] In the worst case, for dense graphs where tw \approx n-1, the complexity becomes exponential, scaling as O(n \cdot d^{n}) or, for binary domains (d=2), O(n \cdot 2^{n}), though practical implementations often exhibit exponential behavior in n. Several factors influence this complexity: larger domain sizes d amplify the exponent, while the number of initial factors affects multiplication costs; incorporating evidence can reduce factor scopes, potentially lowering w and thus mitigating the exponential growth. Compared to naive enumeration, which requires O(d^n) time by evaluating all joint assignments, VE is more efficient when w \ll n, but it remains inferior to approximate methods like loopy belief propagation for high-treewidth graphs where exact inference is intractable.^[21]

Variable Ordering Strategies

The efficiency of variable elimination (VE) hinges critically on the choice of elimination order, as it directly influences the size of intermediate factors generated during summation and multiplication steps. A poor ordering can lead to exponentially large factors by inducing high-degree cliques in the moralized graph, whereas a well-chosen order minimizes the induced width—the maximum size of these cliques minus one—keeping computational costs manageable. For instance, in a chain-structured graph with variables connected sequentially (A-B-C-D), eliminating endpoint variables first (e.g., A then D) maintains factors of size at most two, resulting in linear time complexity; conversely, eliminating a central variable like B first creates a factor over A, B, and C, inflating the width unnecessarily.^[21] Static heuristics compute an ordering in advance based on the graph structure to approximate the minimum induced width. The minimum-degree heuristic selects the variable with the fewest neighbors at each step, aiming to delay the growth of clique sizes. The min-fill heuristic extends this by choosing the variable whose elimination introduces the fewest new edges (fill-ins) to the graph, thereby controlling the density of the induced graph more effectively; empirical studies show it often yields orderings with smaller widths than minimum-degree, particularly in sparse networks. The goal of these heuristics is to find an ordering close to the minimum induced width, though determining the exact minimum is NP-complete.^[21]^[22] Dynamic ordering strategies adapt the selection during the VE process by evaluating current factor sizes or scopes, such as prioritizing the variable whose involved factors have the smallest product of domain sizes, to minimize immediate computational overhead. While less common in pure VE due to the fixed-order nature of the algorithm, dynamic approaches appear in hybrid variants like mini-bucket elimination, where they refine buckets on-the-fly for better bounding. Algorithms for generating good orderings include greedy implementations of min-fill and minimum-degree, which run in near-linear time, and more advanced methods like simulated annealing or depth-first search with approximations for larger graphs, as exhaustive search for the optimal order is infeasible.^[23] Tools such as SamIam, developed by the University of California, Los Angeles, integrate these heuristics into VE implementations for Bayesian networks, allowing users to select min-fill or other static orderings to optimize inference on moralized graphs. Limitations persist, as even strong heuristics may not achieve optimal widths on dense graphs, often necessitating combination with graph triangulation techniques to bound fill-ins upfront; in practice, multiple heuristics are tested to balance preprocessing time against inference speed.^[24]