Causal model

A causal model is a mathematical and conceptual framework used to represent the causal relationships among variables in a system, enabling the distinction between correlation and causation by specifying how changes in one variable influence others through mechanisms rather than mere statistical associations.^[1] In particular, a structural causal model (SCM), as formalized by Judea Pearl, consists of a set of endogenous variables (outcomes determined within the model), exogenous variables (external influences), structural equations defining each endogenous variable as a function of its direct causes and noise terms, and a probability distribution over the exogenous variables.^[2] This structure allows for predictions under interventions and counterfactual scenarios, which are central to causal inference.^[1] Causal models originated in the early 20th century with path analysis developed by Sewall Wright in genetics and were later extended into structural equation modeling (SEM) in econometrics and social sciences to analyze direct and indirect effects among observed variables.^[3] Pearl's SCM framework, introduced in the late 1980s and detailed in his 2000 book Causality, advanced this by incorporating graphical representations like directed acyclic graphs (DAGs) to encode conditional independencies and causal pathways, providing a rigorous basis for do-calculus operations that compute interventional effects from observational data under certain assumptions.^[1] Unlike purely probabilistic models, which capture associations at Pearl's "ladder of causation" Layer 1 (seeing), SCMs support Layer 2 (doing, via interventions like P(Y|do(X))) and Layer 3 (imagining, counterfactuals like what would have happened if X were different).^[2] These models are foundational in fields such as statistics, philosophy of science, artificial intelligence, and epidemiology, where they facilitate tasks like estimating treatment effects, policy evaluation, and machine learning interpretability without requiring randomized experiments.^[3] For instance, in social sciences, causal models help dissect complex phenomena, such as the impact of socioeconomic factors on health outcomes, by diagramming hypothesized relationships and testing them against data.^[3] Recent developments, including extensions to cyclic and latent variable models as well as integrations with machine learning and deep learning techniques, address real-world complexities like feedback loops, unobserved confounders, and dynamic systems, enhancing applicability across industries.^[4]^[5]

Fundamentals

Definition

A causal model is a formal representation that encodes assumptions about the mechanisms generating observed data, enabling inferences about how changes in one variable affect others through interventions rather than mere associations. In particular, a structural causal model (SCM) is defined as a triple \langle \mathbf{U}, \mathbf{V}, \mathbf{F} \rangle, where \mathbf{U} is a set of exogenous variables representing background factors, \mathbf{V} is a set of endogenous variables denoting quantities determined within the system, and \mathbf{F} is a set of structural functions such that each v_i = f_i(\mathbf{pa}_i, u_i), with \mathbf{pa}_i as the direct causes (parents) of v_i and u_i as the corresponding exogenous noise term. This framework unifies probabilistic, manipulative, and counterfactual approaches to causation, distinguishing it from purely associative models by incorporating modifiable mechanisms that remain stable under hypothetical alterations. The primary purposes of causal models include answering "what if" questions about potential outcomes, predicting the effects of actions or policies (such as through the do-operator for interventions), and identifying underlying causal structures from observational data when combined with appropriate assumptions. For instance, these models facilitate reasoning at different levels of causation, from associations to interventions and counterfactuals, as formalized in frameworks like Pearl's ladder. By encoding causal knowledge explicitly, they support decision-making in fields such as epidemiology, economics, and machine learning, where distinguishing true causal effects from spurious correlations is essential.^[6] At its core, a causal model consists of variables connected by relationships that imply directionality, often visualized as directed acyclic graphs (DAGs) where nodes represent variables and edges denote causal influences, though graphical details are elaborated elsewhere. Key assumptions include the absence of unobserved confounders (ensuring all common causes are accounted for), acyclicity to prevent feedback loops, and independence of exogenous variables, which together ensure the model's identifiability and predictive power under interventions. A simple example is a structural equation model where an outcome Y depends on a treatment X and unobserved noise U, expressed as

Y = f(X, U),

with U capturing individual-specific factors; intervening on X (e.g., setting X = x) yields the post-intervention distribution P(Y \mid do(X = x)) by replacing the equation for X while holding f and U fixed. This setup allows estimation of causal effects like \beta in linear cases Y = \beta X + U, provided assumptions hold.^[6]

History

The philosophical foundations of causal modeling trace back to ancient Greece, where Aristotle articulated a theory of causation comprising four distinct types of causes: the material cause (the substance from which something is made), the formal cause (its form or essence), the efficient cause (the agent that brings it about), and the final cause (its purpose or end goal).^[7] This framework provided an early systematic approach to understanding why events occur, influencing subsequent Western thought on causality. In the 18th century, David Hume critiqued traditional notions of causation, arguing that it arises not from any inherent necessary connection between events but from the psychological habit of associating ideas through repeated observations of constant conjunction—observing one event invariably followed by another without perceiving any underlying mechanism.^[8] In the 20th century, causal modeling advanced through statistical innovations in quantitative fields. Geneticist Sewall Wright introduced path analysis in 1921 as a method to decompose correlations into direct and indirect causal effects using systems of linear equations and diagrams, initially applied to quantify relationships in animal breeding and quantitative genetics.^[9] This technique laid groundwork for graphical representations of causality. Concurrently, in econometrics, Trygve Haavelmo's 1943 work revolutionized the field by integrating probability theory into causal models, emphasizing that economic relationships are inherently stochastic and that structural equations must account for probabilistic distributions to enable policy analysis and hypothesis testing.^[10] Key modern developments further formalized causal inference. Philosopher Patrick Suppes proposed a probabilistic theory of causality in 1970, defining prima facie causes as events with positive probability of preceding effects and genuine causes as those not spurious due to common causes, providing a rigorous framework for stochastic dependencies.^[11] Computer scientist Judea Pearl advanced this in the 1980s and 1990s by developing structural causal models (SCMs), which represent causal relationships via directed acyclic graphs and functional equations, and the do-calculus, a set of rules for computing interventional effects from observational data without experimental intervention.^[12] Complementing these, the potential outcomes framework, developed by Jerzy Neyman in 1923 and Donald Rubin in the 1970s, provides a basis for defining and estimating causal effects. Statistician Bradley Efron developed resampling techniques like the bootstrap, which enhance methods for estimating causal effects in observational data and allow inference on counterfactual scenarios under unconfounded assumptions.^[13] Post-2020 expansions have integrated causal modeling with machine learning, particularly for automated causal discovery from data. Jonas Peters, Dominik Janzing, and Bernhard Schölkopf's 2017 book Elements of Causal Inference provided foundational algorithms for learning causal structures using techniques like additive noise models and invariant prediction, with subsequent works extending to high-dimensional data.^[14] Additionally, emphasis on fairness has grown, building on Joshua Kusner et al.'s 2017 introduction of counterfactual fairness—which requires predictions to remain unchanged under interventions on protected attributes—with recent works such as a 2024 analysis clarifying its distinction from demographic parity in algorithmic decision-making.^[15]^[16] Milestones include Pearl's seminal Causality: Models, Reasoning, and Inference (2000, second edition 2009), which unified probabilistic, interventional, and counterfactual approaches to causation, and his 2018 co-authored The Book of Why, which popularized these ideas for broader scientific and AI applications.^[12]^[17]

Causality Concepts

Causality versus Correlation

In causal modeling, correlation describes a statistical association indicating that two variables tend to co-occur or change together, often quantified by measures like Pearson's correlation coefficient r, which ranges from -1 to +1 and assesses the strength and direction of linear relationships between continuous variables.^[18] However, this co-occurrence does not establish causation, as it fails to demonstrate that changes in one variable directly produce changes in the other; true causality demands evidence of underlying mechanisms, such as biological processes, or empirical validation through interventions that isolate the effect.^[19]^[20] Mistaking correlation for causation can lead to flawed decisions in fields like public health and economics, where assuming directionality without verification perpetuates errors.^[21] Several common pitfalls exacerbate the confusion between correlation and causation. Spurious correlations arise when unrelated variables appear linked due to coincidence or external influences, as in the well-known example of ice cream sales and shark attacks, both of which rise during summer months because of warmer weather and increased beach activity rather than any direct causal connection between them.^[22] Confounding introduces bias when a third, unmeasured variable influences both observed variables, creating an illusory association; for instance, socioeconomic status might confound links between education level and health outcomes.^[19] Reverse causation occurs when the presumed effect actually drives the cause, such as assuming that low serotonin causes depression when depression might instead lower serotonin levels.^[19] These issues highlight why observational data alone cannot reliably infer causality without additional scrutiny. In time-series data, Granger causality offers a statistical approach to test whether one variable's past values improve predictions of another's future values, suggesting a potential directional influence.^[23] Yet, this method does not confirm true causation, as it can detect predictive patterns driven by common causes, omitted variables, or non-causal dependencies rather than genuine mechanistic effects.^[24] A classic real-world illustration is the observed correlation between smoking and lung cancer: early epidemiological studies showed a strong association, with smokers exhibiting up to 20 times higher risk than non-smokers, but causation was only established through convergent evidence from prospective cohort studies tracking disease incidence, animal experiments demonstrating tumor induction by tobacco carcinogens, and the ethical infeasibility of randomized controlled trials, which would require assigning participants to smoke.^[25] Philosophically, the distinction is underscored by Hans Reichenbach's common cause principle, which asserts that if two events are statistically dependent and not directly causally connected, they must share a common prior cause that renders them conditionally independent when accounted for.^[26] This principle, formulated in the mid-20th century, provides a foundational rationale for seeking hidden confounders in correlated phenomena and remains influential in causal inference frameworks.^[26]

Types of Causal Relationships

In causal models, relationships between causes and effects can be categorized based on their logical necessity and sufficiency, providing a framework for understanding how factors contribute to outcomes. A necessary cause is defined as a factor that must be present for the effect to occur; without it, the effect cannot happen. Formally, if A is a necessary cause of B, then the absence of A implies the absence of B (¬A → ¬B). For example, oxygen serves as a necessary cause for fire, as combustion cannot occur in its absence.^[27] A sufficient cause, in contrast, is a factor or set of factors that, when present, inevitably produces the effect. Formally, if A is a sufficient cause of B, then the presence of A implies the occurrence of B (A → B). An example is a spark applied to a mixture of flammable material and oxygen, which guarantees ignition under those conditions. In practice, sufficient causes often involve minimal sets of conditions that together ensure the outcome, distinguishing them from necessary causes, which alone do not guarantee the effect.^[28] Many real-world causal relationships involve contributory causes, which are neither strictly necessary nor sufficient on their own but play essential roles within broader mechanisms. These are captured by the concept of INUS conditions: an insufficient but non-redundant part of an unnecessary but sufficient condition. For instance, a short circuit might be an INUS condition for a building fire if it is insufficient alone (requiring additional factors like flammable materials) but non-redundant within a sufficient complex (such as wiring faults plus ignition sources), and the overall complex is unnecessary because alternative paths to fire exist. This framework highlights how individual factors contribute without being indispensable or exhaustive.^[29] The notion of component causes extends this by modeling sufficient causes as composites of multiple elements, as in Rothman's sufficient-component cause model, often visualized as "causal pies." Each pie represents a complete sufficient cause, composed of component causes that together complete the mechanism leading to the effect; a single pie's completion triggers the outcome, while multiple pies illustrate alternative pathways. A component cause appearing in every pie is necessary, whereas others are contributory. For example, in disease etiology, genetic susceptibility might be a component in several pies for cancer, combining with environmental exposures to form distinct sufficient causes. This model emphasizes multifactorial causation, where interactions among components determine the effect. These classifications primarily reflect deterministic views of causation, where causes reliably produce effects under specified conditions. In contrast, probabilistic causation posits that causes raise the probability of effects without guaranteeing them, accommodating stochastic processes in fields like epidemiology and physics. For instance, smoking increases the probability of lung cancer but does not deterministically cause it in every case, differing from the absolute implications in necessary or sufficient frameworks. This distinction underscores the need to specify whether a causal model assumes deterministic mechanisms or probabilistic influences.^[30]

Levels of Causal Analysis

Association

In the ladder of causation proposed by Judea Pearl, the association level represents the foundational rung, focusing on the analysis of observational data to identify patterns and predict outcomes without invoking causal mechanisms.^[31] This level addresses queries of the form "what is?" by examining joint and conditional probability distributions, such as P(Y \mid X), which quantifies the likelihood of an outcome Y given an observed condition X.^[31] At this stage, inferences are derived solely from passive observations, enabling statistical summaries of data regularities but stopping short of causal explanations.^[31] Methods at the association level include computing correlations to measure linear relationships between variables, fitting regression models to estimate predictive dependencies, and testing for conditional independencies to uncover non-associative structures in the data.^[32] For instance, the conditional probability P(Y \mid X) is calculated using the basic definition P(Y \mid X) = \frac{P(X, Y)}{P(X)}, where P(X, Y) is the joint probability derived from empirical frequencies in a dataset.^[33] Bayes' rule further supports probabilistic updates at this level, allowing revision of beliefs about Y based on new evidence X: P(Y \mid X) = \frac{P(X \mid Y) P(Y)}{P(X)}.^[33] These techniques rely on historical or cross-sectional data to summarize associations, such as in epidemiological studies tracking disease prevalence alongside risk factors.^[32] A representative example is estimating P(\text{rain} \mid \text{clouds}) from meteorological records, where cloudy skies are observed to correlate with higher rain probabilities due to shared atmospheric patterns in the data.^[31] This association informs short-term forecasts but does not establish that clouds cause rain, as it merely reflects co-occurrence in observations.^[31] Despite its utility for prediction, the association level has inherent limitations, as it cannot disentangle confounding variables that spuriously link X and Y, nor can it predict effects from deliberate interventions on X.^[31] For example, an observed association between ice cream sales and drowning incidents might stem from a confounder like summer heat, rather than any direct link, highlighting how this level fails to isolate true causal pathways.^[32] Transitioning to higher levels, such as intervention, requires explicit causal modeling to overcome these observational constraints.^[31]

Intervention

In the causal hierarchy developed by Judea Pearl, known as the ladder of causation, the intervention level represents the second rung, addressing questions about the consequences of hypothetical actions, such as "What if we perform action X?" This level shifts from mere observational associations to understanding effects under manipulation, enabling predictions about how systems respond to external changes. Interventions are mathematically formalized using the do-operator, denoted as \operatorname{do}(X = x), which specifies an exogenous setting of variable X to value x. In causal graphical models, this operation severs all incoming arrows to X, isolating it from its usual causes and preventing feedback or confounding influences during the manipulation. This truncation reflects the essence of an ideal intervention, where the action directly alters X without being affected by other variables in the system. The gold standard for estimating interventional effects in practice is the randomized controlled trial (RCT), which approximates the do-operator by randomly assigning treatments to units, thereby ensuring that the intervention is independent of any unobserved factors. RCTs minimize selection bias and allow for unbiased estimation of causal effects at the population level, as the randomization process mimics the severance of incoming influences to the treatment variable. For instance, evaluating the impact of a policy change like mandating a treatment ( \operatorname{do}(\text{treatment}=1) ) on an outcome such as recovery rates can be assessed by comparing post-intervention outcomes in randomly assigned treated and control groups.^[34]^[35] In scenarios where RCTs are impractical due to ethical, logistical, or cost constraints, quasi-experimental designs provide approximations to true interventions. Methods like difference-in-differences exploit temporal and group variations—such as pre- and post-policy changes across affected and unaffected units—to estimate causal effects, assuming parallel trends in the absence of intervention. These approaches, while not as robust as RCTs, can credibly identify interventional distributions when randomization is unavailable.^[36]^[37]

Counterfactuals

In Judea Pearl's ladder of causation, counterfactuals represent the highest level of causal reasoning, enabling queries about subjunctive conditionals such as "Was it X that caused Y?" by contemplating unobserved alternative realities consistent with the observed data.^[38] This level transcends mere associations and interventions, allowing retrospective analysis of what would have happened under different circumstances, often framed as "what if" scenarios that attribute causation to specific events.^[39] A primary challenge in counterfactual reasoning lies in dealing with unobserved worlds, which necessitates strong consistency assumptions about the underlying causal model to ensure that hypothetical alterations align with factual evidence.^[40] For instance, consider a patient who received no treatment and subsequently died; a counterfactual query might ask what the outcome would have been if treatment had been administered, invoking a "twin world" analogy where the patient's background factors remain identical, but the treatment variable is altered to explore the hypothetical path.^[38] Counterfactuals play a crucial role in policy evaluation, particularly through natural experiments, where they facilitate inferences about untestable claims by constructing plausible alternatives to observed outcomes in contexts like environmental or public health interventions.^[41] In structural causal models, counterfactuals are interpreted as outcomes derived from interventions applied to "mutilated" graphs—modified versions of the original model where certain equations are replaced to reflect the hypothetical change, while preserving the exogenous noise terms from the actual world.^[40] This approach, rooted in the potential outcomes framework, provides a mathematical basis for such reasoning without delving into probabilistic distributions of interventions.^[38]

Representing Causal Models

Causal Diagrams

Causal diagrams, commonly represented as directed acyclic graphs (DAGs), provide a visual framework for encoding causal assumptions in empirical research. In a DAG, each node corresponds to a variable—such as observed factors, treatments, or outcomes—while directed arrows signify direct causal influences between them, indicating the direction of causation from cause to effect. These graphs formalize qualitative knowledge about causal structures, enabling researchers to distinguish causal paths from spurious associations. Standard conventions in causal diagrams include the use of directed edges to denote causation, ensuring the graph remains acyclic to avoid implying impossible self-reinforcing loops in static models. Observed variables are typically depicted as filled nodes, while unobserved variables, such as latent confounders, are included as empty or labeled nodes to highlight their role in the structure. This labeling helps in assessing identifiability and potential biases without relying on probabilistic details. Interpreting causal diagrams involves tracing paths to understand effect transmission: directed paths from a treatment to an outcome represent causal influences, while undirected or back-door paths may indicate confounding that must be blocked for valid inference. For instance, in a simple DAG modeling the relationship between smoking, tar deposits, and lung cancer, arrows connect smoking to tar and tar to cancer, illustrating a mediated causal pathway; adding age as a confounder with arrows to both smoking and cancer reveals a common cause that could bias naive associations. Blocking such confounding paths, often by conditioning on age, allows identification of the direct effect of smoking on cancer. Tools like DAGitty facilitate the creation and analysis of these diagrams through a web-based interface, supporting tasks such as path identification and adjustment set computation. Similarly, the R package bnlearn offers capabilities for constructing and visualizing DAGs in statistical workflows.^[42] While traditional causal diagrams assume acyclicity for clear temporal ordering, post-2020 literature has extended these to cyclic graphs to accommodate feedback loops in dynamic systems, such as economic models where variables mutually reinforce each other over time.^[43]

Model Elements

Causal models, particularly those represented as structural causal models (SCMs), consist of variables partitioned into endogenous and exogenous types. Endogenous variables are those whose values are determined by other variables within the model, representing outcomes influenced by causal mechanisms.^[38] Exogenous variables, in contrast, are external factors not explained by the model, serving as sources of variation or noise that drive the system.^[6] Within these, specific roles emerge: mediators are endogenous variables that lie on causal paths between a treatment and an outcome, transmitting effects serially (e.g., a drug influencing health through an intermediate biomarker).^[38] Confounders are variables that cause both a treatment and an outcome, creating spurious associations if unadjusted.^[38] Junction patterns in causal diagrams form the basic structures for understanding dependencies. A chain pattern (A → B → C) represents serial mediation, where A affects C indirectly through B; conditioning on B blocks the path, inducing independence between A and C.^[38] A fork pattern (A → B, A → C) indicates a common cause A influencing both B and C, leading to conditional independence between B and C given A.^[38] A collider pattern (A → C ← B) occurs when two variables A and B both cause a third C; here, A and B are independent unconditionally, but conditioning on C opens a non-causal path, inducing spurious association (collider bias).^[38] Instrumental variables (IVs) are special exogenous or endogenous variables that affect the treatment but influence the outcome solely through the treatment, satisfying exclusion and relevance assumptions.^[44] For example, random assignment via lottery serves as an IV for estimating treatment effects, as it affects participation without direct impact on outcomes. In epidemiology, Mendelian randomization leverages genetic variants as IVs, exploiting random assortment at meiosis to infer causal effects of modifiable exposures like cholesterol on disease, assuming variants are independent of confounders. Backdoor paths in causal models are non-directed paths from treatment to outcome that initiate with an arrow into the treatment, potentially carrying confounding influences. Identification via the backdoor criterion requires conditioning on a set of variables that blocks all such paths without opening colliders or including descendants of the treatment.^[38] A classic example of collider bias is Berkson's paradox, where hospitalization (collider) induces a spurious negative association between unrelated diseases like diabetes and gallstones among patients, as each causes admission independently.^[45]

Handling Associations

Independence Conditions

In causal models represented as directed acyclic graphs (DAGs), the Causal Markov condition posits that every variable is probabilistically independent of its non-descendants given its parents in the graph. This condition formalizes the idea that the causal structure encodes local dependencies, allowing the joint distribution over all variables to be factored as the product of each variable's conditional distribution given its parents: P(V) = \prod_{i} P(V_i \mid \mathrm{Pa}(V_i)), where V denotes the set of all variables and \mathrm{Pa}(V_i) are the parents of V_i. The d-separation criterion provides an algorithmic method to determine the conditional independencies implied by the DAG structure. A path between two variables X and Y is said to be d-separated (blocked) by a set of variables Z if at least one of the following conditions holds along the path:

The path contains a chain A \to B \to C or a fork A \leftarrow B \to C, and the middle node B is in Z.
The path contains a collider A \to B \leftarrow C, and neither B nor any of its descendants is in Z.

If all paths between X and Y are blocked by Z, then X is conditionally independent of Y given Z, denoted X \perp Y \mid Z. This criterion enables efficient computation of independencies without enumerating all possible conditioning sets. For example, consider a DAG with a fork structure where A \to B, A \to C, and B \to D. Here, B and C are independent given A (B \perp C \mid A), as conditioning on the common cause A blocks the only connecting path. Without conditioning on A, B and C may appear dependent due to the shared influence from A. The faithfulness assumption complements d-separation by asserting that the DAG accurately reflects all conditional independencies present in the data-generating process, without additional independencies arising from parameter cancellations or other non-structural reasons. Under faithfulness, every independence readable via d-separation corresponds to an actual probabilistic independence, and vice versa, ensuring the graph is a faithful representation of the causal dependencies. These independence conditions have practical applications in causal inference, such as testing the fit of a proposed causal model to observational data through conditional independence tests. For instance, empirical verification involves checking whether observed data satisfy the independencies predicted by d-separation, often using statistical tests like the chi-squared test for conditional independence. This aids in model selection and validation in fields like epidemiology and machine learning.

Confounders and Adjustments

In causal inference, a confounder is defined as a variable that is associated with both the treatment (exposure) and the outcome, thereby inducing a spurious association between them and biasing causal effect estimates if not properly adjusted for.^[46] This bias arises because confounders create non-causal paths, known as backdoor paths, from treatment to outcome in causal diagrams.^[33] One approach to addressing unobserved confounders involves the deconfounder, a method that approximates the effect of a latent confounder using multiple observed proxy variables, such as negative control outcomes, through probabilistic factor models.^[47] This technique is particularly useful in multiple-cause settings where traditional adjustment is infeasible, though it relies on strong assumptions like the proxies sufficiently capturing the latent structure and has been critiqued for practical limitations in estimation consistency.^[47] The backdoor adjustment provides a standard method for identifying causal effects from observational data by conditioning on a set of variables Z that blocks all backdoor paths between treatment X and outcome Y, satisfying the backdoor criterion: no node in Z is a descendant of X, and Z blocks every path from X to Y with an arrow into X.^[46] Under this criterion, the interventional distribution is given by the adjustment formula:

P(Y \mid do(X = x)) = \sum_{z} P(Y \mid X = x, Z = z) P(Z = z)

This can be estimated using stratification, regression, or matching on Z.^[46] For example, in a drug trial evaluating a new medication's effect on recovery rates, age may confound the relationship if older patients are less likely to receive the drug but also have poorer recovery prospects; adjusting for age via backdoor criterion closes this path and yields unbiased estimates.^[48] When backdoor paths cannot be fully blocked due to unmeasured confounders, the frontdoor adjustment offers an alternative if there exists a mediator set M (intermediate variables) such that X affects M, M fully mediates the effect on Y, and no unblocked backdoor paths exist from M to Y after adjusting for X and any confounders Z.^[46] The frontdoor formula is:

P(Y \mid do(X = x)) = \sum_{m} P(M = m \mid do(X = x)) \sum_{z} P(Y \mid do(M = m), X = x, Z = z) P(Z = z \mid X = x)

where P(M \mid do(X)) equals the observational P(M \mid X) under no confounding for X \to M.^[46] A classic illustration is the effect of smoking (X) on lung cancer (Y), confounded by genotype (U); tar deposits (M) serve as a frontdoor mediator, as smoking fully determines tar levels without confounding, and tar causes cancer independently of genotype when holding smoking constant, allowing identification despite unmeasured U.^[48] Even with adjustment strategies, unmeasured confounding remains a concern, as no set Z may fully capture all biases. Sensitivity analyses quantify the robustness of estimates to potential unmeasured confounders; for instance, the E-value measures the minimum strength of association that an unmeasured confounder would need with both treatment and outcome to fully explain away an observed effect, providing a threshold for credibility.^[49] Introduced in 2017, the E-value has been updated to handle bounds for both point estimates and confidence intervals, aiding interpretation in diverse observational settings.^[49]

Interventional Analysis

Interventional Queries

Interventional queries in causal models seek to answer questions about the effects of hypothetical or actual interventions on a system, focusing on what would happen if specific variables were forcibly set to certain values. These queries are formalized using the do-operator, introduced by Judea Pearl, which denotes an intervention that severs the usual dependencies of a variable and sets it exogenously. The core object of interest is the interventional distribution P(Y | do(X = x)), which represents the probability distribution of outcome Y after intervening to set treatment X to value x. This distribution captures the post-intervention behavior of the system, distinct from observational probabilities P(Y | X = x), as it accounts for the causal mechanisms rather than mere associations.^[38] Common interventional queries include measures of causal effects, such as the average treatment effect (ATE), defined as \mathbb{E}[Y | do(X=1)] - \mathbb{E}[Y | do(X=0)], which quantifies the expected change in Y when X is intervened from a control (0) to a treated (1) state across the population. Another key query is the causal risk ratio, given by P(Y=1 | do(X=1)) / P(Y=1 | do(X=0)), which assesses the relative probability of a binary outcome under intervention, often used in epidemiology to evaluate preventive measures. These queries address practical problems like policy evaluation; for instance, estimating the effect of mandating college education on income might involve computing \mathbb{E}[\text{Income} | do(\text{Education}=\text{college})] using observational data on education, confounders like ability, and outcomes, assuming identifiability conditions hold.^[38]^[38]^[50] A central challenge in interventional queries is the identification problem: determining whether P(Y | do(X = x)) can be expressed solely in terms of observable data distributions, without requiring new experiments. Identification is possible under assumptions like the back-door criterion, which ensures confounders are adequately controlled, allowing reduction to observational queries via adjustment formulas. Randomized controlled trials (RCTs) provide an ideal setting for direct estimation of interventional distributions, as randomization mimics the do-operator by eliminating confounding, yielding unbiased estimates of effects like the ATE.^[38]^[38]^[38] Interventional effects are often non-transportable across populations, meaning an effect identified in one study group may not apply directly to another due to differences in underlying distributions or selection mechanisms. For example, a treatment effect estimated in a clinical trial on one demographic might not generalize to a broader population without additional adjustments for heterogeneity. This limitation underscores the need for careful assessment of external validity when applying interventional queries beyond the original context.^[51]^[51]

Do-Calculus

The do-calculus, introduced by Judea Pearl, provides a formal set of rules for computing interventional distributions from observational data in causal models represented by directed acyclic graphs (DAGs).^[52] It operationalizes the do-operator, denoted as P(Y | do(X)), which replaces the observational probability P(Y | X) with the interventional distribution obtained by setting X to a specific value through external manipulation, effectively severing incoming edges to X in the DAG (a process known as graph mutilation).^[53] This calculus enables the identification of causal effects P(Y | do(X)) without requiring parametric assumptions, provided certain graphical independence conditions hold, thus bridging observational statistics and interventional queries.^[52] The do-calculus consists of three inference rules that manipulate expressions involving do-operators and conditional probabilities based on d-separation criteria in modified graphs.^[53] Rule 1 (Insertion/deletion of observations): If Y \perp Z \mid X, W in the graph G_{\overline{X}} (obtained by deleting all arrows pointing to nodes in X), then
P(y \mid do(x), z, w) = P(y \mid do(x), w).
This rule allows omitting an observed variable Z from the conditioning set if it is independent of the outcome Y given the intervention on X and other conditions W, assessed in the mutilated graph.^[53] Rule 2 (Action/observation exchange): If Y \perp Z \mid X, W in the graph G_{\overline{X} \underline{Z}} (obtained by deleting arrows into X and out of Z), then
P(y \mid do(x), do(z), w) = P(y \mid do(x), z, w).
This rule permits replacing an intervention on Z (do(z)) with mere observation of Z (conditioning on z) when Z has no unblocked paths to Y after accounting for the intervention on X and conditions W.^[53] Rule 3 (Insertion/deletion of actions): If Y \perp Z \mid X, W in the graph G_{\overline{X} \underline{Z}(W)} (where \underline{Z}(W) excludes Z-nodes that are ancestors of any W-node in G_{\overline{X}}), then
P(y \mid do(x), do(z), w) = P(y \mid do(x), w).
This rule justifies ignoring an intervention on Z if Z does not affect Y through paths that bypass the conditions W, after mutilating for X and isolating Z's effects.^[53] These rules are complete, meaning any identifiable causal effect can be derived by their repeated application, without needing additional graphical criteria.^[54] Extensions of do-calculus have been developed to handle counterfactual reasoning and transportability of causal effects across populations or environments. For instance, it supports deriving counterfactual distributions P(Y_{do(x)} \mid evidence) by combining interventional and observational components, and enables transportability maps that transfer effects from a source study to a target population when selection diagrams indicate graphical compatibility.^[55]^[56] As an example, the backdoor criterion for effect identification—adjusting for a set Z that blocks all backdoor paths from X to Y—can be derived using do-calculus rules. Starting from P(y | do(x)), Rule 2 exchanges the do(x) for conditioning on x if no confounding paths remain after isolating X, and Rule 1 then deletes unnecessary observations, yielding the adjustment formula \sum_z P(y | x, z) P(z).^[53] Software implementations facilitate practical application of do-calculus; for example, the open-source Python library DoWhy, developed by Microsoft Research, automates graphical model specification, effect identification via do-calculus, estimation, and refutation testing, with ongoing updates through 2025.^[57]

Counterfactual Analysis

Potential Outcomes

The potential outcomes framework formalizes causal effects by considering counterfactual outcomes that would occur under different treatment assignments for each unit in a population. For a binary treatment T \in \{0, 1\}, each unit i has two potential outcomes: Y_i(1), the value of the outcome Y_i if unit i receives treatment (T_i = 1), and Y_i(0), the value if it receives control (T_i = 0). This notation originates from Neyman's early work on randomized experiments and was generalized by Rubin to broader settings, including observational studies.^[58] The individual causal effect for unit i is defined as the difference Y_i(1) - Y_i(0), which compares the unit's outcome under treatment to what it would have been under control. However, this effect is inherently unobservable for any single unit, as only one potential outcome can be realized depending on the actual treatment received—this is known as the fundamental problem of causal inference. As a result, causal effects must typically be estimated at the population level rather than for individuals.^[58]^[59] To ensure the potential outcomes are well-defined and identifiable from observed data, key assumptions are required, including the stable unit treatment value assumption (SUTVA). SUTVA consists of two parts: (1) no interference between units, meaning the potential outcome for one unit does not depend on the treatments assigned to others, and (2) consistency, meaning there are no hidden variations in treatment implementation that could affect outcomes. Under randomization in a randomized controlled trial (RCT), these assumptions, combined with random assignment, imply that the distributions of potential outcomes are equated across treatment and control groups, allowing identification of the average treatment effect (ATE).^[58] In an RCT, the ATE is defined as \mathbb{E}[Y(1) - Y(0)] and is identified as the difference in observed means: \mathbb{E}[Y \mid T=1] - \mathbb{E}[Y \mid T=0]. This identification holds because randomization ensures that \mathbb{E}[Y(1) \mid T=1] = \mathbb{E}[Y(1) \mid T=0] = \mathbb{E}[Y(1)] and similarly for Y(0), eliminating selection bias. The Neyman-Rubin model, building on Neyman's superpopulation framework, further enables estimation of the ATE's sampling variance and supports testing of sharp null hypotheses, such as the null that the individual effect is zero for all units (Y_i(1) = Y_i(0) for all i), via randomization-based inference. Under this sharp null, all potential outcomes are known (equaling the observed outcomes), allowing exact permutation tests of the null.^[58]^[60] The potential outcomes framework relates to structural causal models by representing outcomes as deterministic functions of treatments and exogenous variables, without explicit equations; thus, it serves as a special case of the more general structural approach, which incorporates modifiable mechanisms via functional equations. This connection allows potential outcomes to be derived as evaluations of structural functions under specific interventions.

Counterfactual Inference

Counterfactual inference involves computing what would have happened under hypothetical scenarios that differ from observed reality, building on the potential outcomes framework defined earlier. In structural causal models, Judea Pearl outlines a three-step process for such inference: abduction, action, and prediction. Abduction infers the values of exogenous variables (U) from the observed evidence (e), updating the prior distribution P(u) to the posterior P(u|e). Action then modifies the model by intervening on the variables of interest in a "twin world" counterfactual scenario, equivalent to applying the do-operator to set those variables to alternative values. Finally, prediction simulates the outcomes forward from the modified model using the abducted exogenous variables to derive the counterfactual distribution.^[40] A classic example illustrates this process in a smoking-lung cancer model. Consider an individual observed to have smoked (S=1) and developed cancer (C=1), with tar deposits (T=1) as an intermediate. Abduction infers the exogenous factors U (e.g., genotype predisposing to smoking and cancer) from the evidence {S=1, T=1, C=1}, yielding P(u|{S=1, T=1, C=1}). Action intervenes by setting do(S=0) in the twin world, removing the smoking effect while preserving U. Prediction then computes P(C|do(S=0), u, {S=1, T=1, C=1}), estimating the probability of cancer had this individual not smoked, which might reveal if smoking caused their cancer. To estimate counterfactual quantities like average treatment effects from observational data, several methods adjust for confounding. Matching pairs treated and control units based on observed covariates or propensity scores to approximate randomized assignment, enabling unbiased counterfactual mean estimation under conditional independence. Inverse probability weighting (IPW) reweights observations by the inverse of the treatment probability (propensity score) to create a pseudo-population where treatment assignment is independent of confounders, yielding consistent estimates of counterfactual means. G-estimation solves estimating equations that directly target the counterfactual effect while modeling the treatment-confounder relationship, providing robustness to outcome model misspecification. These approaches rely on key assumptions: consistency, where the observed outcome equals the potential outcome under the received treatment, ensuring factuals align with counterfactuals; and positivity, requiring non-zero probability of treatment across all covariate levels to avoid extrapolation. Recent advances integrate machine learning for flexible counterfactual prediction in high-dimensional settings. Double machine learning (Double ML) combines ML-based nuisance parameter estimation (e.g., propensity scores and outcome regressions) with orthogonalized score functions to deliver root-n consistent inference on counterfactual parameters, even with complex confounders; updates in subsequent works extend this to heterogeneous effects and instrumental variables. As of 2025, further advancements integrate structural causal models with generative AI and large language models for counterfactual forecasting of human behavior and real-time causal inference in dynamic systems.^[61]^[62] Beyond statistics, counterfactual inference informs legal and ethical domains, such as attributing causal responsibility in tort cases by assessing whether harm would have occurred absent the defendant's action, resolving overdetermination issues in blame assignment.

Mediation Analysis

Mediation analysis in causal models seeks to decompose the total causal effect of a treatment X on an outcome Y into direct and indirect components through an intermediate variable, or mediator M. The total effect represents the overall change in Y when X varies from one level (e.g., x') to another (e.g., x), while the direct effect captures the influence of X on Y not passing through M, and the indirect effect quantifies the path X \to M \to Y. This decomposition provides insights into the mechanisms underlying causal relationships, enabling researchers to understand how much of the effect operates through specific pathways.^[63] In the potential outcomes framework, the natural direct effect (NDE) measures the direct impact of X while allowing M to respond naturally to the reference level of X (i.e., x'); it is defined as E[Y_{x, M_{x'}} - Y_{x', M_{x'}}], where Y_{x, m} denotes the potential outcome of Y under intervention on X = x and M = m. The natural indirect effect (NIE) isolates the indirect pathway by holding the direct effect of X at the reference level; it is given by E[Y_{x, M_x} - Y_{x, M_{x'}}]. The total effect then equals the sum of NDE and NIE: TE = NDE + NIE. These natural effects contrast with controlled direct effects, which fix M to a specific value m regardless of X, yielding E[Y_{x, m} - Y_{x', m}], or pure direct effects, which fix M to its value under a counterfactual X level, such as E[Y_{x, M_{x'}} - Y_{x', M_{x'}}].^[63]^[64] Identification of these effects from observational data requires assumptions like sequential ignorability, which posits that, conditional on covariates, the treatment X is independent of potential outcomes for both M and Y, and the mediator M is independent of potential outcomes for Y. Under these conditions, the NDE and NIE can be expressed via the mediation formula: for the NIE, \sum_m E[Y | x, m] [P(m | x) - P(m | x')], and similarly for the NDE. When confounders affect both X and Y but not the X \to M \to Y path, the front-door criterion identifies the pure indirect effect even without full ignorability for the total effect.^[64]^[63] A representative example involves job training programs (X), where participation affects wages (Y) partly through acquired skills (M). The indirect effect via skills might explain how training improves employability and earnings, while any direct effect could stem from networking or certification unrelated to skill enhancement; empirical studies decompose these to evaluate program efficacy.^[65] Challenges in mediation analysis include interactions between direct and indirect paths, which can lead to effect heterogeneity and complicate additivity (e.g., TE \neq NDE + NIE under certain nonlinearities), as well as handling multiple mediators, where parallel or serial pathways require extensions like generalized mediation formulas to avoid over- or under-attribution of effects.^[63]

Advanced Topics

Transportability

Transportability refers to the process of transferring causal effects estimated from one population or environment (the source) to another distinct population or environment (the target), where data availability may differ between experimental and observational studies across domains.^[66] Unlike generalizability, which concerns extrapolating inferences from a sample to a larger population within the same domain, transportability addresses systematic differences between domains, such as varying distributions or selection mechanisms, often requiring adjustments to ensure valid extrapolation.^[66] Graphical criteria for transportability are formalized using selection diagrams, which augment the causal graph with selection variables (S-nodes, typically depicted as squares) to indicate discrepancies between source and target environments.^[66] These S-nodes point to variables affected by domain-specific differences, such as sampling biases or environmental factors. A key condition for identifiability is the absence of S-nodes on certain paths, enabling criteria like S-admissibility, which extends the backdoor criterion to account for selection by ensuring that adjustments block confounding paths involving S-nodes.^[66] For instance, Q-transportability assesses whether a query is identifiable via backdoor adjustments on selection variables in the augmented graph.^[67] A representative example involves transporting the efficacy of a drug from a randomized controlled trial (RCT) conducted in one country (source population) to the general population in another country (target), where differences arise due to covariates like age and sex.^[66] If age (Z) is the primary differing factor, marked by an S-node pointing to Z, the causal effect in the target can be recovered by stratifying on Z from the source RCT data and weighting by the marginal distribution of Z in the target observational data.^[66] Methods for achieving transportability include stratification, where effects are estimated conditionally on adjustment variables and then marginalized over the target distribution, and reweighting techniques to correct for selection biases, such as using odds ratios to adjust biased samples.^[66] A foundational formula for transporting an interventional effect is:

P'(y \mid do(x)) = \sum_z P(y \mid do(x), z) \, P'(z)

where P(y \mid do(x), z) is estimable from source experimental data stratified by Z, and P'(z) is the target marginal, assuming Z satisfies the graphical criteria (here, G denotes the graph or population indicator, with prime for target).^[66] Recent developments have integrated transportability with machine learning domain adaptation, enabling robust model transfer across environments by leveraging causal graphs to identify invariant mechanisms, as extended in frameworks combining transportability theory with neural networks for visual recognition tasks. These advances, building on Bareinboim and Pearl's 2013 work, include extensions to counterfactual transportability for handling heterogeneous data sources in 2022.^[68] As of 2025, further progress incorporates federated learning to address heterogeneity across distributed sites without data sharing, and forecasting methods for future interventions.^[69]^[70]

Bayesian Causal Networks

Bayesian causal networks, also referred to as causal Bayesian networks, represent a probabilistic extension of causal diagrams, modeling causal relationships through directed acyclic graphs (DAGs) where nodes denote random variables and directed edges indicate causal influences from parents to children. Each node is associated with a conditional probability distribution (CPD) that quantifies the probabilistic dependence of the variable on its direct causes (parents). This framework combines graphical structure with probability theory to encode both causal mechanisms and uncertainty in the joint distribution over variables.^[71] The joint probability distribution factorized by the network structure is expressed as the product of the local CPDs:

P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \mathrm{Pa}(X_i))

where \mathrm{Pa}(X_i) represents the set of parent nodes of X_i, assuming the absence of cycles in the DAG. Under the causal Markov condition, this factorization captures all conditional independencies implied by the causal structure: each variable is conditionally independent of its non-descendants given its parents, enabling arrows to be interpreted as direct causal effects when the model satisfies faithfulness. Structure learning in these networks proceeds via score-based methods, which optimize a scoring function (e.g., Bayesian Information Criterion) over possible DAGs to balance fit and complexity, or constraint-based methods, which infer the skeleton and orientations from conditional independence tests on data.^[72]^[73]^[74] Inference in Bayesian causal networks supports both observational and interventional queries. For observational probabilities, exact inference employs variable elimination, which systematically sums out irrelevant variables by constructing intermediate factors to compute marginals or conditionals efficiently, though it can suffer from exponential complexity in treewidth. Approximate methods like Markov Chain Monte Carlo (MCMC) sampling generate samples from the posterior for large networks. Interventional do-queries, which estimate effects of hypothetical actions do(X=x), are handled by applying do-calculus rules to mutilate the graph—removing incoming edges to intervened nodes—followed by standard inference on the modified network.^[75] A representative example is a diagnostic network for respiratory disease, where a root node "Tuberculosis" has a prior CPD (e.g., low base rate), with child nodes "Visit to Asia" as a risk factor influencing tuberculosis probability, and symptoms like "Dyspnea" and "X-ray Abnormality" as descendants conditioned on the disease and other factors such as smoking. Observing symptoms like dyspnea allows inference of the posterior probability of tuberculosis via variable elimination. This setup, inspired by the classic Asia network, illustrates how causal structure propagates evidence from effects to causes for diagnostic reasoning. Bayesian causal networks offer key advantages in causal modeling by explicitly handling parameter uncertainty through Bayesian updates on CPDs and incorporating domain priors to regularize learning from limited data. Recent applications extend to causal fairness in machine learning, where networks model spurious causal paths from sensitive attributes (e.g., gender) to outcomes, enabling interventions to block unfair influences while preserving legitimate causal effects; for example, Chiappa (2019) demonstrates how such graphs quantify dataset unfairness and guide fair model design in scenarios with multiple bias sources.^[72]^[76] As of 2025, advancements include large language model (LLM)-assisted structure discovery for data-free and data-driven scenarios, and scalable methods for interventional data in high dimensions.^[77]^[78]

Causal Discovery

Causal discovery involves inferring causal structures, often represented as directed acyclic graphs (DAGs), from observational or interventional data. This process assumes faithfulness, meaning that the graph encodes all conditional independencies present in the data distribution, allowing independencies to reveal separations in the graph. Methods in causal discovery aim to recover the DAG or its Markov equivalence class, which consists of graphs implying the same set of conditional independencies.^[79] Constraint-based approaches, such as the PC algorithm developed by Spirtes et al., begin by constructing an undirected skeleton through conditional independence tests and then orient edges using specific patterns. Skeleton recovery relies on partial correlations to test for conditional independence; for variables X, Y, and conditioning set Z, the partial correlation is computed as

\rho_{XY \cdot Z} = \frac{\rho_{XY} - \rho_{XZ} \rho_{YZ}}{\sqrt{(1 - \rho_{XZ}^2)(1 - \rho_{YZ}^2)}}

where \rho denotes Pearson correlation coefficients, and edges are removed if \rho_{XY \cdot Z} = 0 at a significance level.^[80] Orientation proceeds by identifying v-structures, where two variables are independent given a third but dependent unconditionally, indicating converging arrows into the third variable.^[79] In contrast, score-based methods evaluate candidate DAGs using a scoring function that balances fit to data and model complexity, such as the Bayesian Information Criterion (BIC), defined as \mathrm{BIC} = \log L - \frac{k}{2} \log n, where L is the likelihood, k the number of parameters, and n the sample size. The Greedy Equivalence Search (GES) algorithm by Chickering employs a greedy hill-climbing strategy over equivalence classes to maximize the score, starting from an empty graph and adding or deleting edges iteratively.^[81] Key challenges in causal discovery include identifying within Markov equivalence classes, where multiple DAGs are indistinguishable from observational data alone, requiring additional assumptions or interventions to resolve. Multiple testing across numerous conditional independence tests inflates false positives, particularly in high dimensions, while latent variables can induce spurious associations that confound structure recovery.^[79] For example, in time-series data, Granger causality assesses whether past values of one variable predict another beyond its own history, aiding discovery of potential causal directions. Interventions, implemented via do-experiments that exogenously set variable values, enable edge orientation by observing changes that break observational symmetries.^[82] Recent advances leverage deep learning to address nonlinearities and scalability; the NOTEARS method by Zheng et al. formulates DAG learning as a continuous optimization problem, minimizing a score subject to an acyclicity constraint enforced by a neural network trace exponential.^[83] Post-2023 variants extend this to high-dimensional settings with data limitations, incorporating deep architectures for nonlinear causal discovery while maintaining interpretability.^[84] These approaches also integrate fairness constraints to avoid discriminatory structures in learned graphs. As of 2025, notable progress includes LLM-based methods for causal discovery, enhancing reasoning in complex environments, and applications grounded in real-world domains like biology and neuroscience to improve practical utility.^[85]^[86]