Fact-checked by Grok 2 weeks ago

Propensity score matching

Propensity score matching (PSM) is a statistical technique employed in observational studies to estimate causal effects of treatments or interventions by reducing selection bias and confounding through the creation of comparable groups of treated and untreated subjects. The core concept revolves around the propensity score, defined as the conditional probability of receiving treatment given a vector of observed baseline covariates, which allows for balancing these covariates across groups to approximate the conditions of a randomized controlled trial. This method is particularly valuable in fields such as epidemiology, economics, and social sciences, where randomization is often infeasible, enabling researchers to draw more reliable inferences about treatment effects from non-experimental data. Developed by Paul R. Rosenbaum and Donald B. Rubin, PSM was first formally introduced in their seminal 1983 paper, which established the theoretical foundation for using propensity scores to adjust for in observational data. Prior to this, methods for handling relied on direct covariate adjustment or , but Rosenbaum and Rubin demonstrated that the propensity score alone could suffice for balancing multiple covariates, simplifying analysis while preserving the potential outcomes framework for . Since its inception, PSM has gained widespread adoption, especially following advancements in computational tools and software implementations in the and , making it accessible for large-scale datasets. In practice, PSM begins with estimating the propensity score, typically via where treatment status is regressed on the observed covariates to predict the probability of . Once estimated, matching is performed by pairing treated units with one or more untreated units having the closest propensity scores, often using techniques such as nearest-neighbor matching with or without replacement, caliper restrictions to limit score differences, or optimal matching algorithms to minimize overall imbalance. Alternative applications of the propensity score include into subclasses (e.g., quintiles) for within-group comparisons, inverse probability of weighting (IPTW) to create a pseudo-population balanced on covariates, or direct inclusion as a covariate in models. After matching or adjustment, covariate balance is assessed using standardized mean differences or graphical tools like love plots to verify that the distribution of baseline characteristics is similar between groups. PSM's primary advantages lie in its ability to separate the design phase (balancing covariates) from the analysis phase (estimating effects), thereby reducing bias from measured confounders and facilitating the estimation of average treatment effects on the treated (ATT) or the population (ATE). It has been extensively applied in medical research to evaluate drug efficacy, in economics to assess policy impacts, and in public health to study social determinants of health, often yielding results comparable to randomized trials when assumptions hold. However, PSM assumes that all relevant confounders are observed and correctly modeled (no unmeasured confounding), and it can lead to loss of data efficiency if many units remain unmatched, potentially reducing statistical power. Sensitivity analyses are recommended to evaluate robustness to potential hidden biases.

Background and Motivation

Observational Data and Causal Inference Challenges

Observational data arise from studies in which treatments or exposures are not randomly assigned to participants but instead occur as they naturally would in real-world settings, such as through patient choices, policy implementations, or environmental factors. This contrasts sharply with randomized controlled trials (RCTs), where random assignment ensures that treated and control groups are comparable on both observed and unobserved characteristics, thereby minimizing selection bias and enabling unbiased causal estimates. In observational studies, however, treated and control groups often differ systematically due to non-random selection into treatment, leading to incomparable groups and biased estimates of causal effects if not properly addressed. A core challenge in causal inference from observational data is the fundamental problem that counterfactual outcomes—what would have happened to a treated unit had it not received treatment, or vice versa—cannot be directly observed for the same individual or unit. This missing data issue, first formalized in the potential outcomes framework, underscores the impossibility of simultaneously observing both potential outcomes under treatment and no treatment, making direct causal comparisons inherently unfeasible without additional assumptions or methods. To tackle these issues in non-randomized settings, propensity score methods were pioneered by Rosenbaum and Rubin in , providing a framework for estimating causal effects by balancing observed covariates between groups. For instance, in using electronic patient records, these methods help estimate the effects of interventions like drug therapies versus standard care, where ethical constraints prevent , by creating comparable cohorts from historical data. Confounding variables, which influence both treatment assignment and outcomes, exacerbate these biases but can be mitigated through appropriate adjustment techniques.

Role of Confounding in Bias

Confounding refers to a situation in which a third variable, known as a confounder, is associated with both the assignment and the outcome, resulting in spurious associations that distort the estimated causal effect. This occurs because the confounder creates a non-causal pathway linking treatment and outcome, leading to biased estimates if not addressed. In observational data, confounding bias is one of several key sources of distortion, alongside and collider bias. Selection bias arises when the study sample is not representative of the target population due to systematic differences in how participants are included, potentially exaggerating or masking . Collider bias, a subtype often related to selection, emerges when conditioning on a common effect of both treatment and outcome induces a spurious between them. Directed acyclic graphs (DAGs) provide a visual for illustrating through causal structures. In a DAG, s represent variables, and directed s indicate causal directions; manifests as "backdoor paths"—non-causal routes from to outcome that begin with an pointing into the , opened by shared common causes. For instance, if a confounder C causes both A and outcome Y, the path A \leftarrow C \rightarrow Y represents a backdoor path that must be blocked to eliminate . The consequences of unadjusted confounding include overestimation or underestimation of the (ATE), potentially reversing the direction of the apparent effect. Consider a hypothetical examining the association between coffee consumption and risk, where acts as an unmeasured confounder (smokers are more likely to drink coffee and have higher cancer risk). In the full of 20,000 participants:
GroupCancer CasesNo CancerTotal
Coffee drinkers10511,39511,500
Non-coffee drinkers458,4558,500
This yields a crude relative risk (RR) of 1.72, suggesting coffee increases cancer risk. However, stratifying by smoking reveals no association (RR = 1.0 in both smokers and non-smokers), demonstrating how confounding inflated the estimate. To mitigate confounding bias, adjustment methods such as stratification, regression modeling, or matching are essential, as they block backdoor paths by conditioning on the confounder, enabling unbiased estimation of causal effects. Propensity scores offer a particularly efficient approach for multivariable adjustment by summarizing confounder information into a single balancing score.

Introduction to Matching Methods

Matching methods represent a non-parametric approach to in observational studies, where treated and control units are paired based on observed covariates to approximate the conditions of a and thereby reduce bias. By selecting pairs or groups with similar covariate distributions, matching aims to balance the composition of treated and untreated groups, allowing the observed differences in outcomes to more closely reflect the causal effect of the rather than differences in baseline characteristics. This method mimics by ensuring that, within matched sets, treatment assignment is effectively of the covariates used for matching. Compared to parametric methods such as adjustment, matching offers several advantages, including reduced reliance on strong modeling assumptions about the relationship between covariates and outcomes, which can lead to if misspecified. Matching facilitates direct visualization and assessment of covariate balance through simple diagnostics, such as side-by-side histograms or , enabling researchers to verify the quality of the adjustment before estimating effects. Additionally, it mitigates issues of by discarding units without suitable matches, focusing analysis on the region of common support where causal estimates are most credible. Various types of matching have been developed to handle the challenges of covariate imbalance. Exact matching requires identical values on all covariates, which is feasible only in low-dimensional settings but becomes impractical with continuous or numerous variables, often resulting in few or no matches. Covariate matching, such as using , pairs units based on multivariate similarity in raw covariates and performs well with a limited number of discrete or continuous predictors, typically fewer than eight. Propensity score-based matching, which summarizes covariates into a single scalar—the probability of treatment—enhances in higher dimensions and is particularly useful for balancing multiple confounders. Matching methods originated in the early , with initial applications in and to control for extraneous variables in non-experimental designs. Theoretical foundations were advanced in the through work in and , emphasizing bias reduction in observational data. For instance, in studies of job training programs, researchers might match participants (treated) to non-participants (controls) on and to estimate program effects on earnings, ensuring that paired individuals share similar demographic profiles and thus isolating the training's impact. Propensity scores can further refine such matching by improving balance across a broader set of covariates.

Core Concepts and Assumptions

Strongly Ignorable Treatment Assignment

Strong ignorability of treatment assignment is a foundational assumption in using observational data, ensuring that observed covariates capture all relevant factors. Formally, treatment assignment T is strongly ignorable given covariates X if the potential outcomes (Y(0), Y(1)) are independent of T conditional on X, denoted as (Y(0), Y(1)) \perp T \mid X, and the positivity condition holds: $0 < P(T=1 \mid X) < 1. This assumption, introduced by Rosenbaum and Rubin, underpins the validity of methods like propensity score matching by allowing unbiased estimation of causal effects. The assumption comprises two key components: conditional independence, which posits no unmeasured confounding such that treatment assignment depends only on observed covariates, and overlap (positivity), which ensures every unit has a non-zero probability of receiving either treatment across the covariate space. Without conditional independence, unobserved factors could systematically influence both treatment and outcomes, leading to biased estimates. The overlap condition prevents reliance on extrapolation beyond the data's support, where propensity scores near 0 or 1 indicate regions of poor comparability between treated and control groups. Under strong ignorability, the average treatment effect (ATE) can be identified as the expected value of the conditional mean difference: \text{ATE} = E\left[ E[Y \mid T=1, X] - E[Y \mid T=0, X] \right]. This identification formula enables the use of observed data to approximate counterfactuals, facilitating causal inference without randomization. Violations of strong ignorability introduce bias; unmeasured confounding results in systematic differences between treated and control groups that cannot be adjusted for, while poor overlap leads to unstable estimates due to extrapolation in covariate regions with sparse data. Testing the assumption is challenging, as conditional independence cannot be directly verified from observed data alone, but overlap can be assessed by examining the distribution of estimated propensity scores across treatment groups, such as through histograms or density plots to identify areas of limited common support.

Balancing Score

In causal inference from observational data, a balancing score is defined as a function b(\mathbf{X}) of the observed covariates \mathbf{X} such that the treatment assignment T is independent of the covariates conditional on the balancing score: T \perp \mathbf{X} \mid b(\mathbf{X}) . This conditional independence ensures that, within levels of b(\mathbf{X}), the distribution of \mathbf{X} is the same for treated and untreated units, facilitating fair comparisons without requiring direct adjustment for every covariate . The primary role of a balancing score in causal inference is to serve as a dimension-reduction tool, allowing researchers to condition on b(\mathbf{X}) rather than the full set of covariates \mathbf{X} when estimating treatment effects through methods like matching or stratification . This approach mitigates the curse of dimensionality, which arises in high-dimensional settings where matching directly on numerous covariates becomes computationally infeasible and prone to poor overlap between treated and control groups . By summarizing the relevant information in \mathbf{X} into a lower-dimensional scalar or vector, b(\mathbf{X}) preserves the key balancing properties needed for unbiased estimation under the assumption of strongly ignorable treatment assignment. A key property of balancing scores is that any such function suffices for unbiased causal effect estimation provided that strong ignorability holds, meaning that treatment assignment is independent of potential outcomes given \mathbf{X}, and positivity is satisfied . Balancing scores can range from the identity function b(\mathbf{X}) = \mathbf{X} in low-dimensional cases, where direct covariate adjustment is practical, to more compressed forms like sufficient statistics in parametric models or the in nonparametric settings . The , defined as the conditional probability of treatment given \mathbf{X}, represents a specific and particularly useful balancing score due to its coarseness and ease of estimation . Despite these advantages, balancing scores must be estimated accurately from data; misspecification of the function b(\mathbf{X}), such as through an incorrectly parameterized model, can fail to achieve the required conditional independence, leading to residual confounding and biased treatment effect estimates . This sensitivity underscores the importance of model validation techniques, like covariate balance checks, to ensure the estimated balancing score performs as intended .

Propensity Score Definition

The propensity score is formally defined as the conditional probability of receiving treatment given a set of observed covariates, denoted as e(\mathbf{X}) = P(T=1 \mid \mathbf{X}), where T is the binary treatment indicator (with T=1 denoting treatment receipt) and \mathbf{X} represents the vector of baseline covariates. This definition, introduced by , captures the "propensity" toward treatment assignment based solely on observable characteristics, enabling a dimension reduction from the full covariate space to a single scalar value. Intuitively, the propensity score summarizes all relevant information from the covariates \mathbf{X} into one dimension, facilitating the matching of treated and control units that have similar probabilities of treatment, thereby approximating the covariate balance achieved in randomized experiments. As a probability, it inherently lies within the interval [0, 1], and under the standard positivity assumption (where $0 < e(\mathbf{X}) < 1 for all observed \mathbf{X}), it ensures overlap between treated and control groups, preventing extrapolation beyond the data. If the covariates \mathbf{X} are continuous, the propensity score is also continuous, preserving the distributional properties needed for effective balancing. A common way to conceptualize its form is through logistic regression, where the propensity score emerges as e(\mathbf{X}) = \frac{1}{1 + \exp(-\boldsymbol{\beta}^\top \mathbf{X})}, but it is essential to recognize that the score itself is this probability output, independent of the specific modeling approach used to derive it. A frequent misconception is that the propensity score represents the probability of a particular outcome under treatment; in reality, it pertains exclusively to the likelihood of treatment assignment, not the potential response or effect on the outcome variable.

Theoretical Foundations

Main Theorems on Propensity Scores

The foundational theorems on propensity scores, established by , provide the theoretical justification for using the propensity score to balance covariates and identify causal effects under the assumption of strong ignorability. Theorem 1 (Propensity Score as a Balancing Score): Under strong ignorability of treatment assignment given the covariates X (i.e., (Y(1), Y(0)) \perp T \mid X and $0 < P(T=1 \mid X) < 1), the propensity score e(X) = P(T=1 \mid X) is a balancing score, satisfying T \perp X \mid e(X). This means that, conditional on e(X), the distribution of covariates X is independent of treatment assignment T. A proof sketch relies on the definition of conditional probability and . Consider the conditional density of X given T = t and e(X) = e, denoted f(x \mid t, e). For t = 1, f(x \mid 1, e) = \frac{P(T=1 \mid X=x) f(x)}{P(e(X)=e \mid T=1)} = \frac{e \cdot f(x)}{\int_{e(x')=e} e \cdot f(x') \, dx'}, where the integral is over the set \{x' : e(x') = e\}. Since e is constant within this set, this simplifies to f(x \mid 1, e) = f(x \mid e). Similarly, for t = 0, f(x \mid 0, e) = \frac{P(T=0 \mid X=x) f(x)}{P(e(X)=e \mid T=0)} = \frac{(1-e) \cdot f(x)}{\int_{e(x')=e} (1-e) \cdot f(x') \, dx'} = f(x \mid e). Thus, f(x \mid 1, e) = f(x \mid 0, e), confirming T \perp X \mid e(X). Theorem 2 (Unbiased Conditional Expectation): Under strong ignorability given X, it also holds given the propensity score, so E[Y \mid T, X] = E[Y \mid T, e(X)]. This follows because strong ignorability implies (Y(1), Y(0)) \perp T \mid e(X), making the conditional expectation of the outcome Y depend only on T and e(X) rather than the full set of covariates X. Theorem 3 (Identification of the Average Treatment Effect): Under strong ignorability given e(X), the average treatment effect (ATE) \tau = E[Y(1) - Y(0)] is identified as \tau = E\left[ E[Y \mid T=1, e(X)] - E[Y \mid T=0, e(X)] \right]. This uses the law of iterated expectations: since E[Y \mid T=t, e(X)] = E[Y(t) \mid e(X)] for t=0,1, integrating over the distribution of e(X) yields the population-level ATE. An extension identifies the average treatment effect on the treated (ATT), E[Y(1) - Y(0) \mid T=1], as E\left[ E[Y \mid T=1, e(X)] - E[Y \mid T=0, e(X)] \mid T=1 \right], which conditions on the distribution of e(X) among the treated units, effectively weighting by the treated sample.

Relationship to Sufficiency

In the context of propensity score methods, the propensity score e(\mathbf{X}) serves as a sufficient statistic for the treatment assignment indicator T when the conditional distribution of the covariates \mathbf{X} given T and e(\mathbf{X}) equals the conditional distribution of \mathbf{X} given T alone. This property arises from the balancing score framework introduced by Rosenbaum and Rubin, where e(\mathbf{X}) = P(T=1 \mid \mathbf{X}) ensures that treatment assignment is independent of the covariates conditional on the score. Under the assumption of strongly ignorable treatment assignment, the propensity score captures all relevant information in the covariates \mathbf{X} pertaining to the treatment T, functioning analogously to a sufficient statistic in the treatment assignment model. Specifically, it summarizes the covariate information such that further conditioning on the full set of \mathbf{X} provides no additional insight into the distribution of T. This sufficiency enables model-free causal inference within strata defined by values of e(\mathbf{X}), as the score alone balances the covariate distributions across treatment groups, obviating the need to specify a full parametric model for all covariates when the propensity score is accurately estimated. Consequently, analyses can proceed by stratifying, matching, or weighting on e(\mathbf{X}) to approximate randomized assignment without relying on the multidimensional covariate structure. For instance, in parametric models such as for the outcome, the sufficiency of the propensity score simplifies the likelihood function by dimension reduction: adjusting for the score yields the same estimate of the average causal effect as adjusting for the entire covariate vector, without loss of asymptotic efficiency when the true score is used. However, this sufficiency applies exclusively to the treatment assignment model and does not extend to the outcome model, where unmodeled relationships between covariates and the potential outcomes may still require direct adjustment to avoid bias.

Properties of Propensity Scores

The propensity score e(\mathbf{X}) possesses a unique property among balancing scores: it is the coarsest such score, meaning that any other balancing score b(\mathbf{X}) satisfies e(\mathbf{X}) = f(b(\mathbf{X})) for some measurable function f, implying that b(\mathbf{X}) refines the partition induced by e(\mathbf{X}). This coarseness ensures that conditioning on e(\mathbf{X}) creates the largest possible strata where treatment assignment is independent of the covariates, while any monotonic transformation of e(\mathbf{X}) remains a balancing score, preserving the conditional independence property. Matching on the propensity score reduces the variance of causal effect estimators compared to direct adjustment on the full set of covariates \mathbf{X}, particularly in large samples where high-dimensional covariate adjustment can lead to increased variability due to estimation error. This variance reduction arises because the one-dimensional propensity score summarizes the covariate information relevant for balancing, avoiding the inefficiency of matching in high-dimensional spaces while maintaining balance. Under correct specification of the propensity score model and the strong ignorability assumption, estimators based on matching or stratification using e(\mathbf{X}), such as the average treatment effect, are consistent as the sample size grows, converging to the true causal effect. This asymptotic consistency holds provided the positivity condition is satisfied, ensuring non-zero probabilities of treatment assignment across covariate values. For treatments beyond binary outcomes, the propensity score extends to the for continuous or multilevel treatments, defined as the conditional density of treatment given covariates, which balances covariates within levels of the treatment dose. This generalization, proposed by , allows estimation of dose-response functions while preserving the balancing properties of the original propensity score framework. A key property for practical application is the overlap in the distribution of e(\mathbf{X}) between treated and control groups, which diagnoses the positivity assumption by examining the density of propensity scores across groups to identify regions of common support where both treatment probabilities are bounded away from 0 and 1. Insufficient overlap indicates potential extrapolation beyond supported covariate distributions, violating the assumptions underlying propensity score methods.

Estimation and Implementation Procedures

Estimating Propensity Scores

Estimating the propensity score, denoted as e(\mathbf{X}), the conditional probability of receiving treatment given observed covariates \mathbf{X}, is a critical step in propensity score matching. The most widely adopted parametric approach is logistic regression, which models the log-odds of treatment as a linear function of the covariates: \log\left( \frac{e(\mathbf{X})}{1 - e(\mathbf{X})} \right) = \beta_0 + \boldsymbol{\beta}' \mathbf{X}, where \beta_0 is the intercept and \boldsymbol{\beta} are the coefficients estimated via maximum likelihood. This method assumes a linear relationship in the logit scale and is computationally efficient, making it suitable for datasets with moderate numbers of covariates. However, it requires careful specification to include all relevant confounders that influence both treatment assignment and the outcome, as failure to do so can introduce bias. For scenarios where the logistic model may be too restrictive, such as when relationships between covariates and treatment are nonlinear or interactive, non-parametric and semi-parametric alternatives offer greater flexibility. Kernel regression and spline-based methods, like , allow for smooth, non-linear fits without assuming a specific functional form. Machine learning techniques, including and boosted regression trees, further enhance robustness by capturing complex interactions and reducing sensitivity to model misspecification; for instance, random forests aggregate multiple decision trees to estimate e(\mathbf{X}) as the proportion of trees assigning treatment. These approaches are particularly useful in high-dimensional settings but demand larger sample sizes to avoid instability. Ensemble methods, such as , combine multiple algorithms via cross-validation to further improve estimation robustness. Model selection plays a key role in preventing overfitting, which can distort balance and lead to poor matching quality. Techniques such as cross-validation, which partitions data to evaluate predictive performance, or information criteria like the Akaike Information Criterion (AIC), help balance model complexity against fit by penalizing excessive parameters. Additionally, incorporating interaction terms or higher-order polynomials may be necessary if preliminary diagnostics suggest non-additive effects among covariates, though this increases the risk of multicollinearity. A essential validation step involves examining the overlap in the distributions of estimated propensity scores between treated and untreated groups to ensure adequate common support, where matching is feasible. This can be visualized through histograms or density plots; substantial regions of non-overlap indicate extrapolation risks and may require trimming or restricting the sample. Poor overlap often signals model inadequacy or inherent data limitations, underscoring the need for iterative refinement of the estimation process.

Matching Algorithms

Matching algorithms in propensity score matching pair treated and untreated units based on their estimated propensity scores to approximate a randomized experiment and reduce confounding bias. These methods leverage the propensity score, defined as the conditional probability of treatment given observed covariates, to form balanced comparison groups. Common approaches include nearest neighbor matching, optimal matching, stratification, full matching, and extensions like inverse probability weighting, each varying in how they utilize the propensity scores to create matches or weights. Nearest neighbor matching pairs each treated unit with the untreated unit that has the closest propensity score, often using a greedy algorithm that sequentially selects the best available match without replacement. To prevent poor matches, a caliper restriction is commonly applied, such as limiting matches to within 0.2 standard deviations of the logit of the propensity score, which discards treated units without suitable controls and improves covariate balance. This method is computationally simple and widely used, though it may leave some units unmatched and depends on the order of processing. Optimal matching seeks to minimize the overall distance across all matched pairs, treating the problem as a minimum weight bipartite matching in a network flow framework rather than greedy selection. This approach, which can be applied to one-to-one or one-to-many pairings, generally yields better global balance than nearest neighbor methods by optimizing the total sum of absolute propensity score differences. Seminal implementations use integer linear programming or auction algorithms for efficiency in large datasets. Stratification divides the sample into strata based on propensity score quintiles or equal-sized intervals, typically five to ten groups, then computes treatment effects as weighted averages within each stratum where the propensity score balances covariates by design. This method ensures all units are used without explicit pairing, assuming balance within strata, and is robust to propensity score model misspecification as long as strata are fine enough. The overall effect is a simple average across strata, providing stable estimates with minimal computational demands. Full matching creates a set of matched clusters where each cluster contains at least one treated and one untreated unit, allowing variable ratios of treated to untreated units to maximize the number of matched subjects while optimizing balance on the . Unlike pairwise methods, it permits flexible groupings and uses all available data more efficiently, often implemented via network optimization to minimize within-cluster distance. This approach enhances precision in heterogeneous populations by avoiding the discard of observations common in stricter matching. As an extension of matching principles, inverse probability weighting (IPW) assigns weights to units based on their propensity scores to balance the covariate distribution without explicit pairing, effectively creating a pseudo-population where treatment assignment is independent of covariates. For treated units, the weight is \frac{1}{e(X)}, and for untreated units, it is \frac{1}{1 - e(X)}, where e(X) is the estimated propensity score; the weighted average treatment effect is then estimated using these weights in a regression or direct difference. IPW is particularly useful when overlap is poor for matching but requires stabilization or truncation to handle extreme weights. Post-matching balance diagnostics, such as standardized mean differences, should verify covariate comparability across methods.

Post-Matching Analysis

After completing the matching process, the primary goal in post-matching analysis is to estimate the causal effect of the treatment on the outcome. For propensity score matching, the average treatment effect on the treated (ATT) is commonly estimated using the simple difference in means between the outcomes of treated units and their matched controls. This approach leverages the balanced matched sample to approximate the counterfactual outcomes for the treated group, as originally proposed in the foundational work on . Similarly, for the average treatment effect (ATE) across the full population, the difference in means can be computed if the matching covers both treated and control units adequately, though ATT estimation is more robust in typical one-to-one or nearest-neighbor setups. Variance estimation must account for the paired or clustered structure introduced by matching, as well as the variability inherent in the matching process itself. In nearest-neighbor matching with replacement, the provides a consistent approach by incorporating weights that reflect how often control units are reused as matches, thus adjusting for the additional uncertainty from the matching step. The estimator accounts for the variance in treated outcomes, the weighted variance in matched control outcomes, and the variability due to the matching procedure itself. For paired matching without replacement, simpler methods like the paired t-test can be applied to the differences within pairs, providing straightforward inference for continuous outcomes. To further refine estimates and address any residual covariate imbalance, post-matching adjustments such as regression on the matched sample are often employed. This involves fitting an outcome regression model (e.g., linear for continuous outcomes) within the matched data, including key covariates to control for lingering differences and improve precision without reintroducing bias from the full sample. Such double adjustment—combining matching with regression—has been shown to reduce variance while maintaining consistency under standard assumptions. Unmatched units, particularly those without suitable controls within a predefined caliper, are typically handled by trimming (excluding based on overlap) or outright discarding to preserve balance and avoid extrapolation. This ensures the analysis focuses on the common support region where identification is credible, though it may reduce sample size and effective power. As an illustrative example, consider a simulated dataset with binary treatment assignment based on observed covariates and a continuous outcome. After 1:1 nearest-neighbor matching on estimated and trimming unmatched units, the ATT can be computed as the mean difference in outcomes within the matched pairs, with inference via a paired t-test, demonstrating the method's application in assessing policy impacts.

Evaluation and Diagnostics

Assessing Balance After Matching

Assessing balance after matching is essential to confirm that propensity score matching has effectively balanced the distribution of covariates between treated and control groups, thereby minimizing confounding bias in subsequent effect estimates. A key metric for individual covariates is the standardized mean difference (SMD), which measures the magnitude of imbalance in terms of standard deviations. For continuous covariates, the SMD is given by \text{SMD} = \frac{\bar{X}_T - \bar{X}_C}{\sqrt{\frac{\sigma_T^2 + \sigma_C^2}{2}}} where \bar{X}_T and \bar{X}_C denote the sample means in the treated and control groups, respectively, and \sigma_T^2 and \sigma_C^2 are the corresponding variances. An SMD value below 0.1 across covariates is widely regarded as indicating sufficient balance, while values exceeding this threshold suggest residual confounding. Variance ratios provide complementary insight into balance by comparing the spread of covariates between groups post-matching; ratios approaching 1 signify similar variability, with values between 0.5 and 2.0 typically considered acceptable to avoid distortions in variance-dependent analyses. Overall balance can be summarized using Rubin's B statistic, which captures the maximum absolute standardized difference in covariate means (ideally less than 0.5 standard deviations), and the R statistic, which assesses the ratio of within-group variances (ideally close to 1, avoiding extremes like 0.5 or 2). These measures help evaluate whether the matching design approximates randomization across the full set of covariates. To demonstrate the effectiveness of matching, balance is often compared pre- and post-matching through tables of SMDs for all covariates, highlighting reductions that bring most values below 0.1 and confirming improved similarity between groups. Should notable imbalance remain, iteration is recommended by refining the propensity score estimation—such as incorporating nonlinear terms, interactions, or machine learning approaches—or by implementing calipers during matching to exclude distant pairs and enhance quality. Graphical tests can supplement these quantitative assessments by providing visual confirmation of covariate overlap.

Graphical Tests for Confounding

Graphical tests for confounding in propensity score matching provide visual diagnostics to evaluate whether the method has adequately balanced covariates between treatment and control groups, thereby reducing bias due to observed confounders. These plots complement quantitative balance measures, such as standardized mean differences (SMDs), by offering intuitive assessments of distributional similarities and overlaps. By inspecting these visualizations, researchers can identify residual imbalances or poor common support, which might indicate incomplete control for confounding. Love plots display the SMDs for multiple covariates before and after matching, plotted against a horizontal line at an acceptable threshold (often 0.1 in absolute value). Horizontal lines or points below the threshold post-matching indicate successful balance, while those exceeding it highlight problematic covariates. This graphical summary, popularized in observational studies, allows for rapid comparison across many variables and is particularly useful for detecting whether propensity score matching has diminished pretreatment differences. Propensity score histograms or density plots illustrate the distribution of estimated propensity scores e(X) separately for treated and control groups, overlaid to assess the degree of overlap known as common support. Adequate overlap is essential for valid matching, as regions with minimal density in one group relative to the other suggest extrapolation beyond observed data, potentially exacerbating confounding. These plots help identify units that should be trimmed or excluded to ensure inferences rely on comparable populations. Quantile-quantile (Q-Q) plots compare the empirical quantiles of covariate distributions between treated and control groups after matching, with points aligning along the 45-degree line indicating similar distributions and thus balance. Deviations from this line reveal differences in tails or central tendencies, signaling residual confounding for specific covariates. Q-Q plots are especially informative for continuous variables, providing a more nuanced view than summary statistics alone. Directed acyclic graphs (DAGs) visually represent causal relationships among variables, highlighting confounding paths that violate the backdoor criterion—a set of arrows from treatment to outcome passing through common causes. In propensity score contexts, DAGs guide covariate selection by identifying variables that block all backdoor paths without opening new biases, ensuring the propensity score conditions on a sufficient adjustment set. These graphs facilitate qualitative assessment of whether the modeling assumptions address the underlying confounding structure. Box plots of outcomes stratified by treatment status within propensity score quintiles or strata examine whether mean or median outcomes are parallel across levels of e(X), indicating no residual confounding or effect modification. Similar box positions and shapes within each stratum suggest balanced groups with consistent treatment effects; divergences may point to unadjusted confounders influencing outcomes heterogeneously. This approach verifies the stratified design's effectiveness in approximating randomization within homogeneous subgroups.

Sensitivity Analysis

Sensitivity analysis in propensity score matching evaluates the robustness of estimated treatment effects to potential violations of the unconfoundedness assumption due to unmeasured confounders. These techniques quantify how much hidden bias could alter conclusions, providing bounds on the treatment effect or significance levels under varying degrees of unmeasured confounding. Such analyses are essential because propensity score methods rely on all relevant confounders being observed, and unmeasured factors can introduce bias even after matching. One prominent approach is the Rosenbaum bounds, which assess sensitivity to hidden bias in matched observational studies. The parameter Γ represents the maximum odds ratio by which an unmeasured confounder can differentially affect the odds of treatment assignment between treated and control units with the same observed covariates. Specifically, Γ ≥ 1 bounds the ratio of the odds of treatment for any two units, such that if Γ = 1, there is no hidden bias. As Γ increases, the analysis shows how large the unmeasured confounding must be to overturn the findings. This method applies directly to propensity score matched samples by treating the matched sets as fixed and varying the assignment probabilities within the bias model. Under the Rosenbaum framework, upper and lower bounds on the treatment effect are derived by optimizing the estimator over all possible treatment assignments consistent with the bias parameter Γ. For instance, using the Hodges-Lehmann point estimate of the average treatment effect in matched pairs, the bounds are the minimum and maximum values attainable under the constrained randomization probabilities induced by Γ. These limits widen as Γ increases, indicating greater sensitivity to unmeasured confounding. The p-value bounds for hypothesis tests (e.g., Wilcoxon signed-rank test) similarly range from a lower bound (most extreme evidence against the null) to an upper bound (least extreme), computed via exact or approximate methods like the hypergeometric distribution for binary outcomes. If the upper p-value bound exceeds a significance threshold (e.g., 0.05) at a modest Γ, the result is deemed sensitive. The VanderWeele-Vansteelandt method extends sensitivity analysis to scenarios with continuous unmeasured confounders, providing flexible bias formulas for general outcomes and treatments. This approach derives bounding factors for the observed association (e.g., risk ratio, odds ratio, or mean difference) by parameterizing the strength of the unmeasured confounder U with the treatment A and outcome Y, often through regression coefficients or correlation parameters. For a continuous U, the bias correction formula adjusts the log-linear model estimate as: \log(\hat{\beta}_{AY}) - B = \log(\beta_{AY}), where B is the bias term depending on the partial R-squared of U with A given observed confounders C (R²_{U|A,C}) and with Y given A and C (R²_{U|Y,A,C}), or equivalent parameters for non-log-linear cases. This allows computation of the confounder strength needed to reduce the effect to the null, accommodating continuous distributions without discretizing U. The method is particularly useful in propensity score contexts for non-binary outcomes where unmeasured factors may vary continuously, such as socioeconomic indices. Another widely used tool is the E-value, which quantifies the minimum strength of unmeasured confounding required to nullify an observed association, measured on the risk ratio scale. For an observed risk ratio RR > 1, the E-value is the smallest value of RR_{UA} (association between unmeasured confounder U and treatment A) such that, when combined with an equal or stronger RR_{UY|A} (association between U and Y given A), the apparent effect disappears. The formula is: E\text{-value} = RR + \sqrt{RR(RR - 1)}, with a companion E-value for the confidence interval lower bound to assess robustness to chance. For the inverse (RR < 1), it uses 1/RR in the formula. An E-value near 1 indicates high sensitivity, while values above 2 or 3 suggest the finding withstands plausible confounding levels, as typical measured confounders have associations below this threshold. This metric is straightforward for propensity score studies, applying to marginal or conditional effects post-matching. In practice, researchers commonly report Rosenbaum sensitivity at Γ = 2 to 3, representing odds ratios of unmeasured confounding that double or triple the treatment assignment probability—levels considered substantial yet plausible in many fields. If results remain significant (e.g., upper p-value bound < 0.05) at these values, the claims are strengthened; otherwise, caution is warranted. Integrating multiple methods, such as Rosenbaum bounds with E-values, provides a comprehensive robustness check without assuming specific confounder distributions.

Advantages and Limitations

Key Advantages

Propensity score matching (PSM) offers significant dimension reduction by summarizing a large number of observed covariates into a single scalar propensity score, which is the of treatment assignment given those covariates, thereby facilitating the balancing of high-dimensional variables without requiring explicit modeling of all interactions. This approach simplifies the process of creating comparable , as matching or can be performed on this one-dimensional score rather than on the full covariate . A key strength of PSM lies in its transparency, enabling researchers to directly inspect matched pairs or strata and assess covariate through straightforward diagnostics, such as standardized mean differences, which provide clear evidence of whether has been adequately addressed. This visibility contrasts with more opaque methods like multivariable , where is harder to verify post-adjustment. Furthermore, PSM enhances interpretability by emulating the structure of randomized controlled trials (RCTs), where matched groups approximate on observed covariates, allowing causal effects to be estimated using simple comparisons of outcomes within these balanced samples. PSM is inherently flexible and non-parametric, relying primarily on correct specification of the propensity score model rather than the outcome model, which reduces to misspecification in the relationship between covariates and the outcome. This separation of (balancing via propensity scores) from (outcome estimation) permits the use of various matching algorithms, such as nearest-neighbor or caliper matching, without assuming a particular functional form for the outcome. Empirical evidence from simulations supports these advantages, demonstrating that PSM often yields reduced and improved covariate compared to adjustment, particularly when the outcome model is misspecified.

Common Disadvantages and Biases

Propensity score matching (PSM) relies on the strong ignorability , which posits that treatment assignment is of potential outcomes conditional on observed covariates, implying no unmeasured . Violations of this occur when relevant confounders are omitted from the propensity score model, leading to residual bias in causal estimates that cannot be addressed by matching alone. For instance, in observational studies where misses key variables influencing both and outcome, PSM fails to emulate effectively, potentially exaggerating or underestimating effects. Another critical limitation arises from violations of the positivity assumption, which requires sufficient overlap in the of propensity scores between treated and groups. When overlap is poor—such as in cases of extreme covariate values or rare assignments—matching involves extrapolating beyond the common support region, introducing and unreliable estimates for units in the tails of the propensity score . This issue is particularly pronounced in heterogeneous populations where certain subgroups have propensity scores near 0 or 1, resulting in sparse matches and heightened . Sensitivity analysis can help identify such violations, but it does not eliminate the underlying . PSM often incurs efficiency losses due to the discarding of unmatched units, which reduces the effective sample size and statistical power, especially in datasets with limited overlap or imbalanced treatment arms. This pruning can eliminate up to 50% or more of observations in some applications, leading to wider confidence intervals and reduced precision in effect estimates compared to methods that retain all data, such as inverse probability weighting. Additionally, the curse of dimensionality persists even after dimension reduction via propensity scores; in high-dimensional settings with many covariates, the tails of the score distribution become sparse, mimicking rare events and amplifying variance or bias from inexact matches. Bias in PSM is further exacerbated by propensity score model misspecification, where incorrect functional forms or omitted interactions propagate errors into the matching process, often more severely than in direct covariate adjustment. Unlike doubly robust estimators, which remain consistent if either the propensity score or outcome model is correctly specified, PSM lacks this protection and can produce inconsistent estimates even with moderate misspecification. This vulnerability underscores the method's sensitivity to modeling assumptions, particularly in complex observational data where full specification is challenging. Recent debates, including a 2025 review, have intensified criticisms, arguing that PSM can paradoxically increase imbalance and bias through excessive pruning and model dependence, suggesting caution in its application.

Practical Applications and Extensions

Real-World Examples

In healthcare, propensity score matching has been widely applied to estimate the causal effects of interventions using observational data from electronic health records, particularly during the . For instance, a analyzing 136,532 individuals from the used 1:1 propensity score matching based on demographic, clinical, and geographic features to compare outcomes between those receiving mRNA vaccines (BNT162b2 or mRNA-1273) and unvaccinated controls, adjusting for demographics, clinical history, and healthcare factors; this approach demonstrated vaccine effectiveness of approximately 87-89% against hospitalization starting 7 days after the second dose, with improved covariate balance post-matching compared to unadjusted analyses. Similarly, in critically ill patients from ICUs, propensity score matching on 8 covariates balanced treated (vaccinated) and control groups, revealing an adjusted of 0.83 (95% CI 0.77–0.91) for lower ICU mortality associated with vaccination (approximately 17% reduction), highlighting reduced in . In , propensity score matching has been instrumental in evaluating labor market interventions, notably through reanalyses of the Lalonde (1986) dataset from the National Supported Work Demonstration, which compared job training effects on employment using experimental and observational data. Dehejia and Wahba (2002) applied propensity score matching to select comparable non-experimental controls from the , achieving better balance on covariates like age, education, and earnings than naive comparisons; their analysis estimated training effects on earnings of approximately $1,794 annually, closer to the experimental benchmark of $1,861, demonstrating PSM's ability to mitigate bias from unobserved confounders in non-randomized settings. This reanalysis underscored PSM's value in policy evaluation, where unadjusted observational estimates often overstated or understated true impacts by 50-100%. In the social sciences, propensity score matching aids in assessing impacts on socioeconomic outcomes, such as the effects of increases or programs on . For example, a study on South Africa's programs (PWPs) used propensity score matching to evaluate targeting among approximately 600 survey households across two programs, matching participants to non-participants on observables like household size, , and ; post-matching balance improved, revealing that PWPs reached the poorest quintile more effectively than unadjusted samples (e.g., 35-57% in bottom quintiles depending on program), though leakage to non-poor households persisted at around 40-65%. Such applications illustrate PSM's role in isolating effects amid factors like regional disparities, yielding alleviation estimates that adjust for selection into programs. An illustrative step-by-step application of propensity score matching appears in observational studies of efficacy. First, estimate the propensity score e(X) via on covariates (e.g., , baseline , comorbidities) to predict probability for exposed ( A) versus unexposed ( B) patients. Second, each exposed unit to the nearest unexposed unit within a caliper (e.g., 0.2 standard deviations of the ), using greedy or optimal algorithms to form pairs. Third, compute the on the treated () as the difference in outcomes (e.g., systolic reduction at 6 months). Finally, assess balance via Love plots or t-tests to confirm comparability. Overall, these examples demonstrate propensity score matching's capacity to produce unbiased causal estimates in observational across fields.

Software Implementations

Propensity score matching (PSM) is implemented in various statistical software environments, enabling researchers to estimate propensity scores, perform matching, and assess balance across different platforms. In , the MatchIt package provides a user-friendly interface for nonparametric preprocessing, including nearest-neighbor matching, optimal matching, and full matching based on estimated propensity scores, facilitating the creation of balanced . The package complements this by focusing on propensity score using generalized boosted modeling (GBM), which iteratively improves score accuracy to minimize covariate imbalance, and includes diagnostic tools for . For balance assessment, the package offers standardized functions to generate tables and plots comparing covariate distributions before and after matching, supporting integration with MatchIt and other methods to ensure consistent metrics like standardized mean differences. In , the teffects psmatch command handles the full PSM workflow, from propensity score estimation via to kernel-based or nearest-neighbor matching, while accounting for the estimation uncertainty in standard errors for (ATE) and average treatment effect on the treated (ATET) estimates. This built-in simplifies analysis by combining matching with outcome modeling, making it suitable for observational data in econometric applications. implementations include the causalinference library, which supports propensity score estimation and matching techniques such as nearest-neighbor and , allowing custom covariate adjustments for causal effect estimation. For more tailored approaches, can be used to fit logistic models for propensity scores, followed by manual matching, though this requires additional coding. The psmpy package provides a streamlined for PSM, including score calculation, caliper-restricted matching, and balance checks, designed for ease in studies. In SAS, PROC PSMATCH performs propensity score analysis, supporting both estimation (via logistic or generalized linear models) and matching methods like greedy or optimal pairing, with options for Mahalanobis distance integration to enhance covariate balance. It outputs matched datasets and diagnostics, aiding in the preparation of data for subsequent effect estimation. Best practices for PSM implementation emphasize reproducibility and transparency. Setting random seeds in stochastic algorithms, such as GBM in twang or matching in MatchIt, ensures consistent results across runs; for example, R's set.seed() function should be invoked prior to estimation. Reporting balance tables, often generated via cobalt in R or post-estimation commands in Stata and SAS, is essential to document pre- and post-matching covariate distributions, using metrics like standardized differences to verify balance below 0.1 as a common threshold. Inverse probability weighting (IPW) is an alternative approach to propensity score matching that estimates causal effects by reweighting the sample to balance covariates between treatment groups, using weights equal to the inverse of the estimated propensity score, 1/e(X), where e(X) is the probability of treatment given covariates X. This method constructs a pseudo-population in which treatment assignment is independent of observed confounders, allowing for unbiased estimation of average treatment effects under the assumption of no unmeasured confounding. IPW is particularly advantageous in settings with time-varying treatments, such as marginal structural models, where it accounts for time-dependent confounding affected by prior treatment. When combined with outcome regression, IPW forms a doubly robust estimator that remains consistent if either the propensity score model or the outcome model is correctly specified, providing protection against model misspecification in one component. Regression adjustment offers a parametric method for controlling confounding by including covariates directly in a regression model for the outcome, typically estimating the treatment effect as the coefficient on the treatment indicator after adjusting for X. This approach assumes a correctly specified functional form for the relationship between covariates, treatment, and outcome, making it sensitive to model misspecification, especially with high-dimensional or nonlinear confounders. Unlike matching-based methods, regression adjustment does not require explicit balancing of covariates but relies on the model's ability to capture conditional independence between treatment and outcome given X. It is computationally simple and can incorporate interactions or polynomials for flexibility, though it performs best when the outcome model is well-specified and sample sizes are large. Instrumental variables (IV) estimation addresses causal inference in the presence of unmeasured confounding or endogeneity, where treatment assignment is not ignorable given observed covariates, by exploiting an instrument—a variable that affects treatment but is independent of the outcome conditional on treatment and confounders. Under the assumptions of relevance (the instrument correlates with treatment), exclusion (the instrument affects the outcome only through treatment), and monotonicity (no defiers who take the opposite treatment based on the instrument), IV identifies the local average treatment effect (LATE) for compliers, those whose treatment status changes with the instrument. This method does not require the no unmeasured confounding assumption central to propensity score methods but instead trades off external validity for internal validity in settings with valid instruments, such as natural experiments. Difference-in-differences (DiD) is a quasi-experimental technique suited for in observational studies, estimating causal effects by comparing changes in outcomes over time between a treated group and an untreated control group, assuming parallel trends in the absence of . The effect is identified as the difference in post- outcome differences minus the pre- difference, effectively differencing out time-invariant unobserved confounders. DiD relaxes the strict ignorability assumption of cross-sectional methods like propensity score matching by leveraging temporal variation, making it ideal for policy evaluations with staggered adoption or natural experiments, though it requires no anticipation of and stable trends. Recent extensions handle heterogeneous effects and multiple periods, but the core method remains sensitive to violations of parallel trends. Extensions of propensity score matching address limitations in covariate balance or model flexibility. Entropy balancing directly constructs weights that exactly match moments of the covariate distribution between by minimizing the from a subject to constraints, avoiding the need to estimate propensity scores separately and reducing to model . This method ensures perfect on specified moments without trimming, improving efficiency over traditional weighting, particularly with continuous covariates. Covariate balancing propensity scores (CBPS) jointly estimate the propensity score and weights by optimizing a that minimizes covariate imbalance while fitting a logistic model for probability, leading to better and lower than standard propensity scores in simulations and empirical applications. CBPS can be integrated with matching or weighting and extends to multilevel treatments.

References

  1. [1]
  2. [2]
  3. [3]
    Causal inference and effect estimation using observational data
    RCTs strive to achieve exchangeability by randomly assigning the exposure, while observational studies often rely on achieving conditional exchangeability (or ...Key Concepts And Frameworks · Defining Causal Effects · Identifying Causal Effects
  4. [4]
    Randomized Clinical Trials and Observational Studies
    Well-done RCTs are superior to OS because they eliminate selection bias. However, there are many lower quality RCTs that suffer from deficits in external ...
  5. [5]
    Statistics and Causal Inference - jstor
    The usefulness of either the scientific or the statistical solution to the Fundamental Problem of Causal Inference depends on the truth of different sets of ...
  6. [6]
    The central role of the propensity score in observational studies for ...
    The central role of the propensity score in observational studies for causal effects. PAUL R. ROSENBAUM, ... RUBIN. DONALD B. RUBIN. University of Chicago.
  7. [7]
    Propensity Score Matching: A Conceptual Review for Radiology ...
    For instance, in the example study, the statistical comparison between liver CT and liver MRI groups after propensity score matching with the exclusion of 2 ...
  8. [8]
    Control of Confounding and Reporting of Results in Causal ...
    Aug 20, 2018 · A confounder has long been defined as any third variable that is associated with the exposure of interest, is a cause of the outcome of interest ...
  9. [9]
    On the definition of a confounder - PMC - PubMed Central
    The causal inference literature has provided a clear formal definition of confounding expressed in terms of counterfactual independence.
  10. [10]
    Collider Bias | Research, Methods, Statistics - JAMA Network
    Mar 14, 2022 · Bias is often broadly categorized into 3 groups: confounding, information (or measurement) bias, and selection bias. Selection bias is a general ...
  11. [11]
    Collider Bias in Observational Studies - NIH
    The findings of observational studies can be distorted by a number of factors. So-called confounders are well known, but distortion by collider bias (CB) ...
  12. [12]
    Use of Directed Acyclic Graphs - NCBI - NIH
    Using DAG theory, confounding bias can be characterized as an unblocked “backdoor” path from the treatment to the outcome. The next section presents terminology ...Estimating Causal Effects · DAG Terminology · Using DAGs To Select...
  13. [13]
    How to control confounding effects by statistical analysis - PMC - NIH
    There are various ways to exclude or control confounding variables including Randomization, Restriction and Matching.
  14. [14]
    [PDF] Matching Methods for Causal Inference: A Review and a Look Forward
    This paper provides a structure for thinking about matching methods and guidance on their use, coalescing the existing research (both old and new) and providing.
  15. [15]
    Matching to Remove Bias in Observational Studies - jstor
    (See. Cochran [1969] for a discussion of covariance analysis in observational studies.) ... Rubin [1970] includes extensions to the case of many matching.
  16. [16]
    An Introduction to Propensity Score Methods for Reducing the ...
    Propensity score matching entails forming matched sets of treated and untreated subjects who share a similar value of the propensity score (Rosenbaum & Rubin, ...
  17. [17]
    [PDF] The central role of the propensity score in observational studies for ...
    The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates.
  18. [18]
    Alternatives to Randomized Trials for Estimating Treatment Effects
    May 2, 2020 · ... propensity score would look like this. Note that now the probability we are trying to predict is not the probability of the outcome, P(Y), it ...Confounding By Indication · Falsification Tests For... · Propensity Scores<|control11|><|separator|>
  19. [19]
    [PDF] Sufficient covariates and linear propensity analysis
    Rosenbaum, P. R. and Rubin, D. B. (1983). The cen- tral rôle of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55.
  20. [20]
    Using Propensity Scores for Causal Inference: Pitfalls and Tips - PMC
    This paper aims to offer an accessible overview of causal inference using the PS methods and address some common pitfalls and provide tips for applied users.
  21. [21]
    On Kernel Machine Learning for Propensity Score Estimation under ...
    In this paper, focusing on PS estimation, we propose a flexible non-parametric propensity score method. That is, we relax the linear and additive ...
  22. [22]
    Propensity Score Estimation for Causal Effects
    Propensity score estimation with boosted regression for evaluating causal effects in observational studies ... Daniel F McCaffrey , Greg Ridgeway, Andrew R Morral ...<|control11|><|separator|>
  23. [23]
    Propensity score estimation: machine learning and classification ...
    We identified four techniques as alternatives to logistic regression: neural networks, support vector machines, decision trees (CART), and meta-classifiers.
  24. [24]
  25. [25]
    The Central Role of the Propensity Score in Observational Studies ...
    BY PAUL R. ROSENBAUM. Departments of Statistics and Human Oncology, University of Wisconsin, Madison,. Wisconsin, U.S.A.. AND DONALD B ...
  26. [26]
    Reducing Bias in Observational Studies Using Subclassification on ...
    Subclassification on an estimated propensity score is illustrated, using observational data on treatments for coronary artery disease.Missing: sufficiency | Show results with:sufficiency
  27. [27]
    Efficient Estimation of Average Treatment Effects Using the ... - jstor
    We show that weighting by the inverse of a nonparametric estimate of the propensity score, rather than the true propensity score, leads to an efficient estimate ...
  28. [28]
  29. [29]
    Balance diagnostics after propensity score matching - PMC
    The special article aims to outline the methods used for assessing balance in covariates after PSM. Standardized mean difference (SMD) is the most commonly used ...
  30. [30]
    Balance diagnostics for comparing the distribution of baseline ...
    In Section 3, we describe methods for comparing the mean of continuous variables or the prevalence of dichotomous baseline covariates between treated and ...
  31. [31]
    [PDF] Using Propensity Scores to Help Design Observational Studies
    Abstract. Propensity score methodology can be used to help design observational studies in a way analogous to the way randomized experiments are designed: ...
  32. [32]
    [PDF] Some Practical Guidance for the Implementation of Propensity Score ...
    ... conditional distribution of X given b(X) is independent of assignment into treatment. One possible balancing score is the propensity score, i.e. the ...
  33. [33]
    Directed acyclic graphs for clinical research: a tutorial - PMC
    It is important to note that satisfying the backdoor criterion in DAGs refers to no unmeasured confounding assumption or conditional exchangeability mentioned ...
  34. [34]
    [PDF] Using propensity score methods to analyse individual patient‐level ...
    We discuss three ways in which propensity score can be used to control for confounding in the estimation of average ... box-plots by treatment group and quintile ...
  35. [35]
    [PDF] mhbounds - Sensitivity Analysis for Average Treatment Effects
    Instead, Rosenbaum bounds provide evidence on the degree to which any significance results hinge on this untestable assumption. Clearly, if the results turn out ...Missing: seminal | Show results with:seminal
  36. [36]
    Statistical primer: propensity score matching and its alternatives
    Apr 19, 2018 · Although multivariable regression models adjust for confounders by modelling the relationship between covariates and outcome, the PS methods ...
  37. [37]
    Violations of the Positivity Assumption in the Causal Analysis of ...
    Subjects with propensity scores near 0 and 1 are more likely to violate the positivity assumption.
  38. [38]
    Propensity score methods in observational research: brief review ...
    This approach aims to make the distribution of confounding variables similar between treated and untreated groups while also closely mimicking the overall study ...Missing: validation | Show results with:validation
  39. [39]
    [PDF] Why Propensity Scores Should Not Be Used for Matching - Gary King
    Fortunately, since other commonly used matching methods reduce imbalance, model dependence, and bias more effectively than PSM, and do not typically suffer from ...Missing: seminal | Show results with:seminal
  40. [40]
    Model misspecification and robustness in causal inference - PubMed
    In this paper, we compare the robustness properties of a matching estimator with a doubly robust estimator.
  41. [41]
    FDA-authorized mRNA COVID-19 vaccines are effective per real ...
    Aug 13, 2021 · Using a combination of exact matching and 1-to-1 propensity score matching, we were able to match 68,266 of these vaccinated individuals (nBNT ...
  42. [42]
    Estimation of the effect of vaccination in critically ill COVID-19 ...
    Feb 12, 2024 · We designed a study to estimate the effect of vaccination on ICU mortality in critically ill COVID-19 patients by using propensity score matching.<|separator|>
  43. [43]
    [PDF] Propensity score matching methods for non-experimental causal ...
    The data we use, obtained from Lalonde (1986), are from the National Supported. Work Demonstration, a labor market experiment in which participants were ...
  44. [44]
    Does matching overcome LaLonde's critique of nonexperimental ...
    Our analysis demonstrates that while propensity score matching is a potentially useful econometric tool, it does not represent a general solution to the ...
  45. [45]
    [PDF] Using Propensity Score Matching Techniques to Assess the Poverty ...
    This paper explores the socio- economic identity of PWP participants in two programmes in South Africa, in order to establish the incidence of PWP participation ...
  46. [46]
    Can a propensity score matching method be applied to assessing ...
    Aug 1, 2023 · In this study, we demonstrated the feasibility of using the propensity score matching (PSM) method to improve decision making by matching ...
  47. [47]
    MatchIt: Getting Started
    May 29, 2025 · A matching analysis involves four primary steps: 1) planning, 2) matching, 3) assessing the quality of matches, and 4) estimating the treatment effect and its ...
  48. [48]
    [PDF] A guide to the twang package - CRAN - R Project
    Jul 22, 2024 · The twang package aims to (i) compute from the data estimates of the propensity scores which yield accurate causal effect estimates, (ii) check ...
  49. [49]
    Covariate Balance Tables and Plots: A Guide to the cobalt Package
    Aug 20, 2025 · By using cobalt to assess balance across packages, users can be sure they are using a single, equivalent balance metric across methods, and the ...
  50. [50]
    psmpy - PyPI
    This package offers a user friendly propensity score matching protocol created for a Python environment.
  51. [51]
    [PDF] Doubly Robust Estimation in Missing Data and Causal Inference ...
    Summary. The goal of this article is to construct doubly robust (DR) estimators in ignorable missing data and causal inference models.
  52. [52]
    Identification of Causal Effects Using Instrumental Variables
    We outline a framework for causal inference in settings where assignment to a binary treatment is ignorable, but compliance with the assignment is not perfect.
  53. [53]
    Using the Longitudinal Structure of Earnings to Estimate the Effect of ...
    Nov 1, 1984 · Orley Ashenfelter and David Card, "Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs," NBER Working Paper ...
  54. [54]
    [PDF] Entropy Balancing for Causal Effects: A Multivariate Reweighting ...
    Oct 16, 2011 · This paper proposes entropy balancing, a data preprocessing method to achieve covariate balance in observational studies with binary ...