Fact-checked by Grok 2 weeks ago

Impact evaluation

Impact evaluation is a rigorous analytical approach in and policy research that seeks to identify the causal effects of —such as programs, policies, or treatments—on specific outcomes by establishing counterfactual scenarios and attributing observed changes to the itself, rather than factors. This distinguishes it from descriptive monitoring or correlational studies, as it prioritizes through techniques that isolate treatment effects from , , and external influences. Central methods include randomized controlled trials (RCTs), which randomly assign participants to treatment and control groups to ensure comparability; quasi-experimental designs like difference-in-differences or regression discontinuity, which leverage natural variation or thresholds for identification; and instrumental variable approaches that exploit exogenous sources of variation to address non-compliance or hidden bias. These tools have enabled evidence-based decisions in fields like international development, education, and health, where evaluations have demonstrated, for instance, the ineffectiveness of certain cash transfer programs in altering long-term behaviors or the modest gains from deworming initiatives in improving school attendance. However, impact evaluation's defining achievements—such as informing the scaling of microfinance or conditional cash transfers—coexist with persistent challenges, including heterogeneous treatment effects across contexts that undermine generalizability and the difficulty of capturing mechanisms beyond average effects. Controversies arise from methodological limitations and systemic biases: RCTs, often hailed as the gold standard, can suffer from , spillover effects, or ethical constraints in , while non-experimental methods risk ; moreover, and selection biases in and donor-funded studies favor reporting positive or significant results, inflating perceived intervention and skewing policy toward "what works" narratives that overlook failures or null findings. incentives, including tenure pressures and funding from ideologically aligned institutions, exacerbate this optimism, leading to underreporting of negative impacts and overemphasis on short-term metrics over long-run causal chains. Despite these issues, rigorous impact evaluation remains essential for causal realism in resource-scarce environments, provided evaluations incorporate sensitivity analyses, pre-registration to curb p-hacking, and mixed-methods to probe underlying processes.

Definition and Fundamentals

Core Concepts and Purpose

Impact evaluation entails the rigorous estimation of causal effects attributable to an , , or on targeted outcomes, achieved by comparing observed results against the counterfactual—what outcomes would have prevailed absent the . This approach distinguishes from mere by addressing the fundamental problem: the counterfactual remains inherently unobservable, necessitating empirical strategies to approximate it, such as or statistical matching to construct comparable control groups. Central concepts include the (ATE), which quantifies the mean difference in outcomes between treated and untreated units, and considerations of heterogeneity, where effects may vary across subgroups, contexts, or over time. The purpose of impact evaluation lies in generating credible to ascertain whether interventions produce net benefits, the scale of those benefits, and the conditions under which they occur, thereby enabling data-driven decisions in resource-constrained environments. In development contexts, it supports the prioritization of effective programs to alleviate and enhance , as scarce public funds demand verification that expenditures yield measurable improvements rather than illusory gains from factors. Beyond , it informs program refinement, assessments, and replication, countering reliance on anecdotal or associational that often overstates due to omitted variables or selection effects. Evaluations thus promote causal realism, emphasizing mechanisms linking inputs to outputs while highlighting failures, such as null or adverse effects, to avoid perpetuating ineffective practices.

Historical Origins and Evolution

The systematic assessment of program impacts, particularly through , originated in early quantitative evaluation practices but gained methodological rigor in the mid-. Initial roots lie in 19th-century reforms, including William Farish's 1792 introduction of numerical marks for academic performance at Cambridge University and Horace Mann's 1845 standardized tests in schools to gauge educational effectiveness. These efforts focused on for rather than . By the early , Frederick W. Taylor's principles (circa 1911) emphasized efficiency metrics, evolving into objective testing movements that laid groundwork for outcome-oriented scrutiny, though without robust controls for confounding factors. The modern era of impact evaluation emerged in the 1950s-1960s, driven by post-World War II expansions in education and social welfare programs, including the U.S. (1958) and (1965), which mandated evaluations amid concerns over program efficacy. The Sputnik launch in 1957 heightened demands for , while the initiatives spurred social experiments to test interventions like income support. and Julian C. Stanley's 1963 monograph Experimental and Quasi-Experimental Designs for Research formalized designs to mitigate threats—such as and maturation—in non-laboratory settings, enabling causal claims from observational data approximations like pre-post comparisons and nonequivalent control groups. This framework professionalized evaluation, distinguishing true experiments from quasi-experiments and influencing fields beyond . Pioneering randomized controlled trials (RCTs) in followed, with the U.S. experiments (1968-1982) randomizing households to assess guaranteed income effects on labor supply, and the RAND Health Insurance Experiment (1971-1982) evaluating cost-sharing's impact on healthcare utilization, informing 1980s policy shifts toward deductibles. In , Mexico's PROGRESA program (1997) employed RCTs to measure effects on school enrollment and health, catalyzing scalable evaluations across and beyond. The 2000s marked explosive evolution, termed the "evidence revolution," with institutions like the Poverty Action Lab (J-PAL, founded 2003) and the International Initiative for Impact Evaluation (3ie, 2008) institutionalizing RCTs and quasi-experimental methods for poverty alleviation. The U.S. Government Performance and Results Act (1993) and UK Modernizing Government initiative (1999) embedded outcome-focused evaluation in . Advances integrated econometric tools, such as instrumental variables and regression discontinuity designs, to handle in large-scale data. This period's emphasis on rigorous peaked with the 2019 in Economics awarded to , , and for RCTs demonstrating interventions' micro-level effects on development outcomes. Subsequent growth includes evidence synthesis via systematic reviews and government-embedded labs, though debates persist over generalizability from small-scale trials to policy scale.

Methodological Designs

Experimental Designs

Experimental designs in impact evaluation primarily utilize randomized controlled trials (RCTs), in which eligible units such as individuals, households, or communities are randomly assigned to (receiving the ) or (no ) groups to isolate causal effects from factors. This , typically executed through computer algorithms or lotteries, ensures that groups are statistically equivalent on average, both in observed covariates and unobserved characteristics, allowing outcome differences to be credibly attributed to the . RCTs thus provide unbiased estimates of the (ATE), addressing the fundamental challenge of counterfactual reasoning—what would have happened without the —by using the group as a . Key steps in RCT design include defining the eligible , conducting calculations to determine required sample size based on expected effect sizes and variability (often aiming for 80% to detect minimum detectable effects), and verifying post- balance through statistical tests on . Outcomes are measured via surveys, administrative records, or other instruments at and endline, with analysis focusing on intent-to-treat () effects—comparing groups as randomized—to maintain integrity, or treatment-on-the-treated (TOT) effects using instruments for compliance issues. models may adjust for covariates to increase precision, though unadjusted differences suffice for primary inference under . Variations adapt RCTs to contextual constraints. Individual-level randomization assigns treatment independently to each unit, maximizing statistical power but risking spillovers in interconnected settings. Cluster-randomized trials, conversely, assign intact groups (e.g., villages or ) to or , mitigating while requiring larger samples and intra-cluster adjustments; for example, Mexico's PROGRESA randomized 506 communities to evaluate conditional cash transfers, demonstrating sustained impacts on enrollment. designs test multiple interventions simultaneously by crossing arms (e.g., combining cash transfers with training), enabling assessment of interactions and main effects within one trial, as in variations of Indonesia's Raskin food across 17.5 million beneficiaries in 2012. Stratified or blocked ensures across subgroups like or , enhancing precision without altering causal identification. Staggered or phase-in designs roll out interventions sequentially, using early phases as controls for later ones in scalable programs. These designs prioritize but demand safeguards against threats like spillovers (intervention diffusion to controls) or crossovers (controls accessing ), addressed via geographic separation or . Ethical requires uncertainty about intervention efficacy and minimal harm from control withholding, often justified by potential phase-in for all post-evaluation. from RCTs, such as a 43% reduction in arrests from Chicago's One Summer Plus job program, underscores their capacity for policy-relevant causal insights when properly executed.

Quasi-Experimental and Observational Designs

Quasi-experimental designs estimate causal impacts of interventions without , relying instead on structured comparisons or natural variations to approximate experimental conditions. These approaches, first systematically outlined by and Julian C. Stanley in their 1963 chapter, address threats to through designs like time-series analyses or nonequivalent groups, enabling inference in real-world settings where is infeasible, such as policy implementations or large-scale programs. Unlike true experiments, they demand explicit assumptions—such as the absence of contemporaneous events affecting groups differentially—to isolate treatment effects, with validity often assessed via placebo tests or falsification strategies. A core quasi-experimental method is difference-in-differences (DiD), which identifies impacts by subtracting pre-treatment outcome differences from post-treatment differences between treated and control groups, under the parallel trends assumption that untreated trends would mirror counterfactuals. Applied in evaluations like the 1996 U.S. welfare reform, DiD has shown, for instance, that job training programs increased earnings by 10-20% in some cohorts when controlling for economic cycles. Extensions, such as triple differences, incorporate additional dimensions like geography to mitigate violations from heterogeneous trends, though recent critiques highlight sensitivity to staggered adoption in multi-period settings. Regression discontinuity designs (RDD) exploit deterministic assignment rules, estimating local average treatment effects from outcome discontinuities at a , where units near the are quasi-randomized by the forcing . In a 2013 evaluation of Colombia's Ser Pilo Paga , RDD revealed a 0.17 standard deviation increase in for score-justifiers above the eligibility line, with selection via optimal methods ensuring precise local inference. RDD assumes perfect compliance at the cutoff, while fuzzy variants handle partial take-up using within the framework; both require checks for , such as tests showing no bunching. Instrumental variables (IV) address by using an exogenous correlated with uptake but unrelated to outcomes except through , yielding estimates for compliers under monotonicity. In Angrist and Krueger's 1991 analysis of U.S. compulsory , quarter-of-birth instruments—leveraging school entry age laws—estimated a 7-10% return to an additional year of , isolating causal effects amid self-selection. validity hinges on (strong first-stage correlation) and exclusion (no direct outcome path), tested via overidentification in multiple-IV setups; weak instruments bias estimates toward OLS, as quantified in Stock-Yogo critical values from 2005. Observational designs draw causal inferences from non-manipulated data, emphasizing or structural assumptions to mitigate , often via balancing methods like (PSM), which estimates treatment probabilities from covariates to pair similar units. A 2023 review found PSM effective in observational evaluations of interventions, reducing bias by up to 80% when overlap is sufficient, though it fails with unobservables, as evidenced by simulation studies showing 20-50% attenuation under hidden confounders. Advanced observational techniques include panel fixed effects, which difference out time-invariant confounders in longitudinal data, and synthetic controls, constructing counterfactuals as weighted untreated unit combinations to match pre-treatment trajectories. In Abadie et al.'s 2010 California tobacco control evaluation, synthetic controls attributed a 20-30% drop in per-capita cigarettes to the policy, outperforming simple DiD under heterogeneous trends. These methods demand large samples and covariate balance diagnostics, with triangulation—combining, say, PSM and —enhancing robustness, as recommended in 2021 guidelines for non-randomized studies. Despite strengths in scalability, observational designs remain vulnerable to model misspecification, necessitating pre-registration and falsification tests to approximate causal credibility.

Sources of Bias and Validity Threats

Selection and Attrition Biases

occurs when systematic differences between treatment and comparison groups arise due to non-random assignment or participation, leading to distorted estimates of causal effects in impact evaluations. In observational or quasi-experimental designs, individuals self-selecting into programs often possess unobserved characteristics—such as motivation or ability—that correlate with outcomes, inflating or deflating apparent program impacts; for instance, remaining after matching techniques can exceed 100% of the experimentally estimated effect in social program evaluations. This threat undermines by violating the of exchangeability between groups, making it challenging to attribute outcome differences solely to the rather than pre-existing disparities. Even in randomized controlled trials (RCTs), can emerge if eligibility criteria or processes favor certain subgroups, though proper typically mitigates it at . Attrition bias, a post-randomization form of , arises when participants exit studies at differential rates between , particularly if dropouts are correlated with outcomes or treatment status, thereby altering group compositions and biasing effect estimates. In RCTs for social programs, such as interventions, attrition rates exceeding 20% often introduce systematic imbalances, with leavers in treatment groups potentially having worse outcomes than stayers, leading to overestimation of positive effects if not addressed. This bias threatens the completeness of intention-to-treat analyses and can amplify in longitudinal evaluations where follow-up surveys fail to retain high-risk participants, as seen in teen prevention trials where cluster-level exacerbates imbalances. Unlike baseline selection, introduces time-varying , as dropout reasons—like program dissatisfaction or external shocks—may interact with treatment exposure. Both biases compromise by eroding the comparability of groups essential for counterfactual estimation; selection operates pre-treatment, while does so post-treatment, but they converge in non-random loss of that correlates with potential outcomes. In development impact evaluations, empirical assessments show that unadjusted can shift effect sizes by 10-30% in magnitude, with bounding approaches or analyses revealing the direction of potential distortion. Mitigation strategies include covariates for reweighting, worst-case scenario bounds, or pattern-mixture models, though these require assumptions about missingness mechanisms that may not hold without auxiliary . High-quality evaluations report rates and test for differences among dropouts to quantify threats, emphasizing that low alone does not guarantee unbiasedness if patterns are non-ignorable.

Temporal and Contextual Biases

Temporal biases in impact evaluation refer to systematic errors introduced by time-related factors that confound causal attribution, often threatening by providing alternative explanations for observed changes in outcomes. History effects occur when external events, unrelated to the , coincide with its implementation and influence results; for instance, a concurrent change might inflate estimates of a job training program's employment effects. Maturation effects arise from natural developmental or aging processes in participants, such as improved in children over the study period, which could be mistakenly attributed to an educational . These biases are particularly pronounced in longitudinal or quasi-experimental designs lacking , where pre-intervention trends or secular drifts—broader societal shifts like technological adoption—may parallel the timeline and bias estimates upward or downward. Regression to the mean exacerbates temporal issues when extreme baseline values naturally moderate over time, as seen in evaluations of interventions targeting high-risk groups, such as programs where initial severity scores revert without influence. To mitigate, evaluators often employ difference-in-differences methods to test parallel trends or include time-fixed effects in models. Contextual biases stem from the specific setting or of the , which can modify effects or introduce local confounders, thereby limiting generalizability and introducing effect heterogeneity. Interaction effects with settings manifest when outcomes vary due to unmeasured site-specific factors, such as cultural norms or institutional support; for example, a program's success in rural areas may not replicate in urban contexts due to differing market dynamics. Spillover effects, where benefits leak to controls within the same , contaminate comparisons, as documented in cluster-randomized trials of interventions where community-level biases null findings toward underestimation. Hawthorne effects represent a reactive contextual , wherein participants alter behavior due to awareness of evaluation, inflating impacts in monitored settings like workplace productivity studies. Site selection further compounds issues when programs are evaluated in non-representative locations correlated with higher efficacy, such as motivated communities, leading to overoptimistic extrapolations. Addressing these requires explicit testing for moderators via subgroup analyses or heterogeneous treatment effect estimators, alongside transparent reporting of contextual descriptors to aid assessments.

Estimation and Analytical Techniques

Causal Inference Methods

Causal inference methods in impact evaluation seek to identify and quantify the effects of interventions by estimating counterfactual outcomes, typically under the potential outcomes . This posits that for each unit i, there exist two potential outcomes: Y_i(1) under and Y_i(0) under , with the individual defined as Y_i(1) - Y_i(0). The (ATE) averages this difference across units, but the fundamental challenge arises because only one outcome is observed per unit, necessitating assumptions to link observables to the unobserved counterfactual. Originating from Neyman's work in randomized experiments (1923) and extended by (1974) to broader settings, the underpins modern quasi-experimental estimation by emphasizing identification via or exclusion restrictions. These methods are particularly vital in observational data from impact evaluations, where is absent, requiring strategies to mimic experimental conditions through covariates, instruments, or discontinuities. Common approaches include , instrumental variables, regression discontinuity, and difference-in-differences, each relying on distinct identifying assumptions to bound or point-identify causal effects. While powerful, their validity hinges on untestable assumptions, such as no unmeasured confounders or parallel trends, which empirical checks like placebo tests or sensitivity analyses can probe but not fully verify. Propensity Score Matching (PSM) balances treated and control groups by matching on the propensity score, defined as the probability of treatment given observed covariates X, e(X) = P(D=1|X). Under selection on observables (: Y(1), Y(0) \perp D | X), matching yields unbiased estimates of the ATE for the treated or overall. Introduced by Rosenbaum and Rubin (1983), PSM reduces dimensionality from multiple covariates to one score, often implemented via nearest-neighbor or kernel matching, with caliper restrictions to ensure close matches. In impact evaluations of social programs, such as job training initiatives, PSM has estimated effects like a 10-20% earnings increase from participation, though it fails if unobservables like motivation confound assignment. Sensitivity to model misspecification and common support violations necessitates balance diagnostics, where covariate means post-matching should align across groups. Instrumental Variables (IV) addresses endogeneity from unobservables by leveraging an instrument Z correlated with D (relevance: \text{Cov}(Z,D) \neq 0) but affecting outcomes Y only through D (exclusion: no direct path from Z to Y). The two-stage least squares (2SLS) estimator recovers the for compliers—those whose treatment status changes with Z—under monotonicity (no defiers). Angrist, Imbens, and Rubin (1996) formalized LATE as the relevant parameter when heterogeneity exists, applied in evaluations like quarter-of-birth instruments for , yielding estimates of 7-10% per year of versus 5-8% from OLS. Weak instruments estimates toward OLS (first-stage F-statistic >10 recommended), and exclusion violations, such as spillover effects, undermine credibility; overidentification tests (Sargan-Hansen) assess multiple instruments. Regression Discontinuity Design (RDD) exploits or fuzzy discontinuities at a known cutoff in the assignment rule, treating units just above and below as locally randomized. In RDD, the treatment effect is the jump in the conditional expectation of Y at the cutoff, estimated via local polynomials or parametric regressions with bandwidth selection (e.g., Imbens-Kalyanaraman optimal). Imbens and Lemieux (2008) outline implementation, including density tests for manipulation and placebo outcomes for bandwidth sensitivity. For policy cutoffs like scholarships at exam score thresholds, RDD has quantified effects such as a 0.2-0.5 standard deviation improvement in future earnings, with strongest near the cutoff but limited to that margin. Fuzzy RDD extends to imperfect compliance using IV logic, where the first-stage discontinuity instruments the treatment probability. Difference-in-Differences (DiD) estimates effects by differencing changes in outcomes over time between treated and control groups, identifying the ATE under parallel trends: absent treatment, gaps would evolve similarly. The estimator is (E[Y_{TT}] - E[Y_{TC}]) - (E[Y_{CT}] - E[Y_{CC}]), where subscripts denote treated/ control and post/pre periods. Bertrand, Duflo, and Mullainathan (2004) highlight serial correlation inflating standard errors in multi-period panels, recommending clustered errors or data collapse to two periods for robustness. In evaluations of minimum wage hikes, DiD has shown null or small employment effects (e.g., -0.1% per 10% wage increase), contrasting event-study pre-trends to validate assumptions. Extensions like triple differences add a third dimension to control fixed differences, but violations from differential shocks (e.g., Ashenfelter dips) require synthetic controls or staggered adoption adjustments. Other techniques, such as synthetic control for aggregate interventions, construct counterfactuals as weighted combinations of untreated units matching pre-treatment trends, effective for like policy reforms in single units. Across methods, robustness checks, including applications and falsification on pre-treatment , are essential, as are meta-analyses revealing that quasi-experimental estimates often align with RCTs when assumptions hold, though divergence signals . Integration with for covariate adjustment or double robustness (combining outcome and propensity models) enhances precision but demands large samples to avoid .

Economic Evaluation Integration

Economic evaluation integration in impact evaluation extends causal effect estimation by incorporating cost data to assess , enabling comparisons of interventions' value relative to alternatives. This approach quantifies whether observed impacts justify expended resources, often through metrics like incremental cost-effectiveness ratios (ICERs) or benefit-cost ratios (BCRs). For instance, in development programs, impact evaluations using randomized controlled trials (RCTs) may pair treatment effect estimates on outcomes such as school enrollment with program delivery costs to compute costs per additional enrollee. Such integration supports decision-making on scaling interventions, as seen in analyses by organizations like the International Initiative for Impact Evaluation (3ie), which emphasize prospective cost data alongside experimental designs to avoid biases. Cost-effectiveness analysis (CEA), a primary method, measures the cost per unit of outcome achieved, such as dollars per life-year saved or per child educated, without requiring full of benefits. In RCT-based impact evaluations, CEA typically applies the intervention's average cost per beneficiary to the estimated , yielding ratios like $X per Y% increase in productivity. A 2024 3ie outlines standardized steps for CEA in impact evaluations, including delineating direct and (e.g., staff time, materials, overhead) and sensitivity analyses for uncertainty in effect sizes or cost estimates. Challenges include attributing shared costs in multi-component interventions and using shadow prices for non-traded inputs in low-income settings, where market prices may distort true opportunity costs. Cost-benefit analysis (CBA) advances further by monetizing all outcomes, comparing discounted streams of benefits against costs to derive net present values or internal rates of return. Applied to impact evaluations, CBA requires valuing non-market effects, such as health improvements via willingness-to-pay proxies or human capital models projecting lifetime earnings gains from interventions. A analysis found that fewer than 20% of impact evaluations incorporate CBA, often due to data demands and methodological debates over valuation assumptions, yet those that do reveal high returns, like BCRs exceeding 5:1 for deworming programs in based on long-term income effects. Integration with quasi-experimental designs demands adjustments for selection biases in cost attribution, using techniques like to estimate counterfactual costs. Despite advantages, integration faces institutional barriers, including underinvestment in cost data collection during trials, where focus prioritizes of impacts over economic metrics. Guidelines from bodies like the advocate embedding economic components from study inception, with prospective costing protocols to capture fixed and variable expenses accurately. Empirical evidence from underscores the policy relevance, as integrated evaluations have informed reallocations, such as prioritizing cash transfers over less cost-effective subsidies when BCRs differ by factors of 2-10. Ongoing refinements address generalizability, incorporating transferability adjustments for context-specific costs and effects across settings.

Debates and Methodological Controversies

RCT Gold Standard vs. Alternative Approaches

Randomized controlled trials (RCTs) are widely regarded as the gold standard in impact evaluation for establishing causal effects due to randomization, which balances on both observed and unobserved confounders, thereby minimizing and enabling unbiased estimates of average treatment effects under ideal conditions. This approach has been particularly influential in fields like , where organizations such as J-PAL have scaled RCTs to evaluate interventions like programs, yielding precise estimates of effects such as a 0.14 standard deviation increase in earnings from childhood in as of long-term follow-ups reported in 2019. However, proponents acknowledge that RCTs assume stable mechanisms and no spillover effects, which may not hold in complex social settings. Despite their strengths in , RCTs face significant limitations that challenge their unqualified status as the gold standard. Ethical constraints prevent in many contexts, such as evaluating universal programs like national reforms, while high costs—often exceeding $1 million per trial in development settings—and long timelines limit . is another concern, as RCT participants and settings are often unrepresentative; for instance, trials in controlled environments may overestimate effects in diverse real-world applications, with meta-analyses showing effect sizes in RCTs decaying by up to 50% when scaled up. Critics like argue that RCTs provide narrow, context-specific knowledge without illuminating underlying mechanisms or generalizability, potentially misleading if treated as universally superior evidence, as evidenced by discrepancies between RCT findings and broader econometric data in alleviation studies. Alternative approaches, particularly quasi-experimental designs, offer robust when RCTs are infeasible by exploiting natural or policy-induced variation. Methods like regression discontinuity designs (RDD) assign treatment based on a score, approximating near the ; for example, an RDD evaluation of Colombia's scholarship program in 2012 estimated a 4.8 increase in enrollment, comparable to RCT benchmarks. Difference-in-differences (DiD) compares changes over time between treated and untreated groups assuming parallel trends, as in and Krueger's 1994 minimum wage study, which found no employment loss in fast-food sectors post-1992 hike. Instrumental variables () use exogenous shocks for identification, addressing in observational data. These methods rely on testable assumptions—such as no anticipation in RDD or parallel trends in DiD—allowing empirical validation, and often provide stronger by leveraging large-scale administrative data rather than small, artificial samples. The debate pits RCT advocates, including and —who emphasize randomization's avoidance of model dependence against alternatives' reliance on untestable assumptions—against skeptics like Deaton and , who contend that no method guarantees without and , as RCTs can suffer from (up to 20-30% in social trials) or Hawthorne effects. Empirical comparisons reveal mixed results: a 2022 analysis of labor interventions found quasi-experimental estimates aligning with RCTs 70-80% of the time when assumptions hold, but diverging in heterogeneous contexts, underscoring that alternatives can match RCT precision while better capturing policy-relevant variation. In impact evaluation, over-reliance on RCTs, often promoted by institutions with vested interests in experimental methods, risks sidelining credible quasi-experimental evidence from natural experiments, as seen in macroeconomic policy assessments where observational designs have informed reforms like conditional cash transfers in .
ApproachKey StrengthKey LimitationExample Application
RCTsHigh internal validity via Poor scalability, ethical barriers, limited generalizabilityMicrofinance impacts in (2000s trials showing modest effects)
Quasi-Experimental (e.g., DiD, RDD)Leverages real-world for broader applicabilityDepends on assumptions like trends, testable but not always verifiable effects (DiD in 1994 U.S. study)
Ultimately, causal demands selecting methods based on context rather than , integrating RCTs for where possible with quasi-experimental and mechanistic analyses for robustness, as singular elevation of any approach ignores the pluralistic nature of evidence in complex systems.

Empirical vs. Theory-Driven

In impact , empirical prioritizes observable data and to determine program effects, often employing randomized controlled trials (RCTs) or quasi-experimental designs to isolate causal impacts on outcomes while treating interventions as "black boxes" that link inputs directly to results without explicit modeling of internal processes. This approach, rooted in the positivist paradigm's emphasis on objective measurement and , seeks to establish whether an intervention produces net benefits through rigorous hypothesis testing and control for variables, as seen in evaluations by organizations like the Poverty Action Lab (J-PAL), which reported over 1,000 RCTs by 2023 demonstrating average treatment effects in areas like and . Such methods excel in providing high , with meta-analyses showing RCTs yielding effect sizes that are more precise and less biased than non-experimental alternatives, though they may overlook heterogeneous effects across contexts. Theory-driven evaluation, by contrast, integrates explicit program theories—such as theories of change or realist causal —to unpack how interventions generate outcomes via intermediate links, resources, and contextual factors, rather than solely relying on outcome . Originating in the as a of black-box limitations, this , advanced by evaluators like Huey Chen, posits that understanding "what works for whom, in what circumstances, and why" requires mapping assumed causal pathways and testing them empirically or qualitatively, as applied in assessments by the International Institute for Environment and Development (IIED). For instance, a 2014 study on knowledge translation initiatives used realist evaluation to identify context-mechanism-outcome configurations, revealing why certain programs succeeded in specific settings despite similar average effects. Proponents argue it enhances and scalability by addressing generalizability gaps in purely empirical designs, with Treasury Board of Canada guidelines from 2021 recommending its use to examine causal chains beyond net impacts. The tension between these paradigms reflects broader methodological debates in science, where empirical is lauded for its causal rigor—evidenced by post-positivist refinements acknowledging researcher influence but still prioritizing quantifiable evidence over metaphysical assumptions—yet critiqued for that ignores implementation fidelity and adaptive behaviors. -driven approaches counter this by fostering deeper causal through testing, but they risk if theories embed unverified ideological assumptions, as noted in critiques of their subjective theory construction potentially amplifying biases in settings where qualitative methods predominate. Empirical evaluations have demonstrated superior replicability in contexts, with a 2020 review finding that black-box RCT findings influenced 15% more legislative changes than theory-only assessments, though hybrid models combining both—such as realist RCTs—emerge as pragmatic syntheses to balance evidentiary strength with explanatory depth. In practice, over-reliance on positivist metrics in high-stakes funding decisions, like those from USAID since 2010, has prompted calls for theory integration to mitigate failures in empirically validated pilots, underscoring that while empirical methods ground truth claims in data, theory-driven elements are essential for causal interpretation without supplanting evidential primacy.

Ethical, Practical, and Ideological Critiques

Ethical critiques of impact evaluation, particularly randomized controlled trials (RCTs), center on the moral implications of , which deliberately withholds interventions from control groups to establish . This practice raises concerns about and beneficence, as it may deny potentially life-improving treatments to participants in need, especially when preliminary evidence or is absent, violating principles like those in the Declaration of Helsinki. In development contexts, where populations often face or health vulnerabilities, RCTs can exacerbate inequalities by favoring treatment groups, prompting debates over whether such designs are justifiable without assured post-trial access for controls. Critics like argue that conducting RCTs when interventions are suspected to work undermines ethical standards, as it prioritizes experimental purity over participant welfare, potentially amounting to in low-resource settings. Practical challenges include the high financial and temporal costs of RCTs, which often require large samples, extended follow-ups, and sophisticated , rendering them infeasible for small-scale or urgent programs in resource-constrained environments. , non-compliance, and contextual dependencies further compromise reliability, as real-world deviates from idealized protocols, leading to underpowered studies unable to detect modest effects. remains a persistent issue; findings from specific, controlled settings—such as deworming programs in rural —frequently fail to replicate or scale in diverse populations or policy environments, limiting their utility for broad decision-making. Ideological critiques portray RCT-centric impact evaluation as emblematic of empirical , which elevates narrow, ahistorical data over theoretical models, contextual nuances, and , fostering a "randomista" that dismisses non-experimental . This approach is accused of technocratic overreach, depoliticizing by framing decisions as purely evidence-driven while sidelining value judgments, power dynamics, and ethical trade-offs inherent to . In , such methods have been labeled neo-colonial, imposing Western scientific paradigms on global South contexts and prioritizing measurable outcomes over holistic, theory-guided interventions that address systemic causes like institutional failures. Proponents of alternatives, including structural economists, contend that RCTs' aversion to prior assumptions hinders causal understanding in complex systems, where demand mechanistic reasoning beyond effects.

Applications and Empirical Evidence

Development and Social Programs

Impact evaluations, predominantly through randomized controlled trials (RCTs), have been extensively applied to development and social programs in low- and middle-income countries, yielding causal evidence on interventions targeting poverty alleviation, health, education, and nutrition. Organizations such as the Abdul Latif Jameel Poverty Action Lab (J-PAL) and the have conducted or funded numerous RCTs to assess program effectiveness, revealing heterogeneous outcomes where some interventions demonstrate robust benefits while others show modest or null effects. These evaluations emphasize scalable, low-cost programs like and cash transfers, but also highlight challenges such as generalizability beyond pilot settings and long-term . Conditional cash transfer (CCT) programs, which link payments to behaviors like school attendance and health checkups, provide some of the strongest of positive impacts. Mexico's Progresa (later ), launched in 1997, was evaluated using RCTs on over 24,000 households, showing increases in school enrollment by approximately 20% for girls in and improvements in health outcomes, including a 10-18% rise in rates and reduced malnutrition. Long-term follow-ups indicated sustained effects, such as higher and reduced into adulthood, though benefits were more pronounced for targeted poor households. Unconditional cash transfers (UCTs), without behavioral requirements, have been analyzed in a Bayesian of 115 studies across 72 programs, estimating average effects including a 0.08 standard deviation increase in household and reduced , with stronger impacts in acute contexts but limited evidence of transformative poverty escape. In health-focused social programs, initiatives stand out for cost-effectiveness, with RCTs in demonstrating that school-based treatment reduced worm infections and increased school by 25%, alongside long-run earnings gains of up to 20% for treated children tracked into adulthood. A 2022 meta-analysis of multiple studies confirmed modest nutritional benefits, such as 0.3 kg average in children per treatment round, though effects on and height were inconsistent or negligible. Reanalyses of flagship studies have debated effect sizes, attributing some discrepancies to externalities like community-wide treatment spillovers, underscoring the need for careful interpretation in scaling. Microfinance programs, aimed at fostering among the poor, contrast with these successes, as RCTs across six countries found limited causal impacts on household income or , with meta-analyses of seven evaluations reporting negligible for non-entrepreneurial households and only modest adoption among borrowers. These null or small effects challenge earlier observational claims of broad transformative potential, revealing instead that access to credit often supports rather than sustained growth, particularly in saturated markets. Overall, empirical evidence from these applications supports selective investment in high-evidence interventions like CCTs and , which yield positive returns at costs under $100 per beneficiary annually, but cautions against over-reliance on programs like without addressing selection into . Integration with non-experimental methods, such as regressions on observational data, has complemented RCTs for broader policy contexts where is infeasible.

Policy and Institutional Interventions

Impact evaluations of policy and institutional interventions employ methods, such as randomized controlled trials (RCTs) and difference-in-differences (DiD) designs, to measure the effects of reforms on outcomes like , service delivery, and quality. These assessments often reveal mixed results, with successes dependent on contextual factors including political incentives and implementation capacity, while many donor-supported initiatives fail to deliver sustained improvements. For instance, between 1998 and 2008, donor-backed "" reforms in 145 countries resulted in a decline in effectiveness for 50% of recipients, as measured by Governance Indicators, highlighting challenges in achieving causal improvements through institutional changes. Decentralization policies, which devolve authority to local levels, have been evaluated for their impacts on and public provision. A randomized evaluation in during the early 2000s assigned village to women under quotas, finding that female policymakers increased investments in public and roads—goods disproportionately benefiting women—by 10-15 percentage points compared to male-led villages, demonstrating causal effects on pro-poor outcomes via improved representation. In , the 1994 Popular Participation Law, which decentralized 20% of national revenue to municipalities, led to shifts in spending toward and in poorer areas, with per capita infrastructure investments rising by up to 25% in responsive localities, though overall impacts varied by local . Streamlining administrative institutions, such as one-stop service () reforms, aims to reduce bureaucratic hurdles for business registration and permits. In , the 2018 OSS institutional overhaul, consolidating licensing across 369 districts, was assessed using a staggered DiD model on 2014-2018 , revealing a short-term negative impact on per-capita GDP growth, with a of -0.011 (p<0.1), attributed to transitional disruptions like capacity gaps and risk-averse implementation. institutional reforms, including protocols, have shown more consistent causal benefits in RCTs; a multicity U.S. trial in 2015-2016 found procedural justice increased officer with constitutional standards by 10-20%, reducing citizen complaints without elevating rates. Similarly, a 2024 RCT of use-of-force in a large reported a statistically significant reduction in force incidents post-intervention. Broader evidence from reforms indicates limited success in curbing administrative , with systematic reviews finding that while gains reduce opportunities for graft, sustained declines require complementary , as isolated institutional tweaks often yield null or perverse effects due to entrenched incentives. These findings underscore the importance of rigorous, context-specific evaluations to distinguish effective interventions from those undermined by failures or political short-termism.

Organizations, Initiatives, and Reviews

Key Promoters and Evidence Producers

The Abdul Latif Jameel Poverty Action Lab (J-PAL), established in 2003 at the Massachusetts Institute of Technology, serves as a central hub for promoting randomized controlled trials (RCTs) in impact evaluation, particularly in poverty alleviation and development economics. J-PAL-affiliated researchers have conducted or overseen more than 1,100 randomized evaluations worldwide, generating empirical evidence on interventions such as deworming programs, remedial education, and conditional cash transfers, which have informed scalable policies in over 80 countries. Its founders, including Nobel laureates Abhijit Banerjee and Esther Duflo, emphasize RCTs for establishing causal impacts, training policymakers and researchers through courses and partnerships to prioritize evidence over intuition in program design. Innovations for Poverty Action (IPA), founded in 2002 by economist Dean Karlan, functions as a research network that executes field experiments to test poverty interventions, producing evidence on topics like microfinance efficacy, agricultural innovations, and behavioral nudges. has completed hundreds of RCTs across more than 50 countries, collaborating with governments and NGOs to scale proven programs, such as improving teacher attendance in or reducing fraud in cash transfers, while addressing organizational challenges in embedding rigorous evaluation into operations. It complements J-PAL by focusing on implementation science, providing tools for theory-driven evaluations and partnering on joint initiatives to build capacity for evidence generation in low-resource settings. The International Initiative for Impact Evaluation (3ie), launched in as a grant-making NGO, funds and synthesizes high-quality impact studies to support evidence-informed policies in low- and middle-income countries, emphasizing transparency through systematic reviews and repositories of over 4,000 evaluations. 3ie has disbursed grants for more than 300 primary studies and produced evidence maps on sectors like , and climate adaptation, promoting mixed-methods approaches alongside RCTs to enhance generalizability and uptake by decision-makers. It quality-assures outputs via rigorous protocols, countering by incentivizing registration and reporting of null results. Other notable producers include the World Bank's Strategic Impact Evaluation Fund (SIEF), active since 2008, which has supported over 100 studies measuring program effects in areas like and service delivery, influencing Bank-wide lending decisions with data from RCTs in and . The International Food Policy Research Institute (IFPRI) has conducted causal evaluations since the late , including landmark RCTs on Mexico's PROGRESA program, generating evidence on nutrition-sensitive agriculture and social safety nets adopted in multiple nations. These entities collectively advance a of empirical testing, though their RCT-centric focus has drawn scrutiny for potential overemphasis on narrow, context-specific findings at the expense of broader causal mechanisms.

Skeptics, Critics, and Reform Advocates

Nobel laureate Angus Deaton has critiqued the application of randomized controlled trials (RCTs) in impact evaluation, arguing that they are often misinterpreted as providing unassailable evidence for policy without addressing external validity or causal mechanisms. Deaton and co-author Nancy Cartwright contend that RCTs require minimal theoretical assumptions, which aids persuasion in skeptical contexts but hinders deeper understanding by sidelining prior knowledge and generalizability beyond specific trial conditions. They emphasize that RCTs cannot standalone as "gold standard" proofs, as replication across varied settings is rare, and results may fail to predict outcomes in scaled implementations due to contextual differences. Lant Pritchett has similarly challenged the RCT paradigm in development impact evaluation, highlighting paradoxes in where small-scale trials yield effects that diminish or reverse at larger scales due to implementation challenges and institutional constraints. Pritchett argues that RCTs disproportionately focus on marginal, short-term interventions like private goods (e.g., ) rather than public goods or systemic reforms, diverting attention from transformative questions about and . He critiques the methodology for underemphasizing mechanisms of change and scalability, noting that even positive trial findings often encounter "fade-out" when rolled out nationally, as seen in interventions where contract teacher effects did not persist broadly. Ethical concerns form another core critique, particularly in development contexts where control groups receive no , potentially withholding beneficial treatments from vulnerable populations. Deaton points to cases like cash transfers or health programs where equates to denying aid, raising moral hazards absent —true uncertainty about —that is harder to establish for social policies than medical ones. Critics like Ravi argue this practice influences research agendas toward low-stakes questions, amplifying disproportionate sway over policy while exposing participants to harms without adequate safeguards. Reform advocates urge integrating RCTs with theory-driven approaches, qualitative insights, and quasi-experimental methods to enhance and policy relevance. Deaton advocates for RCTs within cumulative scientific programs that incorporate mechanistic understanding and historical data, rather than isolated . Pritchett calls for frameworks prioritizing and growth-oriented reforms, arguing that methodological better addresses barriers than RCT . Such reforms aim to mitigate biases toward feasible but narrow studies, fostering evaluations that inform ambitious interventions despite academia's institutional incentives favoring RCT production.

Recent Developments and Challenges

Technological and Methodological Innovations

Advancements in have enhanced in impact evaluation by addressing high-dimensional data and model misspecification. Double machine learning (Double ML) employs supervised algorithms to flexibly estimate nuisance parameters, such as propensity scores and conditional expectations, within semi-parametric estimators for average treatment effects under unconfoundedness assumptions, thereby improving precision and reduction compared to alternatives. Targeted learning integrates ensemble methods like Super Learner into targeted , allowing for data-adaptive while targeting causal parameters, as demonstrated in policy effect estimations where traditional methods falter with complex covariates. These approaches, formalized in frameworks from 2019 onward, enable evaluators to incorporate vast covariate sets without risks inherent in purely models. Synthetic control methods have seen refinements for broader applicability in non-experimental settings. Generalized synthetic control approaches, which extend the original by incorporating interactive fixed effects, have shown superior performance over standard difference-in-differences and synthetic controls in simulations involving staggered adoption or heterogeneous treatments, particularly for evaluations with controlled donor pools. Recent extensions, such as using multiple outcomes to construct synthetic counterfactuals, mitigate interpolation biases in single-unit interventions, as applied in re-evaluations of shocks where pre-treatment fit is optimized across dimensions like economic and social indicators. These innovations, building on Abadie's framework, facilitate causal claims in contexts lacking randomized variation, such as regional reforms, with applications documented as early as 2015 in health interventions. Technological innovations leverage for scalable outcome measurement and real-time assessment. has enabled proxy-based evaluations of environmental and agricultural programs by capturing changes in or crop yields without reliance on household surveys; for example, analyses in have used it to assess productivity impacts from development interventions. Imagery data, including nighttime lights and high-resolution sensors, supports quasi-experimental designs for hard-to-measure outcomes like , with evaluations highlighting its advantages in coverage and timeliness since the early 2020s. Administrative records and call detail records () provide granular, longitudinal data for difference-in-differences setups, as mapped in systematic reviews linking to outcomes, though causal applications remain limited by concerns. Digital tools have transformed for impact evaluation, enabling and reducing logistical costs. Mobile-based surveys and GPS-enabled applications facilitate continuous tracking in RCTs and quasi-experiments, as seen in India's programs where app-based reporting monitored toilet construction and usage daily, allowing adaptive interventions. of these tools with data enhances precision in attributing effects, such as in agricultural RCTs measuring plot-level yields via phone . A 2023 3ie systematic map indicates growing use of such in impact studies, particularly for measurement validation, but underscores gaps in rigorous due to and issues. These methods, accelerated by post-2020 expansions, support faster feedback loops in policy cycles compared to traditional endline surveys.

Barriers to Policy Influence and Scalability

Impact evaluations frequently encounter resistance in translating findings into policy due to political and institutional dynamics that prioritize or expediency over causal . In a analysis of 73 randomized controlled trials conducted across 30 U.S. cities with a national behavioral insights team, positive results prompted adoption in only 27% of cases, often due to bureaucratic inertia, competing priorities, and skepticism about beyond pilot settings. Similarly, policymakers may disregard evaluations conflicting with entrenched interests, as evidenced by persistent underuse of rigorous data in domains like , where ideological commitments to unproven approaches prevail despite contrary empirical results. Dissemination challenges further impede influence, including untimely evaluation outputs and poor alignment between researchers' focus on average treatment effects and policymakers' need for context-specific, actionable insights. Academic and donor-driven evaluations, while methodologically sound, often fail to engage decision-makers early, leading to findings that are technically credible but politically inert; for example, systematic reviews identify lack of , relevant as the most cited barrier, compounded by institutional that fragment evidence uptake. This disconnect is exacerbated in polarized environments, where evidence is selectively interpreted to fit partisan narratives rather than assessed on causal merits. Scalability of proven interventions presents distinct hurdles, as pilot successes under controlled conditions rarely persist at larger scopes due to emergent complexities like spillovers, heterogeneous effects, and general equilibrium shifts not captured in randomized designs. Cost structures, for instance, inflate dramatically upon expansion—small-scale programs may yield high returns in trials funded by external grants, but rollout demands sustained public budgets amid diminishing marginal benefits and frictions, as seen in attempts to scale micro-interventions in low-income settings where logistical and constraints erode . Critiques highlight that many impact evaluations target incremental "islets" of , such as targeted subsidies or nudges, which prove inadequate for systemic requiring institutional overhauls beyond experimental scope. Lant Pritchett argues this micro-focus yields evidence with limited predictive power for scaled policy, as real-world adoption introduces adaptive changes that alter causal pathways; empirical tracking reveals few RCT-backed programs achieve broad rollout, with adoption rates remaining low due to unaddressed factors like or weak . In development contexts, barriers such as these have constrained scaling of even modestly successful trials, underscoring the gap between localized causal identification and feasible policy transformation.

References

  1. [1]
    Impact evaluation - Better Evaluation
    An impact evaluation must establish the cause of the observed changes. Identifying the cause is known as 'causal attribution' or 'causal inference'.
  2. [2]
    [PDF] Impact Evaluation in Practice
    The basic impact evaluation question essentially constitutes a causal inference problem. Assessing the impact of a program on a series of out- comes is ...
  3. [3]
    [PDF] Impact Evaluation, Causal Inference, and Randomized Evaluation
    Oct 21, 2024 · M&E is focused on the program (process, output). • Impact evaluation is focused on cause and effect, i.e. attribution, on outcomes. How much did ...
  4. [4]
    [PDF] Causal Inference and Impact Evaluation - HAL
    Jun 12, 2020 · By definition, an instrumental variable must have a very significant impact on access to the program being evaluated – in this case, the ...
  5. [5]
    [PDF] Causal Inference and Experimental Impact Evaluation
    What is impact evaluation (IE)?. • IE question: What is the impact (or causal effect) of a program on outcome of interest?
  6. [6]
    Heterogeneous Treatment Effects in Impact Evaluation - Eva Vivalt
    May 4, 2015 · I do this using a large, unique dataset of impact evaluation results. These data were gath- ered by a nonprofit research organization that. I ...Missing: methods | Show results with:methods
  7. [7]
    Common Problems with Formal Evaluations: Selection Bias and ...
    This page discusses the nature and extent of two common problems we see with formal evaluations: selection bias and publication bias.Missing: controversies | Show results with:controversies
  8. [8]
    Failures in impact evaluation | Research Evaluation - Oxford Academic
    Jul 28, 2025 · Researching and evaluating failures​​ In practice, Andrews (2018) argues evaluations are often biased, focusing on reporting outputs and outcomes ...
  9. [9]
    Ten Reasons Not to Measure Impact—and What to Do Instead
    An impact evaluation should help determine why something works, not merely whether it works. Impact evaluations should not be undertaken if they will provide no ...<|control11|><|separator|>
  10. [10]
    [PDF] Introduction to Impact Evaluation - The World Bank
    The objective of impact evaluation is to estimate the causal effect or impact of a program on outcomes of interest. Estimate the causal effect (impact) of ...Missing: core | Show results with:core
  11. [11]
    [PDF] impact evaluation - | Independent Evaluation Group - World Bank
    First, it puts forward the definition of impact evaluation as a. 'counterfactual analysis of the impact of an intervention on final welfare outcomes.' Second ...
  12. [12]
    [PDF] Impact Evaluation in Practice - World Bank Documents & Reports
    Its main goal is to expand the evidence base on what works to improve health, education, and social protection outcomes, thereby informing development policy.
  13. [13]
    [PDF] Impact Evaluation - Climate Investment Funds (CIF)
    Impact Evaluation (IE) as defined here is an evaluation that quantitatively analyzes causal links between programs or interventions and a set of outcomes.
  14. [14]
    Handbook on Impact Evaluation : Quantitative Methods and Practices
    Evaluating impact is particularly critical in developing where resources are scarce and every dollar spent should aim to maximize its impact on poverty ...Missing: core | Show results with:core
  15. [15]
    [PDF] Principles for Impact Evaluation - 3ie
    Policy-relevant impact evaluations offer clear policy messages based on a deep understanding of context and implementation. 3. Social and economic development ...
  16. [16]
    [PDF] The Historical Development of Program Evaluation - OpenSIUC
    Program evaluation's historical development is difficult to describe, but includes seven time periods, starting with the first formal use in 1792.
  17. [17]
    HISTORY OF EVALUATION - Sage Publishing
    While evaluation as a profession is new, evaluation activity began long ago, perhaps as early as Adam and Eve. As defined in Chapter 1, evaluation is a ...
  18. [18]
    [PDF] EXPERIMENTAL AND QUASI-EXPERIMENT Al DESIGNS FOR ...
    DONALD T CAMPBELL AND JULIAN C. STANLEY decrease the respondent's sensitivity or re sponsiveness to the experimental variable and thus make the results ...
  19. [19]
    [PDF] A Look Back at Two Decades of Progress in the Impact Evaluation ...
    It is the largest health policy study in US history and paved the way for increased cost sharing for medical care in the 1980s and 1990s. 1990–2000. The results ...
  20. [20]
    The history of randomized control trials: scurvy, poets and beer
    Apr 18, 2018 · In 1884, we get the first randomization in the social sciences. The (among other things) psychology researcher Charles Pierce was trying to ...
  21. [21]
    3ie: Home
    3ie has been generating rigorous evide... Learn more · Understanding ... Copyright © 2025 International Initiative for Impact Evaluation (3ie)| All ...JobsAboutTeamDevelopment Evidence PortalImpact evaluations
  22. [22]
    Randomized Control Trials | Dime Wiki
    Apr 13, 2021 · A randomized controlled trial (RCT) is a method of impact evaluation in which all eligible units in a sample are randomly assigned to treatment and control ...<|separator|>
  23. [23]
    Introduction to randomized evaluations - Poverty Action Lab
    Randomized evaluations (RCTs) randomly assign participants to treatment and comparison groups to measure the causal impact of an intervention.
  24. [24]
    Randomized controlled trials – a matter of design - PMC
    Randomized controlled trials (RCTs) are the hallmark of evidence-based medicine and form the basis for translating research data into clinical practice.
  25. [25]
    Campbell DT, Stanley JC (1963) - The James Lind Library
    Campbell DT, Stanley JC (1963). Experimental and quasi-experimental designs for research. Chicago: Rand McNally & Company.
  26. [26]
    Quasi-experimental design and methods | Better Evaluation
    Jan 7, 2014 · Quasi-experimental design tests causal hypotheses, like experimental designs, but lacks random assignment, using self or administrator ...
  27. [27]
    Difference-in-difference - Better Evaluation
    Difference-in-difference involves comparing the before-and-after difference for the group receiving the intervention (where they have not been randomly ...
  28. [28]
    Difference-in-Differences | Dime Wiki - World Bank
    Aug 7, 2023 · Difference-in-differences takes the before-after difference in treatment group's outcomes. This is the first difference.
  29. [29]
    Advances in Difference-in-differences Methods for Policy Evaluation ...
    Difference-in-differences (DiD) is a powerful, quasi-experimental research design widely used in longitudinal policy evaluations with health outcomes.
  30. [30]
    Regression discontinuity - Better Evaluation
    RDD is a quasi-experimental evaluation option that measures the impact of an intervention, or treatment, by applying a treatment assignment mechanism.
  31. [31]
    [PDF] Using Regression Discontinuity Design for Program Evaluation
    Regression discontinuity design (RDD) is a popular quasi-experimental design used to evaluate program effects. It differs from the randomized control trial (RCT) ...
  32. [32]
    Instrumental Variables | Urban Institute
    Instrumental variables methods are the backbone of causal inference because they can solve a wide variety of very thorny inference problems.
  33. [33]
    Quasi-Experimental Designs for Causal Inference - PMC
    The strongest quasi-experimental designs for causal inference are regression discontinuity designs, instrumental variable designs, matching and propensity score ...
  34. [34]
    Causal inference and observational data
    Oct 11, 2023 · Observational studies using causal inference frameworks can provide a feasible alternative to randomized controlled trials.
  35. [35]
    Causal inference with observational data: A tutorial on propensity ...
    Propensity score analysis provides a useful way to making causal claims under the assumption of no unobserved confounders.
  36. [36]
    Causal inference and effect estimation using observational data
    We provide a clear, structured overview of key concepts and terms, intended as a starting point for readers unfamiliar with the causal inference literature.
  37. [37]
    Causal inference with observational data: the need for triangulation ...
    The goal of much observational research is to identify risk factors that have a causal effect on health and social outcomes.
  38. [38]
    Observational Studies: Methods to Improve Causal Inferences - PMC
    Mar 23, 2023 · This paper focuses on understanding causal inferences and methods to improve them for observational studies.
  39. [39]
    Sources of selection bias in evaluating social programs - PNAS
    The selection bias remaining after matching is a substantial percentage—often over 100%—of the experimentally estimated impact of program participation.
  40. [40]
    [PDF] Selection Bias - The University of North Carolina at Chapel Hill
    Selection bias is a distortion in a measure of association due to a sample selection that does not accurately reflect the target population.
  41. [41]
    Biases in randomized trials: a conversation between trialists and ...
    Biases in randomized trials: a conversation between trialists and epidemiologists · Selection bias · Performance bias · Detection bias · Attrition bias · Reporting ...
  42. [42]
    [PDF] assessing attrition bias
    Attrition bias occurs when not all participants' outcomes are measured, and different rates of attrition between groups can bias the estimated intervention  ...
  43. [43]
    [PDF] Addressing Attrition Bias in Randomized Controlled Trials
    Attrition bias occurs when people leaving a study have characteristics correlated with group status or outcomes, creating systematic differences and biased ...
  44. [44]
    [PDF] Sample Attrition in Teen Pregnancy Prevention Impact Evaluations
    In this brief, we discuss how attrition affects individual- and cluster-level RCTs, how it is assessed, and strategies to limit it. We pay particular attention ...
  45. [45]
    Attrition bias | Catalog of Bias
    Attrition bias is the unequal loss of participants from study groups, where systematic differences between those who leave and those who stay can bias results.
  46. [46]
    Assessing the impact of attrition in randomized controlled trials
    The aim of this study was to investigate the impact of attrition on baseline imbalance within individual trials and across multiple trials.
  47. [47]
    Assessing the impact of attrition in randomized controlled trials
    The aim of this study was to investigate the impact of attrition on baseline imbalance within individual trials and across multiple trials.
  48. [48]
    Reporting attrition in randomised controlled trials - PMC - NIH
    Such attrition prevents a full intention to treat analysis being carried out and can introduce bias., Attrition can also occur when participants have missing ...
  49. [49]
    A Graphical Catalog of Threats to Validity - PubMed Central - NIH
    Apr 2, 2020 · We define the Campbell tradition's named threats to validity. For each threat, we provide the epidemiologic analog, a corresponding DAG, and one ...
  50. [50]
    Threats to validity - Program Evaluation - Andrew Heiss
    Oct 28, 2020 · ... threats-validity ... One helpful way to assess an evaluation's internal validity is to systematically go through each possible threat and evaluate ...
  51. [51]
    Internal Validity in Impact Evaluation: Overview, Importance, and ...
    Nov 22, 2022 · History: History is a threat to the internal validity of an experiment. History is any event besides the independent variable that happened ...
  52. [52]
    [PDF] SITE SELECTION BIAS IN PROGRAM EVALUATION
    Feb 13, 2015 · “Site selection bias” can occur when the probability that a program is adopted or evaluated is correlated with its impacts.Missing: contextual | Show results with:contextual<|control11|><|separator|>
  53. [53]
    Causal Inference Using Potential Outcomes - Taylor & Francis Online
    Causal effects are defined as comparisons of potential outcomes under different treatments on a common set of units.
  54. [54]
    Introduction to the Potential Outcomes Framework
    Jan 18, 2021 · The Potential Outcomes Framework (aka the Neyman-Rubin Causal Model) is arguably the most widely used framework for causal inference in the ...Potential outcomes in a nutshell · The Rubin Causal Model · Causal Estimands
  55. [55]
    The central role of the propensity score in observational studies for ...
    The propensity score is the conditional probability of assignment to a particular treatment given a vector of observed covariates.
  56. [56]
    [PDF] Instrumental Variables in Action: Sometimes You Get What You Need
    Angrist and Evans (1998) solve this omitted-variables problem using two instrumental variables, both of which lend themselves to Wald-type estimation strategies ...
  57. [57]
    Regression discontinuity designs: A guide to practice - ScienceDirect
    The sharp regression discontinuity design. It is useful to distinguish between two general settings, the sharp and the fuzzy regression discontinuity (SRD ...
  58. [58]
    [PDF] Regression Discontinuity Designs: A Guide to Practice
    This paper was prepared as an introduction to a special issue of the Journal of Econometrics on regression discontinuity designs.
  59. [59]
    [PDF] HOW MUCH SHOULD WE TRUST DIFFERENCES-IN ...
    HOW MUCH SHOULD WE TRUST. DIFFERENCES-IN-DIFFERENCES ESTIMATES? ∗. Marianne Bertrand. Esther Duflo. Sendhil Mullainathan. This Version: June 2003. Abstract.
  60. [60]
    [PDF] NBER WORKING PAPER SERIES HOW MUCH SHOULD WE ...
    Difference in differences estimation, which deals with small effective sample size, and complicated error distribution, seems a particularly fertile ground ...
  61. [61]
    Causal Inference Methods for Combining Randomized Trials and ...
    Oct 7, 2025 · In this paper, we review the growing literature on methods for causal inference on combined RCTs and observational studies, striving for the ...
  62. [62]
    New 3ie handbook for measuring cost-effectiveness in impact ...
    Jun 4, 2024 · The handbook provides a comprehensive 'how-to' guide for implementing cost-effectiveness analysis (CEA) in impact evaluation.
  63. [63]
    Sounds good… but what will it cost? Making the case for rigorous ...
    Dec 11, 2019 · The standards of rigor for integrating CEA/CBA analysis into academic impact evaluation studies in development economics are not well-defined.
  64. [64]
    [PDF] Integrating Value for Money and Impact Evaluations
    An impact evaluation was classified as having a cost-benefit analysis (CBA) if it included a comparison of estimates of Costs and Benefits (with data of costs ...
  65. [65]
    Why don't economists do cost analysis in their impact evaluations?
    May 10, 2016 · Cost-benefit (CB) analysis examines the rate of return of an intervention: For example, what is the present value of lifetime benefits of a ...
  66. [66]
    Integrating Value for Money and Impact Evaluations - eScholarship
    This mixed methods study investigates why fewer than one in five impact evaluations integrates a value-for-money analysis of the development intervention ...
  67. [67]
    Randomised controlled trial | Better Evaluation
    Nov 12, 2021 · An impact evaluation approach that compares results between a randomly assigned control group and experimental group or groups to produce an ...
  68. [68]
    [PDF] Instruments of development: Randomization in the tropics, and the ...
    RCTs are seen as generating gold standard evidence. Page 26. 24 that is superior to econometric evidence, and that is immune to the methodological criticisms ...
  69. [69]
    [PDF] Alternatives to Traditional Randomized Controlled Trials
    Randomized controlled trials (RCTs) have long been considered the “gold standard” for evaluating program impacts. Randomization minimizes selection-related.
  70. [70]
    Rethinking the pros and cons of randomized controlled trials ... - NIH
    Jan 18, 2024 · Randomized controlled trials (RCTs) have traditionally been considered the gold standard for medical evidence. However, in light of emerging ...
  71. [71]
    Methods for Evaluating Causality in Observational Studies - NIH
    In clinical medical research, causality is demonstrated by randomized controlled trials (RCTs). Often, however, an RCT cannot be conducted for ethical reasons, ...
  72. [72]
    Chapter 26 Quasi-Experimental Methods | A Guide on Data Analysis
    Quasi-experimental methods offer valuable tools for causal inference when RCTs are not feasible. However, these designs come with important limitations that ...
  73. [73]
    How to Use Quasi-Experimental Methods in Cardiovascular Research
    Feb 16, 2024 · In research, randomized controlled trials (RCTs) provide the strongest causal inference for treatment and effect. Increasingly, quasi- ...What The Study Adds · How To Exploit Qems · Interrupted Time Series...
  74. [74]
    [PDF] Some Comments on Deaton (2009) and Heckman and Urzua (2009)
    For support for his position that. “Randomization is not a gold standard”. (Deaton, p. 4), Deaton quotes Nancy. Cartwright (2007) as claiming that “there is no ...<|separator|>
  75. [75]
    Understanding and misunderstanding randomized controlled trials
    According to Chalmers (2001) and Bothwell and Podolsky (2016), the development of randomization in medicine originated with Bradford-Hill, who used ...
  76. [76]
    A comparison of four quasi-experimental methods: an analysis of the ...
    Nov 3, 2022 · The aim of this study is to compare some of the commonly used non-experimental methods in estimating intervention effects, and to highlight their relative ...Estimation Models · Data And Methods · Discussion<|separator|>
  77. [77]
    [PDF] Should the Randomistas (Continue to) Rule?
    While RCTs have an important place in the toolkit for impact evaluation, an unconditional preference for RCTs as the “gold standard” is questionable on three ...
  78. [78]
    The Abdul Latif Jameel Poverty Action Lab
    ... J-PAL conducts randomized impact evaluations to answer critical questions in the fight against poverty. Overview. The Abdul Latif Jameel Poverty Action Lab (J- ...Careers · About Us · Staff · J-PAL North AmericaMissing: history | Show results with:history
  79. [79]
    Are randomised controlled trials positivist? Reviewing the social ...
    We conclude that the most appropriate paradigm for RCTs of social interventions is realism not positivism.
  80. [80]
    [PDF] Theory-based impact evaluation
    Theory-based evaluation does not estimate the net effect of an intervention, but it can help us identify controls and confounding factors that can inform the ...
  81. [81]
    Using realist evaluation to open the black box of knowledge translation
    Sep 5, 2014 · Theory-based or theory-driven approaches provide an alternative to black box evaluation that examine not only outcome, but also the possible ...
  82. [82]
    Theory-Based Approaches to Evaluation: Concepts and Practices
    Mar 22, 2021 · Approaches include theory-based evaluation (Weiss, 1995, 2000), theory-driven evaluation ... Theory-based evaluation and varieties of complexity.
  83. [83]
    Postpositivist Paradigm and Program Evaluation
    Oct 8, 2025 · The main differences between positivism and postpositivism are the level of certainty and their contrasting positions on metaphysics.
  84. [84]
  85. [85]
    Issues in the theory-driven perspective - ScienceDirect
    There is currently a strong movement in program evaluation to move from black box evaluations, concerned primarily with the relationship between the inputs ...
  86. [86]
    Understanding and misunderstanding randomized controlled trials
    RCTs can play a role in building scientific knowledge and useful predictions but they can only do so as part of a cumulative program.
  87. [87]
    The ethics of a control group in randomized impact evaluations
    Jul 6, 2011 · One concern is with equity. Systematically favoring the treatment subjects with an intervention can be seen as unfair (although presumably we ...
  88. [88]
    [PDF] Deaton Cartwright RCTs with ABSTRACT August 25
    Aug 25, 2025 · Understanding and misunderstanding randomized controlled trials. Angus Deaton and Nancy Cartwright. Princeton University.
  89. [89]
    [PDF] An Introduction to Impact Evaluations with Randomized Designs1
    Randomized experiments are increasingly popular ways to evaluate the impacts of develop- ment interventions. They provide hope that we can overcome important ...
  90. [90]
    The Problem With Evidence-Based Policies by Ricardo Hausmann
    Feb 25, 2016 · Ricardo Hausmann shows why randomized control trials are the wrong way to test interventions in many areas.
  91. [91]
    Reconsidering evidence-based policy: Key issues and challenges
    Key issues include the relevance of evidence, interaction between research and policy, and the view of EBP as "technocratic" with a preference for quantitative ...Abstract · Introduction · The evolution and purpose of... · Forms of knowledge and...
  92. [92]
    [PDF] Instruments, Randomization, and Learning about Development
    RCTs are seen as generating gold standard evidence that is superior to econometric evidence and that is immune to the methodological criticisms that are.
  93. [93]
    Microcredit: Impacts and promising innovations - Poverty Action Lab
    May 1, 2023 · A meta-analysis of seven randomized evaluations similarly found that the impact of microcredit was negligible for households with no business ...
  94. [94]
    Publication: Evaluation of Development Programs
    In this context RCTs are less suitable even for the simplest interventions. The TPE can be estimated by applying regression techniques to observational data ...
  95. [95]
    [PDF] Using RCTs to Estimate Long-Run Impacts in Development ...
    This review article surveys what we have learned about the determinants of long-run living standards from this growing body of RCTs in development economics, ...
  96. [96]
    Conditional Cash Transfers: The Case of Progresa/Oportunidades
    This article reviews the literature on the development, evaluation, and findings of Progresa/Oportunidades, summarizing what is known about program effects.
  97. [97]
    The Impact of PROGRESA on Health in Mexico - Poverty Action Lab
    PROGRESA involves a cash transfer that is conditional on the recipient household engaging in a set of behaviors designed to improve health and nutrition.
  98. [98]
    The impact of Mexico's conditional cash transfer programme ... - NIH
    The Oportunidades conditional cash transfer programme improved birthweight outcomes. This finding is relevant to countries implementing conditional cash ...
  99. [99]
    Unconditional Cash Transfers: A Bayesian Meta-Analysis of ...
    Aug 1, 2024 · We use Bayesian meta-analysis methods to estimate the impact of unconditional cash transfers (UCTs). Aggregating evidence from 115 studies of 72 UCT programs ...
  100. [100]
    The impact of mass deworming programmes on schooling and ... - NIH
    The study did not find any evidence of effect on nutritional status, cognitive tests or school grades achieved, but these are not reported in the abstracts.
  101. [101]
    Deworm the World | Evidence Action
    A 2022 meta-analysis found that deworming leads to an average weight gain of 0.3kg in children (that's the equivalent to moving a three-year-old from the 25th ...
  102. [102]
    Reanalysis of health and educational impacts of a school ... - 3ie
    3ie funded a two-part replication study of Edward Miguel and Michael Kremer's well-known impact evaluation of a school-based deworming programme in Kenya.
  103. [103]
    [PDF] Six Randomized Evaluations of Microcredit - MIT Economics
    Causal evidence on microcredit impacts informs theory, practice, and debates about its effectiveness as a development tool. The six randomized evaluations ...
  104. [104]
    First generation of microcredit RCTs - Microfinance - VoxDev
    Jan 30, 2025 · In this section, we review randomised controlled trials (RCTs) that provide causal evidence on the impacts of microcredit programmes.
  105. [105]
    [PDF] Should the Randomistas (Continue to) Rule?
    One source is the existence of externalities in evaluations. There is evidence that having an impact evaluation in place for an ongoing development project ...<|separator|>
  106. [106]
    [PDF] Evaluation of Development Programs: Randomized Controlled ...
    Sep 7, 2013 · An RCT evaluation might involve drawing a random sample from the popula- tion and assign treatment randomly within this sample. The researcher ...
  107. [107]
    Impact of institutional reform on development outcomes - GSDRC
    Recent studies find that many institutional reforms do not seem to make government function better, often have quite poor results, and rarely lead to ...
  108. [108]
  109. [109]
    the impact evaluation of the institutional reforms of the one-stop ...
    Jan 5, 2022 · This paper examines the impacts of institutional reform of the One-Stop Service (OSS) structures on increases in Indonesia's economic growth.
  110. [110]
    A multicity randomized trial at crime hot spots - PMC - NIH
    Mar 28, 2022 · Our study is a randomized trial in policing confirming that intensive training in procedural justice (PJ) can lead to more procedurally just behavior.Data And Analytic Approach · Community Survey · Results
  111. [111]
    Full article: The Impact of Training on Use of Force by Police in an ...
    Oct 16, 2024 · We conclude that the PPST curriculum appears effective at reducing use of force by police in a large scale, robust trial.
  112. [112]
    Public sector reforms and their impact on the level of corruption
    May 24, 2021 · The focus of this review is administrative corruption, namely corrupt acts involving civil servants in their dealings with their superiors, ...
  113. [113]
    J-PAL Courses | The Abdul Latif Jameel Poverty Action Lab
    J-PAL courses help implementers, policymakers, and researchers become better users and producers of evidence and equip learners with skills in impact evaluation ...Evaluating Social Programs · Diploma in Impact Evaluation · Online Courses
  114. [114]
    Our Impact | Innovations for Poverty Action
    Sep 28, 2022 · IPA is the R&D engine of the development sector, with high-quality research using the same method used in medical trials, ie randomized evaluations.
  115. [115]
    Resources and Tools for Impact Evaluation | IPA
    IPA assembled this set of resources for use in designing and running an impact evaluation. Beginning with the need for a theory-driven evaluation.
  116. [116]
    Identifying When, Why, and How to Use Impact Evaluations | IPA
    This case study provides lessons learned on identifying how, when, and why to conduct an impact assessment in large organizations.
  117. [117]
    The Strategic Impact Evaluation Fund (SIEF) - World Bank
    The World Bank's Strategic Impact Evaluation Fund (SIEF) supports scientifically rigorous research that measures the impact of programs and policies.
  118. [118]
    IFPRI and causal impact evaluation: Evidence for real-life policies
    Sep 25, 2025 · An excellent example is the impact evaluation of the HarvestPlus Reaching End Users (REU) program, which showed that an integrated approach to ...
  119. [119]
    Reinvigorating Impact Evaluation for Global Development
    2011. Marking a major step forward for impact evaluations of aid programs, the Millennium Challenge Corporation (MCC) and the US Agency ...Missing: promoting | Show results with:promoting
  120. [120]
    [PDF] Randomizing Development: Method or Madness? - Lant Pritchett
    Arguments that RCT research is a good (much less “best”) investment depend on both believing in an implausibly low likelihood that non-RCT research can improve ...
  121. [121]
    Randomized control trials for development? Three problems
    May 11, 2017 · First, there is a systematic bias toward analysis of private goods as opposed to public goods. Private goods are excludable since a seller needs ...
  122. [122]
    [PDF] The Debate about RCTs in Development is Over - Lant Pritchett
    returns to contract teachers from dozens of experiences (Murgai and Pritchett 2006) but also already known but scalability was limited as every single one was ...
  123. [123]
    Some questions of ethics in randomized controlled trials - Khera
    May 26, 2023 · This paper highlights eight areas of concern. RCTs also have a disproportionate influence on shaping research agendas and on policy.
  124. [124]
    Machine learning in policy evaluation: new tools for causal inference
    Mar 1, 2019 · Abstract:While machine learning (ML) methods have received a lot of attention in recent years, these methods are primarily for prediction.
  125. [125]
    A comparison of methods for health policy evaluation with controlled ...
    In our simulations, the generalized synthetic control approach outperformed more commonly used methods (difference‐in‐differences and synthetic control methods ...
  126. [126]
    Using Multiple Outcomes to Improve the Synthetic Control Method
    Feb 21, 2025 · The synthetic control method (SCM) estimates a treated unit's counterfactual untreated outcome via a weighted average of observed outcomes for ...Missing: innovations | Show results with:innovations<|separator|>
  127. [127]
    Examination of the Synthetic Control Method for Evaluating Health ...
    Oct 7, 2015 · This paper examines the synthetic control method in contrast to commonly used difference‐in‐differences (DiD) estimation, in the context of a re‐evaluation of ...
  128. [128]
    Emerging Trends in Impact Evaluation: 7 Innovative Approaches to ...
    Jan 15, 2025 · By adopting trending methodologies such as mixed-methods evaluation, real-time monitoring, and big data analytics, development practitioners can ...
  129. [129]
    Leveraging Imagery Data in Evaluations
    Feb 26, 2024 · This paper explores the potential of imagery data in evaluations and presents various data types and methodologies demonstrating their advantages and ...Missing: big administrative records
  130. [130]
    Using big data for evaluating development outcomes: A systematic ...
    The study maps different sources of big data onto development outcomes (based on SDGs) to identify current evidence base, use and the gaps.
  131. [131]
    [PDF] Geospatial Analysis in Impact Evaluation - 3ie
    Geospatial analysis uses data to measure intervention impacts, align data spatially, and process remotely sensed observations, enabling more precise analysis.Missing: records | Show results with:records
  132. [132]
    State of play for big data in impact evaluation - 3ie
    Dec 21, 2023 · This presentation provides an overview of one stock-taking effort recently conducted by 3ie to get a sense of the scope, scale, and applications of big data ...
  133. [133]
    Recommendation 2: Digital Transformation
    Technological advances in Wi-Fi, cell phones, GPS, and satellite imagery have made gathering and sharing data much easier, and new types of software make this ...
  134. [134]
    Bottlenecks for Evidence Adoption | Journal of Political Economy
    We study 30 US cities that ran 73 RCTs with a national nudge unit. Cities adopt a nudge treatment into their communications in 27% of the cases.
  135. [135]
    Policy Evaluation in Polarized Polities: The Case of Randomized ...
    This paper provides a political-economic analysis of policy evaluation. We focus on Randomized Controlled Trials (RCTs) as a subset of policy evaluations.
  136. [136]
    A systematic review of barriers to and facilitators of the use of ...
    Jan 3, 2014 · The most frequently reported barriers to evidence uptake were poor access to good quality relevant research, and lack of timely research output.
  137. [137]
    Scientific evidence and public policy: a systematic review of barriers ...
    Barriers included institutional fragmentation, limited access to actionable data, political resistance to scientific inputs, and lack of incentives for ...
  138. [138]
    Evidence-based policymaking is not like evidence-based medicine ...
    Apr 26, 2017 · The most frequently-reported barriers relate to problems with disseminating high quality information effectively, namely the lack of time, ...
  139. [139]
    The challenges of scaling effective interventions: A path forward for ...
    RCT evidence by itself offers an incomplete prediction of the effects of policy, due to heterogenous effects, spillovers and general equilibrium changes, ...Missing: based | Show results with:based
  140. [140]
    Implementing successful small interventions at a large scale is hard
    Mar 19, 2020 · Cost issues can make scaling prohibitive. Programs that have large and promising effects when delivered to small numbers of households or firms ...
  141. [141]
    [PDF] Let's Take the Con Out of Randomized Control Trials in Development
    May 1, 2021 · Abstract. The enthusiasm for the potential of RCTs in development rests in part on the assumption that the use of the rigorous evidence that ...Missing: barriers | Show results with:barriers
  142. [142]
    If Randomised Control Trials (RCTs) improve global development ...
    Apr 21, 2020 · While the 'randomistas' proffer RCTs as the most rigorous approach to impact evaluation, there has been a pushback from critics on its gold-standard claim.
  143. [143]
    The challenges of scaling effective interventions: A path forward for ...
    We suggest strategies for tightening the link between development research and anti-poverty policy, for example, by changing the practice of RCTs.