Fact-checked by Grok 2 weeks ago

Missing data

Missing data refers to the absence of recorded values for variables in a dataset, despite intentions to collect them, arising from factors such as non-response in surveys, equipment failures, or participant dropout in longitudinal studies.^[1] This phenomenon is ubiquitous in empirical research across fields including statistics, epidemiology, and social sciences, where it can introduce bias, reduce statistical power, and undermine causal inferences if mishandled.^[2] In 1976, statistician Donald Rubin established a foundational taxonomy of missing data mechanisms based on the probability of missingness depending on observed or unobserved data: missing completely at random (MCAR), where missingness is independent of all data; missing at random (MAR), where it depends only on observed data; and missing not at random (MNAR), where it depends on unobserved data itself.^[3]^[4] Distinguishing these mechanisms is critical, as MCAR permits unbiased analyses via simple methods like listwise deletion, whereas MAR often requires model-based adjustments such as multiple imputation to preserve validity, and MNAR demands sensitivity analyses acknowledging untestable assumptions about the missingness process.^[5] Handling strategies have evolved from ad hoc deletions to sophisticated techniques like multiple imputation by chained equations (MICE) and maximum likelihood estimation, which account for uncertainty in imputed values and yield more efficient estimates under MAR.^[6]^[7] Despite advances, challenges persist in MNAR scenarios, where no standard method fully mitigates bias without auxiliary information or causal modeling, highlighting the need for preventive designs like robust data collection protocols to minimize missingness ab initio.^[8]^[9]

Definition and Mechanisms

Core Definition

Missing data refers to the absence of recorded values for one or more variables in an observation or dataset, where such values would otherwise be meaningful for analysis.^[6] This issue arises in empirical studies when data points are not collected or stored, distinct from structural absences like deliberate design choices in experimental setups.^[10] The presence of missing data complicates statistical inference by potentially distorting parameter estimates, unless appropriately addressed through methods that account for the underlying missingness process.^[7] The foundational framework for missing data analysis, developed by Donald B. Rubin in 1976, classifies missingness mechanisms based on the probability that a data value Y is missing, denoted by indicator R=1 if missing and R=0 if observed.^[10] This probability, P(R \mid Y), determines the ignorability of missingness for likelihood-based inference. Under missing completely at random (MCAR), P(R \mid Y_{\text{obs}}, Y_{\text{mis}}) = P(R), meaning missingness is independent of both observed (Y_{\text{obs}}) and missing (Y_{\text{mis}}) values; for example, random equipment failure unrelated to study variables.^[4] ^[2] Missing at random (MAR) holds when P(R \mid Y_{\text{obs}}, Y_{\text{mis}}) = P(R \mid Y_{\text{obs}}), so missingness depends only on observed data, allowing valid analysis via observed-data likelihood under correct model specification.^[11] ^[5] Missing not at random (MNAR) occurs otherwise, with P(R \mid Y_{\text{obs}}, Y_{\text{mis}}) depending on unobserved values, introducing non-ignorable bias that requires explicit modeling of the missingness process.^[2] ^[12] This taxonomy, elaborated in subsequent works by Roderick Little and Rubin, underpins methods like complete-case analysis (valid under MCAR), imputation (often assuming MAR), and selection models for MNAR.^[7] Distinguishing mechanisms empirically is challenging, as tests for MCAR versus MAR exist but cannot confirm MAR over MNAR without untestable assumptions.^[13] Sensitivity analyses are recommended to assess robustness across plausible mechanisms.^[14]

Classification of Missingness Mechanisms

The classification of missingness mechanisms in statistical analysis of incomplete data was introduced by Donald B. Rubin in his 1976 paper on inference with missing data. This framework categorizes the processes generating missing values into three distinct types based on the relationship between the missingness indicator and the data values: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).^[2] These categories determine the assumptions under which unbiased inferences can be drawn and influence the choice of appropriate imputation or modeling strategies.^[5] Missing completely at random (MCAR) implies that the probability of a data point being missing is independent of both the observed data and the would-be observed values of the missing data.^[2] Formally, if [R](/page/R) denotes the missingness indicator and Y the full data vector, MCAR holds when P(R | Y, X) = P(R | X), where X are covariates unrelated to missingness, meaning missingness arises from external factors like random equipment failure without systematic patterns.^[4] Under MCAR, complete-case analysis yields unbiased estimates, though with reduced sample size and efficiency.^[15] Missing at random (MAR) extends MCAR by allowing the probability of missingness to depend on observed data but not on the missing values themselves, conditional on those observed portions.^[2] Mathematically, P(R | Y_{obs}, Y_{mis}, X) = P(R | Y_{obs}, X), where Y_{obs} and Y_{mis} partition the data into observed and missing components.^[5] For instance, dropout in longitudinal studies due to age or baseline responses exemplifies MAR, as missingness correlates with recorded variables.^[4] MAR permits methods like multiple imputation to recover unbiased results by leveraging observed patterns, assuming the model correctly specifies the dependencies.^[2] Missing not at random (MNAR), also termed nonignorable missingness, occurs when the probability of missing data directly depends on the unobserved values, even after conditioning on observed data: P(R | Y_{obs}, Y_{mis}, X) \neq P(R | Y_{obs}, X).^[2] This mechanism introduces inherent bias, as seen in surveys where nonresponse correlates with unreported sensitive outcomes like income levels exceeding thresholds.^[15] Distinguishing MNAR empirically is challenging without auxiliary information or sensitivity analyses, as standard tests conflate it with MAR, and complete-case or naive imputation often fails to mitigate bias.^[5] Rubin's taxonomy underscores that while MCAR and MAR allow ignorability under certain models, MNAR requires explicit modeling of the missingness process for valid inference.^[4]

Mechanism	Probability Dependence	Ignorability	Example
MCAR	Independent of all data	Fully ignorable	Random file corruption^[2]
MAR	On observed data only	Conditionally ignorable	Missing lab results due to patient demographics^[15]
MNAR	On missing values	Non-ignorable	Self-censorship in income reporting^[4]

Historical Development

Pre-Modern Approaches

Early analysts of demographic and vital data, such as John Graunt in his 1662 Natural and Political Observations Made upon the Bills of Mortality, encountered incomplete records from London parish clerks, which often omitted causes of death, underreported events, or contained inconsistencies due to voluntary reporting and clerical errors. Graunt addressed these gaps through manual scrutiny, including physical inspections of records—such as breaking into a locked clerk's office to verify underreporting—and by cross-referencing available tallies to correct obvious discrepancies, effectively applying a form of available-case analysis where only verifiable observations informed rates of christenings, burials, and diseases.^[16] ^[17] This approach allowed derivation of empirical patterns, like excess male burials and seasonal plague variations, without systematic imputation, prioritizing observed data over speculation.^[18] In the late 17th century, Edmond Halley extended such practices in his 1693 construction of the first reliable life table from Breslau (now Wrocław) vital records spanning 1687–1691, which suffered from incomplete coverage, particularly for infants and non-residents. Halley adjusted for undercounts by assuming uniform reporting within observed age groups and extrapolating survival probabilities from complete subsets, focusing on insured lives to mitigate biases from migration and unrecorded deaths.^[19] These methods reflected a reliance on deletion of unverifiable cases and simple proportional scaling, common in political arithmetic, where analysts like William Petty—influenced by Graunt—tabulated incomplete Irish hearth tax data by omitting deficient returns and estimating totals from compliant districts. Such ad hoc deletions preserved computational feasibility amid manual calculations but risked biasing estimates toward better-documented subpopulations. By the 18th and 19th centuries, as censuses and surveys proliferated, practitioners routinely applied listwise deletion, excluding entire records with any missing values to facilitate aggregation. For instance, early British decennial censuses from 1801 onward discarded incomplete household schedules during tabulation, assuming non-response reflected negligible population fractions, while American censuses similarly omitted partial enumerations to compute averages from complete cases only.^[20] Rudimentary imputation emerged sporadically, such as substituting modal values or averages from similar locales for absent vital events, as seen in Quetelet's 1835 social physics analyses of Belgian data, where gaps in height or crime records were filled via group means to maintain sample sizes for averaging.^[21] These techniques, devoid of probabilistic frameworks, underscored a pragmatic focus on usable data subsets, often introducing unacknowledged selection biases that later formal methods would quantify.^[22]

Formalization in the Late 20th Century

The formalization of missing data mechanisms in statistical inference was advanced significantly by Donald B. Rubin in his 1976 paper "Inference and Missing Data," published in Biometrika. Rubin introduced a rigorous framework using missing data indicators R, where R_i = 1 if the i-th observation is missing and R_i = 0 if observed, to classify missingness based on its dependence on observed data Y_{obs} and missing data Y_{mis}. He defined three key mechanisms: missing completely at random (MCAR), where the probability of missingness P(R) is independent of both Y_{obs} and Y_{mis}; missing at random (MAR), where P(R \mid Y_{obs}, Y_{mis}) = P(R \mid Y_{obs}); and missing not at random (MNAR), where missingness depends on Y_{mis} even after conditioning on Y_{obs}.^[3] This typology addressed prior oversights in statistical practice, where the generating process of missing values was often ignored, leading to biased inferences under non-MCAR conditions. Rubin's analysis established that valid likelihood-based inference about parameters \theta in the full data distribution f(Y \mid \theta) is possible by ignoring the missingness mechanism if data are MAR and the parameters of the full data model are distinct from those of the missingness model (ignorability).^[3] These conditions, the weakest general requirements for such inferences, shifted focus from ad hoc deletion methods to mechanism-aware approaches, emphasizing empirical verification of assumptions where feasible. Building on this foundation, Rubin and Roderick J.A. Little's 1987 book Statistical Analysis with Missing Data synthesized the framework into a comprehensive methodology, integrating Rubin's earlier work with practical tools like multiple imputation. The book formalized multiple imputation as drawing multiple plausible values for Y_{mis} from their posterior predictive distribution given Y_{obs}, then analyzing each completed dataset separately and pooling results to account for between-imputation variability, yielding valid inferences under MAR. This approach contrasted with single imputation by properly reflecting uncertainty, with theoretical guarantees derived from Rubin's Bayesian perspective on data augmentation.^[23] By the 1990s, these concepts influenced broader statistical software and guidelines, such as early implementations in SAS and S-PLUS for maximum likelihood under MAR via expectation-maximization algorithms, though MNAR required specialized sensitivity analyses due to unidentifiability without strong assumptions. Rubin's framework underscored that while MCAR and MAR enable standard methods, MNAR demands explicit modeling of selection, often via pattern-mixture or selection models, highlighting the causal interplay between data generation and observation processes.^[24]^[25]

Causes and Patterns

Practical Causes in Data Collection

In survey-based data collection, nonresponse arises when sampled individuals refuse participation, cannot be contacted, or provide incomplete answers, often due to privacy concerns, time constraints, or survey fatigue; refusal rates in household surveys typically range from 10% to 40%, varying by mode of administration such as telephone versus in-person.^[26]^[27] Inability to respond, stemming from factors like language barriers, cognitive limitations, or absence during contact attempts, further exacerbates unit nonresponse in cross-sectional studies.^[28] Longitudinal studies encounter attrition as a primary cause, where participants drop out between waves due to relocation, death, loss of interest, or competing demands, leading to cumulative missingness rates of 20-50% over multiple rounds in population panels.^[29] Item nonresponse, distinct from unit nonresponse, occurs when respondents skip specific questions on sensitive topics like income or health, with skip rates increasing with questionnaire length or perceived intrusiveness.^[30] In experimental and observational settings, technical malfunctions such as equipment failure, sensor errors, or power disruptions result in unobserved measurements; for example, hardware breakdowns in laboratory instruments can nullify data from entire trials, while network interruptions in remote sensing erase telemetry records.^[31]^[32] Human procedural errors during manual recording, including transcription omissions or rushed fieldwork, contribute to sporadic missing values, particularly in resource-limited environments where data entry lags collection.^[31] Budgetary or logistical constraints often truncate data collection prematurely, as in underfunded studies where follow-ups are abbreviated, yielding systematically absent observations from hard-to-reach subgroups.^[33] Poorly designed protocols, such as ambiguous questions or inadequate sampling frames, induce accidental omissions or failed deliveries in digital surveys.^[27] These causes, while sometimes random, frequently correlate with unobserved variables like socioeconomic status, introducing patterns beyond mere randomness.^[6]

Observed Patterns and Diagnostics

Observed patterns in missing data describe the structural arrangement of absent values within a dataset, which can reveal potential dependencies or systematic absences. Univariate patterns occur when missingness is confined to a single variable across observations, often arising from isolated measurement failures. Monotone patterns feature nested missingness, where if a value is missing for one variable, all subsequent variables in a sequence (e.g., later time points in longitudinal data) are also missing, commonly seen in attrition scenarios. Arbitrary or non-monotone patterns involve irregular missingness across multiple variables without such hierarchy, complicating analysis due to potential inter-variable dependencies.^[34] Visualization techniques, such as missing data matrices or heatmaps, facilitate identification of these patterns by plotting missing indicators (e.g., 0 for observed, 1 for missing) across cases and variables, highlighting clusters, monotonicity, or outflux (connections from observed to missing data). Influx patterns quantify how missing values link to observed data in other variables, aiding in assessing data connectivity. Empirical studies, such as those in quality-of-life research, show that non-random patterns often cluster by subgroups, with missingness rates varying from 5-30% in clinical datasets depending on follow-up duration.^[34]^[35] Diagnostics for missing data mechanisms primarily test the assumption of missing completely at random (MCAR) versus alternatives like missing at random (MAR) or missing not at random (MNAR). Little's MCAR test evaluates whether observed means differ significantly across missing data patterns, using a chi-squared statistic derived from comparisons of subgroup means under the null hypothesis of MCAR; rejection (typically p < 0.05) indicates non-MCAR missingness, though the test assumes multivariate normality and performs poorly with high missingness (>20-30%) or non-normal data.^[36] To distinguish MAR, logistic regression models the missingness indicator as a function of fully observed variables; significant predictors suggest MAR, as missingness depends on observed data but not the missing values themselves. MNAR cannot be directly tested, as it involves unobservable dependencies, necessitating sensitivity analyses that vary assumptions about the missingness model to assess result robustness. Pattern visualization combined with auxiliary variables (e.g., comparing demographics between complete and incomplete cases) provides indirect evidence; for instance, if missingness correlates with observed age or income but not the missing outcome, MAR is plausible. Limitations include low power in small samples and inability to falsify MNAR without external data, emphasizing the need for multiple diagnostic approaches.^[37]^[38]^[36]

Consequences for Analysis

Introduction of Bias and Variance Issues

Missing data introduces bias into statistical estimators when the mechanism of missingness violates the missing completely at random (MCAR) assumption, such as under missing at random (MAR) or missing not at random (MNAR) conditions, where missingness depends on observed covariates or the unobserved values themselves, respectively.^[6] In complete-case analysis, which discards units with any missing values, the observed subsample becomes systematically unrepresentative of the full population, leading to inconsistent estimates of parameters like means, regressions coefficients, or associations; for instance, if higher-income respondents are more likely to refuse income questions (MNAR), mean income estimates will be downward biased.^[39] ^[40] This bias persists even in large samples unless the missingness mechanism is explicitly modeled and accounted for, as naive methods fail to correct for the selection process inherent in the data collection.^[39] Beyond bias, missing data elevates the variance of estimators due to the effective reduction in sample size, which diminishes precision and widens confidence intervals; for example, the variance of the sample mean scales inversely with the number of complete observations, so a 20% missingness rate can increase variance by up to 25% relative to the full dataset under MCAR.^[6] Listwise deletion, a common ad hoc approach, not only amplifies this sampling variance but also underestimates the variance-covariance structure of variables with missing values, propagating errors into downstream parameters like correlations or standard errors in regression models.^[40] Imputation methods exacerbate variance issues if not properly adjusted: single imputation treats filled values as known, artificially reducing uncertainty and yielding overly narrow standard errors, whereas multiple imputation aims to restore appropriate variability by incorporating imputation uncertainty, though it requires valid modeling of the missingness to avoid residual bias.^[39] ^[41] These bias and variance distortions collectively inflate the mean squared error (MSE) of predictions or inferences, compromising the reliability of analyses in fields like epidemiology and econometrics, where even modest missingness (e.g., 10-15%) can shift effect sizes by 20% or more if unaddressed.^[42] Empirical studies confirm that ignoring non-ignorable missingness often results in both directional bias and inefficient estimators, underscoring the need for sensitivity analyses to assess robustness across plausible missing data mechanisms.^[43]^[39]

Loss of Statistical Power and Efficiency

Missing data reduces the effective sample size in analyses, leading to a loss of statistical power, which is the probability of correctly rejecting a false null hypothesis in hypothesis testing. This diminution increases the risk of Type II errors, where true effects go undetected due to insufficient evidence. In complete-case analysis, where observations with any missing values are discarded, the sample size shrinks proportionally to the missingness rate; for example, if 20% of data are missing under missing completely at random (MCAR) conditions, power calculations effectively operate on 80% of the original sample, as if the study were underpowered from the outset.^[44]^[45]^[46] Even under MCAR, where complete-case estimators remain unbiased, the reduced sample size inflates the variance of estimates, compromising their efficiency relative to full-data counterparts. Efficiency here denotes the precision of estimators, typically assessed via asymptotic relative efficiency or variance ratios; missing data effectively scales the information matrix by the retention proportion, necessitating larger initial samples to match the precision of complete-data analysis. This inefficiency manifests in wider confidence intervals and less reliable inference, particularly in multivariate settings where missingness compounds across variables.^[47]^[48] Under missing at random (MAR) or not at random (MNAR) mechanisms, power losses can be more severe if unaddressed, as partial information from observed data is discarded in simplistic methods, further eroding efficiency without the unbiased guarantee of MCAR. Model-based approaches, such as maximum likelihood estimation, can preserve more efficiency by utilizing all available data, but they require correct specification of the missingness mechanism to avoid compounded power deficits. Empirical studies confirm that ignoring missing data routinely halves power in moderate missingness scenarios (e.g., 25-50% missing), underscoring the need for deliberate handling to maintain analytical rigor.^[49]^[50]

Handling Techniques

Deletion-Based Methods

Deletion-based methods for handling missing data entail the removal of incomplete observations or variables from the dataset prior to analysis, thereby utilizing only the fully observed cases or pairs. These approaches are computationally straightforward and serve as default options in many statistical software packages, such as SPSS and SAS, where listwise deletion is often automatically applied.^[6] They avoid introducing assumptions about the underlying data-generating process beyond those required for the substantive model, but they can substantially reduce effective sample size, particularly when missingness is prevalent.^[51] The primary variant is listwise deletion, also known as complete-case analysis, which excludes any observation containing at least one missing value across the variables of interest. This method ensures a consistent sample for all parameters estimated in the model, preserving the integrity of multivariate analyses like regression or factor analysis. For instance, in a dataset with 1,000 cases where 10% have missing values on one predictor, listwise deletion might retain only 900 cases, assuming independence of missingness patterns. It yields unbiased estimates under the missing completely at random (MCAR) assumption, where missingness is unrelated to observed or unobserved data, but introduces bias under missing at random (MAR) or missing not at random (MNAR) mechanisms unless the completers form a representative subsample.^[52]^[53] Moreover, it diminishes statistical power and increases variance, as demonstrated in simulations where power drops by up to 20-30% with 15% missing data under MCAR.^[54] In contrast, pairwise deletion (or available-case analysis) retains data for each pair of variables analyzed, excluding only those specific pairs with missing values. This maximizes information use—for correlations, it computes each pairwise coefficient from all non-missing pairs—potentially retaining more data than listwise deletion when missingness is scattered. However, it risks producing inconsistent sample sizes across estimates (e.g., varying from 800 to 950 cases per pair in a 1,000-case dataset), which can lead to biased standard errors or non-positive definite covariance matrices in procedures like principal component analysis. Pairwise deletion also assumes MCAR for unbiasedness and is less suitable for models requiring fixed samples, such as logistic regression.^[55]^[56] Less commonly, variable deletion removes entire predictors with excessive missingness (e.g., >50% missing), preserving sample size at the cost of model specification. All deletion methods perform adequately when missing data proportions are low (<5%), but their validity hinges on empirical diagnostics like Little's MCAR test, which rejects MCAR if p < 0.05, signaling potential bias. Critics note that these methods discard potentially informative data, exacerbating inefficiency in small samples or high-dimensional settings, prompting preference for imputation or modeling under MAR.^[6]^[57]

Imputation Strategies

Imputation strategies replace missing values in a dataset with estimated values derived from observed data, enabling the use of complete-case analyses while attempting to mitigate bias introduced by deletion methods.^[58] These approaches range from simple deterministic techniques to sophisticated stochastic procedures that account for uncertainty in the estimates.^[59] Single imputation methods generate one replacement value per missing entry, often leading to underestimation of variance and distortion of associations, whereas multiple imputation creates several plausible datasets to propagate imputation uncertainty into inference.^[60] Simple single imputation techniques, such as mean, median, or mode substitution, fill missing values with central tendencies computed from observed cases in the same variable.^[61] These methods are computationally efficient and preserve sample size but introduce systematic bias by shrinking variability toward the mean and ignoring relationships with other variables; for instance, mean imputation reduces standard errors by up to 20-30% in simulations under missing at random (MAR) scenarios.^[62] Regression-based imputation predicts missing values using linear models fitted on observed predictors, offering improvement over unconditional means by incorporating covariate information, yet it still fails to reflect imputation error, resulting in overly precise confidence intervals.^[63] Hot-deck imputation draws replacements from observed values in similar cases, classified as random hot-deck (within strata) or deterministic variants, which better preserves data distributions in empirical studies compared to mean substitution but can propagate errors if donor pools are small.^[64] Multiple imputation (MI), formalized by Donald Rubin in 1987, addresses limitations of single imputation by generating m (typically 5-20) imputed datasets through iterative simulation, analyzing each separately, and pooling results via Rubin's rules to adjust variances for between- and within-imputation variability.^[65] Under MAR assumptions, MI yields unbiased estimates and valid inference, outperforming single methods in Monte Carlo simulations where it reduces mean squared error by 10-50% relative to complete-case analysis depending on missingness proportion (e.g., 20% missing).^[66] Procedures like multivariate normal MI or chained equations (iterative conditional specification) adapt to data types, with the latter handling non-normal or mixed variables by sequentially modeling each as a function of others.^[58] Empirical comparisons confirm MI's robustness, though it requires larger m for high missingness (>30%) or non-ignorable mechanisms to avoid coverage shortfalls below nominal 95% levels.^[67] Advanced strategies incorporate machine learning, such as k-nearest neighbors (KNN) imputation, which averages values from the k most similar observed cases based on distance metrics, or random forest-based methods that leverage ensemble predictions to capture nonlinear interactions.^[68] These perform competitively in high-dimensional settings, with studies showing KNN reducing bias by 15-25% over parametric regression in categorical data, but they demand substantial computational resources and risk overfitting without cross-validation.^[69] Selection among strategies hinges on missing data mechanisms, proportions (e.g., <5% favors simple methods for efficiency), and validation via sensitivity analyses, as no universal optimum exists; for example, MI excels under MAR but may falter if missing not at random without auxiliary variables.^[62]

Model-Based Procedures

Model-based procedures for handling missing data involve specifying a joint probability distribution for the observed and missing variables, typically under the missing at random (MAR) assumption, to derive likelihood-based inferences or imputations. These methods leverage parametric or semiparametric models to maximize the observed-data likelihood, avoiding explicit imputation in some cases while accounting for uncertainty in others. Unlike deletion or simple imputation, they integrate missingness into the estimation process, potentially yielding more efficient estimators when the model is correctly specified.^[7] A primary approach is full information maximum likelihood (FIML), which computes parameter estimates by directly maximizing the likelihood function based solely on observed data patterns, without requiring complete cases. FIML is particularly effective for multivariate normal data or generalized linear models, as it uses all available information across cases, reducing bias under MAR compared to listwise deletion. For instance, in regression analyses with missing covariates or outcomes, FIML adjusts standard errors to reflect data incompleteness, maintaining valid inference if the model encompasses the data-generating process.^[70]^[71] Computationally, the expectation-maximization (EM) algorithm facilitates MLE when closed-form solutions are unavailable, iterating between an E-step that imputes expected values for missing data given current parameters and an M-step that updates parameters as if data were complete. Introduced by Dempster, Laird, and Rubin in 1977, EM converges to local maxima under regularity conditions, with applications in finite mixture models and latent variable analyses featuring missingness. Its efficiency stems from avoiding multiple simulations, though it requires careful initialization to avoid poor local optima.^[72]^[73] Multiple imputation (MI) extends model-based principles by generating multiple plausible datasets from a posterior predictive distribution under a specified model, followed by separate analyses and pooling of results via Rubin's rules to incorporate imputation uncertainty. Joint modeling approaches, such as multivariate normal imputation, assume a full-data model (e.g., via Markov chain Monte Carlo), while sequential methods like chained equations approximate it by univariate conditional models. Little and Rubin (2020) emphasize MI's robustness for complex data structures, as it preserves multiplicity in inferences, outperforming single imputation in variance estimation; however, results depend on model adequacy, with simulations showing degradation under misspecification.^[7]^[74] Bayesian model-based methods further generalize these by sampling from the full posterior, incorporating priors on parameters and treating missing values as latent variables, often via data augmentation. This framework unifies MLE and MI under a probabilistic umbrella, enabling hierarchical modeling for clustered data with missingness. Empirical studies indicate Bayesian imputation tracks complete-data estimates closely when priors are weakly informative, offering advantages in small samples over frequentist alternatives.^[74]^[75] Overall, model-based procedures excel in efficiency and validity under MAR when the posited model captures substantive relations, but demand diagnostic checks for assumption violations, such as sensitivity analyses to MNAR scenarios. Software implementations, including PROC MIANALYZE in SAS and packages like mice in R, facilitate their application, though users must verify convergence and model fit via information criteria like AIC.^[76]^[77]

Assumptions, Limitations, and Controversies

Required Assumptions for Validity

The validity of methods for handling missing data depends on untestable assumptions about the missingness mechanism, which describes the relationship between the probability of data being missing and the values of the observed and unobserved variables. These mechanisms are categorized as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Under MCAR, the probability of missingness is independent of both observed and missing values, permitting unbiased complete-case analysis or simple deletion methods without introducing systematic error, though with potential efficiency loss.^[39]^[78] This assumption is stringent and rarely holds in practice, as it implies no systematic patterns in missingness, verifiable only through tests like Little's MCAR test, which assess uniformity in observed data distributions but cannot confirm independence from unobserved values.^[4] MAR, a weaker and more plausible assumption, posits that missingness depends only on observed data (including covariates) and not on the missing values themselves, formalized as the probability of missingness conditioning on the full data equaling the probability conditioning solely on observed data. This enables consistent inference via multiple imputation by chained equations (MICE) or maximum likelihood estimation under the missing-at-random ignorability condition, where the observed-data likelihood factors correctly, provided the model for missingness and outcomes is correctly specified.^[39]^[78] Model-based procedures, such as full-information maximum likelihood, rely on MAR for the parameters of the observed data distribution to be identifiable without bias, assuming the parametric form captures the data-generating process adequately; violations, such as omitted variables correlating with both missingness and outcomes, can lead to inconsistent estimates.^[7] MNAR occurs when missingness directly depends on the unobserved values, rendering standard methods invalid without additional, often untestable, structural assumptions about the missingness process, such as selection models or pattern-mixture models that parameterize the dependence. No universally valid approach exists under MNAR, as it requires sensitivity analyses comparing results across plausible MNAR scenarios, since the true mechanism cannot be empirically distinguished from MAR using observed data alone.^[78]^[39] For all mechanisms, auxiliary variables strongly correlated with missingness can enhance robustness under MAR by improving imputation models, but they do not mitigate MNAR bias. Empirical verification of these assumptions is limited; diagnostics like comparing observed patterns across missingness indicators provide evidence against MCAR but cannot falsify MAR or identify MNAR definitively.^[4]

Risks and Criticisms of Common Methods

Deletion-based methods, such as listwise deletion, risk introducing substantial bias when missingness violates the missing completely at random (MCAR) assumption, as the removal of incomplete cases can distort parameter estimates by systematically excluding observations correlated with the missing values.^[6] This approach also leads to reduced statistical power and inefficient use of available data, particularly in datasets with high missingness rates, where sample sizes can shrink dramatically and inflate standard errors.^[79] Pairwise deletion, while preserving more data for certain analyses, exacerbates inconsistencies in sample composition across correlations, potentially yielding unstable covariance matrices and misleading inference.^[6] Simple imputation techniques, including mean or median substitution, systematically underestimate variability and distort associations by shrinking imputed values toward the center of observed distributions, thereby biasing regression coefficients and confidence intervals even under MCAR conditions.^[80] Regression-based single imputation may mitigate some bias but still fails to account for uncertainty in predictions, leading to overconfident estimates and invalid hypothesis tests.^[81] Multiple imputation addresses variance underestimation by generating plausible datasets but requires the missing at random (MAR) assumption, which, if unverified, can propagate errors from incorrect imputation models, especially when auxiliary variables inadequately capture dependencies.^[82] Critics note that multiple imputation's reliance on repeated simulations demands large observed samples for reliable imputations and can produce nonsensical results if applied mechanistically without domain-specific insight into missing mechanisms.^[82]^[83] Model-based procedures, like the expectation-maximization (EM) algorithm, assume a specified parametric form for the data-generating process, which introduces bias if the model is misspecified or if missingness is missing not at random (MNAR), as unverifiable dependencies on unobserved values render likelihood-based corrections invalid.^[84] Convergence in EM can be slow or fail with extensive missingness—exceeding 50% in some variables—due to iterative instability, and computational demands scale poorly for high-dimensional data.^[6] Full information maximum likelihood methods similarly hinge on MAR, yielding asymptotically efficient estimates only under correct specification; deviations, common in real-world MNAR scenarios like selective non-response in surveys, result in attenuated effects or reversed associations.^[84] Across methods, a pervasive criticism is the untestable nature of MAR/MCAR assumptions, fostering overreliance on diagnostics like Little's test that lack power against subtle violations, ultimately undermining causal inferences in non-experimental settings.^[85]^[39]

Debates on Method Selection

Deletion methods, such as listwise deletion, remain popular due to their simplicity and validity under missing completely at random (MCAR) or missing at random (MAR) mechanisms, where missingness does not depend on unobserved values after conditioning on observed data; however, they reduce sample size and statistical power, potentially introducing bias under missing not at random (MNAR) conditions prevalent in real-world scenarios like survey nonresponse correlated with outcomes.^[39]^[86] Imputation techniques, by contrast, preserve sample size and can enhance efficiency under MAR by filling gaps with predicted values, but critics argue they risk amplifying errors if the imputation model is misspecified or fails to capture complex dependencies, as single imputation underestimates variance while multiple imputation (MI) addresses this by generating several datasets and pooling results, though it demands correct auxiliary variable inclusion and substantial computational resources.^[62]^[87] A central contention revolves around the MAR assumption underpinning most imputation and maximum likelihood methods, which is often untestable and optimistic; empirical simulations demonstrate that while MI outperforms deletion in power under MAR with 10-30% missingness, both falter under MNAR without explicit modeling of selection processes, as in Heckman correction or pattern-mixture models, leading to calls for sensitivity analyses to probe robustness rather than defaulting to MAR-based approaches.^[88]^[89] Proponents of model-based procedures like full information maximum likelihood (FIML) highlight their avoidance of explicit imputation, relying instead on likelihood contributions from incomplete cases, yet detractors note similar vulnerability to MAR violations and less intuitiveness in high-dimensional settings compared to flexible MI chains.^[39]^[90] Empirical comparisons across simulated datasets reveal no universally superior method; for instance, in partial least squares structural equation modeling with up to 20% missing data, MI and predictive mean matching yielded lower bias than mean imputation or deletion under MAR, but generalized additive models excelled in nonlinear MNAR cases, underscoring the need for mechanism-informed selection over rote application.^[91]^[66] Critics of overreliance on MI in observational studies point to its sensitivity to the fraction of missing information (FMI), where high FMI (>0.5) inflates standard errors unless augmented with strong predictors, while advocates counter that empirical evidence from clinical trials favors MI for intent-to-treat analyses when MAR holds plausibly.^[92]^[93] Ultimately, debates emphasize context-specific trade-offs—deletion for low missingness and verifiable MCAR, imputation for efficiency gains under testable MAR—prioritizing diagnostics like Little's test and global pattern assessments over arbitrary thresholds like 5% missingness dictating method choice.^[94]^[95]

Recent Developments

Integration with Machine Learning

Machine learning workflows commonly incorporate missing data handling as a preprocessing step, where imputation replaces absent values to enable model training, since many algorithms such as linear regression and neural networks require complete datasets.^[31] Tree-based ensemble methods like random forests and gradient boosting machines (e.g., XGBoost) integrate missing data natively through mechanisms such as surrogate splits or treating missingness as a distinct category, allowing predictions without explicit imputation and often preserving performance under moderate missingness rates up to 20-30%.^[96] ^[97] Advanced imputation strategies leverage ML itself, including k-nearest neighbors (KNN) for local similarity-based filling and iterative methods like multiple imputation by chained equations (MICE), which model each variable conditionally on others using regressions or classifications.^[93] Model-based approaches such as missForest apply random forests iteratively to impute multivariate data, outperforming simpler mean or median substitutions in preserving data distribution and improving downstream classifier accuracy, as demonstrated in benchmarks with missingness ratios from 0% to 50%.^[98] ^[99] Recent integrations emphasize generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs) for synthesizing plausible missing values while capturing complex dependencies, particularly effective for high-dimensional data like images or time series where traditional methods distort variance.^[100] These methods, evaluated in clinical datasets, reduce imputation error metrics like root mean square error by 10-20% over statistical baselines under missing at random assumptions, though they demand larger samples to avoid overfitting.^[101] Ensemble imputation combining multiple ML learners further enhances robustness, with studies confirming superior predictive performance in supervised tasks compared to single algorithms.^[31] Empirical assessments highlight that imputation quality directly correlates with ML efficacy, underscoring the need for method selection aligned with missing data mechanisms to mitigate bias amplification in causal inference pipelines.^[102]

Advances in Generative and Scalable Methods

Generative adversarial networks (GANs) have emerged as a prominent approach for missing data imputation by pitting a generator against a discriminator to produce realistic synthetic values that align with observed data distributions. Introduced in frameworks like GAIN in 2018, these methods treat imputation as an adversarial game, where the generator fills missing entries while the discriminator identifies them, enabling handling of complex dependencies under missing at random (MAR) assumptions.^[103] Recent enhancements, such as improved GAN architectures proposed in 2024, incorporate advanced loss functions and network designs to boost imputation accuracy on tabular datasets, outperforming traditional methods like k-nearest neighbors in metrics such as mean absolute error.^[104] Scalable variants, including differentiable GAN-based systems like SCIS from 2022, accelerate training for large-scale data by optimizing gradients directly, reducing computational overhead compared to non-differentiable predecessors.^[105] Variational autoencoders (VAEs) complement GANs by learning latent representations of data, facilitating probabilistic imputation that captures uncertainty in missing values. Models like TVAE, adapted for tabular data, encode observed features into a low-dimensional space and decode imputations, showing superior performance in preserving correlations on benchmarks with up to 50% missingness.^[106] Hybrid approaches, such as those combining VAEs with genetic algorithms for hyperparameter tuning, further refine imputation for biomedical datasets, achieving lower root mean squared error than multiple imputation by chained equations (MICE).^[107] For scalability, denoising autoencoder-based methods like MIDAS, developed in 2021, enable efficient multiple imputation on datasets exceeding millions of observations by leveraging deep neural networks for rapid inference.^[108] Diffusion models represent a newer generative paradigm, iteratively denoising data to impute missing values by modeling forward and reverse diffusion processes conditioned on observed entries. The DiffPuter framework, introduced in 2024 and accepted at ICLR 2025, integrates diffusion with expectation-maximization to handle arbitrary missing patterns, demonstrating state-of-the-art results on synthetic and real-world benchmarks under MAR and missing not at random (MNAR) scenarios.^[109] Tabular-specific diffusion models like TabCSDI, from 2022, address mixed data types and achieve scalability through conditional sampling, with empirical evaluations showing reduced bias in downstream tasks like classification compared to GANs.^[110] These methods scale to high-dimensional data by parallelizing diffusion steps, though they require careful tuning of noise schedules to avoid mode collapse in sparse regimes.^[111]

Implementation Tools

Statistical Software Packages

SAS provides the PROC MI procedure for multiple imputation of missing data, supporting methods such as parametric regression, logistic regression for classification variables, and fully conditional specification (FCS) for flexible multivariate imputation.^[112] This procedure generates multiple imputed datasets, allowing users to assess missing data patterns with the NIMPUTE=0 option and incorporate imputed values into subsequent analyses like PROC MIANALYZE for pooling results.^[113] PROC MI handles arbitrary missing data patterns and is particularly effective for datasets assuming missing at random (MAR), though users must verify assumptions empirically.^[114] IBM SPSS Statistics offers a Missing Values module for exploratory analysis, including pattern detection via Analyze Patterns and estimation of missing values using expectation-maximization (EM) algorithms.^[115] The software supports multiple imputation through its dedicated procedure, which imputes missing data under MAR assumptions and provides diagnostics for convergence and plausibility.^[116] SPSS distinguishes system-missing (absent values) from user-defined missing values, enabling tailored handling in analyses while warning against complete-case deletion biases in large-scale surveys.^[117] Stata's mi command suite facilitates multiple imputation for incomplete datasets, with mi impute chained implementing multivariate imputation by chained equations (MICE) for non-monotone patterns and mi impute monotone for sequential imputation.^[118] Users can set mi styles (e.g., wide or flong) to store imputations, explore patterns via mi describe, and combine results using mi estimate for Rubin's rules-based inference.^[119] Stata supports passive variables and constraints, making it suitable for complex survey data, but requires careful specification of imputation models to avoid bias under MNAR mechanisms.^[120] R, as an open-source statistical environment, integrates missing data handling through specialized packages rather than core functions, with Amelia performing bootstrapping-based multiple imputation for cross-sectional and time-series data under MAR.^[121] The package generates multiple completed datasets efficiently, outperforming single imputation in variance estimation, as validated in simulations with up to 50% missingness.^[122] Complementary tools like the mice package enable MICE for flexible predictive mean matching and regression-based imputation across variable types.^[123] These packages prioritize empirical diagnostics, such as trace plots for convergence, over default listwise deletion common in legacy software.^[124]

Software	Key Procedure/Package	Supported Methods	Pattern Handling
SAS	PROC MI	Regression, FCS, Propensity Score	Arbitrary
SPSS	Missing Values Analysis	EM, Multiple Imputation	Exploratory, MAR
Stata	mi impute	Chained Equations, Monotone	Non-monotone
R (Amelia)	amelia()	Bootstrapping	Cross-sectional, Time-series

Programming Libraries and Frameworks

Several programming libraries in Python facilitate missing data handling, with scikit-learn offering imputation transformers including SimpleImputer for strategies like mean or median substitution and IterativeImputer for multivariate feature modeling via iterative regression.^[125] Specialized packages such as MIDASpy extend this to multiple imputation using deep learning methods, achieving higher accuracy in benchmarks compared to traditional approaches for certain datasets.^[126] The gcimpute package supports imputation across diverse variable types, including continuous, binary, and truncated data, as detailed in its 2024 Journal of Statistical Software publication.^[127] In R, the mice package implements multiple imputation by chained equations (MICE), generating plausible values from predictive distributions and enabling analysis of uncertainty via pooled results, a method validated in numerous empirical studies since its introduction.^[128] Complementary tools like missForest use random forests for nonparametric imputation, performing robustly under missing at random assumptions without requiring normality.^[129] The CRAN Missing Data Task View catalogs additional options such as Amelia for expectation-maximization algorithms and naniar for visualization and pattern detection, emphasizing exploration prior to imputation to assess mechanisms like missing completely at random.^[123] Julia provides built-in support for missing values via the missing singleton, with packages like Impute.jl offering interpolation methods for vectors, matrices, and tables, including linear and spline-based approaches suitable for time series or spatial data.^[130] The Mice.jl package ports R's MICE functionality, supporting chained equations for multiple imputation in high-performance computing environments.^[131] In machine learning frameworks, scikit-learn's imputers integrate seamlessly into pipelines, allowing preprocessing before model fitting, while emerging tools like MLimputer automate regression-based imputation tailored to predictive tasks.^[132] These libraries generally assume mechanisms like missing at random for validity, with users advised to verify assumptions empirically to avoid biased inferences.^[31]

References

[1]
Missing data - Statistical Consulting Centre
'Missing data' refers to data which was intended to have been collected but was not. Missing data occurs commonly across a range of quantitative disciplines.
[2]
Missing data: A statistical framework for practice - PMC - NIH
Definition of Rubin's missingness mechanisms. Probability of X being missing depends on, Missingness mechanism. Neither X, Y or Z, Missing Completely At Random.
[3]
[PDF] Inference and Missing Data - Donald B. Rubin
Feb 18, 2003 · The statistical literature also discusses missing data that arise intentionally. In these cases, the process that causes missing data is ...
[4]
1.2 Concepts of MCAR, MAR and MNAR - Stef van Buuren
Rubin (1976) classified missing data problems into three categories. In his theory every data point has some likelihood of being missing. The process that ...
[5]
Missing Data Mechanism - an overview | ScienceDirect Topics
Following the definition by Rubin (1987), data are missing completely at random if the missingness does not depend on either the observed or missing values in Y ...
[6]
The prevention and handling of the missing data - PMC - NIH
This manuscript reviews the problems and types of missing data, along with the techniques for handling missing data. The mechanisms by which missing data occurs ...
[7]
Statistical Analysis with Missing Data, Third Edition
Apr 12, 2019 · An up-to-date, comprehensive treatment of a classic text on missing data in statistics. The topic of missing data has gained considerable ...
[8]
Missing Data Analysis - Annual Reviews
Feb 12, 2024 · Missing data are defined, and a taxonomy of main approaches to analysis is presented, including complete-case and available-case analysis, ...
[9]
Identify the most appropriate imputation method for handling missing ...
Aug 28, 2024 · The analysis of 58 studies revealed that conventional statistical methods are most effective when considering the mechanisms, patterns, and ...
[10]
https://qwone.com/~jason/trg/papers/rubin-missing-76.pdf
[11]
Multiple Imputation
MCAR implies MAR but not vice-versa. MNAR. If the data are Missing Not At Random, then the missingness depends on the values of the missing data. Censored ...
[12]
Missing data - Stat@Duke
Under MNAR, missing data are related to unobserved data. For instance, suppose instead that students who have higher alcohol consumption are less likely to ...
[13]
https://iriseekhout.com/post/2022-06-28-missingdatamechanisms/
[14]
Missing Data Assumptions by Roderick J. Little :: SSRN
Mar 10, 2021 · I review assumptions about the missing-data mechanisms that underlie methods for the statistical analysis of data with missing values.<|control11|><|separator|>
[15]
9.2 Missing data mechanism | Introduction to Regression Methods ...
MAR: Missing at random, or; MNAR: Missing not at random. Missing data are MCAR if the probability of missingness is independent of the data. In other words, the ...
[16]
Analyzing and interpreting “imperfect” Big Data in the 1600s
Feb 17, 2016 · This paper examines the work of John Graunt (1620–1674) in the tabulation of diseases in London and the development of a life table using the “imperfect data”
[17]
[PDF] John Graun'ts Bills of Mortality - Neonatology on the Web
Here Graunt tells of having gone to visit a parish clerk whom he suspects of terrible under-reporting. He found the office locked and, after breaking in, found.
[18]
John Graunt F.R.S. (1620-74): The founding father of human ...
He quantified the high infant mortality and attempted the calculation of a case fatality rate during an epidemic of fever. He was the first to document the ...
[19]
Medical Statistics from Graunt to Farr - jstor
Graunt was the first to apply an arithmetical test of mortality; he compared the statistics of Romsey with those of London. For Romsey he had ninety years' ...<|separator|>
[20]
History of statistics - Wikipedia
Sir William Petty, a 17th-century economist who used early statistical methods to analyse demographic data. The term 'statistic' was introduced by the ...Missing: incomplete | Show results with:incomplete
[21]
Error, Uncertainty, and the Shifting Ground of Census Data
May 26, 2020 · Historian Dan Bouk analyses the struggles to create reliable census data by showing how paying attention to efforts to address 'uncertainty' and 'error' over ...
[22]
Statistics and Politics in the 18th Century - jstor
The first uses of statistics in politics can be found in France and the German principalities, and they can be dated, quite precisely, to the last third of the.
[23]
Correcting missing-data bias in historical demography - PubMed
Several methods for the correction of mortality estimates are proposed in the literature, most of which first estimate the number of individuals at risk and ...Missing: vital | Show results with:vital
[24]
[PDF] Statistical Analysis with Missing Data
Definition 1.1 Missing data are unobserved values that would be meaning- ful ... (Rubin 1974), which can be estimated from this missing data perspective.
[25]
Chapter 7 Missing data Mechanisms | Book_MI.utf8.md - Bookdown
The key idea behind Rubin's missing data mechanisms is that the probability of missing data in a variable may or may not be related to the values of other ...
[26]
3 Missing Data - A Tutorial for Mental Health Researchers
3.1 Missing Data Mechanisms. The way statisticians think about missing data has been shaped in large measure by the seminal work of Donald Rubin (Rubin 1976; ...
[27]
[PDF] Non-response in the American Time Use Survey
Groves and Couper (1998) develop a model of non-response to household interview surveys that distinguishes non-contact, refusal and other reasons for survey non ...<|separator|>
[28]
What is nonresponse bias and how to avoid errors - SurveyMonkey
5 common causes of nonresponse bias · 1. Poor survey design · 2. Wrong target audience · 3. Refusals · 4. Failed delivery · 5. Accidental omission.
[29]
[PDF] Reasons for Unit Non-Response - ISER/Essex
Reasons could include: Refusal to provide an answer • Inability to provide an answer • Other failure to answer (e.g. by accident) • Provided answer being of ...
[30]
[PDF] NONRESPONSE BIAS ANALYSES AT THE NATIONAL CENTER ...
Longitudinal studies can be particularly vulnerable to nonresponse bias, as bias in the first wave of data collection may persist in future rounds of data ...
[31]
Item Nonresponse & Interview Timings - National Longitudinal Surveys
Missing data, or nonresponse, occurs for a number of reasons in the NLSY97 survey. First, a number of respondents may not participate at all that survey year, ...Missing: studies | Show results with:studies
[32]
A survey on missing data in machine learning | Journal of Big Data
Oct 27, 2021 · Missing values can be handled by certain techniques including, deletion of instances and replacement with potential or estimated values [5,6,7], ...<|control11|><|separator|>
[33]
5 Most Common Lab Equipment Malfunctions & How to Prevent Them
Oct 26, 2021 · 5. Data Loss. Protecting research data is just as important as maintaining physical equipment. Unexpected power outages, hardware failures, or ...
[34]
Missing Data | Types, Explanation, & Imputation - Scribbr
Dec 8, 2021 · Missing data often come from attrition bias, nonresponse, or poorly designed research protocols. When designing your study, it's good practice ...<|separator|>
[35]
4.1 Missing data pattern - Stef van Buuren
The outflux of a variable quantifies how well its observed data connect to the missing data on other variables.
[36]
Investigating the missing data mechanism in quality of life outcomes
Jun 22, 2009 · The three mechanisms of missing data are missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [1].
[37]
How to Diagnose the Missing Data Mechanism - The Analysis Factor
Missing data mechanisms are MCAR (random), MAR (related to observed data), and MNAR (related to missing values). Diagnosis involves measuring missing data and ...
[38]
How to decide whether missing values are MAR, MCAR, or MNAR
Apr 24, 2020 · Classifying missing data can be done by using statistical tests. In brief, it's the chi-squared for MCAR and logistic regression for MAR.
[39]
13.3 Diagnosing the Missing Data Mechanism - Bookdown
The three main mechanisms for missing data are MCAR (Missing Completely at Random), MAR (Missing at Random), and MNAR (Missing Not at Random).
[40]
Principled missing data methods for researchers - PMC
In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation- ...
[41]
[PDF] Missing Data - Statistical Horizons
In par- ticular, variances for the variables with missing data tend to be underestimated, and this bias is propagated to any parameters that depend on variances ...Missing: "Little | Show results with:"Little
[42]
[PDF] Multiple Imputation After 18+ Years - Donald B. Rubin
Jun 6, 2005 · Rice, Johnson, Khare, Little, Rubin, and Schafer (1995) il- lustrates such efforts and the resulting valid inferences. Since with public-use ...<|separator|>
[43]
Impact of missing data on bias and precision when estimating ...
Jun 20, 2019 · Missing data can substantially affect the precision of estimated change in PRO scores from clinical registry data.
[44]
More Notes on Missing Data for Statistical Inference - Kevin Urban
Apr 22, 2019 · ... missing data introduces a negligible amount of bias o rnot…. The ... Little, Rubin, 1987). Single Imputation methods can work for MAR ...
[45]
[PDF] Statistical data preparation: management of missing values and ...
While the simplicity of analysis is an advantage, reduced sample size and lower statistical power are disadvantages because drawing statistical inferences ...
[46]
[PDF] Facing Your Fear of the Unknown - Handling Missing Data
Missing data is a common ... WHAT ARE MISSING DATA? Missing data are defined by Little et. ... reduced sample size, and therefore a reduced statistical power ...
[47]
[PDF] Missing Data: An Introductory Conceptual Overview for the Novice ...
Listwise deletion of missing data affects statistical power in two ways. First, in multivariate analysis, deleting a relatively large number of cases that have ...
[48]
[PDF] Methods for Handling Missing Data in the Behavioral Neurosciences
Consequently, acceptable methods for incorporating missing data are needed to increase statistical power and ... missing data (Rubin, 1976; Little, 1979; Hedeker ...Missing: loss | Show results with:loss
[49]
[PDF] Missing Data A Gentle Introduction missing data a gentle introduction
- Loss of statistical power: Missing data reduces the sample size, potentially ... the reduced sample size can increase variance and lower statistical power.
[50]
[PDF] Getting tough on missing data: a boot camp for social science ...
However, discarding cases with missing data reduces sample size and statistical power. ... sample due to reduced sample size. Instead, we observe in table ...
[51]
[PDF] The Effect of a Missing at Random Missing Data Mechanism on a
However, a disadvantage is a reduced sample size and statistical power. ... The magnitude of bias of β0 from a complete case analysis is greater for missing data ...<|separator|>
[52]
Complete Case Analysis - an overview | ScienceDirect Topics
Complete case analysis is defined as a method of handling missing data by discarding any observation with a missing value for any variable, resulting in the ...
[53]
Handling Missing Data: Listwise Versus Pairwise Deletion
Researchers using listwise deletion will remove a case completely if it is missing a value for one of the variables included in the analysis.
[54]
When Is a Complete-Case Approach to Missing Data Valid ... - NIH
The authors explained that the CCA estimate is valid for the full study sample only if the modifier and missing-data indicator are unconditionally independent. ...Missing Data And... · Data Missing Completely At... · Discussion
[55]
Complete Case Analysis | Andrea Gabrio
Apr 27, 2016 · Bias, when the missing data mechanism is not missing completely at random (MCAR) and the completers are not a random samples of all the cases.<|control11|><|separator|>
[56]
Missing Data: Listwise vs. Pairwise - Statistics Solutions
Listwise deletion (complete-case analysis) removes all data for a case that has one or more missing values. This technique is commonly used if the researcher is ...
[57]
Pairwise vs Listwise Deletion - GeeksforGeeks
Jul 23, 2025 · Pairwise deletion uses available data for each pair, while listwise deletion removes rows with any missing values, ensuring consistency.
[58]
Assumptions and analysis planning in studies with missing data in ...
Feb 13, 2023 · In this manuscript we outline an approach for deciding which method to use to handle multivariable missing data in an analysis.Abstract · Introduction · The approach · Discussion
[59]
Multiple Imputation: A Flexible Tool for Handling Missing Data - PMC
Common statistical methods used for handling missing values were reviewed. When missing data occur, it is important to not exclude cases with missing ...
[60]
Missing Data Imputation: A Comprehensive Review
This review has provided a comprehensive overview of missing data imputation techniques, from traditional methods, and statistical methods to advanced machine ...
[61]
[PDF] Multiple Imputation for Missing Data - OARC Stats
Instead of filling in a single value for each missing value, Rubin's (1987) multiple imputation procedure replaces each missing value with a set of plausible ...
[62]
Identify the most appropriate imputation method for handling missing ...
Aug 28, 2024 · We conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ...
[63]
A Comprehensive Review of Handling Missing Data - arXiv
Apr 7, 2024 · Three missing mechanisms are defined in the literature: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random ...
[64]
Dealing with Missing Data - HERC
If missingness is correlated with the outcome of interest, then ignoring it will bias the results of statistical tests. In addition, most statistical software ...Missing: consequences | Show results with:consequences
[65]
Evaluation of different approaches for missing data imputation on ...
Sep 3, 2021 · Here, we have reviewed five statistical methods available to impute missing data in genomic studies. We used coding and non-coding variants ...<|separator|>
[66]
Multiple Imputation for Nonresponse in Surveys - Wiley Online Library
Demonstrates how nonresponse in sample surveys and censuses can be handled by replacing each missing value with two or more multiple imputations.
[67]
Empirical Comparison of Imputation Methods for Multivariate ...
Jan 14, 2023 · In this study, we compared three popular imputation methods: sequential multiple imputation, fractional hot-deck imputation, and generalized ...
[68]
A comparison of multiple imputation methods for missing data in ...
Dec 12, 2018 · In this paper, we have identified and compared 12 different MI methods for imputing missing data in longitudinal studies.<|control11|><|separator|>
[69]
A comparison of imputation methods for categorical data
We compared the following imputation methods for categorical data in an empirical analysis: Mode, K-Nearest Neighbors (KNN), Random Forest (RF), Sequential ...
[70]
A comparison of various imputation algorithms for missing data
The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.
[71]
Maximum likelihood estimation with missing outcomes - NIH
Aug 8, 2019 · Maximum likelihood (ML) methods provide a conceptually straightforward approach to estimation when the outcome is partially missing.
[72]
[PDF] Handling Missing Data by Maximum Likelihood - Statistical Horizons
Multiple imputation is rapidly becoming a popular method for handling missing data, especially with easy-to-use software like PROC MI.
[73]
EM Algorithm for Data with Missing Values - SAS Help Center
Feb 21, 2025 · The EM algorithm (Dempster, Laird, and Rubin 1977) is a technique that finds maximum likelihood estimates in parametric models for incomplete data.
[74]
[PDF] Missing Data and the EM algorithm - Oxford statistics department
Jan 31, 2007 · The EM algorithm is an alternative to Newton–Raphson or the method of scoring for computing MLE in cases where the complications in calculating ...
[75]
[PDF] model-based imputation - ERIC
Simulation results from the aforementioned studies suggest that Bayesian approaches to imputation offer substantial improvement over older missing data handling ...
[76]
[PDF] Missing-data imputation - Columbia University
25.6 Model-based imputation. Missing data can be handled in Bugs by modeling ... Little and Rubin (2002) provide an overview of methods for analysis with missing.
[77]
[PDF] Imputation of Missing Values in Survey Data - GESIS
These predictions are taken as imputed values (Little & Rubin, 2019). ... model is misspecified compared to a purely model-based imputation. Moreover ...
[78]
[PDF] Multiple Imputation as a Missing Data Machine - Stef van Buuren
Another reason is that by a simple model, model based imputation is much faster than data driven imputation. Finally, a real data driven imputation algorithm is.
[79]
Missing Data Assumptions - Annual Reviews
Mar 7, 2021 · I review assumptions about the missing-data mechanisms that underlie methods for the statistical analysis of data with missing values.
[80]
A Note on Listwise Deletion versus Multiple Imputation
Aug 3, 2018 · This process of “listwise deletion” is inefficient, and frequently biased when the probability that an observation is missing is related to its ...<|separator|>
[81]
Handling Missing Values in Information Systems Research
Mar 31, 2022 · We believe that a review of missing value theory is necessary and timely for the IS community to understand the nature of missing values.
[82]
Multiple imputation for missing data in epidemiological and clinical ...
In this article, we review the reasons why missing data may lead to bias and loss of information in epidemiological and clinical research.
[83]
12.1 Some dangers, some do's and some don'ts - Stef van Buuren
The major danger of the technique is that it may provide nonsensical or even misleading results if applied without appropriate care or insight. Multiple ...
[84]
Missing Data in Clinical Research: A Tutorial on Multiple Imputation
Historically, a popular approach when faced with missing data was to exclude all subjects with missing data on any necessary variables and to conduct subsequent ...
[85]
How handling missing data may impact conclusions - NIH
However, when data are MNAR, it is difficult to identify, and consequently respond to, the missing mechanisms as this is unverifiable. This is when the risk of ...
[86]
[PDF] A Comprehensive Review of Handling Missing Data - arXiv
Apr 9, 2024 · Our review covers traditional techniques such as deletion and imputation, as well as emerging methods based on representation learning. We ...
[87]
Missing data: Issues, concepts, methods - ScienceDirect.com
We aim to explain in non-technical language the issues and concepts around missing data, as well as discuss common methods for handling missing data.
[88]
Much ado about nothing: A comparison of missing data methods ...
Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Development of statistical methods to address missingness have ...
[89]
Full article: What is Missing in Missing Data Handling? An ...
Feb 7, 2023 · Table 4 compares the methods for handling missing data in dissertations and peer-reviewed journal articles using NHANES data. The majority ...
[90]
A review of the use of controlled multiple imputation in randomised ...
Apr 15, 2021 · There are three broad categories of assumptions that can be made for missing data: missing completely at random (MCAR), missing at random (MAR) ...Missing: criticisms | Show results with:criticisms
[91]
Accounting for missing data in statistical analyses - Oxford Academic
Mar 16, 2019 · All statistical methods for analysing data with missing values ('incomplete data') require assumptions about the reasons for missing data.Missing: vital | Show results with:vital
[92]
An empirical comparison of some missing data treatments in PLS-SEM
Jan 19, 2024 · The issue arises when participants have insufficient or unavailable data for one or more variables in the analysis model. This missingness can ...
[93]
The proportion of missing data should not be used to guide ...
The proportion of missing data should not guide decisions on multiple imputation; instead, the fraction of missing information (FMI) should be used.
[94]
Comparison of the effects of imputation methods for missing data in ...
Feb 16, 2024 · The processing of missing data is frequently separated into deletion and imputation [5]. Deletion is the most user-friendly method. The most ...
[95]
A review on missing values for main challenges and methods
This review aims to consolidate current developments in novel missing-value research methodologies.
[96]
Research and scholarly methods: Missing data - ACCP Journals
Mar 19, 2025 · It is important to determine which kinds of missing data may be present in a study to aid in selecting an appropriate missing data-handling ...
[97]
Machine learning algorithms to handle missing data - Cross Validated
Jun 16, 2014 · The R-package randomForestSRC, which implements Breiman's random forests, handles missing data for a wide class of analyses.
[98]
Missing value imputation affects the performance of machine learning
Missing Value Imputation (MVI) reinforces Machine Learning (ML) models' performance, and suitable MVI methods enhance decision-making actions.
[99]
(PDF) A systematic review of machine learning-based missing value ...
This study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square ...
[100]
Comparison of missing value imputation tools for machine learning ...
Apr 1, 2025 · We have compared these imputation algorithms based on eight case studies with various missing value ratios from 0 to 0.5.
[101]
Missing Data Imputation Techniques | Nature Research Intelligence
Recent investigations have illustrated the potential of deep learning and distributed methodologies in addressing missing data challenges. Early work using ...
[102]
Evaluating the state of the art in missing data imputation for clinical ...
Dec 9, 2021 · The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of ...
[103]
The impact of imputation quality on machine learning classifiers for ...
Oct 6, 2023 · Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the ...
[104]
Improving Missing Data Imputation with Deep Generative Models
Feb 27, 2019 · This paper compares and proposes improvements to deep generative models for imputing missing values in datasets, which negatively impact ...
[105]
Improved generative adversarial imputation networks for missing data
Sep 5, 2024 · We have proposed a new imputation method based on GAN to enhance the accuracy of missing data imputation in this study.
[106]
Differentiable and Scalable Generative Adversarial Models for Data ...
Jan 10, 2022 · In this paper, we propose an effective scalable imputation system named SCIS to significantly speed up the training of the differentiable ...
[107]
Comparison of Data Imputation Performance in Deep Generative ...
This study imputed the missing data using state-of-the-art deep generative models, TVAE [45], CTGAN [45], and TabDDPM [28]. These methods were selected because ...
[108]
Missing data imputation with deep Variational Autoencoders and ...
VAEs are employed to impute missing values by learning latent data representations, while GAs optimize the VAE architecture and hyperparameters, including the ...
[109]
The MIDAS Touch: Accurate and Scalable Missing-Data Imputation ...
Feb 26, 2021 · We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders).<|separator|>
[110]
DiffPuter: Empowering Diffusion Models for Missing Data Imputation
May 31, 2024 · This paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation.
[111]
Diffusion models for missing value imputation in tabular data - arXiv
Oct 31, 2022 · This paper proposes TabCSDI, a diffusion model for missing value imputation in tabular data, using techniques for handling categorical and ...
[112]
Rethinking the Diffusion Models for Missing Data Imputation: A ...
Dec 9, 2024 · Diffusion models have demonstrated competitive performance in missing data imputation (MDI) task. However, directly applying diffusion models to ...
[113]
Overview: MI Procedure - SAS Help Center
Oct 28, 2020 · To impute missing values for a classification variable, you can use a logistic regression method when the classification variable has a binary, ...
[114]
PROC MI Statement - SAS Help Center
Oct 28, 2020 · For each imputation, the data set contains all variables in the input data set with missing values being replaced by the imputed values.
[115]
[PDF] MI FOR MI, OR HOW TO HANDLE MISSING INFORMATION WITH ...
The PROC MI. SAS procedure is the main tool to perform the multiple imputation: many different approaches and options are available there, and a lot of them ...
[116]
[PDF] IBM SPSS Missing Values 28
Use Missing Value Analysis and Analyze Patterns to explore patterns of missing values in your data and determine whether multiple imputation is necessary. 2.
[117]
Missing Values - IBM SPSS Statistics
IBM SPSS Missing Values helps you uncover patterns in missing data and replace the missing values with plausible estimates.
[118]
Missing data | SPSS Learning Modules - OARC Stats - UCLA
There are two types of missing values in SPSS: 1) system-missing values, and 2) user-defined missing values. We will demonstrate reading data containing each ...
[119]
Multiple imputation for missing data - Stata
Stata's new mi command provides a full suite of multiple-imputation methods for the analysis of incomplete data, data for which some values are missing.
[120]
Multiple Imputation in Stata - OARC Stats - UCLA
Stata has a suite of multiple imputation (mi) commands to help users not only impute their data but also explore the patterns of missingness present in the data ...
[121]
[PDF] mi impute — Impute missing values - Title Syntax
If variables follow a monotone-missing pattern (see Patterns of missing data under Remarks and examples in [MI] intro substantive), they can be imputed ...
[122]
CRAN: Package Amelia
Nov 8, 2024 · Amelia is a tool that 'multiply imputes' missing data in cross-sections, time series, or time-series-cross-sectional data sets.
[123]
Amelia II: A Program for Missing Data | GARY KING
Amelia II multiply imputes missing data using multiple imputation, creating multiple completed datasets, and is faster than other methods.
[124]
CRAN Task View: Missing Data
Missing data patterns can be identified and explored using the packages mi (and its GUI migui), wrangle, DescTools, and naniar. daqapo is a generic data quality ...<|separator|>
[125]
[PDF] Amelia – multiple imputation in R - Princeton University
Amelia is an R package for multiple imputation, which uses probabilistic imputation to create multiple datasets for unbiased estimates.
[126]
7.4. Imputation of missing values — scikit-learn 1.7.2 documentation
The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the ...
[127]
MIDASverse/MIDASpy: Python package for missing-data ... - GitHub
MIDASpy is a Python package for multiply imputing missing data using deep learning methods. The MIDASpy algorithm offers significant accuracy and efficiency ...
[128]
https://www.r-bloggers.com/2015/10/imputing-missing-data-with-r-mice-package/
[129]
Imputing missing data with R; MICE package - R-bloggers
Oct 4, 2015 · The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution ...
[130]
Tutorial on 5 Powerful R Packages used for imputing missing values
Jul 5, 2020 · Learn about powerful R packages like amelia, missForest, hmisc, mi and mice used for imputing missing values in R for predictive modeling in
[131]
Home · Impute.jl
Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables. Installation julia> using Pkg; Pkg.add("Impute")
[132]
[ANN] Mice.jl - multiple imputation by chained equations in Julia
Nov 9, 2023 · Mice.jl is a Julia package for handling missing data via multiple imputation by chained equations, based on the R package mice.
[133]
MLimputer: Missing Data Imputation Framework for Machine Learning
The MLimputer project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction ...