Fact-checked by Grok 2 weeks ago

Missing data

Missing data refers to the absence of recorded values for variables in a , despite intentions to collect them, arising from factors such as non-response in surveys, equipment failures, or participant dropout in longitudinal studies. This phenomenon is ubiquitous in empirical research across fields including statistics, , and social sciences, where it can introduce , reduce statistical power, and undermine causal inferences if mishandled. In 1976, statistician Donald Rubin established a foundational of missing data mechanisms based on the probability of missingness depending on observed or unobserved data: missing completely at random (MCAR), where missingness is independent of all data; missing at random (MAR), where it depends only on observed data; and missing not at random (MNAR), where it depends on unobserved data itself. Distinguishing these mechanisms is critical, as MCAR permits unbiased analyses via simple methods like listwise deletion, whereas MAR often requires model-based adjustments such as multiple imputation to preserve validity, and MNAR demands sensitivity analyses acknowledging untestable assumptions about the missingness process. Handling strategies have evolved from ad hoc deletions to sophisticated techniques like multiple imputation by chained equations (MICE) and , which account for uncertainty in imputed values and yield more efficient estimates under MAR. Despite advances, challenges persist in MNAR scenarios, where no standard method fully mitigates bias without auxiliary information or causal modeling, highlighting the need for preventive designs like robust protocols to minimize missingness .

Definition and Mechanisms

Core Definition

Missing data refers to the absence of recorded values for one or more variables in an or , where such values would otherwise be meaningful for . This issue arises in empirical studies when points are not collected or stored, distinct from structural absences like deliberate design choices in experimental setups. The presence of missing data complicates by potentially distorting parameter estimates, unless appropriately addressed through methods that account for the underlying missingness process. The foundational framework for missing data analysis, developed by Donald B. Rubin in , classifies missingness mechanisms based on the probability that a data value Y is missing, denoted by indicator R=1 if missing and R=0 if observed. This probability, P(R \mid Y), determines the ignorability of missingness for likelihood-based inference. Under missing completely at random (MCAR), P(R \mid Y_{\text{obs}}, Y_{\text{mis}}) = P(R), meaning missingness is independent of both observed (Y_{\text{obs}}) and missing (Y_{\text{mis}}) values; for example, random equipment failure unrelated to study variables. Missing at random (MAR) holds when P(R \mid Y_{\text{obs}}, Y_{\text{mis}}) = P(R \mid Y_{\text{obs}}), so missingness depends only on observed data, allowing valid analysis via observed-data likelihood under correct model specification. Missing not at random (MNAR) occurs otherwise, with P(R \mid Y_{\text{obs}}, Y_{\text{mis}}) depending on unobserved values, introducing non-ignorable bias that requires explicit modeling of the missingness process. This taxonomy, elaborated in subsequent works by Roderick Little and , underpins methods like complete-case analysis (valid under MCAR), imputation (often assuming MAR), and selection models for MNAR. Distinguishing mechanisms empirically is challenging, as tests for MCAR versus MAR exist but cannot confirm MAR over MNAR without untestable assumptions. Sensitivity analyses are recommended to assess robustness across plausible mechanisms.

Classification of Missingness Mechanisms

The classification of missingness mechanisms in statistical analysis of incomplete data was introduced by Donald B. Rubin in his paper on inference with missing data. This framework categorizes the processes generating missing values into three distinct types based on the relationship between the missingness indicator and the data values: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). These categories determine the assumptions under which unbiased inferences can be drawn and influence the choice of appropriate imputation or modeling strategies. Missing completely at random (MCAR) implies that the probability of a point being is independent of both the observed and the would-be observed values of the . Formally, if [R](/page/R) denotes the missingness indicator and Y the full vector, MCAR holds when P(R | Y, X) = P(R | X), where X are covariates unrelated to missingness, meaning missingness arises from external factors like random equipment failure without systematic patterns. Under MCAR, complete-case analysis yields unbiased estimates, though with reduced sample size and efficiency. Missing at random () extends MCAR by allowing the probability of missingness to depend on observed but not on the missing values themselves, conditional on those observed portions. Mathematically, P(R | Y_{obs}, Y_{mis}, X) = P(R | Y_{obs}, X), where Y_{obs} and Y_{mis} partition the into observed and missing components. For instance, dropout in longitudinal studies due to or responses exemplifies MAR, as missingness correlates with recorded variables. MAR permits methods like multiple imputation to recover unbiased results by leveraging observed patterns, assuming the model correctly specifies the dependencies. Missing not at random (MNAR), also termed nonignorable missingness, occurs when the probability of missing directly depends on the unobserved values, even after conditioning on observed data: P(R | Y_{obs}, Y_{mis}, X) \neq P(R | Y_{obs}, X). This introduces inherent , as seen in surveys where nonresponse correlates with unreported sensitive outcomes like levels exceeding thresholds. Distinguishing MNAR empirically is challenging without auxiliary information or sensitivity analyses, as standard tests conflate it with , and complete-case or naive imputation often fails to mitigate . Rubin's underscores that while MCAR and allow ignorability under certain models, MNAR requires explicit modeling of the missingness process for valid inference.
MechanismProbability DependenceIgnorabilityExample
MCARIndependent of all Fully ignorableRandom file corruption
MAROn observed onlyConditionally ignorableMissing results due to demographics
MNAROn missing valuesNon-ignorable in income reporting

Historical Development

Pre-Modern Approaches

Early analysts of demographic and vital , such as in his 1662 Natural and Political Observations Made upon the , encountered incomplete records from parish clerks, which often omitted causes of death, underreported events, or contained inconsistencies due to voluntary reporting and clerical errors. Graunt addressed these gaps through manual scrutiny, including physical inspections of records—such as breaking into a locked clerk's office to verify underreporting—and by cross-referencing available tallies to correct obvious discrepancies, effectively applying a form of available-case analysis where only verifiable observations informed rates of christenings, burials, and diseases. This approach allowed derivation of empirical patterns, like excess male burials and seasonal variations, without systematic imputation, prioritizing observed data over speculation. In the late 17th century, extended such practices in his 1693 construction of the first reliable from Breslau (now ) vital records spanning 1687–1691, which suffered from incomplete coverage, particularly for infants and non-residents. Halley adjusted for undercounts by assuming uniform reporting within observed age groups and extrapolating survival probabilities from complete subsets, focusing on insured lives to mitigate biases from and unrecorded deaths. These methods reflected a reliance on deletion of unverifiable cases and simple proportional scaling, common in political arithmetic, where analysts like —influenced by Graunt—tabulated incomplete Irish data by omitting deficient returns and estimating totals from compliant districts. Such ad hoc deletions preserved computational feasibility amid manual calculations but risked biasing estimates toward better-documented subpopulations. By the 18th and 19th centuries, as censuses and surveys proliferated, practitioners routinely applied listwise deletion, excluding entire records with any missing values to facilitate aggregation. For instance, early decennial censuses from onward discarded incomplete household schedules during tabulation, assuming non-response reflected negligible population fractions, while censuses similarly omitted partial enumerations to compute averages from complete cases only. Rudimentary imputation emerged sporadically, such as substituting values or averages from similar locales for absent vital events, as seen in Quetelet's 1835 social physics analyses of Belgian data, where gaps in height or crime records were filled via group means to maintain sample sizes for averaging. These techniques, devoid of probabilistic frameworks, underscored a pragmatic focus on usable subsets, often introducing unacknowledged selection biases that later would quantify.

Formalization in the Late 20th Century

The formalization of missing data mechanisms in statistical inference was advanced significantly by Donald B. Rubin in his 1976 paper "Inference and Missing Data," published in Biometrika. Rubin introduced a rigorous framework using missing data indicators R, where R_i = 1 if the i-th observation is missing and R_i = 0 if observed, to classify missingness based on its dependence on observed data Y_{obs} and missing data Y_{mis}. He defined three key mechanisms: missing completely at random (MCAR), where the probability of missingness P(R) is independent of both Y_{obs} and Y_{mis}; missing at random (MAR), where P(R \mid Y_{obs}, Y_{mis}) = P(R \mid Y_{obs}); and missing not at random (MNAR), where missingness depends on Y_{mis} even after conditioning on Y_{obs}. This addressed prior oversights in statistical practice, where the generating process of values was often ignored, leading to biased under non-MCAR conditions. Rubin's established that valid likelihood-based about parameters \theta in the full f(Y \mid \theta) is possible by ignoring the missingness if are MAR and the parameters of the full are distinct from those of the missingness model (ignorability). These conditions, the weakest general requirements for such , shifted focus from deletion methods to mechanism-aware approaches, emphasizing empirical verification of assumptions where feasible. Building on this foundation, and Roderick J.A. Little's 1987 book Statistical Analysis with Missing Data synthesized the framework into a comprehensive , integrating Rubin's earlier work with practical tools like multiple imputation. The book formalized multiple imputation as drawing multiple plausible values for Y_{mis} from their given Y_{obs}, then analyzing each completed dataset separately and pooling results to account for between-imputation variability, yielding valid inferences under . This approach contrasted with single imputation by properly reflecting , with theoretical guarantees derived from Rubin's Bayesian perspective on . By the 1990s, these concepts influenced broader statistical software and guidelines, such as early implementations in and for maximum likelihood under via expectation-maximization algorithms, though MNAR required specialized sensitivity analyses due to unidentifiability without strong assumptions. Rubin's framework underscored that while MCAR and enable standard methods, MNAR demands explicit modeling of selection, often via pattern-mixture or selection models, highlighting the causal interplay between data generation and observation processes.

Causes and Patterns

Practical Causes in Data Collection

In survey-based , nonresponse arises when sampled individuals refuse participation, cannot be contacted, or provide incomplete answers, often due to concerns, time constraints, or survey fatigue; refusal rates in surveys typically range from 10% to 40%, varying by of administration such as versus in-person. Inability to respond, stemming from factors like language barriers, cognitive limitations, or absence during contact attempts, further exacerbates nonresponse in cross-sectional studies. Longitudinal studies encounter as a primary cause, where participants drop out between waves due to relocation, , , or competing demands, leading to cumulative missingness rates of 20-50% over multiple rounds in panels. Item nonresponse, distinct from unit nonresponse, occurs when respondents skip specific questions on sensitive topics like or , with skip rates increasing with length or perceived intrusiveness. In experimental and observational settings, technical malfunctions such as equipment failure, errors, or power disruptions result in unobserved measurements; for example, hardware breakdowns in instruments can nullify from entire trials, while network interruptions in erase records. Human procedural errors during manual recording, including transcription omissions or rushed fieldwork, contribute to sporadic missing values, particularly in resource-limited environments where lags collection. Budgetary or logistical constraints often truncate prematurely, as in underfunded studies where follow-ups are abbreviated, yielding systematically absent observations from hard-to-reach subgroups. Poorly designed protocols, such as ambiguous questions or inadequate sampling frames, induce accidental omissions or failed deliveries in digital surveys. These causes, while sometimes random, frequently correlate with unobserved variables like , introducing patterns beyond mere randomness.

Observed Patterns and Diagnostics

Observed patterns in missing data describe the structural arrangement of absent values within a , which can reveal potential dependencies or systematic absences. Univariate patterns occur when missingness is confined to a single across observations, often arising from isolated measurement failures. patterns feature nested missingness, where if a value is missing for one variable, all subsequent variables in a (e.g., later time points in longitudinal data) are also missing, commonly seen in scenarios. Arbitrary or non-monotone patterns involve irregular missingness across multiple variables without such hierarchy, complicating analysis due to potential inter-variable dependencies. Visualization techniques, such as missing data matrices or heatmaps, facilitate identification of these patterns by plotting missing indicators (e.g., 0 for observed, 1 for missing) across cases and variables, highlighting clusters, monotonicity, or outflux (connections from observed to missing data). Influx patterns quantify how missing values link to observed data in other variables, aiding in assessing data . Empirical studies, such as those in quality-of-life , show that non-random patterns often cluster by subgroups, with missingness rates varying from 5-30% in clinical datasets depending on follow-up duration. Diagnostics for missing data mechanisms primarily test the assumption of missing completely at random (MCAR) versus alternatives like missing at random () or missing not at random (MNAR). Little's MCAR test evaluates whether observed means differ significantly across missing data patterns, using a chi-squared statistic derived from comparisons of subgroup means under the of MCAR; rejection (typically p < 0.05) indicates non-MCAR missingness, though the test assumes multivariate normality and performs poorly with high missingness (>20-30%) or non-normal data. To distinguish MAR, logistic regression models the missingness indicator as a function of fully observed variables; significant predictors suggest , as missingness depends on observed but not the missing values themselves. MNAR cannot be directly tested, as it involves unobservable dependencies, necessitating analyses that vary assumptions about the missingness model to assess result robustness. visualization combined with auxiliary variables (e.g., comparing demographics between complete and incomplete cases) provides indirect ; for instance, if missingness correlates with observed or but not the missing outcome, is plausible. Limitations include low power in small samples and inability to falsify MNAR without external , emphasizing the need for multiple diagnostic approaches.

Consequences for Analysis

Introduction of Bias and Variance Issues

Missing data introduces bias into statistical estimators when the mechanism of missingness violates the missing completely at random (MCAR) assumption, such as under missing at random (MAR) or missing not at random (MNAR) conditions, where missingness depends on observed covariates or the unobserved values themselves, respectively. In complete-case analysis, which discards units with any missing values, the observed subsample becomes systematically unrepresentative of the full population, leading to inconsistent estimates of parameters like means, regressions coefficients, or associations; for instance, if higher-income respondents are more likely to refuse income questions (MNAR), mean income estimates will be downward biased. This bias persists even in large samples unless the missingness mechanism is explicitly modeled and accounted for, as naive methods fail to correct for the selection process inherent in the data collection. Beyond , missing data elevates the variance of estimators due to the effective reduction in sample size, which diminishes and widens intervals; for example, the variance of the sample scales inversely with the number of complete observations, so a 20% missingness rate can increase variance by up to 25% relative to the full dataset under MCAR. Listwise deletion, a common approach, not only amplifies this sampling variance but also underestimates the variance-covariance structure of variables with missing values, propagating errors into downstream parameters like correlations or standard errors in regression models. Imputation methods exacerbate variance issues if not properly adjusted: single imputation treats filled values as known, artificially reducing and yielding overly narrow standard errors, whereas multiple imputation aims to restore appropriate variability by incorporating imputation , though it requires valid modeling of the missingness to avoid . These and variance distortions collectively inflate the (MSE) of predictions or inferences, compromising the reliability of analyses in fields like and , where even modest missingness (e.g., 10-15%) can shift effect sizes by 20% or more if unaddressed. Empirical studies confirm that ignoring non-ignorable missingness often results in both directional and inefficient estimators, underscoring the need for sensitivity analyses to assess robustness across plausible missing data mechanisms.

Loss of Statistical Power and Efficiency

Missing data reduces the effective sample size in analyses, leading to a loss of statistical , which is the probability of correctly rejecting a false in testing. This diminution increases the risk of Type II errors, where true effects go undetected due to insufficient . In complete-case , where observations with any missing values are discarded, the sample size shrinks proportionally to the missingness rate; for example, if 20% of data are missing under missing completely at random (MCAR) conditions, power calculations effectively operate on 80% of the original sample, as if the study were underpowered from the outset. Even under MCAR, where complete-case estimators remain unbiased, the reduced sample size inflates the variance of estimates, compromising their relative to full-data counterparts. here denotes the of estimators, typically assessed via asymptotic relative or variance ratios; missing data effectively scales the information matrix by the retention proportion, necessitating larger initial samples to match the of complete-data . This inefficiency manifests in wider confidence intervals and less reliable inference, particularly in multivariate settings where missingness compounds across variables. Under at random () or not at random (MNAR) mechanisms, power losses can be more severe if unaddressed, as partial information from observed data is discarded in simplistic methods, further eroding efficiency without the unbiased guarantee of MCAR. Model-based approaches, such as , can preserve more efficiency by utilizing all available data, but they require correct specification of the missingness mechanism to avoid compounded power deficits. Empirical studies confirm that ignoring data routinely halves power in moderate missingness scenarios (e.g., 25-50% ), underscoring the need for deliberate handling to maintain analytical rigor.

Handling Techniques

Deletion-Based Methods

Deletion-based methods for handling missing data entail the removal of incomplete observations or variables from the prior to , thereby utilizing only the fully observed cases or pairs. These approaches are computationally straightforward and serve as default options in many statistical software packages, such as and , where listwise deletion is often automatically applied. They avoid introducing assumptions about the underlying data-generating process beyond those required for the substantive model, but they can substantially reduce effective sample size, particularly when missingness is prevalent. The primary variant is listwise deletion, also known as complete-case analysis, which excludes any observation containing at least one missing value across the variables of interest. This method ensures a consistent sample for all parameters estimated in the model, preserving the integrity of multivariate analyses like or . For instance, in a with 1,000 cases where 10% have missing values on one predictor, listwise deletion might retain only 900 cases, assuming independence of missingness patterns. It yields unbiased estimates under the missing completely at random (MCAR) assumption, where missingness is unrelated to observed or unobserved data, but introduces bias under missing at random (MAR) or missing not at random (MNAR) mechanisms unless the completers form a representative subsample. Moreover, it diminishes statistical power and increases variance, as demonstrated in simulations where power drops by up to 20-30% with 15% missing data under MCAR. In contrast, pairwise deletion (or available-case analysis) retains data for each pair of variables analyzed, excluding only those specific pairs with missing values. This maximizes information use—for correlations, it computes each pairwise from all non-missing pairs—potentially retaining more data than listwise deletion when missingness is scattered. However, it risks producing inconsistent sample sizes across estimates (e.g., varying from 800 to 950 cases per pair in a 1,000-case ), which can lead to biased standard errors or non-positive definite covariance matrices in procedures like . Pairwise deletion also assumes MCAR for unbiasedness and is less suitable for models requiring fixed samples, such as . Less commonly, variable deletion removes entire predictors with excessive missingness (e.g., >50% missing), preserving sample size at the cost of model specification. All deletion methods perform adequately when missing data proportions are low (<5%), but their validity hinges on empirical diagnostics like Little's MCAR test, which rejects MCAR if p < 0.05, signaling potential bias. Critics note that these methods discard potentially informative data, exacerbating inefficiency in small samples or high-dimensional settings, prompting preference for imputation or modeling under MAR.

Imputation Strategies

Imputation strategies replace missing values in a dataset with estimated values derived from observed data, enabling the use of complete-case analyses while attempting to mitigate bias introduced by deletion methods. These approaches range from simple deterministic techniques to sophisticated stochastic procedures that account for uncertainty in the estimates. Single imputation methods generate one replacement value per missing entry, often leading to underestimation of variance and distortion of associations, whereas multiple imputation creates several plausible datasets to propagate imputation uncertainty into inference. Simple single imputation techniques, such as mean, median, or mode substitution, fill missing values with central tendencies computed from observed cases in the same variable. These methods are computationally efficient and preserve sample size but introduce systematic bias by shrinking variability toward the mean and ignoring relationships with other variables; for instance, mean imputation reduces standard errors by up to 20-30% in simulations under missing at random () scenarios. Regression-based imputation predicts missing values using linear models fitted on observed predictors, offering improvement over unconditional means by incorporating covariate information, yet it still fails to reflect imputation error, resulting in overly precise confidence intervals. Hot-deck imputation draws replacements from observed values in similar cases, classified as random hot-deck (within strata) or deterministic variants, which better preserves data distributions in empirical studies compared to mean substitution but can propagate errors if donor pools are small. Multiple imputation (MI), formalized by Donald Rubin in 1987, addresses limitations of single imputation by generating m (typically 5-20) imputed datasets through iterative simulation, analyzing each separately, and pooling results via to adjust variances for between- and within-imputation variability. Under MAR assumptions, MI yields unbiased estimates and valid inference, outperforming single methods in Monte Carlo simulations where it reduces mean squared error by 10-50% relative to complete-case analysis depending on missingness proportion (e.g., 20% missing). Procedures like multivariate normal MI or chained equations (iterative conditional specification) adapt to data types, with the latter handling non-normal or mixed variables by sequentially modeling each as a function of others. Empirical comparisons confirm MI's robustness, though it requires larger m for high missingness (>30%) or non-ignorable mechanisms to avoid coverage shortfalls below nominal 95% levels. Advanced strategies incorporate , such as k-nearest neighbors (KNN) imputation, which averages values from the k most similar observed cases based on distance metrics, or random forest-based methods that leverage ensemble predictions to capture nonlinear interactions. These perform competitively in high-dimensional settings, with studies showing KNN reducing by 15-25% over in categorical data, but they demand substantial computational resources and risk without cross-validation. Selection among strategies hinges on missing data mechanisms, proportions (e.g., <5% favors simple methods for efficiency), and validation via sensitivity analyses, as no universal optimum exists; for example, MI excels under MAR but may falter if missing not at random without auxiliary variables.

Model-Based Procedures

Model-based procedures for handling missing data involve specifying a joint probability distribution for the observed and missing variables, typically under the missing at random (MAR) assumption, to derive likelihood-based inferences or imputations. These methods leverage parametric or semiparametric models to maximize the observed-data likelihood, avoiding explicit imputation in some cases while accounting for uncertainty in others. Unlike deletion or simple imputation, they integrate missingness into the estimation process, potentially yielding more efficient estimators when the model is correctly specified. A primary approach is full information maximum likelihood (FIML), which computes parameter estimates by directly maximizing the likelihood function based solely on observed data patterns, without requiring complete cases. FIML is particularly effective for multivariate normal data or generalized linear models, as it uses all available information across cases, reducing bias under MAR compared to listwise deletion. For instance, in regression analyses with missing covariates or outcomes, FIML adjusts standard errors to reflect data incompleteness, maintaining valid inference if the model encompasses the data-generating process. Computationally, the expectation-maximization (EM) algorithm facilitates MLE when closed-form solutions are unavailable, iterating between an E-step that imputes expected values for missing data given current parameters and an M-step that updates parameters as if data were complete. Introduced by in 1977, EM converges to local maxima under regularity conditions, with applications in finite mixture models and latent variable analyses featuring missingness. Its efficiency stems from avoiding multiple simulations, though it requires careful initialization to avoid poor local optima. Multiple imputation (MI) extends model-based principles by generating multiple plausible datasets from a posterior predictive distribution under a specified model, followed by separate analyses and pooling of results via Rubin's rules to incorporate imputation uncertainty. Joint modeling approaches, such as multivariate normal imputation, assume a full-data model (e.g., via Markov chain Monte Carlo), while sequential methods like chained equations approximate it by univariate conditional models. Little and Rubin (2020) emphasize MI's robustness for complex data structures, as it preserves multiplicity in inferences, outperforming single imputation in variance estimation; however, results depend on model adequacy, with simulations showing degradation under misspecification. Bayesian model-based methods further generalize these by sampling from the full posterior, incorporating priors on parameters and treating missing values as latent variables, often via data augmentation. This framework unifies and under a probabilistic umbrella, enabling hierarchical modeling for clustered data with missingness. Empirical studies indicate Bayesian imputation tracks complete-data estimates closely when priors are weakly informative, offering advantages in small samples over frequentist alternatives. Overall, model-based procedures excel in efficiency and validity under MAR when the posited model captures substantive relations, but demand diagnostic checks for assumption violations, such as sensitivity analyses to MNAR scenarios. Software implementations, including PROC MIANALYZE in SAS and packages like mice in R, facilitate their application, though users must verify convergence and model fit via information criteria like AIC.

Assumptions, Limitations, and Controversies

Required Assumptions for Validity

The validity of methods for handling missing data depends on untestable assumptions about the missingness mechanism, which describes the relationship between the probability of data being missing and the values of the observed and unobserved variables. These mechanisms are categorized as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Under MCAR, the probability of missingness is independent of both observed and missing values, permitting unbiased complete-case analysis or simple deletion methods without introducing systematic error, though with potential efficiency loss. This assumption is stringent and rarely holds in practice, as it implies no systematic patterns in missingness, verifiable only through tests like Little's MCAR test, which assess uniformity in observed data distributions but cannot confirm independence from unobserved values. MAR, a weaker and more plausible assumption, posits that missingness depends only on observed data (including covariates) and not on the missing values themselves, formalized as the probability of missingness conditioning on the full data equaling the probability conditioning solely on observed data. This enables consistent inference via multiple imputation by chained equations () or maximum likelihood estimation under the missing-at-random ignorability condition, where the observed-data likelihood factors correctly, provided the model for missingness and outcomes is correctly specified. Model-based procedures, such as full-information maximum likelihood, rely on MAR for the parameters of the observed data distribution to be identifiable without bias, assuming the parametric form captures the data-generating process adequately; violations, such as omitted variables correlating with both missingness and outcomes, can lead to inconsistent estimates. MNAR occurs when missingness directly depends on the unobserved values, rendering standard methods invalid without additional, often untestable, structural assumptions about the missingness process, such as selection models or pattern-mixture models that parameterize the dependence. No universally valid approach exists under MNAR, as it requires sensitivity analyses comparing results across plausible MNAR scenarios, since the true mechanism cannot be empirically distinguished from MAR using observed data alone. For all mechanisms, auxiliary variables strongly correlated with missingness can enhance robustness under MAR by improving imputation models, but they do not mitigate MNAR bias. Empirical verification of these assumptions is limited; diagnostics like comparing observed patterns across missingness indicators provide evidence against MCAR but cannot falsify MAR or identify MNAR definitively.

Risks and Criticisms of Common Methods

Deletion-based methods, such as listwise deletion, risk introducing substantial bias when missingness violates the missing completely at random (MCAR) assumption, as the removal of incomplete cases can distort parameter estimates by systematically excluding observations correlated with the missing values. This approach also leads to reduced statistical power and inefficient use of available data, particularly in datasets with high missingness rates, where sample sizes can shrink dramatically and inflate standard errors. Pairwise deletion, while preserving more data for certain analyses, exacerbates inconsistencies in sample composition across correlations, potentially yielding unstable covariance matrices and misleading inference. Simple imputation techniques, including mean or median substitution, systematically underestimate variability and distort associations by shrinking imputed values toward the center of observed distributions, thereby biasing regression coefficients and confidence intervals even under MCAR conditions. Regression-based single imputation may mitigate some bias but still fails to account for uncertainty in predictions, leading to overconfident estimates and invalid hypothesis tests. Multiple imputation addresses variance underestimation by generating plausible datasets but requires the missing at random (MAR) assumption, which, if unverified, can propagate errors from incorrect imputation models, especially when auxiliary variables inadequately capture dependencies. Critics note that multiple imputation's reliance on repeated simulations demands large observed samples for reliable imputations and can produce nonsensical results if applied mechanistically without domain-specific insight into missing mechanisms. Model-based procedures, like the expectation-maximization (EM) algorithm, assume a specified parametric form for the data-generating process, which introduces bias if the model is misspecified or if missingness is missing not at random (MNAR), as unverifiable dependencies on unobserved values render likelihood-based corrections invalid. Convergence in EM can be slow or fail with extensive missingness—exceeding 50% in some variables—due to iterative instability, and computational demands scale poorly for high-dimensional data. Full information maximum likelihood methods similarly hinge on MAR, yielding asymptotically efficient estimates only under correct specification; deviations, common in real-world MNAR scenarios like selective non-response in surveys, result in attenuated effects or reversed associations. Across methods, a pervasive criticism is the untestable nature of MAR/MCAR assumptions, fostering overreliance on diagnostics like Little's test that lack power against subtle violations, ultimately undermining causal inferences in non-experimental settings.

Debates on Method Selection

Deletion methods, such as listwise deletion, remain popular due to their simplicity and validity under missing completely at random (MCAR) or missing at random (MAR) mechanisms, where missingness does not depend on unobserved values after conditioning on observed data; however, they reduce sample size and statistical power, potentially introducing bias under missing not at random (MNAR) conditions prevalent in real-world scenarios like survey nonresponse correlated with outcomes. Imputation techniques, by contrast, preserve sample size and can enhance efficiency under MAR by filling gaps with predicted values, but critics argue they risk amplifying errors if the imputation model is misspecified or fails to capture complex dependencies, as single imputation underestimates variance while multiple imputation (MI) addresses this by generating several datasets and pooling results, though it demands correct auxiliary variable inclusion and substantial computational resources. A central contention revolves around the MAR assumption underpinning most imputation and maximum likelihood methods, which is often untestable and optimistic; empirical simulations demonstrate that while MI outperforms deletion in power under MAR with 10-30% missingness, both falter under MNAR without explicit modeling of selection processes, as in Heckman correction or pattern-mixture models, leading to calls for sensitivity analyses to probe robustness rather than defaulting to MAR-based approaches. Proponents of model-based procedures like full information maximum likelihood (FIML) highlight their avoidance of explicit imputation, relying instead on likelihood contributions from incomplete cases, yet detractors note similar vulnerability to MAR violations and less intuitiveness in high-dimensional settings compared to flexible MI chains. Empirical comparisons across simulated datasets reveal no universally superior method; for instance, in partial least squares structural equation modeling with up to 20% missing data, and predictive mean matching yielded lower bias than mean imputation or deletion under , but generalized additive models excelled in nonlinear MNAR cases, underscoring the need for mechanism-informed selection over rote application. Critics of overreliance on in observational studies point to its sensitivity to the fraction of missing information (FMI), where high FMI (>0.5) inflates standard errors unless augmented with strong predictors, while advocates counter that from clinical trials favors for intent-to-treat analyses when holds plausibly. Ultimately, debates emphasize context-specific trade-offs—deletion for low missingness and verifiable MCAR, imputation for efficiency gains under testable —prioritizing diagnostics like Little's test and global pattern assessments over arbitrary thresholds like 5% missingness dictating method choice.

Recent Developments

Integration with Machine Learning

Machine learning workflows commonly incorporate missing data handling as a preprocessing step, where imputation replaces absent values to enable model training, since many algorithms such as and neural networks require complete datasets. Tree-based ensemble methods like random forests and machines (e.g., ) integrate missing data natively through mechanisms such as surrogate splits or treating missingness as a distinct category, allowing predictions without explicit imputation and often preserving performance under moderate missingness rates up to 20-30%. Advanced imputation strategies leverage ML itself, including k-nearest neighbors (KNN) for local similarity-based filling and iterative methods like multiple imputation by chained equations (MICE), which model each variable conditionally on others using regressions or classifications. Model-based approaches such as missForest apply random forests iteratively to impute multivariate data, outperforming simpler mean or median substitutions in preserving data distribution and improving downstream classifier accuracy, as demonstrated in benchmarks with missingness ratios from 0% to 50%. Recent integrations emphasize generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs) for synthesizing plausible missing values while capturing complex dependencies, particularly effective for high-dimensional data like images or where traditional methods distort variance. These methods, evaluated in clinical datasets, reduce imputation error metrics like error by 10-20% over statistical baselines under missing at random assumptions, though they demand larger samples to avoid . Ensemble imputation combining multiple ML learners further enhances robustness, with studies confirming superior predictive performance in supervised tasks compared to single algorithms. Empirical assessments highlight that imputation quality directly correlates with ML efficacy, underscoring the need for method selection aligned with missing data mechanisms to mitigate bias amplification in pipelines.

Advances in Generative and Scalable Methods

Generative adversarial networks () have emerged as a prominent approach for missing data imputation by pitting a against a discriminator to produce realistic synthetic values that align with observed distributions. Introduced in frameworks like in 2018, these methods treat imputation as an adversarial game, where the generator fills missing entries while the discriminator identifies them, enabling handling of complex dependencies under missing at random () assumptions. Recent enhancements, such as improved architectures proposed in 2024, incorporate advanced loss functions and network designs to boost imputation accuracy on tabular datasets, outperforming traditional methods like k-nearest neighbors in metrics such as . Scalable variants, including differentiable -based systems like SCIS from 2022, accelerate training for large-scale by optimizing gradients directly, reducing computational overhead compared to non-differentiable predecessors. Variational autoencoders (VAEs) complement GANs by learning latent representations of data, facilitating probabilistic imputation that captures uncertainty in missing values. Models like TVAE, adapted for tabular data, encode observed features into a low-dimensional space and decode imputations, showing superior performance in preserving correlations on benchmarks with up to 50% missingness. Hybrid approaches, such as those combining VAEs with genetic algorithms for hyperparameter tuning, further refine imputation for biomedical datasets, achieving lower root mean squared error than multiple imputation by chained equations (MICE). For scalability, denoising autoencoder-based methods like , developed in 2021, enable efficient multiple imputation on datasets exceeding millions of observations by leveraging deep neural networks for rapid . Diffusion models represent a newer generative , iteratively denoising to impute missing values by modeling forward and reverse processes conditioned on observed entries. The DiffPuter , introduced in 2024 and accepted at ICLR 2025, integrates with expectation-maximization to handle arbitrary missing patterns, demonstrating state-of-the-art results on synthetic and real-world benchmarks under and missing not at random (MNAR) scenarios. Tabular-specific models like TabCSDI, from 2022, address mixed data types and achieve scalability through conditional sampling, with empirical evaluations showing reduced bias in downstream tasks like compared to GANs. These methods scale to high-dimensional by parallelizing steps, though they require careful tuning of noise schedules to avoid mode collapse in sparse regimes.

Implementation Tools

Statistical Software Packages

SAS provides the PROC MI procedure for multiple imputation of missing data, supporting methods such as parametric regression, logistic regression for classification variables, and fully conditional specification (FCS) for flexible multivariate imputation. This procedure generates multiple imputed datasets, allowing users to assess missing data patterns with the NIMPUTE=0 option and incorporate imputed values into subsequent analyses like PROC MIANALYZE for pooling results. PROC MI handles arbitrary missing data patterns and is particularly effective for datasets assuming missing at random (MAR), though users must verify assumptions empirically. IBM SPSS Statistics offers a Missing Values module for exploratory analysis, including pattern detection via Analyze Patterns and estimation of missing values using expectation-maximization (EM) algorithms. The software supports multiple imputation through its dedicated procedure, which imputes missing data under assumptions and provides diagnostics for and plausibility. SPSS distinguishes system-missing (absent values) from user-defined missing values, enabling tailored handling in analyses while warning against complete-case deletion biases in large-scale surveys. Stata's mi command suite facilitates multiple imputation for incomplete datasets, with mi impute chained implementing multivariate imputation by chained equations (MICE) for non-monotone patterns and mi impute monotone for sequential imputation. Users can set mi styles (e.g., wide or flong) to store imputations, explore patterns via mi describe, and combine results using mi estimate for Rubin's rules-based inference. Stata supports passive variables and constraints, making it suitable for complex survey data, but requires careful specification of imputation models to avoid bias under MNAR mechanisms. R, as an open-source statistical environment, integrates missing data handling through specialized packages rather than core functions, with performing bootstrapping-based multiple imputation for cross-sectional and time-series data under . The package generates multiple completed datasets efficiently, outperforming single imputation in variance estimation, as validated in simulations with up to 50% missingness. Complementary tools like the mice package enable MICE for flexible predictive mean matching and regression-based imputation across variable types. These packages prioritize empirical diagnostics, such as trace plots for convergence, over default listwise deletion common in legacy software.
SoftwareKey Procedure/PackageSupported MethodsPattern Handling
SASPROC MIRegression, FCS, Propensity ScoreArbitrary
SPSSMissing Values AnalysisEM, Multiple ImputationExploratory, MAR
Statami imputeChained Equations, MonotoneNon-monotone
R (Amelia)amelia()BootstrappingCross-sectional, Time-series

Programming Libraries and Frameworks

Several programming libraries in facilitate missing data handling, with offering imputation transformers including SimpleImputer for strategies like or substitution and IterativeImputer for multivariate feature modeling via iterative regression. Specialized packages such as MIDASpy extend this to multiple imputation using methods, achieving higher accuracy in benchmarks compared to traditional approaches for certain datasets. The gcimpute package supports imputation across diverse variable types, including continuous, binary, and truncated data, as detailed in its 2024 Journal of Statistical Software publication. In , the mice package implements multiple imputation by chained equations (MICE), generating plausible values from predictive distributions and enabling analysis of uncertainty via pooled results, a method validated in numerous empirical studies since its introduction. Complementary tools like missForest use random forests for nonparametric imputation, performing robustly under missing at random assumptions without requiring normality. The CRAN Missing Data catalogs additional options such as for expectation-maximization algorithms and naniar for visualization and pattern detection, emphasizing exploration prior to imputation to assess mechanisms like missing completely at random. Julia provides built-in support for missing values via the missing singleton, with packages like Impute.jl offering interpolation methods for vectors, matrices, and tables, including linear and spline-based approaches suitable for or spatial data. The Mice.jl package ports R's MICE functionality, supporting chained equations for multiple imputation in environments. In frameworks, scikit-learn's imputers integrate seamlessly into pipelines, allowing preprocessing before model fitting, while emerging tools like MLimputer automate regression-based imputation tailored to predictive tasks. These libraries generally assume mechanisms like missing at random for validity, with users advised to verify assumptions empirically to avoid biased inferences.

References

  1. [1]
    Missing data - Statistical Consulting Centre
    'Missing data' refers to data which was intended to have been collected but was not. Missing data occurs commonly across a range of quantitative disciplines.
  2. [2]
    Missing data: A statistical framework for practice - PMC - NIH
    Definition of Rubin's missingness mechanisms. Probability of X being missing depends on, Missingness mechanism. Neither X, Y or Z, Missing Completely At Random.
  3. [3]
    [PDF] Inference and Missing Data - Donald B. Rubin
    Feb 18, 2003 · The statistical literature also discusses missing data that arise intentionally. In these cases, the process that causes missing data is ...
  4. [4]
    1.2 Concepts of MCAR, MAR and MNAR - Stef van Buuren
    Rubin (1976) classified missing data problems into three categories. In his theory every data point has some likelihood of being missing. The process that ...
  5. [5]
    Missing Data Mechanism - an overview | ScienceDirect Topics
    Following the definition by Rubin (1987), data are missing completely at random if the missingness does not depend on either the observed or missing values in Y ...
  6. [6]
    The prevention and handling of the missing data - PMC - NIH
    This manuscript reviews the problems and types of missing data, along with the techniques for handling missing data. The mechanisms by which missing data occurs ...
  7. [7]
    Statistical Analysis with Missing Data, Third Edition
    Apr 12, 2019 · An up-to-date, comprehensive treatment of a classic text on missing data in statistics. The topic of missing data has gained considerable ...
  8. [8]
    Missing Data Analysis - Annual Reviews
    Feb 12, 2024 · Missing data are defined, and a taxonomy of main approaches to analysis is presented, including complete-case and available-case analysis, ...
  9. [9]
    Identify the most appropriate imputation method for handling missing ...
    Aug 28, 2024 · The analysis of 58 studies revealed that conventional statistical methods are most effective when considering the mechanisms, patterns, and ...
  10. [10]
  11. [11]
    Multiple Imputation
    MCAR implies MAR but not vice-versa. MNAR. If the data are Missing Not At Random, then the missingness depends on the values of the missing data. Censored ...
  12. [12]
    Missing data - Stat@Duke
    Under MNAR, missing data are related to unobserved data. For instance, suppose instead that students who have higher alcohol consumption are less likely to ...
  13. [13]
  14. [14]
    Missing Data Assumptions by Roderick J. Little :: SSRN
    Mar 10, 2021 · I review assumptions about the missing-data mechanisms that underlie methods for the statistical analysis of data with missing values.<|control11|><|separator|>
  15. [15]
    9.2 Missing data mechanism | Introduction to Regression Methods ...
    MAR: Missing at random, or; MNAR: Missing not at random. Missing data are MCAR if the probability of missingness is independent of the data. In other words, the ...
  16. [16]
    Analyzing and interpreting “imperfect” Big Data in the 1600s
    Feb 17, 2016 · This paper examines the work of John Graunt (1620–1674) in the tabulation of diseases in London and the development of a life table using the “imperfect data”
  17. [17]
    [PDF] John Graun'ts Bills of Mortality - Neonatology on the Web
    Here Graunt tells of having gone to visit a parish clerk whom he suspects of terrible under-reporting. He found the office locked and, after breaking in, found.
  18. [18]
    John Graunt F.R.S. (1620-74): The founding father of human ...
    He quantified the high infant mortality and attempted the calculation of a case fatality rate during an epidemic of fever. He was the first to document the ...
  19. [19]
    Medical Statistics from Graunt to Farr - jstor
    Graunt was the first to apply an arithmetical test of mortality; he compared the statistics of Romsey with those of London. For Romsey he had ninety years' ...<|separator|>
  20. [20]
    History of statistics - Wikipedia
    Sir William Petty, a 17th-century economist who used early statistical methods to analyse demographic data. The term 'statistic' was introduced by the ...Missing: incomplete | Show results with:incomplete
  21. [21]
    Error, Uncertainty, and the Shifting Ground of Census Data
    May 26, 2020 · Historian Dan Bouk analyses the struggles to create reliable census data by showing how paying attention to efforts to address 'uncertainty' and 'error' over ...
  22. [22]
    Statistics and Politics in the 18th Century - jstor
    The first uses of statistics in politics can be found in France and the German principalities, and they can be dated, quite precisely, to the last third of the.
  23. [23]
    Correcting missing-data bias in historical demography - PubMed
    Several methods for the correction of mortality estimates are proposed in the literature, most of which first estimate the number of individuals at risk and ...Missing: vital | Show results with:vital
  24. [24]
    [PDF] Statistical Analysis with Missing Data
    Definition 1.1 Missing data are unobserved values that would be meaning- ful ... (Rubin 1974), which can be estimated from this missing data perspective.
  25. [25]
    Chapter 7 Missing data Mechanisms | Book_MI.utf8.md - Bookdown
    The key idea behind Rubin's missing data mechanisms is that the probability of missing data in a variable may or may not be related to the values of other ...
  26. [26]
    3 Missing Data - A Tutorial for Mental Health Researchers
    3.1 Missing Data Mechanisms. The way statisticians think about missing data has been shaped in large measure by the seminal work of Donald Rubin (Rubin 1976; ...
  27. [27]
    [PDF] Non-response in the American Time Use Survey
    Groves and Couper (1998) develop a model of non-response to household interview surveys that distinguishes non-contact, refusal and other reasons for survey non ...<|separator|>
  28. [28]
    What is nonresponse bias and how to avoid errors - SurveyMonkey
    5 common causes of nonresponse bias · 1. Poor survey design · 2. Wrong target audience · 3. Refusals · 4. Failed delivery · 5. Accidental omission.
  29. [29]
    [PDF] Reasons for Unit Non-Response - ISER/Essex
    Reasons could include: Refusal to provide an answer • Inability to provide an answer • Other failure to answer (e.g. by accident) • Provided answer being of ...
  30. [30]
    [PDF] NONRESPONSE BIAS ANALYSES AT THE NATIONAL CENTER ...
    Longitudinal studies can be particularly vulnerable to nonresponse bias, as bias in the first wave of data collection may persist in future rounds of data ...
  31. [31]
    Item Nonresponse & Interview Timings - National Longitudinal Surveys
    Missing data, or nonresponse, occurs for a number of reasons in the NLSY97 survey. First, a number of respondents may not participate at all that survey year, ...Missing: studies | Show results with:studies
  32. [32]
    A survey on missing data in machine learning | Journal of Big Data
    Oct 27, 2021 · Missing values can be handled by certain techniques including, deletion of instances and replacement with potential or estimated values [5,6,7], ...<|control11|><|separator|>
  33. [33]
    5 Most Common Lab Equipment Malfunctions & How to Prevent Them
    Oct 26, 2021 · 5. Data Loss. Protecting research data is just as important as maintaining physical equipment. Unexpected power outages, hardware failures, or ...
  34. [34]
    Missing Data | Types, Explanation, & Imputation - Scribbr
    Dec 8, 2021 · Missing data often come from attrition bias, nonresponse, or poorly designed research protocols. When designing your study, it's good practice ...<|separator|>
  35. [35]
    4.1 Missing data pattern - Stef van Buuren
    The outflux of a variable quantifies how well its observed data connect to the missing data on other variables.
  36. [36]
    Investigating the missing data mechanism in quality of life outcomes
    Jun 22, 2009 · The three mechanisms of missing data are missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [1].
  37. [37]
    How to Diagnose the Missing Data Mechanism - The Analysis Factor
    Missing data mechanisms are MCAR (random), MAR (related to observed data), and MNAR (related to missing values). Diagnosis involves measuring missing data and ...
  38. [38]
    How to decide whether missing values are MAR, MCAR, or MNAR
    Apr 24, 2020 · Classifying missing data can be done by using statistical tests. In brief, it's the chi-squared for MCAR and logistic regression for MAR.
  39. [39]
    13.3 Diagnosing the Missing Data Mechanism - Bookdown
    The three main mechanisms for missing data are MCAR (Missing Completely at Random), MAR (Missing at Random), and MNAR (Missing Not at Random).
  40. [40]
    Principled missing data methods for researchers - PMC
    In this paper, we discussed and demonstrated three principled missing data methods: multiple imputation, full information maximum likelihood, and expectation- ...
  41. [41]
    [PDF] Missing Data - Statistical Horizons
    In par- ticular, variances for the variables with missing data tend to be underestimated, and this bias is propagated to any parameters that depend on variances ...Missing: "Little | Show results with:"Little
  42. [42]
    [PDF] Multiple Imputation After 18+ Years - Donald B. Rubin
    Jun 6, 2005 · Rice, Johnson, Khare, Little, Rubin, and Schafer (1995) il- lustrates such efforts and the resulting valid inferences. Since with public-use ...<|separator|>
  43. [43]
    Impact of missing data on bias and precision when estimating ...
    Jun 20, 2019 · Missing data can substantially affect the precision of estimated change in PRO scores from clinical registry data.
  44. [44]
    More Notes on Missing Data for Statistical Inference - Kevin Urban
    Apr 22, 2019 · ... missing data introduces a negligible amount of bias o rnot…. The ... Little, Rubin, 1987). Single Imputation methods can work for MAR ...
  45. [45]
    [PDF] Statistical data preparation: management of missing values and ...
    While the simplicity of analysis is an advantage, reduced sample size and lower statistical power are disadvantages because drawing statistical inferences ...
  46. [46]
    [PDF] Facing Your Fear of the Unknown - Handling Missing Data
    Missing data is a common ... WHAT ARE MISSING DATA? Missing data are defined by Little et. ... reduced sample size, and therefore a reduced statistical power ...
  47. [47]
    [PDF] Missing Data: An Introductory Conceptual Overview for the Novice ...
    Listwise deletion of missing data affects statistical power in two ways. First, in multivariate analysis, deleting a relatively large number of cases that have ...
  48. [48]
    [PDF] Methods for Handling Missing Data in the Behavioral Neurosciences
    Consequently, acceptable methods for incorporating missing data are needed to increase statistical power and ... missing data (Rubin, 1976; Little, 1979; Hedeker ...Missing: loss | Show results with:loss
  49. [49]
    [PDF] Missing Data A Gentle Introduction missing data a gentle introduction
    - Loss of statistical power: Missing data reduces the sample size, potentially ... the reduced sample size can increase variance and lower statistical power.
  50. [50]
    [PDF] Getting tough on missing data: a boot camp for social science ...
    However, discarding cases with missing data reduces sample size and statistical power. ... sample due to reduced sample size. Instead, we observe in table ...
  51. [51]
    [PDF] The Effect of a Missing at Random Missing Data Mechanism on a
    However, a disadvantage is a reduced sample size and statistical power. ... The magnitude of bias of β0 from a complete case analysis is greater for missing data ...<|separator|>
  52. [52]
    Complete Case Analysis - an overview | ScienceDirect Topics
    Complete case analysis is defined as a method of handling missing data by discarding any observation with a missing value for any variable, resulting in the ...
  53. [53]
    Handling Missing Data: Listwise Versus Pairwise Deletion
    Researchers using listwise deletion will remove a case completely if it is missing a value for one of the variables included in the analysis.
  54. [54]
    When Is a Complete-Case Approach to Missing Data Valid ... - NIH
    The authors explained that the CCA estimate is valid for the full study sample only if the modifier and missing-data indicator are unconditionally independent. ...Missing Data And... · Data Missing Completely At... · Discussion
  55. [55]
    Complete Case Analysis | Andrea Gabrio
    Apr 27, 2016 · Bias, when the missing data mechanism is not missing completely at random (MCAR) and the completers are not a random samples of all the cases.<|control11|><|separator|>
  56. [56]
    Missing Data: Listwise vs. Pairwise - Statistics Solutions
    Listwise deletion (complete-case analysis) removes all data for a case that has one or more missing values. This technique is commonly used if the researcher is ...
  57. [57]
    Pairwise vs Listwise Deletion - GeeksforGeeks
    Jul 23, 2025 · Pairwise deletion uses available data for each pair, while listwise deletion removes rows with any missing values, ensuring consistency.
  58. [58]
    Assumptions and analysis planning in studies with missing data in ...
    Feb 13, 2023 · In this manuscript we outline an approach for deciding which method to use to handle multivariable missing data in an analysis.Abstract · Introduction · The approach · Discussion
  59. [59]
    Multiple Imputation: A Flexible Tool for Handling Missing Data - PMC
    Common statistical methods used for handling missing values were reviewed. When missing data occur, it is important to not exclude cases with missing ...
  60. [60]
    Missing Data Imputation: A Comprehensive Review
    This review has provided a comprehensive overview of missing data imputation techniques, from traditional methods, and statistical methods to advanced machine ...
  61. [61]
    [PDF] Multiple Imputation for Missing Data - OARC Stats
    Instead of filling in a single value for each missing value, Rubin's (1987) multiple imputation procedure replaces each missing value with a set of plausible ...
  62. [62]
    Identify the most appropriate imputation method for handling missing ...
    Aug 28, 2024 · We conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ...
  63. [63]
    A Comprehensive Review of Handling Missing Data - arXiv
    Apr 7, 2024 · Three missing mechanisms are defined in the literature: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random ...
  64. [64]
    Dealing with Missing Data - HERC
    If missingness is correlated with the outcome of interest, then ignoring it will bias the results of statistical tests. In addition, most statistical software ...Missing: consequences | Show results with:consequences
  65. [65]
    Evaluation of different approaches for missing data imputation on ...
    Sep 3, 2021 · Here, we have reviewed five statistical methods available to impute missing data in genomic studies. We used coding and non-coding variants ...<|separator|>
  66. [66]
    Multiple Imputation for Nonresponse in Surveys - Wiley Online Library
    Demonstrates how nonresponse in sample surveys and censuses can be handled by replacing each missing value with two or more multiple imputations.
  67. [67]
    Empirical Comparison of Imputation Methods for Multivariate ...
    Jan 14, 2023 · In this study, we compared three popular imputation methods: sequential multiple imputation, fractional hot-deck imputation, and generalized ...
  68. [68]
    A comparison of multiple imputation methods for missing data in ...
    Dec 12, 2018 · In this paper, we have identified and compared 12 different MI methods for imputing missing data in longitudinal studies.<|control11|><|separator|>
  69. [69]
    A comparison of imputation methods for categorical data
    We compared the following imputation methods for categorical data in an empirical analysis: Mode, K-Nearest Neighbors (KNN), Random Forest (RF), Sequential ...
  70. [70]
    A comparison of various imputation algorithms for missing data
    The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.
  71. [71]
    Maximum likelihood estimation with missing outcomes - NIH
    Aug 8, 2019 · Maximum likelihood (ML) methods provide a conceptually straightforward approach to estimation when the outcome is partially missing.
  72. [72]
    [PDF] Handling Missing Data by Maximum Likelihood - Statistical Horizons
    Multiple imputation is rapidly becoming a popular method for handling missing data, especially with easy-to-use software like PROC MI.
  73. [73]
    EM Algorithm for Data with Missing Values - SAS Help Center
    Feb 21, 2025 · The EM algorithm (Dempster, Laird, and Rubin 1977) is a technique that finds maximum likelihood estimates in parametric models for incomplete data.
  74. [74]
    [PDF] Missing Data and the EM algorithm - Oxford statistics department
    Jan 31, 2007 · The EM algorithm is an alternative to Newton–Raphson or the method of scoring for computing MLE in cases where the complications in calculating ...
  75. [75]
    [PDF] model-based imputation - ERIC
    Simulation results from the aforementioned studies suggest that Bayesian approaches to imputation offer substantial improvement over older missing data handling ...
  76. [76]
    [PDF] Missing-data imputation - Columbia University
    25.6 Model-based imputation. Missing data can be handled in Bugs by modeling ... Little and Rubin (2002) provide an overview of methods for analysis with missing.
  77. [77]
    [PDF] Imputation of Missing Values in Survey Data - GESIS
    These predictions are taken as imputed values (Little & Rubin, 2019). ... model is misspecified compared to a purely model-based imputation. Moreover ...
  78. [78]
    [PDF] Multiple Imputation as a Missing Data Machine - Stef van Buuren
    Another reason is that by a simple model, model based imputation is much faster than data driven imputation. Finally, a real data driven imputation algorithm is.
  79. [79]
    Missing Data Assumptions - Annual Reviews
    Mar 7, 2021 · I review assumptions about the missing-data mechanisms that underlie methods for the statistical analysis of data with missing values.
  80. [80]
    A Note on Listwise Deletion versus Multiple Imputation
    Aug 3, 2018 · This process of “listwise deletion” is inefficient, and frequently biased when the probability that an observation is missing is related to its ...<|separator|>
  81. [81]
    Handling Missing Values in Information Systems Research
    Mar 31, 2022 · We believe that a review of missing value theory is necessary and timely for the IS community to understand the nature of missing values.
  82. [82]
    Multiple imputation for missing data in epidemiological and clinical ...
    In this article, we review the reasons why missing data may lead to bias and loss of information in epidemiological and clinical research.
  83. [83]
    12.1 Some dangers, some do's and some don'ts - Stef van Buuren
    The major danger of the technique is that it may provide nonsensical or even misleading results if applied without appropriate care or insight. Multiple ...
  84. [84]
    Missing Data in Clinical Research: A Tutorial on Multiple Imputation
    Historically, a popular approach when faced with missing data was to exclude all subjects with missing data on any necessary variables and to conduct subsequent ...
  85. [85]
    How handling missing data may impact conclusions - NIH
    However, when data are MNAR, it is difficult to identify, and consequently respond to, the missing mechanisms as this is unverifiable. This is when the risk of ...
  86. [86]
    [PDF] A Comprehensive Review of Handling Missing Data - arXiv
    Apr 9, 2024 · Our review covers traditional techniques such as deletion and imputation, as well as emerging methods based on representation learning. We ...
  87. [87]
    Missing data: Issues, concepts, methods - ScienceDirect.com
    We aim to explain in non-technical language the issues and concepts around missing data, as well as discuss common methods for handling missing data.
  88. [88]
    Much ado about nothing: A comparison of missing data methods ...
    Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Development of statistical methods to address missingness have ...
  89. [89]
    Full article: What is Missing in Missing Data Handling? An ...
    Feb 7, 2023 · Table 4 compares the methods for handling missing data in dissertations and peer-reviewed journal articles using NHANES data. The majority ...
  90. [90]
    A review of the use of controlled multiple imputation in randomised ...
    Apr 15, 2021 · There are three broad categories of assumptions that can be made for missing data: missing completely at random (MCAR), missing at random (MAR) ...Missing: criticisms | Show results with:criticisms
  91. [91]
    Accounting for missing data in statistical analyses - Oxford Academic
    Mar 16, 2019 · All statistical methods for analysing data with missing values ('incomplete data') require assumptions about the reasons for missing data.Missing: vital | Show results with:vital
  92. [92]
    An empirical comparison of some missing data treatments in PLS-SEM
    Jan 19, 2024 · The issue arises when participants have insufficient or unavailable data for one or more variables in the analysis model. This missingness can ...
  93. [93]
    The proportion of missing data should not be used to guide ...
    The proportion of missing data should not guide decisions on multiple imputation; instead, the fraction of missing information (FMI) should be used.
  94. [94]
    Comparison of the effects of imputation methods for missing data in ...
    Feb 16, 2024 · The processing of missing data is frequently separated into deletion and imputation [5]. Deletion is the most user-friendly method. The most ...
  95. [95]
    A review on missing values for main challenges and methods
    This review aims to consolidate current developments in novel missing-value research methodologies.
  96. [96]
    Research and scholarly methods: Missing data - ACCP Journals
    Mar 19, 2025 · It is important to determine which kinds of missing data may be present in a study to aid in selecting an appropriate missing data-handling ...
  97. [97]
    Machine learning algorithms to handle missing data - Cross Validated
    Jun 16, 2014 · The R-package randomForestSRC, which implements Breiman's random forests, handles missing data for a wide class of analyses.
  98. [98]
    Missing value imputation affects the performance of machine learning
    Missing Value Imputation (MVI) reinforces Machine Learning (ML) models' performance, and suitable MVI methods enhance decision-making actions.
  99. [99]
    (PDF) A systematic review of machine learning-based missing value ...
    This study shows that clustering- and instance-based algorithms are the most proposed MVI methods. Percentage of correct prediction (PCP) and root mean square ...
  100. [100]
    Comparison of missing value imputation tools for machine learning ...
    Apr 1, 2025 · We have compared these imputation algorithms based on eight case studies with various missing value ratios from 0 to 0.5.
  101. [101]
    Missing Data Imputation Techniques | Nature Research Intelligence
    Recent investigations have illustrated the potential of deep learning and distributed methodologies in addressing missing data challenges. Early work using ...
  102. [102]
    Evaluating the state of the art in missing data imputation for clinical ...
    Dec 9, 2021 · The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of ...
  103. [103]
    The impact of imputation quality on machine learning classifiers for ...
    Oct 6, 2023 · Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the ...
  104. [104]
    Improving Missing Data Imputation with Deep Generative Models
    Feb 27, 2019 · This paper compares and proposes improvements to deep generative models for imputing missing values in datasets, which negatively impact ...
  105. [105]
    Improved generative adversarial imputation networks for missing data
    Sep 5, 2024 · We have proposed a new imputation method based on GAN to enhance the accuracy of missing data imputation in this study.
  106. [106]
    Differentiable and Scalable Generative Adversarial Models for Data ...
    Jan 10, 2022 · In this paper, we propose an effective scalable imputation system named SCIS to significantly speed up the training of the differentiable ...
  107. [107]
    Comparison of Data Imputation Performance in Deep Generative ...
    This study imputed the missing data using state-of-the-art deep generative models, TVAE [45], CTGAN [45], and TabDDPM [28]. These methods were selected because ...
  108. [108]
    Missing data imputation with deep Variational Autoencoders and ...
    VAEs are employed to impute missing values by learning latent data representations, while GAs optimize the VAE architecture and hyperparameters, including the ...
  109. [109]
    The MIDAS Touch: Accurate and Scalable Missing-Data Imputation ...
    Feb 26, 2021 · We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders).<|separator|>
  110. [110]
    DiffPuter: Empowering Diffusion Models for Missing Data Imputation
    May 31, 2024 · This paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation.
  111. [111]
    Diffusion models for missing value imputation in tabular data - arXiv
    Oct 31, 2022 · This paper proposes TabCSDI, a diffusion model for missing value imputation in tabular data, using techniques for handling categorical and ...
  112. [112]
    Rethinking the Diffusion Models for Missing Data Imputation: A ...
    Dec 9, 2024 · Diffusion models have demonstrated competitive performance in missing data imputation (MDI) task. However, directly applying diffusion models to ...
  113. [113]
    Overview: MI Procedure - SAS Help Center
    Oct 28, 2020 · To impute missing values for a classification variable, you can use a logistic regression method when the classification variable has a binary, ...
  114. [114]
    PROC MI Statement - SAS Help Center
    Oct 28, 2020 · For each imputation, the data set contains all variables in the input data set with missing values being replaced by the imputed values.
  115. [115]
    [PDF] MI FOR MI, OR HOW TO HANDLE MISSING INFORMATION WITH ...
    The PROC MI. SAS procedure is the main tool to perform the multiple imputation: many different approaches and options are available there, and a lot of them ...
  116. [116]
    [PDF] IBM SPSS Missing Values 28
    Use Missing Value Analysis and Analyze Patterns to explore patterns of missing values in your data and determine whether multiple imputation is necessary. 2.
  117. [117]
    Missing Values - IBM SPSS Statistics
    IBM SPSS Missing Values helps you uncover patterns in missing data and replace the missing values with plausible estimates.
  118. [118]
    Missing data | SPSS Learning Modules - OARC Stats - UCLA
    There are two types of missing values in SPSS: 1) system-missing values, and 2) user-defined missing values. We will demonstrate reading data containing each ...
  119. [119]
    Multiple imputation for missing data - Stata
    Stata's new mi command provides a full suite of multiple-imputation methods for the analysis of incomplete data, data for which some values are missing.
  120. [120]
    Multiple Imputation in Stata - OARC Stats - UCLA
    Stata has a suite of multiple imputation (mi) commands to help users not only impute their data but also explore the patterns of missingness present in the data ...
  121. [121]
    [PDF] mi impute — Impute missing values - Title Syntax
    If variables follow a monotone-missing pattern (see Patterns of missing data under Remarks and examples in [MI] intro substantive), they can be imputed ...
  122. [122]
    CRAN: Package Amelia
    Nov 8, 2024 · Amelia is a tool that 'multiply imputes' missing data in cross-sections, time series, or time-series-cross-sectional data sets.
  123. [123]
    Amelia II: A Program for Missing Data | GARY KING
    Amelia II multiply imputes missing data using multiple imputation, creating multiple completed datasets, and is faster than other methods.
  124. [124]
    CRAN Task View: Missing Data
    Missing data patterns can be identified and explored using the packages mi (and its GUI migui), wrangle, DescTools, and naniar. daqapo is a generic data quality ...<|separator|>
  125. [125]
    [PDF] Amelia – multiple imputation in R - Princeton University
    Amelia is an R package for multiple imputation, which uses probabilistic imputation to create multiple datasets for unbiased estimates.
  126. [126]
    7.4. Imputation of missing values — scikit-learn 1.7.2 documentation
    The SimpleImputer class provides basic strategies for imputing missing values. Missing values can be imputed with a provided constant value, or using the ...
  127. [127]
    MIDASverse/MIDASpy: Python package for missing-data ... - GitHub
    MIDASpy is a Python package for multiply imputing missing data using deep learning methods. The MIDASpy algorithm offers significant accuracy and efficiency ...
  128. [128]
  129. [129]
    Imputing missing data with R; MICE package - R-bloggers
    Oct 4, 2015 · The mice package in R, helps you imputing missing values with plausible data values. These plausible values are drawn from a distribution ...
  130. [130]
    Tutorial on 5 Powerful R Packages used for imputing missing values
    Jul 5, 2020 · Learn about powerful R packages like amelia, missForest, hmisc, mi and mice used for imputing missing values in R for predictive modeling in
  131. [131]
    Home · Impute.jl
    Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables. Installation julia> using Pkg; Pkg.add("Impute")
  132. [132]
    [ANN] Mice.jl - multiple imputation by chained equations in Julia
    Nov 9, 2023 · Mice.jl is a Julia package for handling missing data via multiple imputation by chained equations, based on the R package mice.
  133. [133]
    MLimputer: Missing Data Imputation Framework for Machine Learning
    The MLimputer project constitutes an complete and integrated pipeline to automate the handling of missing values in datasets through regression prediction ...