Fact-checked by Grok 2 weeks ago

Data dredging

Data dredging, also known as data fishing, data snooping, or p-hacking, refers to the practice of conducting multiple unplanned statistical analyses on a to identify patterns or associations that appear statistically significant, without a predefined , thereby increasing the likelihood of false positives. This approach exploits the inherent variability in data and the multiplicity of possible tests, where even random noise can yield results below conventional significance thresholds like p < 0.05, leading to misleading conclusions. In research contexts, data dredging often arises from "researcher degrees of freedom," such as selectively testing subgroups, variables, or outcome measures post-data collection, which can distort findings and contribute to the reproducibility crisis in fields like , , and . For instance, observational studies on initially suggested protective effects against coronary heart based on dredged associations, but subsequent randomized controlled trials revealed increased risks, highlighting how dredging ignores confounding factors like or . Similarly, analyses of supplements showed apparent reductions in risk in cohort studies, yet clinical trials demonstrated an 18% increase, underscoring the unreliability of unplanned explorations. The consequences of data dredging extend beyond individual studies, inflating the rate of Type I errors (false discoveries) across and eroding trust in . P-value distributions in affected publications often cluster just below 0.05, a hallmark of selective reporting, as seen in meta-analyses of clinical trials where only a fraction of prespecified analyses confirm significance. Ethically, it can lead to retracted papers, wasted resources, and misguided policy decisions, such as promoting ineffective interventions based on spurious correlations. To mitigate data dredging, researchers are encouraged to prespecify hypotheses, analysis protocols, and variables before data examination, often through study registration on platforms like . Additional safeguards include adjusting for multiple comparisons (e.g., using Bonferroni corrections), reporting all conducted tests transparently, and employing stricter significance criteria like p < 0.001 in exploratory work. Enhanced statistical education and a cultural shift in publishing—prioritizing rigorous study design over novel "significant" results—further help distinguish legitimate exploratory analysis from biased dredging.

Overview

Definition and Terminology

Data dredging, also known as data snooping or data fishing, refers to the misuse of data analysis techniques to identify patterns or relationships within a dataset that are presented as statistically significant without the guidance of a pre-specified hypothesis, often resulting in spurious correlations that do not reflect true underlying effects. This practice involves extensively probing the data through multiple unplanned statistical tests or manipulations until a desirable outcome, such as a low p-value, is achieved, thereby distorting the interpretation of results as confirmatory evidence. A prominent synonym for data dredging is p-hacking, a term popularized by Simmons, Nelson, and Simonsohn in their 2011 paper, which describes the selective reporting or analysis of data—such as deciding on sample sizes, variable inclusions, or outcome measures —to reduce p-values below a conventional threshold like 0.05, thereby fabricating apparent . Other related terms include bias, which highlights the risk of overinterpreting chance findings in large datasets as meaningful insights. Data dredging must be distinguished from legitimate (EDA), which involves open-ended examination of data to generate hypotheses and is conducted transparently, with findings clearly labeled as preliminary and requiring subsequent validation through independent confirmatory studies. In contrast, data dredging conceals its exploratory origins, presenting results as if derived from a priori hypotheses to mimic rigorous, hypothesis-driven . At its core, data dredging inflates the Type I error rate—the probability of incorrectly rejecting a true —because it involves unadjusted multiple testing, where numerous analyses are performed without correcting for the increased chance of false positives across the ensemble of tests. This relates to the broader , in which the rises exponentially with the number of tests conducted.

Significance in Research

Data dredging, also known as p-hacking, poses a significant threat to the validity of scientific by systematically inflating the rate of false positive findings. Simulations demonstrate that undisclosed flexibility in and —common practices in data dredging—can elevate false positive rates from the nominal 5% to as high as 60.7% when multiple decisions such as variable selection, sample size adjustments, and covariate inclusion are combined without pre-specification. This prevalence is evident across diverse disciplines; for example, an analysis of 258,050 test results from psychological studies has revealed a consistent excess of p-values just below the 0.05 threshold, indicating widespread p-hacking. Similar patterns have been observed in meta-analyses from fields like and . In the scientific process, data dredging fundamentally contrasts with confirmatory hypothesis testing, where analyses are pre-planned to test specific, predictions against . Post-hoc exploration inherent in dredging allows researchers to adjust methods until statistically significant patterns emerge, thereby capitalizing on and violating of falsifiability by tailoring to observed results rather than independently verifying them. This practice undermines the reliability of evidence, as it increases Type I errors and erodes confidence in reported associations, making it difficult to distinguish genuine effects from artifacts. On a field-wide scale, data dredging contributes substantially to the , where many landmark findings fail to reproduce under rigorous conditions. For instance, a large-scale effort to replicate 100 psychological studies found that only 36% yielded significant results, compared to 97% in the originals, highlighting how practices like dredging propagate unreliable knowledge. Such issues extend beyond to , , and social sciences, where meta-analytic evidence shows inflated effect sizes due to selective analysis. The motivations driving data dredging often stem from intense academic pressures, particularly the "publish or perish" culture that rewards novel, significant results while discouraging null findings. This incentive structure encourages selective reporting and questionable research practices, as researchers face career advancement tied to publication volume and impact, leading to widespread adoption of post-hoc analyses to achieve publishable outcomes.

History

Early Statistical Concepts

The concept of data dredging traces its roots to early 20th-century concerns in regarding multiple comparisons and the risk of spurious findings by chance. , in his foundational work on coefficients during the 1890s and 1900s, explicitly warned about spurious correlations that could arise from improper indexing or heterogeneous data mixtures, emphasizing how such chance associations might mislead interpretations without rigorous controls. These ideas highlighted the dangers of exploring datasets for patterns without accounting for the inflated probability of false positives when numerous tests are conducted. To address these issues, the was introduced in 1936 by Italian mathematician Carlo Emilio Bonferroni as a method to adjust significance levels for multiple tests, based on inequalities bounding the probability of joint events. This procedure divides the overall significance level α by the number of comparisons m, ensuring the remains controlled even as testing multiplicity increases. Bonferroni's inequalities provided a conservative yet practical framework for mitigating the risks of chance findings in exploratory analyses. By the 1960s, biostatisticians began critiquing the practical implications of unadjusted multiple testing in . Jacob 's 1962 review of psychological studies revealed alarmingly low statistical power (averaging around 0.48 for medium effects), which exacerbated Type II errors and compounded risks when researchers "fished" through subsets without adjustments, leading to unstable multiple comparisons on small samples. noted that such exploratory practices often left investigators with unreliable means, underscoring the need for power considerations to avoid overinterpreting chance results in unadjusted tests. The term "data dredging" emerged in the within econometric literature to describe biased through exhaustive specification searches, where researchers iteratively test variables until significant results appear, inflating error rates. Edward Leamer's 1978 book formalized these concerns, critiquing from nonexperimental data as prone to and spurious significance. Michael C. Lovell's 1983 paper further elaborated on "" in this context, simulating how such practices distort by capitalizing on chance correlations in macroeconomic datasets. A key mathematical illustration of these risks is the (FWER), which quantifies the probability of at least one false positive across m tests each at level α: \text{FWER} = 1 - (1 - \alpha)^m This formula demonstrates how unadjusted testing rapidly inflates the overall error rate—for instance, with α = 0.05 and m = 20, FWER exceeds 0.64—necessitating corrections like Bonferroni's to maintain rigorous control.

Modern Recognition and P-Hacking

The term "p-hacking" emerged prominently in the early to describe the practice of selectively analyzing data in ways that increase the likelihood of obtaining statistically significant results, often through undisclosed flexibility in . This concept gained traction following a seminal paper by Simmons, , and Simonsohn, which demonstrated via computer and experiments how common flexible practices—such as deciding post-hoc whether to include covariates, drop conditions, or continue —could inflate false-positive rates far beyond the nominal 5% level. In one simulation combining multiple such flexibilities, the false-positive rate reached 60.7%, highlighting the ease with which researchers could inadvertently or deliberately produce misleading evidence of effects that do not exist. The reproducibility crisis in scientific research, particularly in psychology and related fields, was starkly triggered by high-profile scandals around the same time, amplifying awareness of data dredging as a systemic issue. In 2011, Dutch social psychologist Diederik Stapel was found to have fabricated data in at least 50 publications over a decade, leading to the retraction of numerous papers and a major investigation by Dutch universities that exposed widespread flaws in social psychology research practices. This scandal, which eroded public trust in the field, coincided with growing calls for reform and set the stage for broader scrutiny of analytical flexibility akin to p-hacking. A 2016 Nature survey of over 1,500 researchers across disciplines further underscored the crisis, revealing that more than 70% had failed to reproduce another scientist's experiments and over 50% had struggled to replicate their own, with many attributing issues to selective reporting and poor transparency. In response to these developments, initiatives like the 2012 launch of the Framework (OSF) by the Center for Open Science aimed to promote transparency by providing a free platform for preregistering studies, sharing data, and archiving protocols, thereby reducing opportunities for p-hacking through mandatory disclosure. By the 2020s, recognition of data dredging extended to the era of and , where vast datasets and automated exacerbate risks of and biased results. For instance, a 2023 study on "fairness hacking" in algorithms drew parallels to p-hacking, showing how researchers could manipulate fairness metrics across multiple evaluation pipelines to fabricate equitable outcomes without genuine improvements. Similarly, during the , p-hacking contributed to biases in rapid research outputs, as evidenced by analyses of studies where post-hoc subgroup analyses and selective reporting led to conflicting efficacy claims, prompting calls for stricter preregistration to mitigate such distortions.

Types

Drawing Conclusions from Data

Drawing conclusions from data in the context of data dredging involves the practice of conducting exploratory analyses without a pre-specified hypothesis and then presenting significant findings as if they were predicted a priori, often without disclosing the exploratory nature of the work. This approach, known as HARKing (Hypothesizing After the Results are Known), was coined by psychologist Norbert L. Kerr to describe the retrofitting of narratives to post-hoc discoveries, which can mislead readers about the evidential strength of the results. The process typically begins with unrestricted data exploration, where researchers perform a wide array of statistical tests to identify patterns or associations that reach conventional significance thresholds, such as p < 0.05, and selectively report only those "hits" while omitting the full scope of analyses conducted. By ignoring the iterative search process, this selective reporting creates an illusion of confirmatory , as the reported results appear more robust than they are when viewed in from the broader testing landscape. Statistically, this practice inflates the (FDR), which represents the proportion of false positives among all significant findings, because multiple unplanned tests increase the likelihood of spurious results without appropriate adjustments like the Benjamini-Hochberg procedure. For instance, conducting 20 independent tests at α = 0.05 without correction yields an expected 1 false positive by chance alone (20 × 0.05 = 1), and the probability of at least one false positive across the tests rises to approximately 64%, undermining the reliability of any isolated significant outcome. This form of data dredging is particularly common during initial data scans in exploratory phases of , where it can generate promising leads for subsequent hypothesis-driven studies, provided the exploratory origins are transparently acknowledged to avoid overinterpretation.

Optional Stopping

Optional stopping refers to the practice in sequential experiments where continues until a statistical test yields a below a predetermined , such as 0.05, rather than adhering to a pre-specified sample size. This approach violates the assumptions underlying standard fixed-sample testing by incorporating peeking at interim results to decide on continuation or termination, thereby introducing into the process. The mechanism of optional stopping inflates the Type I error rate because each interim analysis functions as an additional test, akin to unplanned multiple comparisons, without appropriate adjustments. For instance, if a researcher checks significance after every 10 subjects up to a maximum of 50, this equates to five potential tests; without correction, the overall Type I error rate can rise from the nominal 5% to approximately 23%, calculated via the familywise error rate formula for repeated independent tests: $1 - (1 - \alpha)^k, where \alpha = 0.05 and k = 5. This inflation occurs because the probability of obtaining at least one false positive across checks compounds, undermining the control of false positives. In contrast, legitimate sequential testing methods like the (SPRT), developed by , allow for repeated analyses while controlling error rates through adjusted boundaries that account for the cumulative risk of multiple looks, ensuring the overall Type I error remains at the desired level. Abusing thresholds in optional stopping, however, disregards these boundaries, leading to unreliable results. A notable example arises in clinical trials, where interim analyses for early termination based on "promising" trends without proper alpha-spending functions can prematurely halt studies, exaggerating treatment effects and increasing the risk of approving ineffective interventions. Regulatory guidelines emphasize pre-planned adjustments, such as , to mitigate this during efficacy or futility assessments.

Post-hoc Data Replacement

Post-hoc data replacement involves altering datasets after initial analysis by replacing missing values, , or other data points with substituted values to achieve statistically significant results. This technique, a form of p-hacking, includes selective outlier exclusion using thresholds like 2 standard deviations or favorable imputation methods such as mean substitution or regression-based filling, chosen post-hoc to favor desired outcomes. Researchers may remove under the pretext of data cleaning or impute with values that align with the , thereby tweaking p-values without prespecifying the approach. Such practices introduce substantial by distorting the original and relationships among variables. For instance, selective mean imputation can artificially shift p-values toward by reducing variance and inflating sizes, with simulations showing false-positive rates that can be raised to at least 30% with p-hacking strategies, including selective data replacement and imputation. In , studies reporting removal are associated with higher rates of reporting s and lower methodological quality, potentially leading to overestimation of effects due to excluded points that contradict the . This is exacerbated in scenarios with higher proportions of , where aggressive imputation worsens type I . A common application occurs in survey research, where non-responses are imputed to "balance" demographic groups, often resulting in biased estimates including overestimation of treatment effects or associations. Surveys indicate that around 38% of researchers admit to excluding or replacing post-analysis without prior specification, highlighting the prevalence of this practice in fields like . Detecting post-hoc data replacement is challenging, as it requires to logs or preregistration details to verify if alterations were prespecified; vague reporting or unreported exclusions often obscure the manipulation, and standard tools like p-curve analysis may fail to identify it in isolation.

Post-hoc Grouping

Post-hoc grouping, also known as unplanned subgroup analysis, involves dividing a into subgroups based on characteristics observed after initial inspection, such as quartiles or other emergent categories, and then conducting statistical tests within those subgroups without prior specification in the study protocol. This practice is common in exploratory analyses but raises concerns because it lacks pre-planning, potentially leading to biased interpretations of treatment effects or associations that may not generalize. A primary issue with post-hoc grouping is the inflation of the Type I error rate due to multiple comparisons; for n independent subgroup tests each conducted at significance level α (typically 0.05), the —the probability of at least one false positive—approaches 1 - (1 - α)^n, which can exceed 0.2 for as few as five subgroups. This multiplicity problem exacerbates the risk of spurious findings, as researchers may selectively report significant subgroups while ignoring non-significant ones, akin to other forms of data manipulation like post-hoc data replacement. Post-hoc grouping can also produce misleading results through phenomena like , where an overall null effect in the full dataset reverses to apparent significance in subgroups due to variables or unequal subgroup sizes, creating the illusion of heterogeneous effects that do not hold upon validation. For instance, a treatment may show no overall benefit but appear effective in a post-hoc age-based subgroup if older participants have higher baseline risks, masking the true lack of efficacy. This practice is particularly prevalent in clinical trials pursuing "" claims, where post-hoc subgroup analyses are reported in up to 86% of randomized phase III trials to suggest tailored therapies, often without adequate adjustment for multiplicity or external validation, leading to overstated subgroup-specific benefits. Such analyses, while useful for generation, require cautious interpretation and preregistration in future studies to mitigate their role in data dredging.

Hypothesis from Non-Representative Data

One form of data dredging involves selecting atypical or non-representative subsets of data, such as extreme cases or outliers, to generate hypotheses that are then presented as broadly applicable. This method entails isolating unusual data slices— for instance, focusing on a small group of participants exhibiting rare behaviors or outcomes within a larger dataset—to identify patterns or correlations that suggest a new hypothesis. Such selection often occurs post-hoc, without prespecification, leading researchers to formulate theories based on these uncharacteristic portions rather than the full dataset. The primary problem with deriving hypotheses from non-representative data is the resulting low , where observed correlations or effects in the biased subset fail to generalize to the broader and often do not replicate in subsequent studies. This undermines the reliability of the generated , as the atypical data may reflect , sampling artifacts, or chance rather than a genuine , leading to misguided directions. A specific instance occurs in pilot studies, where convenience samples—recruited based on rather than representativeness—are sometimes used to draw preliminary "evidence" for larger claims, such as or causal links. These small, non-random samples, often comprising volunteers or easily reachable individuals, can yield unstable effect sizes that overestimate or underestimate true effects, rendering any generated from them uninterpretable for broader application. For instance, estimating treatment effects from such pilots may suggest overly optimistic outcomes, prompting full-scale trials that fail due to non-replication. This practice is closely linked to , as only the "exciting" hypotheses derived from these selective, non-representative subsets are more likely to be reported, while null or inconsistent findings from the full dataset remain unpublished. Studies assessing meta-analyses have shown that selective reporting of results can distort effect estimates.

Systematic Bias

Systematic bias in data dredging arises from fundamental flaws in study design that predispose analyses to spurious findings, particularly through non-random sampling and the initial omission of variables during exploratory searches. Non-random sampling introduces by creating datasets that do not represent the broader population, thereby distorting associations between exposures and outcomes. For instance, in studies, participants who self-select or are recruited based on unmeasured traits—such as those more likely to report symptoms—can link unrelated factors like and complaints, amplifying false signals in subsequent dredging efforts. Confounding variables, when ignored in post-hoc exploratory analyses, further exacerbate this by allowing unadjusted models to detect patterns driven by overlooked mediators rather than true causal links. In observational , such as regressing grades on without for , even minor correlations between and variables of can produce p-values approaching zero in large samples, mimicking genuine effects during probing. This design-level oversight relates closely to analyses of non-representative , where initial sampling choices compound the issue. A unique aspect involves post-hoc confounder adjustment after patterns emerge—for example, initially ignoring demographics like age or in datasets, then retrofitting adjustments to bolster "significant" findings from unadjusted exploratory models, which preserves the underlying rather than mitigating it. The impact of these systematic biases is profound, as they transform exploratory fishing expeditions into ostensibly robust results, often leading to policy or clinical decisions based on artifacts. In , selection bias in cohorts has historically produced spurious protective associations, such as hormone replacement therapy () appearing to reduce coronary heart disease risk ( 0.50 in observational data), only for randomized trials to reveal the opposite ( 1.11) due to unadjusted confounders like healthier lifestyles among HRT users. By enabling unadjusted models to yield "significant" outcomes, systematic bias undermines the validity of dredged associations, prioritizing apparent novelty over replicability.

Multiple Modeling

Multiple modeling, a form of data dredging, occurs when researchers fit numerous statistical models to the same —such as varying combinations of covariates in multiple analyses—and report only the one producing the most desirable outcome, typically the lowest , without disclosing the exploratory process. This selective reporting inflates the apparent significance of findings, as the "best" model is chosen from potentially hundreds of iterations. The primary statistical issue with multiple modeling is , where the selected model excessively fits the idiosyncrasies and random noise of the specific sample rather than capturing generalizable patterns. Researchers often overlook model selection criteria like adjusted R^2, which penalizes the inclusion of extraneous variables to prevent complexity-driven fit improvements, or the (AIC), which balances goodness-of-fit against the number of parameters to favor parsimonious models. Consequently, these models perform well on the analyzed but fail to predict or replicate in independent samples, undermining reliability. To mitigate the multiplicity problem in multiple modeling, corrections such as the Bonferroni adjustment are applied, dividing the original by the number of models tested: p_{\text{adjusted}} = \frac{p_{\text{original}}}{m} where m represents the total number of models considered. This conservative approach controls the but can become overly stringent with large m. In , multiple modeling manifests as specification searches, where analysts iteratively refine variable inclusions, transformations, or functional forms to achieve statistically significant coefficients, yielding "ideal" but fragile results that do not hold across datasets or time periods. Such practices were critiqued early on for producing illusory empirical regularities, highlighting the need for pre-specified models to ensure robustness.

Examples

In Meteorology and Epidemiology

In , early cloud-seeding experiments exemplified data dredging through selective post-hoc analyses of rainfall and storm data. During Project Cirrus in the late 1940s and early 1950s, researchers conducted trials to enhance precipitation using and , claiming success based on observed increases in targeted areas. However, these interpretations relied on unadjusted examinations of variable weather patterns, leading to overstated effects that were later attributed to natural chance fluctuations rather than causal intervention. A prominent case occurred on October 13, 1947, when seeding a hurricane off the southeastern U.S. coast coincided with the storm's unexpected directional shift, devastating , and prompting lawsuits; subsequent reviews concluded the change was coincidental, highlighting how post-hoc grouping of data points can mislead conclusions. In , data dredging has similarly distorted interpretations in large-scale observational studies, particularly those on (HRT) in the 1990s. Post-hoc subgroup analyses of cohort data suggested HRT reduced coronary heart disease risk by 35-50% in postmenopausal women, influencing clinical guidelines and widespread adoption. These apparent benefits arose from exploratory stratifications by age, timing of initiation, and other factors without multiplicity adjustments, but the 2002 randomized trial reversed these findings, showing no cardioprotective effect and increased risks of and , thereby exposing the artifacts of unprespecified analyses. The issue persists in epidemiological investigations testing multiple endpoints without correction, inflating false positives and generating spurious benefits, as seen in vitamin studies linking supplements like vitamin D or E to lowered risks of cancer, cardiovascular disease, and mortality—associations that often dissolve in confirmatory trials due to overlooked testing multiplicity. For instance, early observational reports of vitamin E reducing heart disease were undermined by later evidence attributing signals to chance amid numerous unadjusted outcomes. Meta-analyses from the 2010s reveal that multiplicity is frequently unacknowledged in epidemiological reporting, contributing to low reproducibility rates in the field.

In Psychology and Social Sciences

In psychology, data dredging has been particularly problematic in studies of priming effects, where researchers explore unconscious influences on behavior. A prominent example is John Bargh's 1996 experiment, which suggested that exposing participants to words associated with elderly stereotypes led to slower walking speeds, implying automatic activation of social stereotypes affects physical actions. However, subsequent replication attempts in the 2010s, including a 2012 study by Doyen and colleagues, failed to reproduce these results, with multiple unpublished efforts also yielding null findings. These failures have been attributed to practices like optional stopping—continuing data collection until statistical significance emerges—which inflates false positives in behavioral experiments. Small sample sizes exacerbate data dredging in , where many studies historically relied on n < 50 participants, reducing statistical power and enabling p-hacking to produce up to 50% false positives among published significant results. A seminal 2011 simulation by Simmons, Nelson, and Simonsohn demonstrated this vulnerability: with just five common flexible choices—such as deciding sample size post-hoc, excluding outliers, or selecting covariates—researchers could achieve in 60.7% of cases even when no true effect existed, highlighting how undisclosed analytic flexibility undermines replicability. This issue contributed to the broader in the field, where practices like (hypothesizing after results are known) further compounded selective reporting. In the social sciences, dredging surfaced dramatically in the 2015 LaCour voter study, which claimed could persistently shift attitudes toward using fabricated survey from a nonexistent firm. Uncovered by Broockman, Kalla, and Aronow through irreproducible patterns and post-hoc adjustments to simulate effects, the involved inventing responses and selectively analyzing subsets to achieve significance, leading to the paper's retraction from . (retracted) This case underscored how post-hoc data replacement and grouping in experiments can fabricate supportive evidence, eroding trust in research.

In Finance and Economics

In finance, data dredging has historically led to the identification of illusory profitable strategies through extensive testing of stock selection rules on shared datasets, particularly during the when computational tools became widely available. For example, analyses of performance from 1974 to 1988 suggested a "hot hand" effect, where funds with strong recent returns continued to outperform in the short term, encouraging investors to chase past winners in stock picking based on indicators. However, this apparent persistence was likely an artifact of data snooping across numerous funds and indicators, as subsequent out-of-sample evaluations and adjustments for multiple comparisons revealed no genuine skill, with strategies failing to deliver excess returns beyond benchmarks. A prominent manifestation of data dredging in financial involves screening hundreds of trading rules—such as crossovers or signals—on historical , where reporting only the few that achieve (e.g., p < 0.05 among 100 tested rules) ignores the expected false positives from random variation. This practice, compounded by in which delisted or underperforming assets are excluded from datasets, creates an illusion of robustness, as the "surviving" strategies appear highly profitable in-sample but deteriorate out-of-sample due to . To counter such biases, Halbert White developed the reality check test in the late 1990s, a bootstrap-based procedure that adjusts p-values to account for multiple model searches in forecasting and trading evaluations, enabling detection of spurious results in financial time series like returns. In economics, data dredging often arises in the specification of multiple regression models for macroeconomic forecasting, where analysts iteratively test variable combinations to achieve good in-sample fits, resulting in models that overlook critical risks. Leading up to the 2008 financial crisis, many dynamic stochastic general equilibrium (DSGE) models cherry-picked variables like GDP growth and inflation rates from post-World War II data, excluding financial leverage, housing market dynamics, and credit expansion, which led to overly optimistic predictions and failure to anticipate the downturn. This specification search contributed to systematic underestimation of crisis probabilities, as the tuned models performed well on historical non-crisis periods but broke down when applied to the unprecedented financial shocks.

Consequences

Statistical Pitfalls

Data dredging, by involving numerous exploratory analyses on the same without prior specification, substantially inflates the Type I error rate, which is the probability of incorrectly rejecting a true . For instance, conducting five independent tests at a significance level of α = 0.05 without adjustment results in a (FWER) of approximately 1 - (1 - 0.05)^5 ≈ 0.226, or 22.6%, meaning over one in five such analyses will yield at least one false positive by chance alone. This inflation arises because each test carries its own 5% risk of error, and unadjusted multiple testing compounds these risks across the family of hypotheses. Furthermore, while adjustments for multiplicity can control the inflated Type I error, they simultaneously reduce the statistical power to detect true effects in individual tests. For example, applying the to five tests divides the α level by 5, lowering the per-test threshold to 0.01 and thereby decreasing power from 80% to about 59% for detecting a medium . In data dredging scenarios, where many potential effects are probed, this power loss exacerbates the challenge of identifying genuine associations amid noise. To address the prevalence of false positives in large-scale testing, the false discovery rate (FDR) provides a less conservative alternative to FWER control, defined as the proportion of false positives among all declared significant results, i.e., FDR = (number of false positives) / (total number of positives). The Benjamini-Hochberg procedure controls the expected FDR at a desired level q by sorting the m p-values in ascending order as p_{(1)} ≤ ... ≤ p_{(m)}, then rejecting all null hypotheses for which p_{(i)} ≤ (i/m) q, starting from the smallest i up to the largest such index. This stepwise method balances discovery of true effects with error control, particularly useful in exploratory analyses like data dredging. In modeling contexts, data dredging promotes , where models capture idiosyncratic noise in the rather than underlying patterns, leading to poor generalizability and failed predictions on new . Overfitted models exhibit low bias but high , performing well on the dredged dataset but degrading sharply on independent samples due to spurious correlations identified through exhaustive searches. Cross-validation mitigates this by partitioning into and validation sets, estimating out-of-sample to select models that generalize beyond the original dataset. Data dredging also contributes to , where non-significant findings are suppressed, distorting the literature toward positive results. plots visualize this by plotting sizes against study (e.g., ); in unbiased scenarios, they form a symmetrical inverted , but —with smaller, less precise studies exaggerating —signals selective reporting of significant outcomes. Such is quantified via of standardized estimates against , where a non-zero intercept (P < 0.1) indicates from suppressed results.

Broader Impacts on Science

Data dredging contributes significantly to the reproducibility crisis in scientific , where findings that initially appear robust often fail to replicate upon independent verification. This crisis has led to substantial wasted resources, with estimates indicating that approximately $28 billion is spent annually alone on basic biomedical that cannot be successfully reproduced. The proliferation of non-reproducible results stemming from data dredging undermines the reliability of scientific knowledge, diverting funding from promising avenues and slowing progress in fields like . Ethically, data dredging poses risks by misleading policy decisions and eroding public trust in science. For instance, the U.S. Food and Drug Administration's approval of the Duragesic (fentanyl patch) in 1990 relied on data dredging techniques, as acknowledged by FDA official Robert Harter, which contributed to the expanded use of potent and exacerbated the ongoing opioid crisis. A 2016 survey of over 1,500 researchers revealed that more than 70% had failed to reproduce another scientist's experiments, highlighting systemic pressures that foster practices like p-hacking and further diminish confidence in scientific outputs. In specific fields, data dredging has caused tangible harms by delaying genuine discoveries. In , the 2015 replication effort by the Collaboration found that only 36% of 100 high-profile studies replicated successfully, prompting a toward larger sample sizes and preregistration protocols to counteract dredging-induced biases. This transition, while beneficial, illustrates how years of reliance on dredged data obscured true effects and hindered advancement. The advent of in the has intensified these issues, with pipelines capable of conducting thousands of exploratory analyses daily, often yielding false hypotheses due to unchecked multiple testing and . Such automated dredging amplifies the volume of spurious findings, compounding the reproducibility crisis across disciplines.

Remedies

Preventive Measures

Pre-registration serves as a foundational preventive measure against data dredging by requiring researchers to document their hypotheses, methods, and analysis plans in a time-stamped, publicly accessible format prior to or observation. Platforms like the Open Science Framework (OSF.io) enable non-clinical researchers to submit read-only versions of study plans, while mandates registration for clinical studies involving human subjects to ensure accountability and reduce the temptation to adjust analyses based on observed patterns. This approach locks in the research design, limiting opportunities for selective reporting or exploratory fishing that could inflate false positives. Since approximately 2015, many peer-reviewed journals have adopted pre-registration as a submission requirement, particularly in fields like and , to curb data dredging and promote reproducible findings; for example, nearly all major journals now enforce it for clinical trials to align with standards. Complementing pre-registration, rigorous study protocols emphasize fixed sample sizes determined via power calculations before the study begins, alongside predefined analysis plans that specify statistical tests and decision rules. Blinding analysts to during protocol development further safeguards against , ensuring that interim peeks or adjustments do not influence the final design. Transparency practices reinforce these upfront commitments by mandating open data sharing through repositories, allowing independent verification of whether analyses deviated from the registered plan. Researchers should report all performed statistical tests, including nonsignificant ones, to provide a complete picture and avoid the illusion of isolated significant results from dredging. Tools like p-curve analysis, which plots the distribution of significant p-values, exemplify this by enabling detection of selective reporting if p-values cluster suspiciously near 0.05; full disclosure facilitates such assessments and builds trust in the results. The widespread adoption of simple, standardized templates on platforms like AsPredicted.org, launched in 2015, has notably curbed p-hacking in social sciences, with empirical evidence indicating that pre-registrations paired with detailed pre-analysis plans substantially reduce evidence of selective manipulation in test statistics.

Corrective Techniques

Multiple testing corrections are essential post-hoc methods to adjust for the inflated risk of false positives arising from data dredging, where numerous hypotheses are tested on the same dataset. The , one of the simplest approaches, controls the (FWER) by dividing the significance level α by the number of tests m, or equivalently multiplying each by m and comparing to α; this ensures the probability of at least one false rejection across all tests remains at or below α. However, its conservativeness can reduce statistical power, particularly when m is large. The Holm-Bonferroni procedure offers a less stringent stepwise alternative that also controls the FWER while maintaining greater . It involves sorting the p-values in ascending order, then sequentially comparing each to adjusted thresholds: the smallest to α/m, the next to α/(m-1), and so on, rejecting hypotheses until a non-rejection occurs, after which remaining hypotheses are accepted. For scenarios where discovering true effects is prioritized over strictly controlling false positives, (FDR) methods like the Benjamini-Hochberg () procedure provide a balance by controlling the expected proportion of false rejections among all rejections. The BH method sorts the p-values as p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}, then rejects all hypotheses with p_{(i)} \leq \frac{i}{m} q, where q is the desired FDR level (often 0.05), starting from the largest i and proceeding downward until the condition fails. This procedure controls the FDR under independence and positive dependence assumptions, proving more powerful than FWER methods in high-dimensional settings common to data dredging. Beyond p-value adjustments, cross-validation and out-of-sample testing help validate dredged findings by assessing generalizability. Cross-validation partitions the into and validation folds, training models on subsets and evaluating on held-out portions to estimate performance without to the full dataset, thereby mitigating the from multiple exploratory fits. Similarly, out-of-sample testing reserves a portion of the unseen during for final hypothesis confirmation, providing unbiased evidence of effect persistence. Permutation tests offer a nonparametric corrective tool for empirical computation, especially useful when assumptions of tests are violated due to dredging-induced multiplicity. By randomly reshuffling under the many times (e.g., 10,000 ) and recalculating test statistics, the empirical is the proportion of permuted statistics exceeding the observed one, inherently accounting for dependencies and multiple comparisons without distributional assumptions. Recent advancements in safe testing frameworks, such as those developed by Vovk and collaborators, enable error control in flexible, post-hoc analyses without requiring pre-specification of hypotheses or stopping rules. These e-value-based methods use supermartingales to guarantee type-I error bounds anytime during testing, even under optional continuation, making them suitable for iterative dredging scenarios in modern data science.

References

  1. [1]
    Data-dredging bias | Catalog of Bias - The Catalogue of Bias
    Data-dredging bias. A distortion that arises from presenting the results of unplanned statistical tests as if they were a fully prespecified course of analyses.Background · Impact · Preventive steps
  2. [2]
    What is Data Dredging (data fishing)? - TechTarget
    Apr 14, 2022 · Data dredging -- sometimes referred to as data fishing -- is a data mining practice in which large data volumes are analyzed to find any possible relationships ...
  3. [3]
    Data dredging, bias, or confounding: They can all get you into ... - NIH
    Data dredging is thought by some to be the major problem: epidemiologists have studies with a huge number of variables and can relate them to a large number of ...Missing: original | Show results with:original
  4. [4]
    False-Positive Psychology - Joseph P. Simmons, Leif D. Nelson, Uri ...
    (2011). Measuring the prevalence of questionable research practices with incentives for truth-telling. Manuscript submitted for publication. Go to Reference.
  5. [5]
  6. [6]
    On a form of spurious correlation which may arise when indices are ...
    —On a form of spurious correlation which may arise when indices are used in the measurement of organs. Karl Pearson.
  7. [7]
    Sage Research Methods - Correlation, Pearson
    (1995). Correlations genuine and spurious in Pearson and Yule. Statistical Science, 10(4), 364–376. Bruning, J. L., & Kintz, B. L. (1997).
  8. [8]
    Carlo Bonferroni (1892 - 1960) - Biography - MacTutor
    Biography. Carlo Bonferroni's first interests were in music and he studied conducting and the piano at the Music Conservatory of Turin.<|separator|>
  9. [9]
    Bonferroni, Carlo Emilio - Encyclopedia of Mathematics
    Mar 25, 2023 · His name is familiar in the statistical world through his 1936 paper in which the Bonferroni Inequalities first appear. If pi is the ...Missing: multiple comparisons
  10. [10]
    “THE STATISTICAL POWER OF ABNORMAL-SOCIAL ...
    Sep 22, 2015 · In 1962, Jacob Cohen wrote an important article about a fundamental problem in the way psychologists conduct their studies. He noted that ...
  11. [11]
    [PDF] Data Mining as an Industry - andrew.cmu.ed
    Sep 14, 2001 · Lovell, Michael C., “Data Mining," this REVIEW 65 (Feb. 1983), 1-12. Pesaran, M. H., "On the Comprehensive Method of Testing. Non-Nested ...Missing: dredging | Show results with:dredging
  12. [12]
    [PDF] Specification Searches
    Many had some first-hand knowledge of econometric practice and were happy to learn that someone else thought econometric theory to be rather remote. on ...<|separator|>
  13. [13]
    Data Mining - The Review of Economics and Statistics - jstor
    LXV FEBRUARY 1983 NUMBER 1. DATA MINING. Michael C. Lovell*. I. Introduction. T HIS paper investigates certain conse- quences of data mining, a research ...Missing: dredging | Show results with:dredging
  14. [14]
    Report finds massive fraud at Dutch universities - Nature
    Nov 1, 2011 · When colleagues called the work of Dutch psychologist Diederik Stapel too good to be true, they meant it as a compliment.
  15. [15]
    1,500 scientists lift the lid on reproducibility - Nature
    May 25, 2016 · More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own ...
  16. [16]
    500,000 OSF Users: Celebrating a Global Open Science Community
    Aug 30, 2022 · OSF has experienced non-linear growth every year since it launched in November 2012. In early 2013, OSF was a self-funded lab project with ...<|separator|>
  17. [17]
    Open science saves lives: lessons from the COVID-19 pandemic
    Jun 5, 2021 · In this article, we express concerns about the violation of some of the Open Science principles and its potential impact on the quality of research output.
  18. [18]
    HARKing: Hypothesizing After the Results are Known - Sage Journals
    Kerr, N. L., Garst, J., Harris, S. E., & Sheppard, L. (1998). The effect of having a sufficient hypothesis on generation of alternative hypotheses. Unpublished ...
  19. [19]
    Common pitfalls in statistical analysis: The perils of multiple testing
    It can be calculated that if two groups are compared 5 times, the probability of a false positive finding is as high as 23%; if they are compared 20 times, the ...
  20. [20]
    HARKing, Cherry-Picking, P-Hacking, Fishing Expeditions, and Data ...
    Feb 18, 2021 · As with fishing expeditions, with data dredging the probability of false positive findings is very high because of the very large number of ...
  21. [21]
    What is P Hacking: Methods & Best Practices - Statistics By Jim
    P-hacking jacks up that false positive rate, sometimes drastically! False positive studies tend not to reproduce significant results when scientists repeat ...
  22. [22]
    2 Error control – Improving Your Statistical Inferences - GitHub Pages
    One practice that inflates the Type 1 error rate is known as optional stopping. In optional stopping a researcher repeatedly analyzes the data, continues the ...Missing: subjects | Show results with:subjects
  23. [23]
    When Null Hypothesis Significance Testing Is Unsuitable for Research
    ... Type I error in k independent tests, each with significance level α, is αTOTAL = 1 - (1 - α)k. For example if k = 1, 2, 3, 4, 5, and 10 than αTOTAL is 5 ...Missing: stopping | Show results with:stopping
  24. [24]
    The frequentist implications of optional stopping on Bayesian ...
    Specifically, we quantitatively demonstrate the impact of optional stopping on the resulting Bayes factors in two common situations: (1) when the truth is a ...
  25. [25]
    Guidance on interim analysis methods in clinical trials - PMC
    Without protections on controlling type I error rate in a trial like this, we run the risk of incorrectly stopping the trial for early efficacy. At the end of ...
  26. [26]
    Big little lies: a compendium and simulation of p-hacking strategies
    When engaging in p-hacking, researchers are assumed to explore different data analysis options in a trial and error fashion, fishing for 'publishable' ...
  27. [27]
    Outlier Removal and the Relation with Reporting Errors and Quality ...
    ... p-hacking [25] or significance chasing [26] in null hypothesis significance testing. Here, we investigate the relationship between outlier removal ...
  28. [28]
    Imputation strategies when a continuous outcome is to be ...
    Jul 23, 2019 · Non-response imputation resulted in biased estimates, both underestimates and overestimates. ... Multiple imputation for nonresponse in surveys.
  29. [29]
    Types of Analysis: Planned (prespecified) vs Post Hoc, Primary ... - NIH
    ... p-hacking, fishing expeditions, and data dredging and mining, all of which are dubious research practices. A post hoc analysis is an unplanned analysis.
  30. [30]
    Post hoc subgroups in clinical trials: Anathema or analytics? - PubMed
    We compare the statistical and analytics perspectives and suggest that predictive modeling should often replace subgroup analysis.
  31. [31]
    4.2 - Controlling Family-wise Error Rate | STAT 555
    Pr(V > 0) is called the family-wise error rate or FWER. It is easy to show that if you declare tests significant for p<α then FWER ≤ min(m0α,1) m i n ( m 0 α , ...
  32. [32]
    Best (but oft-forgotten) practices: the multiple problems of multiplicity ...
    In its favor, Moyé (16) holds that “Type I error accumulates with each executed hypothesis test and must be controlled by the investigators” (p. 354); Cormier ...
  33. [33]
    Simpson's Paradox, Lord's Paradox, and Suppression Effects are ...
    This article discusses three statistical paradoxes that pervade epidemiological research: Simpson's paradox, Lord's paradox, and suppression.
  34. [34]
    Simpson's Paradox in Meta-Analysis – Choice of Studies and ...
    May 4, 2020 · As such, Simpson's paradox can be prevented by careful study selection and subgroup analysis. As we demonstrated in our study,.
  35. [35]
    Subgroup analyses in randomized phase III trials of systemic ...
    Oct 10, 2022 · In this systematic review, subgroup analyses were presented in 217 (86%) publications. Methodology of subgroup analysis was often lacking.
  36. [36]
    Pilot Studies: Common Uses and Misuses | NCCIH
    The goal of pilot studies is not to test hypotheses; thus, no inferential statistics should be proposed. Therefore, it is not necessary to provide power ...
  37. [37]
    p-Curve and p-Hacking in Observational Research | PLOS One
    Omitted-variable bias represents a typical case of confounding that is not accounted for. For example, if exam grades are regressed on class attendance, it is ...
  38. [38]
    Detecting p‐Hacking - Elliott - 2022 - Econometrica
    Mar 24, 2022 · We theoretically analyze the problem of testing for p-hacking based on distributions of p-values across multiple studies.
  39. [39]
    [PDF] A Brief, Nontechnical Introduction to Overfitting in Regression-Type ...
    Overfitted models will fail to replicate in future samples, thus creating considerable uncertainty about the scientific merit of the finding. The present ...
  40. [40]
    Model selection and overfitting | Nature Methods
    Aug 30, 2016 · Overly complex models typically have low bias and high variance (overfitting). Figure 1: Overfitting is a challenge for regression and ...
  41. [41]
    Data mining reconsidered: encompassing and the general‐to ...
    This paper examines the efficacy of the general‐to‐specific modeling approach associated with the LSE school of econometrics using a simulation framework.
  42. [42]
    Why, When and How to Adjust Your P Values? - PMC - NIH
    Aug 7, 2018 · The simplest way to adjust your P values is to use the conservative Bonferroni correction method which multiplies the raw P values by the number ...
  43. [43]
    [PDF] Let's Take the Con Out of Econometrics - Edward E. Leamer
    Sep 20, 2005 · This searching for a model is often well intentioned, but there can be no doubt that such a specification search in- validates the ...
  44. [44]
    Benchmarks: October 13, 1947: A disaster with Project Cirrus
    Jun 20, 2016 · Project Cirrus scientists dropped 80 kilograms of dry ice from a B-17 bomber into the clouds at the western edge of the Oct. 13 hurricane.
  45. [45]
    What a US mission to control hurricanes taught us about deadly storms
    Jun 5, 2025 · An ambitious project to weaken or divert hurricanes generated decades of suspicion and disagreement. What did we learn – and will it ever be ...Missing: debunked | Show results with:debunked
  46. [46]
    Fact check: Debunking weather modification claims - NOAA
    Oct 23, 2024 · CLAIM: NOAA modifies the weather. FACT: NOAA does not modify the weather, nor does it fund, participate in or oversee cloud seeding or any other ...
  47. [47]
    Hormone therapy for preventing cardiovascular disease in post ...
    Mar 10, 2015 · Evidence from systematic reviews of observational studies suggests that hormone therapy may have beneficial effects in reducing the incidence of ...
  48. [48]
    Postmenopausal Hormone Therapy and Risk of Cardiovascular ...
    Apr 4, 2007 · Subgroup analyses in the 2 WHI trials of hormone therapy suggested a nonsignificant reduction in risk of CHD in women aged 50 to 59 years in the ...
  49. [49]
    The Women's Health Initiative Hormone Therapy Trials - NIH
    The Women's Health Initiative (WHI) hormone therapy (HT) trials were designed to determine the benefits and risks of HT taken for chronic disease prevention by ...Missing: 1990s dredging
  50. [50]
    Why Most Published Research Findings Are False | PLOS Medicine
    Aug 30, 2005 · Published research findings are sometimes refuted by subsequent evidence, with ensuing confusion and disappointment. Refutation and controversy ...Correction · View Reader Comments · View Figures (6) · View About the Authors
  51. [51]
    Spurious Correlation? A review of the relationship between Vitamin ...
    May 26, 2020 · Abstract. The study reviews the evidence presented in a recent study linking vitamin D levels and Covid-19 infection and mortality.
  52. [52]
    Placing epidemiological results in the context of multiplicity ... - NIH
    Jun 12, 2014 · Epidemiological studies evaluate multiple exposures, but the extent of multiplicity often remains non-transparent when results are reported.
  53. [53]
  54. [54]
    Survivor Bias Risk: What It Is and How It Works - Investopedia
    Survivorship bias risk is the tendency of investors to make flawed decisions based on performance data that only includes successful funds.
  55. [55]
    Where modern macroeconomics went wrong - Oxford Academic
    Jan 5, 2018 · This paper provides a critique of the DSGE models that have come to dominate macroeconomics during the past quarter-century.
  56. [56]
    [PDF] Disentangling the Channels of the 2007-2009 Recession
    May 11, 2012 · Abstract. This paper examines the macroeconomic dynamics of the 2007-09 recession in the United States and the subsequent slow recovery.Missing: specifications dredging
  57. [57]
    [PDF] Post Hoc Tests Familywise Error
    where αFWE is the familywise error rate, αEC is the alpha rate for an individual test (almost always considered to be .05), and K is the number of comparisons.
  58. [58]
    Adjusting for multiple testing when reporting research results
    ... reduces the apparent significance of effects and thus reduces statistical power. The Bonferroni procedure is the most widely recommended way of doing this ...
  59. [59]
    [PDF] Guidelines for Multiple Testing in Impact Evaluations of Educational ...
    In the absence of multiple comparison adjustments, statistical power is 80 percent. Applying the Bonferroni correction reduces the power to 59 percent for 5 ...
  60. [60]
    Bias in meta-analysis detected by a simple, graphical test
    ### Summary of Funnel Plots for Detecting Publication Bias
  61. [61]
    How the FDA Helped Ignite, and Then Worsened, the Opioid Crisis
    Apr 25, 2025 · A headline in one trade publication read: “'Data Dredging' Was Key to Toradol, Lodine, Duragesic Approvals, FDA's Harter Says.” In response ...
  62. [62]
    Is AI leading to a reproducibility crisis in science? - Nature
    Dec 5, 2023 · Could machine learning fuel a reproducibility crisis in science? Efforts to correct training or test data sets can lead to their own problems.Missing: dredging | Show results with:dredging
  63. [63]
    Preregistration - Center for Open Science
    When you preregister your research, you're simply specifying your research plan in advance of your study and submitting it to a registry.
  64. [64]
    Home | ClinicalTrials.gov
    ClinicalTrials.gov is a website and online database of clinical research studies and information about their results.Frequently Asked Questions · Learn About Studies · FDAAA 801 and the Final Rule
  65. [65]
    From pre-registration to publication: a non-technical primer for ...
    Pre-registration of clinical trials is a requirement for submission to almost all peer-reviewed journals; however, few journals explicitly require meta-analysis ...
  66. [66]
    [PDF] E 9 Statistical Principles for Clinical Trials Step 5
    4.5 Interim Analysis and Early Stopping. An interim analysis is any analysis intended to compare treatment arms with respect to efficacy or safety at any ...
  67. [67]
    Blinding of study statisticians in clinical trials - BioMed Central
    Jun 27, 2022 · Blinding is an established approach in clinical trials which aims to minimise the risk of performance and detection bias.Missing: fixed | Show results with:fixed
  68. [68]
    TOP Guidelines - Center for Open Science
    The Transparency and Openness Promotion Guidelines (TOP) is a policy framework for advancing open science practices. Specifically, TOP includes practices ...
  69. [69]
    [44] AsPredicted: Pre-registration Made Easy - Data Colada
    Dec 1, 2015 · We introduce AsPredicted.org, a new website that makes pre-registration as simple as possible. We then show that pre-registrations don't actually tie ...
  70. [70]
    Do Preregistration and Preanalysis Plans Reduce p-Hacking and ...
    Brodeur, A., N. Cook, J. S. Hartley, and A. Heyes. 2024. “Replication Data for: 'Do Pre-Registration and Pre-Analysis Plans Reduce p-Hacking and Publication ...
  71. [71]
    Multiple significance tests: the Bonferroni method - The BMJ
    Jan 21, 1995 · Multiple significance tests: the Bonferroni method. BMJ 1995; 310 doi: https://doi.org/10.1136/bmj.310.6973.170 (Published 21 January 1995)
  72. [72]
    [PDF] A Simple Sequentially Rejective Multiple Test Procedure - IME-USP
    ABSTRACT. This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one ...
  73. [73]
    Data snooping - Stanford Data Science
    Jan 7, 2022 · Data snooping is a form of statistical bias manipulating data or analysis to artificially get statistically significant results.
  74. [74]
    Permutation – based statistical tests for multiple hypotheses - PMC
    The tool allows the calculation of Chi-square test for categorical data, and ANOVA test, Bartlett's test and t-test for paired and unpaired data. Once a test ...
  75. [75]
    Game-theoretic statistics and safe anytime-valid inference
    Oct 4, 2022 · Safe anytime-valid inference (SAVI) provides measures of statistical evidence and certainty -- e-processes for testing and confidence sequences for estimation.