Fact-checked by Grok 2 weeks ago

Replication crisis

The replication crisis, also referred to as the reproducibility crisis, is an ongoing methodological challenge in various scientific disciplines where a substantial number of published research findings cannot be independently replicated by other researchers or even by the original authors, thereby eroding confidence in the reliability of scientific knowledge. This phenomenon highlights systemic issues in research practices that lead to inflated rates of false positives and non-reproducible results across fields. The crisis first drew significant attention in psychology through the 2015 Reproducibility Project: Psychology, a collaborative effort led by the Open Science Collaboration that attempted to replicate 100 experiments from three leading psychology journals published in 2008. Of these, 97% of the original studies reported statistically significant results (p < 0.05), but only 36% of the replication attempts achieved significance, with replication effect sizes roughly half as large as those in the originals (correlation coefficient (r) declining from 0.403 to 0.197). This stark discrepancy underscored the prevalence of non-replicable findings and prompted widespread scrutiny of psychological research. Beyond psychology, the replication crisis affects diverse areas including biomedicine, economics, ecology, and social sciences, as evidenced by failed replications in cancer biology (where only 11% of 53 high-profile studies replicated) and economics (with replication rates around 61% for high-quality studies). A 2016 survey of more than 1,500 scientists across disciplines found that over 70% had tried and failed to reproduce another researcher's experiments, while more than 50% had failed to reproduce their own work; similar concerns persist, with a 2024 survey finding 72% of biomedical researchers agreeing the field faces a reproducibility crisis. These issues have implications for public trust in science, policy decisions, and resource allocation in research funding. Contributing factors to the crisis include publication bias, where journals preferentially publish novel, positive, or statistically significant results, leading to a literature skewed toward false discoveries; questionable research practices such as (manipulating data analysis to achieve significance) and (hypothesizing after results are known); and underpowered studies with small sample sizes that increase the likelihood of Type I errors. Additionally, the "publish or perish" incentive structure in academia pressures researchers to prioritize quantity over rigor, exacerbating these problems. In response, the scientific community has initiated reforms like of studies, open data sharing, and increased emphasis on replication attempts to enhance transparency and reliability.

Background

Definition and Importance of Replication

Replication in scientific research refers to the process of independently conducting a study to verify the results of a prior investigation, typically by employing the same or closely analogous methods, materials, and analytical procedures. This practice ensures that findings are not artifacts of chance, error, or unique circumstances, thereby providing diagnostic evidence about the validity of previous claims. There are two primary types of replication: direct replication, which involves an exact repetition of the original study's protocol, including the same experimental conditions, participant recruitment, and data analysis steps, to confirm the reliability of specific results; and conceptual replication, which tests the underlying hypothesis or theory using alternative methods, populations, or measures to assess the generalizability of the core idea. Direct replication focuses on reproducibility under identical conditions, while conceptual replication emphasizes robustness across variations, both contributing to the accumulation of reliable knowledge. The importance of replication lies in its role as a cornerstone of the scientific method, enabling the distinction between genuine effects and statistical noise or false positives, thus building cumulative and trustworthy scientific knowledge. As philosopher of science emphasized in his principle of falsifiability, scientific theories must be testable and potentially refutable through repeatable experiments; without replicability, isolated findings hold no significance for advancing knowledge. , who lived from 1902 to 1994, argued that reproducibility underpins the objectivity of scientific evidence, allowing independent verification to falsify or corroborate hypotheses. For instance, in a simple laboratory experiment measuring human reaction times to visual stimuli, replication might involve a second researcher recruiting a new group of participants, using the same timing software and stimulus presentation, and applying identical statistical tests to the data; successful replication would confirm the original average reaction time as a reliable benchmark, illustrating how this process validates basic empirical claims.

Historical Origins

The concept of replication emerged as a cornerstone of scientific practice during the early modern period, particularly through the experimental culture fostered by the in 17th-century England. 's air-pump experiments in the 1660s exemplified this, as he emphasized detailed reporting and encouraged independent repetitions by witnesses to verify findings on air pressure and vacuums, thereby establishing empirical reliability amid debates with critics like . This approach transformed replication from ad hoc verification into a normative expectation, promoting trust in experimental claims through communal scrutiny rather than solitary authority. By the 19th and early 20th centuries, replication gained formal prominence in physics, where precise measurements demanded repeated trials to resolve discrepancies and build consensus. The Michelson-Morley experiment of 1887, testing the luminiferous ether, was extensively replicated by Dayton Miller and others in subsequent decades, with improved interferometers confirming the null result and paving the way for Einstein's relativity theory. Concurrently, the professionalization of science led to journals like Nature (founded 1869) and Philosophical Transactions implicitly requiring verifiable evidence, as editors prioritized reproducible demonstrations to distinguish rigorous work from speculation. Following World War II, the expansion of quantitative methods in social sciences and psychology assumed replication as an inherent safeguard, yet systematic checks remained rare amid rapid institutional growth. Fields like experimental psychology grew through federal funding and applied testing, where studies on behavior and cognition were presumed replicable due to controlled lab settings, but this optimism overlooked contextual variations. Key debates in the 1960s and 1970s, notably Lee Cronbach's critiques, highlighted tensions in reliability, distinguishing internal validity (controlled replicability within studies) from external validity (generalizability across settings). In his 1975 address, Cronbach argued for bridging experimental and correlational approaches to enhance psychological findings' robustness, influencing methodological reforms. Sociologically, replication norms solidified during science's professionalization from the 19th century onward, as universities and academies standardized training to curb fraud and bias, embedding verification in peer review and tenure criteria to legitimize disciplines amid industrialization and specialization.

Early Indicators and Statistics

Early signs of the replication crisis emerged through quantitative analyses in the late 20th and early 21st centuries, revealing systemic issues in research reliability across disciplines. One foundational indicator was the low statistical power in psychological studies, which increases the risk of false negatives and undermines replicability. In a 1962 review by , the average statistical power to detect medium-sized effects in abnormal-social psychological research was approximately 0.46, implying a high likelihood of missing true effects. Subsequent pre-2010 surveys confirmed persistently low power; for instance, Peter Sedlmeier and 's 1989 analysis of psychological studies from 1960 to 1987 found an average power of 0.37, corresponding to a 63% chance of false negatives for medium effects. In medicine, early tools for evaluating research quality further highlighted reproducibility concerns. The development of the (A MeaSurement Tool to Assess systematic Reviews) instrument in 2007 provided a standardized 11-item checklist to appraise the methodological rigor of systematic reviews, often revealing deficiencies that compromise reproducibility. Initial applications of to non-Cochrane systematic reviews in fields like oncology and cardiology demonstrated low overall quality scores, with many reviews failing to adequately address publication bias, study heterogeneity, or conflict of interest—factors that erode confidence in replicated findings. Global statistics from early meta-analyses also pointed to replication failures in applied fields. John P. A. Ioannidis's seminal 2005 paper, "Why Most Published Research Findings Are False," modeled the probability of false positives using factors like power, bias, and pre-study odds, estimating that 50-90% of findings in fields with small effects and low power—such as and —could be false. This was echoed in early meta-analyses; for instance, 1990s reviews of on dietary factors and disease risk often showed inconsistent results across trials, with replication success rates below 50% for associations like antioxidant supplements and cancer prevention. In , retrospective analyses from 1999-2010 documented stark inconsistencies: C. Glenn Begley and Lee M. Ellis reported in 2012 that only 11% (6 out of 53) of influential preclinical studies from that period could be replicated by an independent team at , attributing discrepancies to selective reporting and experimental variability. These indicators collectively signaled the need for broader scrutiny of scientific practices.

Prevalence

In Psychology

The replication crisis in psychology became starkly evident through the Reproducibility Project: Psychology, conducted by the Open Science Collaboration from 2011 to 2015, which attempted to replicate 100 experimental and correlational studies originally published in three prominent psychology journals in 2008. Of these, only 36% produced statistically significant results in the same direction as the originals, with replication effect sizes averaging roughly half those reported in the initial studies (mean original effect size d = 0.403; replication d = 0.197). This project highlighted systemic issues in psychological research reproducibility, prompting widespread scrutiny and reform efforts within the field. Subfields within psychology showed varying degrees of vulnerability to replication failures, with social psychology particularly affected. For instance, priming effects—such as those influencing behavior through subtle cues—replicated successfully in only 17% of cases across 94 studies, with 94% exhibiting smaller effect sizes than the originals. Similarly, the ego depletion hypothesis, positing that self-control is a limited resource that depletes with use, has faced severe challenges, succeeding in just 4 out of 36 major multi-site replication attempts by 2022, yielding a success rate below 20%. In contrast, cognitive psychology demonstrated somewhat higher replicability, with memory studies and related experiments achieving around 48% success in the , compared to 23% in social psychology, though inconsistencies persist in areas like false memory paradigms. Surveys have underscored the role of questionable research practices (QRPs) in contributing to non-replication. A 2012 study by John et al., surveying over 2,000 psychologists, found that more than 50% admitted to practices such as failing to report all dependent measures (63%) or selectively reporting analyses that "worked" (56%), which inflate the likelihood of false positives and hinder reproducibility. Recent analyses indicate ongoing challenges, with effect sizes in psychological journals post-2015 remaining approximately halved compared to pre-crisis levels, reflecting a conservative shift in reporting but persistent overestimation in originals. A 2023 meta-analysis of replications across psychology subfields estimated overall failure rates between 40% and 60%, varying by domain, with social psychology at the lower end of replicability. These findings emphasize the need for continued vigilance in psychological research validation.

In Medicine and Biomedical Sciences

The replication crisis in medicine and biomedical sciences manifests prominently in both preclinical and clinical research, where failures to reproduce findings undermine the reliability of evidence used for drug development, treatment decisions, and public health policies. High-profile initiatives have highlighted systemic issues, with preclinical studies in particular showing low reproducibility rates that contribute to wasted resources and delayed therapeutic advances. For instance, pharmaceutical companies have reported significant challenges in validating published results, leading to reevaluations of research pipelines and calls for improved standards. A landmark effort, the Reproducibility Project: Cancer Biology, launched in 2013 by the Center for Open Science in collaboration with Science Exchange, aimed to replicate key experiments from 50 high-profile cancer biology papers published between 2010 and 2012. By 2021, the project had completed replications for 23 papers, finding that only 46% of the 97 experiments showed statistically significant effects in the expected direction, compared to 87% in the originals; moreover, replicable effect sizes were on average 85% smaller than those initially reported. This outcome underscores the fragility of preclinical cancer research, where positive findings were only half as likely to replicate successfully (40%) as null results (80%), suggesting inflated original effects due to methodological or selective reporting issues. Preclinical failures extend beyond cancer, as evidenced by a 2012 internal analysis at , which attempted to reproduce 67 landmark publications in oncology, women's health, and cardiovascular disease. The team could validate only 25% of the studies to a level sufficient for further drug development, attributing discrepancies to incomplete experimental details, biological variability, and potential biases in original reporting. Similarly, scientists reported in 2012 replicating just 6 out of 53 influential cancer studies (11%), reinforcing industry-wide concerns about the translatability of basic research to clinical applications. These corporate audits, though not exhaustive, illustrate how non-reproducible preclinical data can lead to billions in downstream costs, estimated at $28 billion annually in the U.S. alone for irreproducible preclinical research. In clinical trials, reproducibility challenges arise in confirming drug efficacy and safety, often resulting from selective publication, underpowered designs, and inconsistent protocols across studies. A 2009 analysis estimated that up to 85% of health research funding, including clinical trials, is wasted due to avoidable issues like poor question formulation, non-generalizable designs, and inaccessible data, which hinder independent verification and meta-analyses. For example, many phase III trials for expensive therapeutics fail to replicate prior efficacy signals observed in smaller studies, contributing to regulatory scrutiny and retracted approvals; a 2023 assessment of highly cited clinical research from 2004–2018 found replication rates as low as 40–50% for key claims in top journals. These issues exacerbate the translation gap, where promising preclinical results seldom advance to approved treatments, with only about 5–10% of cancer drugs succeeding in clinical phases. A 2024 international survey of over 1,900 biomedical researchers revealed widespread acknowledgment of a reproducibility crisis, with 72% agreeing that biomedicine faces severe replicability problems and only 5% estimating that more than 80% of studies are reproducible; respondents attributed low rates (<30% in many estimates) to cultural pressures and methodological flaws across fields like cell biology. In subareas such as neuroscience, functional MRI (fMRI) studies have been particularly affected, with a 2009 analysis by Vul et al. demonstrating that over 50 high-profile papers reported implausibly high correlations (often >0.8) between brain activation and traits like or , likely inflated by non-independent selection of peaks—suggesting up to 70% false positive rates when accounting for low statistical power and flexible analyses. These findings prompted methodological reforms, including preregistration and whole-brain corrections, but highlight ongoing vulnerabilities in that parallel broader biomedical concerns.

In Economics and Other Social Sciences

The replication crisis has notably impacted , where empirical studies often rely on complex econometric models and large datasets, making verification challenging. A comprehensive assessment of empirical papers published in the American Economic Review's centenary volume revealed that only 29% had been formally replicated, highlighting a low baseline rate for top-tier work in the field. Similarly, a large-scale of laboratory experiments in found replication success rates ranging from 61% to 78%, depending on the indicator used, such as or consistency; however, this still indicates substantial variability and failure in about 22-39% of cases. These findings underscore how the field's emphasis on novel techniques, like difference-in-differences and regression discontinuity designs, can obscure without standardized . Key analyses have pinpointed p-hacking as a contributing factor in , where researchers may selectively report results to achieve . In a study of over 57,000 hypothesis tests from top economics journals spanning 1963 to 2018, Brodeur et al. identified clear patterns of p-value bunching just below the 0.05 threshold, particularly in causal analyses using instrumental variables and regression discontinuity, suggesting p-hacking influences 20-50% of the distribution of significant results depending on the method. This practice not only inflates false positives but also complicates replication, as original datasets and code are often unavailable or inadequately documented. In other social sciences, such as and , replication challenges arise from reliance on survey data, network analyses, and observational studies of . In , efforts to computationally reproduce findings from prominent articles have highlighted issues with data and code availability, pointing to undocumented data cleaning steps and software dependencies. faces analogous problems, with large-scale replication projects in the 2020s yielding an overall success rate of approximately 50%, particularly in observational studies where model specifications vary across contexts. For instance, replications of studies have shown inconsistent effects of campaign interventions on turnout, often failing due to heterogeneous samples and unmodeled interactions in multi-level data. Illustrative examples highlight these issues. The seminal 1994 study by and Krueger on the employment effects of a increase in produced counterintuitive null results on job losses, but subsequent replications using payroll records and alternative estimators have yielded mixed outcomes, with some confirming no significant impact and others finding modest employment reductions of 1-4%, illustrating sensitivities to data sources and measurement error. Similarly, research on , such as the Gini coefficient's links to or generosity outcomes, has encountered inconsistencies; replications of studies linking to reduced have provided mixed evidence, with effect sizes varying widely across cultures and failing to replicate in 40-60% of attempts due to variables like regional differences. Broader trends in the , drawn from surveys of social scientists, indicate that non-replication rates hover around 50-60% for survey-based research across , , and , driven by issues like low statistical power in heterogeneous populations and selective reporting of subgroup analyses. These patterns emphasize shared vulnerabilities in social sciences, where human-centric data amplify variability compared to controlled experimental settings.

In Natural and Emerging Fields

In nutrition science, replication issues have been particularly pronounced in studies examining dietary effects, such as the purported links between saturated fat intake and cardiovascular disease. Initial observational and intervention studies from the mid-20th century suggested strong associations, but subsequent meta-analyses have failed to consistently replicate these findings, revealing inconsistencies due to methodological variations, confounding factors like overall diet quality, and selective reporting. For instance, a comprehensive review highlighted that many early claims about saturated fats increasing heart disease risk do not hold under rigorous re-examination, with effect sizes often diminishing or reversing in larger, better-controlled datasets. This has implications for public health guidelines, where non-replicable evidence has influenced long-standing recommendations on fat consumption, prompting calls for preregistration and transparent data sharing to bolster reliability. Representative examples include conflicting results from cohort studies on low-fat diets, where initial protective effects against heart disease were not upheld in replication attempts across diverse populations. In water resource management, models assessing impacts from the onward have exhibited significant replication failures, particularly in cross-site applications, with approximately 50% of projections failing to align when transferred to new geographic or temporal contexts. These discrepancies arise from model sensitivities to local variables like soil hydrology and patterns, which are often not fully parameterized in original simulations. For example, hydrologic models developed for basin-specific climate scenarios in frequently underperform when replicated in or Asian watersheds, highlighting the limitations of generalizability in environmental modeling. Physics and chemistry fields, while generally more robust due to standardized experimental protocols, are not immune to replication challenges, though failures are rarer and often high-profile. In 2023, claims surrounding cold fusion-like processes, including low-energy nuclear reactions in palladium setups, resurfaced but remained non-replicated despite initial excitement, echoing historical debacles from the 1980s. Similarly, in material science, complex syntheses like 2D materials (e.g., graphene derivatives) prove especially difficult to duplicate due to subtle variations in fabrication conditions. A 2022 analysis of moiré materials synthesis emphasized that precise replication requires exact control over atomic layering, which is often inadequately documented, leading to inconsistent electronic properties in follow-up studies. Emerging fields like and have amplified the replication crisis through opaque practices, such as undisclosed training data and hyperparameters, resulting in non-replicable models. Reports from 2024 and 2025 highlight that up to 70% of image recognition benchmarks fail replication, often due to data leakage—where test sets inadvertently overlap with training data—or unshared proprietary datasets in large-scale models. For instance, convolutional neural networks achieving state-of-the-art accuracy on datasets like frequently underperform in replication efforts because of non-deterministic elements like random initialization and hardware-specific optimizations. These issues extend beyond performance metrics to broader scientific applications, where ML-driven predictions in fields like climate modeling inherit the same reproducibility pitfalls. Cross-trends in 2025 analyses point to the replication crisis extending deeply into computational fields, including simulations in natural sciences, where algorithmic choices and software environments exacerbate non-replicability. Lovrić's examination emphasizes that p-hacking and insufficient validation in computational workflows contribute to this expansion, urging standardized pipelines to mitigate risks across physics, environmental modeling, and . This convergence underscores a systemic need for open-source code and protocols to restore confidence in computational outputs.

Causes

Systemic and Cultural Factors

The expansion of science funding following the , particularly in the United States through agencies like the and , increasingly tied grants to the production of novel findings, which diminished institutional support and incentives for replication studies. This shift prioritized groundbreaking discoveries over verification, as funding panels favored high-risk, high-reward projects that promised new knowledge, leaving replication efforts under-resourced and undervalued. Sociological analyses trace the replication crisis to the erosion of Robert K. Merton's framework of scientific norms, known as CUDOS—communalism (sharing knowledge freely), universalism (impartial evaluation), disinterestedness (objectivity over personal gain), and organized skepticism (rigorous scrutiny). Competitive academic environments have undermined these norms, fostering a that elevates novelty and rapid over verification and communal critique. In this context, organized skepticism has weakened, as pressures for productivity discourage the time-intensive work of replicating prior results, leading to a performative where impact metrics overshadow collective reliability. The "" culture intensified in the and , with and promotions increasingly linked to publication volume rather than depth or . Surveys of faculty indicate that around 68% perceive greater pressure to publish compared to recent years, exacerbating the de-emphasis on replication in favor of prolific output. This systemic pressure has normalized a focus on quantity, where career advancement depends on accumulating papers in high-impact journals, often at the expense of robust verification processes. Globally, replication norms vary, with systems generally exhibiting stricter emphasis on due to more balanced structures, in contrast to the 's highly competitive, novelty-driven model that amplifies challenges. In , funding bodies like the often integrate replication considerations into grant evaluations more explicitly than their counterparts, reflecting cultural differences in prioritizing cumulative knowledge over individual breakthroughs. A contributing cultural factor is the prevalent in scientific practice, where researchers and evaluators overlook the low prior probabilities of novel effects being true, leading to overconfidence in initial positive findings without adequate replication. This in scientific culture amplifies the crisis by fostering expectations of high success rates for discoveries that statistically are unlikely to hold, independent of methodological rigor. emerges as a symptom of these broader systemic issues, where null or replicated results are less likely to be disseminated.

Publication and Incentive Structures

One major flaw in the scientific publication system is , which favors the reporting of positive or novel results over null or negative findings. This bias leads to the "file drawer problem," where studies yielding non-significant results are often left unpublished, distorting the and hindering efforts to replicate or verify claims. Standards of reporting in published papers frequently lack the necessary details for replication, with analyses from the revealing minimal adoption of transparency practices. For instance, a review of empirical articles in high-impact journals found that fewer than half provided publicly available (40%), materials (20%), or code (3%), indicating insufficient methodological descriptions to enable independent reproduction. The "" culture exacerbates these issues by prioritizing publication quantity and prestige over rigorous, replicable work. Career advancement metrics, such as journal impact factors and the , reward novel findings in high-profile outlets, often at the expense of confirmatory or replication studies, fostering a system where researchers face pressure to produce eye-catching results to secure jobs, promotions, and tenure. Journal practices further discourage replication, as such studies are rarely published. Prior to , only about 1.6% of publications explicitly involved replications, reflecting editorial preferences for original, groundbreaking research over verification efforts. Funding incentives compound these structural problems by emphasizing innovation over confirmation. For example, (NIH) grant criteria historically prioritize "transformative" research that promises paradigm shifts, while systematic replication or confirmatory studies receive little dedicated support, limiting resources for checks.

Questionable Research Practices

Questionable research practices (QRPs) refer to a range of design, analytic, and reporting choices that researchers make to enhance the chances of obtaining statistically significant results and achieving , without crossing into outright fabrication or falsification of . These practices are often subtle and flexible, allowing researchers to "listen to the " in ways that capitalize on chance findings while presenting them as confirmatory , as demonstrated in simulations showing how such flexibility can dramatically inflate error rates. Unlike , QRPs occupy a gray area where researchers may rationalize them as standard procedure to navigate competitive pressures. Surveys indicate widespread use of QRPs across fields, particularly in where self-admission rates for selective reporting of analyses that "work" range from 50% to over 70%. In and , witnessed rates for QRPs, including conditional reporting of results, are around 40%, based on surveys of researchers from 2010 to 2020. A 2012 survey using truth-telling incentives found that 94% of psychological researchers admitted to engaging in at least one QRP over their career, with specific practices like failing to report all dependent measures (63%) and selectively reporting studies that yielded significant results (46%). Common examples include (hypothesizing after the results are known), where researchers formulate or adjust hypotheses post-analysis and present them as pre-planned, which obscures the exploratory nature of the work and biases interpretation toward confirmation. Another is optional stopping, in which data collection continues or stops based on interim statistical significance checks, effectively inflating the chance of false positives without adjusting for multiple testing. These practices are enabled by , where non-significant results are less likely to be published, further incentivizing flexibility. Simulations illustrate the severe impact of QRPs on scientific validity, demonstrating that even moderate use can double the from the nominal 5% to over 50%, as researchers exploit analytic flexibility to report only favorable outcomes. For instance, combining practices like optional stopping with selective outcome reporting can push the likelihood of publishing false positives to 60% or higher in low-power studies. Such inflation undermines replicability, as the reported effects often stem from noise rather than true phenomena, contributing directly to the replication crisis.

Statistical and Methodological Issues

One major statistical issue contributing to the replication crisis is low statistical power in experimental designs. Statistical power is defined as $1 - \beta, where \beta represents the Type II error rate, or the probability of failing to detect a true effect. In social sciences, including , typical power levels range from 0.3 to 0.5, meaning that even true effects have a 50-70% chance of going undetected in a single study, leading to high non-replication rates for genuine findings. For instance, a of studies found a power of 0.21, exacerbating the risk of missing real effects and inflating apparent ones in published results. P-hacking, the selective reporting or analysis of data to achieve , further undermines replicability by inflating Type I error rates through practices like conducting multiple tests without correction. A common example is multiple testing, where the increases without adjustments; the addresses this by dividing the level \alpha by the number of tests k, yielding an adjusted of \alpha / k. Simulations demonstrate that such undisclosed flexibility can produce up to 60% false positives even for null effects, directly contributing to non-replicable claims in the literature. Underpowered studies also introduce positive effect size bias, where detected effects are systematically overestimated due to the phenomenon—significant results arise disproportionately from larger-than-true sample effects. In , this bias led to effect size overestimates by a factor of up to three times the true value across low-power studies. Similar patterns appear in sciences, where small samples amplify , resulting in inflated estimates that fail to replicate at more realistic scales. Null hypothesis significance testing (NHST) exacerbates these problems through widespread misinterpretation of p-values, often treated as measures of effect importance or practical significance rather than evidence against the null. A p-value below 0.05 indicates only that the data are unlikely under the null hypothesis, not the probability that the null is true or the size of any alternative effect, yet a 2018 survey found that 99% of psychology researchers and students misinterpreted at least one aspect of p-values. This dichotomous focus on significance thresholds discourages nuanced reporting and promotes cherry-picking, with alternatives like Bayesian methods offering posterior probabilities for hypotheses but seeing limited adoption due to computational demands. Context sensitivity in effect sizes, where results vary across populations or settings, poses another methodological challenge, often overlooked in generalized claims. For example, psychological effects calibrated on (Western, Educated, Industrialized, Rich, Democratic) samples—comprising about 96% of publications despite representing only 12% of the global population—frequently diminish or reverse in diverse groups, reducing cross-study replicability. In meta-analyses attempting to aggregate findings, distorts pooled estimates by favoring positive results, detectable via Egger's test, which regresses standardized effect sizes against their precision to identify asymmetry indicating missing small or null studies. This bias can substantially inflate overall effect sizes in affected fields like . Finally, statistical heterogeneity across studies, quantified by the I^2 statistic as the percentage of total variation due to between-study differences rather than chance, often signals underlying methodological inconsistencies; values exceeding 50% suggest substantial issues, such as unaccounted moderators, complicating reliable synthesis and replication.

Consequences

Effects on Scientific Knowledge

The replication crisis has led to substantial wasted resources in scientific , particularly in biomedical fields where irreproducible preclinical studies consume billions annually. A 2015 analysis estimated that approximately $28 billion per year is spent in the United States on basic biomedical that cannot be successfully repeated, representing about half of the total preclinical budget due to factors like low reproducibility rates. This financial burden diverts from promising avenues, exacerbating inefficiencies in across scientific endeavors. The crisis undermines cumulative scientific knowledge by allowing theories to be constructed on foundations of false positives, resulting in the eventual collapse of entire research paradigms. For instance, the social priming paradigm in , which posited that subtle environmental cues could unconsciously influence complex behaviors, largely disintegrated following a series of failed replications in the , prompting widespread reevaluation of related . Such breakdowns highlight how non-replicable findings propagate errors, slowing the refinement of theoretical models and hindering genuine progress in understanding phenomena. In specific fields like , the replication crisis has driven a "credibility revolution" since the , leading to the revision or rejection of a significant share of established results. Large-scale replication projects have shown that only about 36-50% of key psychological effects from prominent journals hold up under rigorous retesting, necessitating updates to textbooks and curricula that previously presented these as settled . This erosion extends to broader scientific domains, where irreproducible preclinical results contribute to high failure rates in pipelines; for example, only 11% of landmark cancer papers could be reproduced by one pharmaceutical company, delaying therapeutic innovations and increasing costs for viable treatments. Long-term analyses reveal the enduring impact on , with a substantial portion of highly cited papers from the 2000-2010 period now viewed as questionable due to replication challenges. Studies indicate that non-replicable findings are often cited far more frequently—up to 153 times more than replicable ones—perpetuating flawed knowledge and complicating efforts to build reliable cumulative . This pattern underscores how the crisis not only wastes immediate resources but also distorts the historical record, requiring ongoing efforts to reassess and correct the .

Impact on Public Trust and Awareness

The replication crisis has heightened public of scientific issues, particularly in fields like . Surveys conducted in 2025 indicate that 18% of laypeople have heard of recent failures to replicate studies, with awareness rising to 29% among those exposed to discussions of methodological flaws. This increased visibility has been amplified post-COVID-19, as widespread about scientific findings, including vaccine efficacy, has drawn attention to broader concerns over research reliability. For instance, high-profile failures in psychological priming experiments during the , such as attempts to replicate social priming effects that garnered significant media attention, have contributed to this growing public familiarity with replication challenges. Erosion of in has been a notable consequence, with polls showing a marked decline linked to perceptions of irreproducibility. A November 2024 survey found that 76% of Americans reported a great deal (34%) or fair amount (42%) of confidence in , down from 87% in , with the decline partly attributed to events including the replication crisis and the . This downturn, which accelerated during the , has been attributed in part to the replication crisis, as revelations of non-reproducible results have fueled doubts about the reliability of scientific claims. The crisis's exposure of systemic issues has thus intertwined with other trust-eroding events, deepening skepticism toward expert consensus. Media coverage of the replication crisis has further shaped public perceptions, spotlighting scandals and prompting official acknowledgments. In the , extensive reporting on failed replications of priming studies in , which had previously achieved viral status in outlets like Talks and major news publications, highlighted the fragility of celebrated findings and sparked widespread debate. By 2025, this culminated in statements addressing the "replication crisis" as a threat to public confidence, including an on "Restoring Gold Standard Science" that emphasized to rebuild trust in federally funded . These developments have had tangible societal consequences, including indirect contributions to through perceived scientific unreliability. The crisis has amplified uncertainties in biomedical research, where replication failures foster a general that exacerbates hesitancy by portraying as prone to error or . On a positive note, however, the heightened has empowered the public to demand more rigorous, transparent , fostering greater of claims and ultimately strengthening societal expectations for evidence-based knowledge.

Institutional and Academic Responses

The credibility revolution in psychology during the 2010s represented a pivotal shift toward prioritizing and transparency in research practices, prompted by large-scale replication failures that highlighted systemic issues in the field. A key component of this movement was the founding of the Open Science Framework (OSF) in 2013 by Brian Nosek and Jeffrey Spies at the , which provides free tools for preregistration, , and collaborative project management to foster . The OSF has since supported major initiatives, such as the Reproducibility Project: Psychology, which attempted to replicate 100 studies from top journals and found only 36% showed statistically significant effects in the same direction. Journals responded by revising policies to encourage reproducible . In April 2013, journals implemented updated reporting standards requiring authors to provide detailed methods, statistical analyses, and information to enhance and facilitate . followed in March 2014 with a mandatory policy, compelling authors to include statements on how underlying could be accessed for replication, reanalysis, or validation of findings. Funding agencies introduced measures to enforce rigor. The U.S. (NIH) announced plans in 2014 to bolster , issuing guidelines for preclinical and requiring applicants from 2015 onward to address the strength of prior studies, of key resources, and potential biases in experimental design. The European Research Council (ERC) similarly stresses data management and retention in its requirements, recommending that funded researchers maintain accessible files to enable and verification. Academic training adapted to these concerns, with U.S. graduate programs in the 2020s increasingly integrating and into curricula; a survey of APA-accredited doctoral programs found that over 70% offered training on topics like preregistration and to equip students against challenges. Conferences played a role in dissemination, as seen in the Association for Psychological Science () 2015 annual convention, which included dedicated sessions on replication strategies and open practices to guide researchers in implementing reforms. These institutional efforts often reference preregistration as a core tool for mitigating selective reporting.

Remedies

Reforms in Publishing Practices

To address the replication crisis, several reforms in practices have emerged to mitigate and questionable research practices by emphasizing methodological rigor over results. These include preregistration of studies, result-blind , mandates for and code sharing, dedicated journals for , and databases tracking retractions. Such changes aim to shift incentives toward transparent, verifiable research processes. Preregistration involves researchers publicly committing to their hypotheses, methods, and analysis plans before data collection, typically via platforms like the (OSF), which launched preregistration capabilities in 2013. This practice distinguishes confirmatory analyses from exploratory ones, reducing the flexibility that enables p-hacking and other questionable research practices (QRPs) by locking in decisions prior to observing outcomes. Empirical evidence shows preregistration improves the credibility of findings by preserving accurate calibration of evidence and minimizing post-hoc adjustments. For instance, the 's initiatives have demonstrated that preregistered studies exhibit higher evidential value and lower rates of bias compared to non-preregistered ones. Result-blind peer review, proposed as a key reform in 2013, evaluates manuscripts based solely on research questions, methods, and proposed analyses without knowledge of results, thereby countering biases favoring positive or significant outcomes. Journals such as Psychological Science adopted related formats like Registered Replication Reports starting in 2013, where occurs in stages: initial approval of methods before , followed by review of results. This approach has been implemented in over 200 journals across disciplines by the , leading to higher-quality and reduced selective . Studies of these formats indicate they enhance replicability by prioritizing scientific merit over novelty. Open science mandates have further transformed publishing by requiring , code, and materials to be shared alongside publications, facilitating independent verification. The Transparency and Openness Promotion (TOP) Guidelines, developed in 2015 and widely adopted by 2016, provide a modular framework for journals to enforce levels of transparency across citation, , code, research design, and analysis transparency. In , numerous high-impact journals, including those from the , have integrated TOP standards, promoting compliance through editorial policies and checklists. Adoption has grown steadily, with surveys indicating that by the late 2010s, a significant portion of included availability statements, though full sharing remains variable due to barriers like concerns. These guidelines directly target by making non-significant or null results verifiable and reusable. Dedicated metascience journals have emerged to prioritize replication studies and methodological critiques, providing outlets for research that might otherwise face publication hurdles. Meta-Psychology, launched in 2020, exemplifies this by focusing exclusively on the methods, theories, and practices of psychological science, including empirical replications and analyses of replicability factors. Such venues encourage rigorous evaluation of the research ecosystem, with articles often employing Bayesian or meta-analytic approaches to assess replication success rates across fields. Metadata tools like the Database, established in 2010, track retractions, expressions of concern, and related issues in , promoting accountability in . By cataloging over 50,000 retractions and corrections by the mid-2020s, it enables researchers and journals to monitor patterns of and reliability, informing policy reforms such as enhanced post-publication review. The database's has facilitated meta-analyses revealing spikes in retractions linked to the replication crisis, underscoring the need for proactive standards.

Enhancements in Statistical Methods

In response to the replication crisis, researchers have proposed several enhancements to statistical methods to reduce false positives and improve the reliability of findings. One prominent reform involves tightening the threshold for . In 2017, a group of 72 researchers advocated redefining the default threshold from the conventional 0.05 to 0.005 for claims of new discoveries, arguing that this change would approximately halve the while maintaining acceptable statistical power. This distinguishes between "suggestive evidence" (p < 0.005) and conventional significance, encouraging replication before accepting novel results as definitive. To address the widespread issue of underpowered studies, which often fail to detect true effects reliably, recommendations emphasize increasing sample sizes to achieve higher statistical power, typically targeting 90% power (1 - β = 0.9) rather than the common 50-60%. For a two-sided test, the required sample size n per group can be calculated using the formula: n = \frac{(Z_{1-\alpha/2} + Z_{1-\beta})^2 \cdot \sigma^2}{\delta^2} where Z_{1-\alpha/2} is the critical value for the desired significance level α (e.g., 1.96 for α = 0.05), Z_{1-\beta} is the critical value for power (e.g., 1.28 for 90% power), σ is the standard deviation, and δ is the minimum detectable effect size. Achieving 90% power often requires sample sizes approximately three times larger than those in typical underpowered studies, depending on the effect size and variability. Misinterpretation of s has exacerbated the crisis, leading to overreliance on dichotomous testing. The American Statistical Association's 2016 statement urged a shift toward emphasizing via confidence intervals, which provide a range of plausible effect sizes rather than a outcome. This approach promotes better understanding of uncertainty and effect magnitude, with education efforts highlighting that a does not measure the probability that the is true or the size of an effect. Confidence intervals, for instance, allow researchers to assess practical alongside statistical evidence, fostering more nuanced interpretations. In fields like and , where model can undermine replicability, cross-validation techniques have been adopted to evaluate model robustness. K-fold cross-validation, a standard method, partitions the into k subsets (folds), the model on k-1 folds and validating on the held-out fold, repeating this process k times to compute an average metric. This resampling approach reduces variance in performance estimates and helps ensure models generalize beyond the , with k often set to 5 or 10 for balance between bias and computation. Its application has become routine in predictive modeling to guard against spurious results that fail replication. Bayesian statistical methods offer a from significance testing (NHST) by incorporating knowledge and providing direct probability statements about parameters via posterior distributions. Instead of p-values, Bayesian approaches use credible intervals to quantify in effect estimates, allowing for more flexible through Bayes factors or model . John Kruschke's 2014 work exemplifies this transition, demonstrating how Bayesian estimation with priors and posteriors yields richer inferences than NHST, particularly for small samples or complex models. This method has gained traction in and social sciences for its ability to accumulate evidence across studies without rigid thresholds. Improvements in techniques aim to correct for , a key driver of non-replicable findings. The trim-and-fill method, introduced by Duval and Tweedie, addresses asymmetry by iteratively "trimming" studies with overly large effects (presumed biased) and "filling" in hypothetical missing studies on the opposite side to estimate an unbiased overall effect. This nonparametric approach has been widely implemented in software like Comprehensive Meta-Analysis, providing adjusted effect sizes that better reflect the true literature. While not without limitations, such as sensitivity to the number of studies, it enhances the credibility of meta-analytic syntheses in fields prone to selective reporting.

Replication Initiatives and Funding

Organized efforts to address the replication crisis have included large-scale collaborative projects aimed at systematically replicating key findings across multiple laboratories. The Many Labs series, initiated in the , exemplifies such initiatives; Many Labs 2, conducted in , involved 36 samples from 28 different laboratories replicating 28 classic and contemporary psychological effects, achieving a replication success rate of approximately 50% based on . Similarly, in and sciences, the Institute for Replication (I4R), established in the early 2020s, conducts reproductions and replications of influential studies to enhance credibility, including meta-analyses such as a 2024 study of 110 papers that found 85% computational . Funding from philanthropic and governmental sources has been crucial to sustaining these replication efforts. The provided over $1.5 million to the Center for between 2011 and 2020 to support initiatives in , including projects like Many Labs that aligned scientific practices with values of openness and integrity. In 2023, the U.S. (NSF) allocated more than $1.8 million across 10 awards to advance infrastructure, encompassing replication and programs that encourage high-powered designs and in social and behavioral sciences. Databases have emerged to catalog and track replication attempts, facilitating meta-analytic insights into replicability trends. The , launched in 2015, established an open database on the Open Science Framework containing replications of 100 studies from top journals, revealing an overall replication rate of 36% and enabling ongoing queries into factors like sample size and effect magnitude. Complementing this, Curate Science, a web-based platform introduced in 2015 and expanded in the 2020s, allows researchers, journals, and institutions to tag and evaluate the transparency and credibility of published studies, promoting community-driven assessments of reproducibility. Guidelines have been developed to involve original authors in replication processes, enhancing the fidelity of attempts. The ARRIVE guidelines, updated in 2018 for reporting animal experiments, recommend that original teams provide detailed protocols, , and materials to support replications, thereby reducing barriers to verification in biomedical fields. Educational initiatives in post-secondary institutions have increasingly emphasized replication design to train future researchers. In the , universities have integrated courses and modules on the replication crisis into and curricula, such as workshops teaching preregistration, , and multi-lab coordination to foster robust study designs. Big team science approaches have further advanced replication in specialized domains. The ManyBabies project, ongoing since 2017, unites over 100 laboratories worldwide to replicate and extend infant cognition studies using standardized protocols, quantifying variability in effects like infants' preference for prosocial agents and achieving high generalizability through diverse samples.

Broader Cultural and Policy Shifts

The replication crisis has prompted a shift toward methodological , which emphasizes integrating evidence from diverse approaches—such as observational data, experiments, and genetic studies—rather than relying solely on direct replication to validate findings. This strategy, advocated by Munafò and Davey Smith, helps mitigate biases inherent in single methods and builds more robust conclusions by cross-validating results across independent lines of inquiry. In parallel, the crisis has encouraged viewing scientific progress through the lens of complex adaptive systems, where knowledge evolves dynamically as an interconnected of theories, , and practices that self-correct over time. Failed replications serve as signals for revising or refining theories, fostering adaptability rather than treating non-replications as mere failures, as explored in recent analyses linking the crisis to and systemic resilience in . The movement has accelerated these changes, with the Transparency and Openness Promotion (TOP) Guidelines undergoing significant updates in 2024 to incorporate verification practices and study types that enhance across disciplines. Complementing this, the principles—ensuring data and materials are Findable, Accessible, Interoperable, and Reusable—have become foundational for sharing resources, promoting collaborative validation beyond isolated labs. On the policy front, the White House Office of Science and Technology Policy (OSTP) issued a 2025 memorandum establishing "Gold Standard Science" requirements, mandating standards for federally funded research to ensure transparency and rigor in grant allocations; as of November 2025, the NSF has integrated these into grant review processes, requiring replication plans for high-risk projects. Recent developments extend these principles to , exemplified by NeurIPS 2025's updated reproducibility checklists that require detailed reporting of computational environments and data handling to address unique challenges in AI validation. Meta-analyses in 2025 indicate tangible progress, with psychological studies showing stronger evidential support through larger sample sizes (up to 100% increases in some subfields) and fewer questionable p-values, reflecting improved reliability post-crisis.

References

  1. [1]
    1,500 scientists lift the lid on reproducibility - Nature
    May 25, 2016 · More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own ...
  2. [2]
    Estimating the reproducibility of psychological science
    Aug 28, 2015 · Estimating the reproducibility of psychological science. Open Science ... The Open Science Collaboration. Alexander A. Aarts,1 Joanna E ...
  3. [3]
    Low replicability can support robust and efficient science - Nature
    Jan 17, 2020 · The replication crisis is real, but it is less clear how it should be resolved. Here we examine potential solutions by modeling a scientific ...
  4. [4]
    The replication crisis has led to positive structural, procedural, and ...
    Jul 25, 2023 · Researchers have suggested that the replication crisis is, in fact, a “theory crisis”. Low rates of replicability may be explained in part by ...
  5. [5]
    'Publish or perish' culture blamed for reproducibility crisis - Nature
    Jan 20, 2025 · The leading cause cited for that crisis was “pressure to publish”. ... New research protocol yields ultra-high replication rate. Meet this ...
  6. [6]
    What is replication? - PMC - PubMed Central - NIH
    Mar 27, 2020 · Replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research. This definition reduces ...Replication Reconsidered · Replication Resolved · Fig 2. A Discovery Provides...
  7. [7]
    Making sense of replications - PMC - NIH
    Jan 19, 2017 · Replication is independently repeating the methodology of a previous study and obtaining the same results. In another sense, the answer is difficult.
  8. [8]
    Direct replication | FORRT - FORRT
    Feb 7, 2025 · Generally, direct replication refers to a new data collection that attempts to replicate original studies' methods as closely as possible.
  9. [9]
    Explicating Exact versus Conceptual Replication | Erkenntnis
    Sep 29, 2021 · What does it mean to replicate an experiment? A distinction is often drawn between 'exact' (or 'direct') and 'conceptual' replication.
  10. [10]
    Examining the Meanings of “Conceptual Replication” and “Direct ...
    The proper way to bolster and extend a theory is by conceptual replication. Whereas proponents of direct replication present theory as constrained (by evidence) ...Abstract · Direct Replication As... · Enacting Variability
  11. [11]
    Rethinking research reproducibility - PMC - NIH
    Dec 5, 2018 · As Karl Popper famously put it: “Single occurrences that cannot be reproduced are of no significance to science” (Popper, 1935).
  12. [12]
    Karl Popper: Philosophy of Science
    Falsification also plays a key role in Popper's proposed solution to David Hume's infamous problem of induction. On Popper's interpretation, Hume's problem ...
  13. [13]
    The Replication Crisis in Psychology - Noba Project
    In science, replication is the process of repeating research to determine the extent to which findings generalize across time and across situations.Learning Objectives · Outside Resources · Discussion Questions · Vocabulary
  14. [14]
    Robert Boyle on the importance of reporting and replicating ...
    Feb 7, 2020 · Robert Boyle on the importance of reporting and replicating experiments · Boyle's research programme: building theory using empirical research ...
  15. [15]
    The Michelson–Morley experiments of 1881 and 1887 - Book chapter
    Miller's replications of the Michelson–Morley experiment increased the effective length of the interferometer arms from approximately 11 m to 32 m. This ...
  16. [16]
    Reproducibility of Scientific Results
    Dec 3, 2018 · This review consists of four distinct parts. First, we look at the term “reproducibility” and related terms like “repeatability” and “replication”.
  17. [17]
    Psychology after World War II - History of Psychology - iResearchNet
    The post-World War II era marked a critical phase in the development of psychology as both an academic discipline and a professional practice.
  18. [18]
    [PDF] Beyond the two disciplines of scientific psychology - Gwern.net
    Between 1960 and 1970, many of us searched fruitlessly for interactions of abilities in the Thurstone or Guilford systems. One hypothesis. Snow and I pursued ...
  19. [19]
    [PDF] Evaluating the Replicability of Social Priming Studies
    Ninety-four percent of the replications had effect sizes smaller than the effect they replicated and only 17% of the replications reported a significant p-value.
  20. [20]
    Self-control and limited willpower: Current status of ego depletion ...
    As of 2022, there had been 36 major multi-site replication attempts across all of social psychology, and only four had succeeded. A few more had mixed, ...
  21. [21]
    Raise standards for preclinical cancer research - Nature
    Mar 28, 2012 · A team at Bayer HealthCare in Germany last year reported that only about 25% of published preclinical studies could be validated to the point at ...
  22. [22]
    Reproducibility in Cancer Biology: What have we learned? - eLife
    Dec 7, 2021 · As the final outputs of the Reproducibility Project: Cancer Biology are published, it is clear that preclinical research in cancer biology is not as ...
  23. [23]
    More than half of high-impact cancer lab studies could not ... - Science
    Dec 7, 2021 · Franklin/ Science ; Data: Reproducibility Project: Cancer Biology. Results from only five papers could be fully reproduced. Other ...
  24. [24]
    The Economics of Reproducibility in Preclinical Research
    Jun 9, 2015 · Nevertheless, we believe a 50% irreproducibility rate, leading to direct costs of approximately US$28B/year, provides a reasonable starting ...<|separator|>
  25. [25]
    Biomedical researchers' perspectives on the reproducibility of ...
    Nov 5, 2024 · Key findings include that 72% of participants agreed there was a reproducibility crisis in biomedicine, with 27% of participants indicating the ...
  26. [26]
    Puzzlingly High Correlations in fMRI Studies of Emotion, Personality ...
    We show that these correlations are higher than should be expected given the (evidently limited) reliability of both fMRI and personality measures.
  27. [27]
  28. [28]
  29. [29]
    Promoting Reproducibility and Replicability in Political Science
    Feb 13, 2024 · Published assessment studies generally report reproducibility rates below 50%, and sometimes the success rate is single-digit (Avelino et al., ...
  30. [30]
    Examining the replicability of online experiments selected by a ...
    Nov 19, 2024 · Overall, 54% of the studies were successfully replicated, with replication effect size estimates averaging 45% of the original effect size ...
  31. [31]
    Minimum Wages and Employment: Replication of Card and Krueger ...
    Apr 10, 2010 · We employ the original Card and Krueger (1994) data and the CIC estimator to reexamine the evidence on the effect of minimum wages on employment ...
  32. [32]
    Replications provide mixed evidence that inequality moderates the ...
    Replications provide mixed evidence that inequality moderates the association between income and generosity, Proc. Natl. Acad. Sci. U.S.A. 117 (16) 8696-8697, ...
  33. [33]
    Meta-analyses in nutrition research: sources of insight or confusion?
    Sep 18, 2017 · Suddenly, eating saturated fat or being overweight no longer look so dangerous. But it's just an artifact of a faulty method. Because meta- ...
  34. [34]
    Dietary Fat and Cardiovascular Disease: Ebb and Flow Over ... - NIH
    A systematic review of the effect of dietary saturated and polyunsaturated fat on heart disease. Nutr Metab Cardiovasc Diseases. 2017;27(12):1060–80. [DOI] ...Missing: replication crisis replicable
  35. [35]
    Are climate models “ready for prime time” in water resources ...
    Oct 13, 2010 · Our response to the question posed in the title of this editorial is that, while they are getting better, climate models are not (up to) ready ...Missing: replication | Show results with:replication
  36. [36]
    Modeling U.S. water resources under climate change - AGU Journals
    Feb 24, 2014 · We demonstrate a new modeling system that integrates climatic and hydrological determinants of water supply with economic and biological driversMissing: replication | Show results with:replication
  37. [37]
    Cold fusion is making a scientific comeback | Popular Science
    Jul 3, 2023 · There is still no generally accepted theory that supports cold fusion; many still doubt that it's possible at all. But those physicists and ...Missing: non- | Show results with:non-
  38. [38]
    Science Has a Reproducibility Problem. Can Sample Sharing Help?
    Jul 27, 2023 · So difficult is it to precisely repeat the preparation of some moiré materials that in a 2022 review, physicist Chun Ning Lau and her colleagues ...
  39. [39]
    Leakage and the Reproducibility Crisis in ML-based Science
    We argue that there is a reproducibility crisis in ML-based science. We compile evidence of this crisis across fields, identify data leakage as a pervasive ...
  40. [40]
    Is AI leading to a reproducibility crisis in science? - ResearchGate
    The current standard of scientific research in AI has led many prominent AI researchers to warn of a reproducibility crisis (Kapoor & Narayanan, 2023; Ball, ...
  41. [41]
    Go Forth and Replicate: On Creating Incentives for Repeat Studies
    Sep 11, 2017 · Scientists have few direct incentives to replicate other researchers' work, including precious little funding to do replications.
  42. [42]
    The Evolution and Impact of Federal Government Support for R&D in ...
    The calls for increased funding were supported by a strong NIH director, who could point to new scientific understanding of disease processes as the basis for ...
  43. [43]
    The DECAY of Merton's scientific norms and the new academic ethos
    Here, the impact of a performative culture is linked to the need for a large number of academics to align their research interests with funding opportunities.
  44. [44]
  45. [45]
    Might Europe one day again be a global scientific powerhouse ...
    We conclude that research performance in GFIS and in other EU countries is intrinsically low, even in highly selected and generously funded projects.Missing: stricter replication norms
  46. [46]
    [PDF] The "File Drawer Problem" and Tolerance for Null Results
    Both behavioral researchers and statisti- cians have long suspected that the studies published in the behavioral sciences are a biased sample of the studies ...
  47. [47]
    An empirical assessment of transparency and reproducibility-related ...
    Feb 19, 2020 · Less than half the articles were publicly available (101/250, 40% [34% to 47%]). Minimal adoption of transparency and reproducibility-related ...
  48. [48]
    Fixing the Engine of American Science - Paragon Health Institute
    Nov 27, 2023 · Yet NIH has never systematically funded replication experiments. ... NIH's extramural funding has flowed almost entirely to traditional ...
  49. [49]
    Fifty years of research on questionable research practises in science
    QRPs have been defined as 'design, analytic or reporting practices that have been questioned because of the potential for the practice to be employed with the ...
  50. [50]
    False-Positive Psychology - Joseph P. Simmons, Leif D. Nelson, Uri ...
    (2011). Measuring the prevalence of questionable research practices with incentives for truth-telling. Manuscript submitted for publication. Go to Reference.
  51. [51]
    Measuring the Prevalence of Questionable Research Practices With ...
    Aug 9, 2025 · They are probably more prevalent than most psychologists suspect. In a survey of 2,000 research psychologists, John et al. (2012) reported ...
  52. [52]
    A Systematic Review and Meta-Analysis | Science and Engineering ...
    Jun 29, 2021 · The first meta-analysis to calculate the prevalence of RM and QRPs was conducted by Fanelli (2009), who examined 21 surveys and found that 1.97% ...
  53. [53]
    [PDF] Measuring the Prevalence of Questionable Research Practices With ...
    A survey of over 2,000 psychologists found a surprisingly high percentage engaged in questionable research practices, suggesting it may be a prevailing norm.
  54. [54]
    HARKing: Hypothesizing After the Results are Known - Sage Journals
    HARKing is defined as presenting a post hoc hypothesis (ie, one based on or informed by one's results) in one's research report as if it were, in fact, an a ...
  55. [55]
    2 Catalogue of questionable research practices - How Scientists Lie
    The goal of this catalogue is to detail all questionable research practices (QRPs) that has been identified in the research literature so far.
  56. [56]
    Replication Success Under Questionable Research Practices-a ...
    Even if the positive effect of QRPs on type-I error (T1E) rate, that is, the false positive rate, has already been intensively investigated (Simmons, Nelson ...
  57. [57]
    Questionable research practices may have little effect on replicability
    This strategy of data peeking can easily raise the rate of false positives up to 20% (Simmons et al., 2011). (d) Finally, selective outlier removal can also ...
  58. [58]
    Do studies of statistical power have an effect on the power of studies?
    The long-term impact of studies of statistical power is investigated using J. Cohen's (1962) pioneering work as an example. We argue that the impact is nil; ...
  59. [59]
    Power failure: why small sample size undermines the reliability of ...
    Apr 10, 2013 · We show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low ...
  60. [60]
    (PDF) False-Positive Psychology - ResearchGate
    ... P-hacking refers to undisclosed data practices aimed at achieving statistically significant outcomes (i.e., p values below a threshold like .05; Simmons et ...
  61. [61]
    Underpowered studies and exaggerated effects: A replication and re ...
    Feb 15, 2025 · This study replicates an anchoring study that reported an effect size of a 31% increase in participants' bids.
  62. [62]
    When Null Hypothesis Significance Testing Is Unsuitable for Research
    Null hypothesis significance testing (NHST) has several shortcomings that are likely contributing factors behind the widely debated replication crisis.Abstract · The Replication Crisis and... · NHST Logic is Incomplete
  63. [63]
    Misinterpretations of the p-value in psychological research
    Feb 25, 2025 · Misinterpretations of p-values perpetuate overconfidence in research findings, often leading to oversights in clinical trials, misallocation of resources, and ...
  64. [64]
    [PDF] The weirdest people in the world? - Description
    The findings suggest that members of. WEIRD societies, including young children, are among the least representative populations one could find for generalizing ...
  65. [65]
    Bias in meta-analysis detected by a simple, graphical test - The BMJ
    Sep 13, 1997 · A simple analysis of funnel plots provides a useful test for the likely presence of bias in meta-analyses.
  66. [66]
    Quantifying heterogeneity in a meta-analysis - PubMed
    The extent of heterogeneity in a meta-analysis partly determines the difficulty in drawing overall conclusions. This extent may be measured by estimating a ...
  67. [67]
    Failure to replicate | Science
    Jan 9, 2025 · ... priming came crashing down in the course of psychology's replication crisis, a crisis that has since swept through the sciences more generally.Missing: paradigm collapse
  68. [68]
    How Psychological Study Results Have Changed Since the ...
    May 21, 2025 · The field's response to the replication crisis illustrates self-correction mechanisms. Over the past decade, researchers have embarked on a ...
  69. [69]
    New Center for Open Science Designed to Increase Research ...
    Mar 4, 2013 · Nosek founded the center with Jeffrey Spies, a U.Va. graduate ... The center's signature project is the Open Science Framework website.
  70. [70]
    Reproducibility Project: Psychology - OSF
    We conducted replications of 100 experimental and correlational studies published in three psychology journals using high-powered designs and original ...
  71. [71]
    Enhancing reproducibility | Nature Methods
    Apr 29, 2013 · New reporting standards for Nature journal authors are intended to improve transparency and reproducibility.
  72. [72]
    Promoting reproducibility by emphasizing reporting - PLOS One
    Jun 14, 2017 · In 2014, an updated Data Availability Policy was implemented across all PLOS journals to encourage validation, reanalysis, and replication of ...
  73. [73]
    Enhancing Reproducibility through Rigor and Transparency
    Sep 9, 2024 · This webpage provides information about the efforts underway by NIH to enhance rigor and reproducibility in scientific research.Guidance · Principles and Guidelines for... · Meetings and Workshops
  74. [74]
    [PDF] “Toward a FAIR Reproducible Research”
    18The European Research Council (ERC) recommends ”to all its funded researchers that they follow best practice by retaining files of all the research data they ...
  75. [75]
    Open Science Training in APA-accredited Clinical Psychology ... - OSF
    The purpose of the present study is to evaluate the breadth of training on issues related to the replication crisis and open science among Clinical Psychology ...Missing: 2020s | Show results with:2020s
  76. [76]
    APS: Leading the Way in Replication and Open Science
    Dec 14, 2017 · Read about the breadth of APS activities on advancing replicability and reproducibility in psychological science.
  77. [77]
    Registered Reports - Center for Open Science
    Registered Reports is a publishing format that emphasizes the importance of the research question and the quality of methodology by conducting peer review ...
  78. [78]
    The preregistration revolution - PNAS
    Progress in science relies in part on generating hypotheses with existing observations and testing hypotheses with new observations.
  79. [79]
    Result-Blind Peer Reviews and Editorial Decisions - Hogrefe eContent
    We argue that peer-reviewed journals based on the principle of rigorous evaluation of research proposals before results are known would address this problem ...Missing: 2013 | Show results with:2013
  80. [80]
    TOP Guidelines - Center for Open Science
    The TOP Guidelines are a policy framework for open science, including seven research practices, two verification practices, and four verification study types.
  81. [81]
    Transparency and Openness Promotion
    TOP Guidelines help make science and data more accessible through multiple aspects of research planning and reporting for journals to follow, ...
  82. [82]
    Redefine statistical significance | Nature Human Behaviour
    Sep 1, 2017 · We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.
  83. [83]
    Statistical notes for clinical researchers: Sample size calculation 1 ...
    The sample size was calculated as 16 subjects per group. n 1 = 2 × ( Z α / 2 + Z β ) 2 σ 2 = 2 × ( 1.96 + 0.84 ) 2 × 10 2 = 15.68 ≈ 16 ( mean 1 - mean 2 ) 2 ...
  84. [84]
    [PDF] p-valuestatement.pdf - American Statistical Association
    March 7, 2016. The American Statistical Association (ASA) has released a “Statement on Statistical Significance and P-Values” with six principles underlying ...
  85. [85]
    The ASA Statement on p-Values: Context, Process, and Purpose
    Jun 9, 2016 · The ASA Board decided to take up the challenge of developing a policy statement on p-values and statistical significance.
  86. [86]
    3.1. Cross-validation: evaluating estimator performance - Scikit-learn
    In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same ...
  87. [87]
    A Gentle Introduction to k-fold Cross-Validation
    Oct 4, 2023 · In this tutorial, you will discover a gentle introduction to the k-fold cross-validation procedure for estimating the skill of machine learning models.
  88. [88]
    Cross validation for model selection: A review with examples from ...
    Nov 13, 2022 · Choosing k in k-fold cross validation: It is sometimes claimed that there is a bias-variance trade-off when selecting the value of k in k-fold ...
  89. [89]
    The Bayesian New Statistics: Hypothesis testing, estimation, meta ...
    Feb 7, 2017 · The New Statistics emphasizes a shift of emphasis away from null hypothesis significance testing (NHST) to “estimation based on effect sizes, ...
  90. [90]
    The Bayesian New Statistics: Hypothesis testing, estimation, meta ...
    The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals.
  91. [91]
    Trim and fill: A simple funnel-plot-based method of testing ... - PubMed
    Trim and fill is a nonparametric method using funnel plots to estimate missing studies and adjust for publication bias in meta-analysis.
  92. [92]
    The trim-and-fill method for publication bias - PubMed Central - NIH
    Jun 7, 2019 · The trim-and-fill method is a popular tool to detect and adjust for publication bias. Simulation studies have been performed to assess this method.
  93. [93]
    [PDF] meta trimfill — Nonparametric trim-and-fill analysis of publication bias
    The main goal of the trim-and-fill method is to evaluate the impact of publication bias on our final inference. The idea of the method is to iteratively ...
  94. [94]
    Institute For Replication
    I4R works with researchers to reproduce and replicate research findings to improve transparency and trust in social science research.Reports and Meta Papers · About · Games · Metadata
  95. [95]
    Center for Open Science - Openness, integrity, and reproducibility
    Center for Open Science, Inc. Grant Amount. $1,500,000. Funding Area. Human Sciences. Department. Human Sciences.Missing: psychology | Show results with:psychology
  96. [96]
    US National Science Foundation Shows Commitment to Year of ...
    Sep 28, 2023 · In 2023, through their public-access program, the NSF has funded 10 awards totaling over $1.8 million. ... replication, and verification.
  97. [97]
    Curate Science
    Curate Science is an initiative to strengthen science by developing web apps/tools to curate the transparency and credibility of research.Missing: launch date
  98. [98]
    Teaching the Replication Crisis and Open Science in ... - OSF
    Feb 17, 2025 · Teaching the Replication Crisis and Open Science in Introduction to Psychology. Authors. Charlotte Rebecca Pennington and Madeleine Pownall.
  99. [99]
    ManyBabies
    ManyBabies (MB) is a collaborative project for replication and best practices in developmental psychology research.Publications & Manuscripts · MB5 · MB2 · MB4Missing: 2017 | Show results with:2017
  100. [100]
    Robust research needs many lines of evidence - Nature
    Jan 23, 2018 · Replication is not enough. Marcus R. Munafò and George Davey Smith state the case for triangulation.
  101. [101]
    Replication Crisis in Psychology, Second-Order Cybernetics, and ...
    Jan 8, 2025 · This article aims to reconceptualize the replication crisis as not merely a problem of flawed methods, lack of scientific rigor, or questionable researcher ...
  102. [102]
    New Preprint Introduces Major Update to the TOP Guidelines
    Sep 18, 2024 · The Center for Open Science (COS) is pleased to introduce a major update to the Transparency and Openness Promotion (TOP) Guidelines, dubbed TOP 2025.
  103. [103]
    OSTP Issues Agency Guidance for Gold Standard Science
    Jun 23, 2025 · White House Office of Science and Technology Policy Director Michael Kratsios issued guidance to federal agencies on incorporating Gold Standard Science tenets ...
  104. [104]
    NeurIPS Paper Checklist Guidelines
    2025 ... The NeurIPS Paper Checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, ...