Fact-checked by Grok 2 weeks ago

Reproducibility

Reproducibility is a cornerstone of scientific research, referring to the ability of independent researchers to obtain consistent results when repeating an experiment or analysis under similar conditions, whether using the original data and methods (computational reproducibility) or new data to verify findings (replicability). This principle ensures that scientific findings can be independently verified and built upon, advancing knowledge reliably across disciplines such as , physics, and social sciences. Despite its centrality, reproducibility has faced significant challenges, often termed the "reproducibility crisis," which highlights widespread difficulties in replicating published results. A 2016 survey of 1,576 scientists published in Nature revealed that more than 70% had failed to reproduce another researcher's experiments, while over 50% had even failed to reproduce their own work. This crisis gained prominence with John P. A. Ioannidis's influential 2005 paper in PLOS Medicine, which mathematically demonstrated that most published research findings are likely false due to factors like low statistical power, small effect sizes, bias, and flexible study designs that inflate false positives. The issue is particularly acute in fields like biomedical research, where irreproducible results can waste resources and undermine public trust in science, and the crisis has persisted into the 2020s with ongoing concerns in areas such as artificial intelligence. Several key factors contribute to poor reproducibility, including inadequate access to , protocols, and materials; misidentified biological reagents; complex ; suboptimal practices; cognitive biases; and a competitive academic culture that prioritizes novel positive results over rigorous replication. To address these, initiatives such as the American Society for Biology's multi-tiered framework—encompassing direct replication (same conditions), analytic replication (reanalysis of ), systematic replication (varied models), and conceptual replication (different methods)—promote structured approaches to . Broader efforts include pre-registration of studies, sharing, and enhanced training, as recommended by the Academies of Sciences, , and , to foster transparency and rigor without stifling innovation.

Definitions and Terminology

Core Definitions

Reproducibility is the ability to obtain consistent results by applying the same methodology, inputs, and conditions as those used in the original study, thereby verifying the reliability of the findings. This principle underpins the scientific process by ensuring that reported outcomes are not artifacts of unique circumstances but can be reliably demonstrated again. In practice, reproducibility serves as a foundational check against errors, biases, or variability in execution. According to the National Academies of Sciences, , and (NASEM), reproducibility specifically refers to computational reproducibility: obtaining consistent results using the same input data, computational steps, methods, code, and conditions of analysis. Note that terminology varies across fields; for example, some standards (e.g., ACM) define reproducibility more broadly as involving different teams or setups, while this article aligns with NASEM for consistency. A key distinction within reproducibility lies between exact replication and conceptual replication. Exact replication seeks to recreate the original study under as identical conditions as possible, aiming for precise duplication of procedures, materials, and environment to confirm the specific results. In contrast, conceptual replication tests the same underlying or using similar but varied methods, populations, or settings, emphasizing generalizability over literal repetition. While exact replication is often idealized in computational contexts for bit-for-bit consistency, conceptual replication is particularly valuable in empirical fields to assess robustness across contexts. The scope of reproducibility varies between empirical sciences and computational research. In empirical sciences, such as or physics, it involves repeating experiments or field observations under controlled conditions to achieve results within statistical margins of error. Conversely, computational reproducibility focuses on ensuring that software code, datasets, and analysis pipelines yield the same outputs when rerun on the same and . This distinction highlights how reproducibility adapts to the nature of the inquiry, from physical to digital determinism. A basic reproducibility check can be formalized mathematically: if the original result R is derived from method M applied to D, then reproduction demands a result R' such that R' \approx R under the identical M and D, where approximation accounts for acceptable numerical or statistical tolerances. Reproducibility is often distinguished from , which refers to the ability to obtain consistent results from the same experiment or analysis under nearly identical conditions, typically by the same or over a short period. In contrast, while NASEM defines reproducibility narrowly as computational with same inputs, broader usages (e.g., in ) emphasize consistency across different laboratories or implementations with minor variations, though this aligns more closely with replicability in NASEM terms. This distinction is crucial in fields like , where might confirm a measurement's in one setup, but broader across facilities tests reliability. Reproducibility also differs from replicability, which involves independent recreation of the study by others using new data but similar methods to address the same question, aiming to verify the finding's validity beyond the original context. Generalizability, meanwhile, extends further by assessing whether results apply to broader populations, settings, or conditions not tested in the original study, such as extrapolating clinical trial outcomes to diverse patient groups. For instance, a reproducible psychological experiment might yield the same effect via rerunning original code and data, a replicable one might confirm the effect with fresh participants, and a generalizable one might hold across cultural contexts. Robustness is another related but distinct , defined as the of results to intentional perturbations or alternative plausible methods, ensuring against variations that could reasonably arise. Unlike reproducibility's focus on methodological consistency to achieve the same outcome, robustness tests the finding's resilience, such as whether a yields similar conclusions when using different but valid assumptions. In , for example, a robust maintains performance despite noisy inputs, whereas reproducibility ensures the exact training process can be rerun to produce identical model outputs.
TermTime ScaleConditionsExample Field
RepeatabilityShort-termIdentical setup, same teamLaboratory measurements in chemistry
ReproducibilityN/ASame inputs/data/code, identical conditionsComputational biology analysis reruns
ReplicabilityVariableNew data, similar methodsPsychological experiments
GeneralizabilityBroadNew contexts/populationsClinical trials in medicine
RobustnessPerturbationAlternative plausible variationsMachine learning models

Historical Context

Origins in Scientific Method

The concept of reproducibility emerged as a foundational principle within the during the , particularly through Francis Bacon's advocacy for systematic experimentation in his 1620 work . Bacon criticized traditional scholastic methods for their reliance on unverified authorities and proposed an inductive approach that began with careful observations and controlled experiments to build reliable knowledge. He emphasized the need for experiments that could be repeated under similar conditions to verify hypotheses and eliminate errors, forming the basis of what he termed "" as a collection of reproducible facts. This framework aimed to ensure that scientific claims were grounded in verifiable evidence rather than speculation, marking a shift toward empirical rigor in inquiry. In the 17th century, Galileo Galilei and René Descartes further advanced the emphasis on repeatable observations and reproducible experiments, integrating them into the evolving scientific method. Galileo's work, such as his telescopic observations and inclined plane experiments detailed in Dialogues Concerning Two New Sciences (1638), demonstrated the value of quantitative measurements and repeatable trials to confirm mechanical principles, like the uniform acceleration of falling bodies. He advocated publishing detailed experimental accounts to allow others to replicate and verify results, setting a precedent for transparency in scientific reporting. Similarly, Descartes, in his Discourse on the Method (1637), outlined rules for methodical doubt and experimentation, stressing that hypotheses must be tested through reproducible observations to achieve certainty, blending rational deduction with empirical verification. Their contributions underscored reproducibility as essential for distinguishing true natural laws from illusory perceptions. By the early 19th century, reproducibility became more formalized in laboratory practices, particularly in chemistry and physics, through standardized protocols that ensured consistent outcomes. Justus von Liebig's establishment of a teaching laboratory at the in the 1820s revolutionized chemical education by implementing structured analytical methods and apparatus, such as his kaliapparat for organic analysis, which allowed students and researchers to replicate experiments with precision and reliability. This model promoted reproducibility by training practitioners in uniform techniques, reducing variability in results and enabling widespread verification of chemical compositions. In physics, exemplified replication protocols in his electromagnetic researches, meticulously documenting apparatus designs, procedural variations, and visual diagrams in works like his 1821 paper on electromagnetic rotation, facilitating exact reproductions by contemporaries such as Ampère. These practices solidified reproducibility as a cornerstone of experimental science, ensuring findings could be independently confirmed. The influence of in early scientific journals, notably the Philosophical Transactions of the Royal Society launched in 1665, reinforced the requirement for reproducible methods by institutionalizing scrutiny of experimental descriptions. Editor introduced a referee system where submissions were evaluated for clarity and verifiability, ensuring that reported procedures were detailed enough for replication by skilled practitioners. This process, applied to accounts of phenomena like Boyle's air pump experiments, helped filter unreliable claims and elevated standards for scientific communication, embedding reproducibility in the communal validation of knowledge.

Evolution in the 20th and 21st Centuries

In the early , reproducibility in scientific advanced significantly through the integration of statistical methods into experimental design. Ronald A. Fisher's 1925 publication, Statistical Methods for Research Workers, introduced key concepts such as analysis of variance and randomized experimental designs, which emphasized replication of observations to account for variability and ensure results reflected broader populations rather than isolated instances. These principles provided a rigorous framework for testing hypotheses and reducing , fundamentally shaping reproducible practices across fields like and . Following , standardization efforts in biology further solidified reproducibility by establishing consistent protocols for research and product development. The , founded in 1948, developed international biological standards and requirements for substances like and sera, ensuring uniformity and reliability in testing and manufacturing across nations. In the United States, the (NIH) underwent significant expansion after 1948, implementing peer-reviewed funding mechanisms and guidelines that promoted standardized methodologies in biomedical research, thereby enhancing the replicability of experimental outcomes. From the 1980s to the 2000s, the rise of introduced new dimensions to reproducibility, particularly with the need to manage software and data dependencies. The concept of "reproducible research," coined by geophysicist Jon Claerbout in 1992, advocated for archiving code, data, and workflows alongside publications to allow exact recreation of results. This era saw the emergence of systems, exemplified by Git's creation in 2005, which facilitated collaborative tracking of code changes and mitigated issues from evolving software environments. In the 2010s, initiatives and empirical surveys drove further evolution in reproducibility standards. , launched in 2006, contributed to and later implemented a mandatory availability policy in 2014, requiring authors to make supporting publicly accessible. A 2007 analysis of cancer publications found that articles with publicly shared received 69% more citations than those without. Key surveys from 2011 to 2015, including the Collaboration's 2015 attempt to replicate 100 studies (succeeding in only 36% of cases), underscored field-specific challenges and prompted widespread adoption of preregistration and transparency measures. A 2016 poll of 1,500 scientists revealed that over 70% had failed to reproduce others' experiments and more than 50% their own, highlighting the need for systemic reforms across disciplines. Post-2016 developments continued to advance reproducibility through institutional reports and funded initiatives. The National Academies of Sciences, Engineering, and Medicine released a 2019 report, Reproducibility and Replicability in Science, which defined key terms, identified barriers, and recommended practices like better training and incentives for replication to enhance scientific reliability. In the 2020s, efforts included NIH-funded replication studies in preclinical research (as of 2025) and international initiatives like Nature's exploration of reproducibility in social sciences, alongside a 2024 survey in reaffirming persistent challenges in replicating work.

Importance and Challenges

Role in Scientific Validity

Reproducibility serves as a foundational mechanism for falsification in the scientific method, as articulated by Karl Popper in his 1934 work Logik der Forschung (later published in English as The Logic of Scientific Discovery in 1959), where he proposed that scientific theories must be testable and potentially refutable through empirical observation. This criterion demands that experiments yielding results can be independently repeated under similar conditions to verify or challenge the original findings, ensuring that apparent falsifications are not artifacts of unique circumstances or errors. Without reproducibility, the ability to rigorously test and potentially disprove hypotheses is undermined, rendering scientific claims vulnerable to confirmation bias and impeding the demarcation between empirical science and pseudoscience. Furthermore, reproducibility facilitates cumulative knowledge building by allowing subsequent researchers to rely on validated prior results as a stable foundation for new investigations, thereby accelerating theoretical advancement and innovation across fields. The role of reproducibility extends to broader institutional impacts, influencing funding decisions, policy formulation, and in . Funders, including major agencies like the (NIH), increasingly prioritize reproducible research in grant evaluations to maximize the return on public investments, as irreproducible findings lead to wasted resources and delayed progress. For instance, studies have shown that high retraction rates—often linked to irreproducibility—correlate with eroded confidence in scientific outputs, prompting policy reforms such as mandatory requirements to restore accountability. This erosion affects , as evidenced by surveys indicating that awareness of reproducibility issues diminishes societal reliance on expert advice during crises, underscoring the need for verifiable to sustain support for research endeavors. Reproducibility offers key benefits that enhance scientific rigor, including the reduction of various biases such as and selective biases, which can skew interpretations of data. It enables robust meta-analyses by providing access to and methods that can be reanalyzed across studies, yielding more reliable estimates and identifying patterns that individual experiments might miss. Additionally, it supports interdisciplinary validation, allowing experts from diverse fields to scrutinize and adapt findings, thereby strengthening cross-domain applications and mitigating field-specific limitations. Ethically, reproducibility forms a of responsible conduct in research (RCR), as emphasized in guidelines from the Office of Research Integrity (ORI) under the U.S. Department of Health and Human Services, which integrate it into training on , rigor, and to prevent and ensure ethical . These principles align with mandates for RCR , requiring institutions to foster practices that promote verifiable outcomes and uphold the moral obligations of researchers to the and society.

The Reproducibility Crisis

The reproducibility crisis refers to the observation that a substantial proportion of scientific studies cannot be independently replicated, undermining confidence in published results across various disciplines. The term gained prominence in the early , particularly following high-profile reports highlighting systemic failures in reproducing key findings, marking a shift from isolated concerns to widespread recognition of the issue. This awareness was intensified by a 2012 report from researchers, who attempted to replicate 53 landmark preclinical cancer studies and succeeded in only 6 cases, revealing an irreproducibility rate of approximately 89%. Field-specific investigations have provided empirical evidence of the crisis's scope. In psychology, the Open Science Collaboration's 2015 large-scale replication effort targeted 100 studies from top journals and achieved a success rate of 36%, with replication effect sizes significantly smaller than originals. Similarly, in cancer biology, the Amgen findings indicated less than 50% reproducibility for influential studies, often due to insufficient methodological details. In economics, a 2016 replication project of 18 laboratory experiments published in leading journals yielded a 61% success rate, though still highlighting variability and challenges in confirming results. Several underlying factors contribute to this crisis, including publication bias favoring novel positive results, p-hacking through selective data analysis to achieve statistical significance, and resource constraints limiting comprehensive replications. These practices, often incentivized by "publish or perish" pressures, reduce statistical power and inflate false positives. The crisis persisted into the 2020s, notably during the COVID-19 pandemic, where rapid preprint dissemination amplified issues; a 2021 analysis documented an elevated retraction rate of 0.065% for COVID-19 publications—over six times the baseline scientific average—signaling broader quality and reproducibility concerns in expedited research. The crisis has persisted into the 2020s, with a 2025 reproducibility project in Brazil failing to validate dozens of biomedical studies, and surveys indicating 72% of biomedicine researchers agree a significant crisis exists.

Measures and Assessment

Quantitative Metrics

Quantitative metrics provide objective, numerical assessments of reproducibility by quantifying the agreement, consistency, or predictive accuracy between original studies and their replications. These metrics are essential for evaluating the reliability of scientific findings across disciplines, particularly in fields like , , and statistics, where variability in results can undermine validity. Common approaches include measures of between repeated measurements, comparisons of standardized effect magnitudes, prediction-based indices, and Bayesian model validation techniques. These tools enable researchers to statistically determine the extent to which results can be reliably reproduced, often revealing rates as low as 36-50% in large-scale replication efforts. The (ICC) is a widely used to measure the reproducibility of quantitative outcomes across multiple replications or raters. It assesses the proportion of total variance attributable to between-subject differences relative to within-subject variability, with values ranging from 0 (no reproducibility) to 1 (perfect reproducibility). The formula for ICC in a one-way is given by: \text{ICC} = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B + (k-1)\text{MS}_W} where \text{MS}_B is the mean square between subjects, \text{MS}_W is the mean square within subjects (error), and k is the number of replicates per subject. In reproducibility studies, ICC values above 0.75 indicate excellent agreement, while those below 0.5 suggest poor reproducibility; for instance, biomedical measurement tools often achieve ICCs around 0.8-0.9 when protocols are standardized. This metric is particularly valuable for continuous in clinical and experimental settings, as it accounts for both systematic and random in replication attempts. Effect size consistency evaluates reproducibility by comparing standardized measures of effect magnitude, such as Cohen's d, between original and replication studies. Cohen's d quantifies the difference between group means in standard deviation units, with small (d \approx 0.2), medium (d \approx 0.5), and large (d \approx 0.8) effects as benchmarks. Reproducibility is assessed by checking whether the replication effect size falls within the 95% of the original or by computing the ratio of replication to original effect sizes; in the Reproducibility Project: Psychology, 47% of original effect sizes were within the 95% of the replication effect sizes, highlighting inflated original effects and reduced consistency upon replication. This approach prioritizes practical significance over p-values, revealing that even statistically significant replications often show diminished effect sizes (e.g., halved on average), which underscores power issues in under-resourced studies. The replication , as framed through s, quantifies the expected proportion of replications that align with original findings by calculating the percentage of replication effect sizes falling within a 95% derived from the original study's statistics. Introduced in analyses of psychological replications, this metric accounts for sampling variability and power; Patil et al. (2016) applied it to the data, finding that 77% of replication effect sizes were within the predicted interval, far higher than the 36% rate of significant p-values in direct tests. This provides a more lenient yet statistically grounded measure of reproducibility, emphasizing plausible ranges over binary success/failure, and is especially useful for meta-analyses where original studies have heterogeneous sample sizes. Bayesian metrics, such as posterior predictive checks (PPCs), assess model reproducibility by simulating new data from the posterior distribution and comparing it to observed data. PPCs generate replicated datasets \tilde{y} from the model parameters \theta drawn from the posterior p(\theta | y), then evaluate discrepancy measures T(y, \theta) (e.g., mean or variance) to compute a posterior predictive p-value (PPP) as the proportion of simulated discrepancies exceeding the observed one; PPP values near 0.5 indicate good model fit and reproducibility, while extremes suggest misspecification. In reproducibility contexts, PPCs verify whether a Bayesian model can consistently generate data patterns matching empirical observations across independent runs, as demonstrated in admixture modeling where PPCs rejected ill-fitting models with PPP < 0.05. This approach enhances reproducibility by incorporating prior knowledge and uncertainty quantification, complementing frequentist metrics in complex, hierarchical data analyses.

Qualitative Evaluations

Qualitative evaluations of reproducibility involve interpretive and procedural assessments that rely on expert judgment, structured checklists, and transparency reviews rather than purely numerical metrics. These methods emphasize the clarity, completeness, and adherence to best practices in research reporting and execution, helping to identify potential sources of variability or bias that could undermine replication efforts. By focusing on narrative and audit-based approaches, qualitative evaluations complement quantitative metrics, such as replication success rates, by providing contextual insights into methodological rigor. Peer review checklists serve as a cornerstone of qualitative assessment, offering standardized criteria to evaluate the transparency and detail in research protocols and reports. For instance, the ARRIVE guidelines, developed in 2010 by the (NC3Rs), provide a 20-item checklist for reporting in vivo animal experiments, covering aspects like study design, randomization, blinding, and statistical methods to enhance reproducibility. These guidelines have been widely adopted in biomedical journals, with updates in 2020 refining them into essential and recommended items to further improve reporting quality and facilitate independent replication. In systematic reviews, narrative synthesis methods allow assessors to qualitatively appraise the overall quality of evidence across studies, integrating descriptive insights on reproducibility factors like methodological consistency and risk of bias. The GRADE (Grading of Recommendations Assessment, Development and Evaluation) approach, established in the early 2000s and formalized through ongoing refinements, structures this evaluation by rating evidence certainty based on domains such as inconsistency, indirectness, and publication bias, often through expert consensus discussions. Studies have shown that GRADE assessments exhibit good inter-rater reliability when applied by trained reviewers, making it a reproducible tool for synthesizing qualitative judgments on evidence robustness in fields like medicine and public health.30643-9/fulltext) Lab audits and transparency scoring systems provide ongoing qualitative oversight by examining research practices and documentation for openness and verifiability. The Open Science Framework (OSF) badges system, introduced by the Center for Open Science in 2013, awards digital badges to publications that demonstrate preregistration, data sharing, or code availability, serving as a visual audit of transparency that encourages reproducible practices without mandating numerical outcomes. These badges have been integrated into over 100 journals and have correlated with increased rates of data accessibility, as evidenced by uptake in psychology and other disciplines. Emerging AI-assisted qualitative checks are enhancing protocol validation by automating reviews of research descriptions for completeness and adherence to reproducibility standards. For example, the APPRAISE-AI tool, developed in 2023, uses machine learning to evaluate primary studies on clinical AI models, scoring items like data source documentation and validation procedures through natural language processing of manuscripts, achieving high accuracy in identifying gaps that affect replicability. Such tools streamline expert reviews while maintaining a focus on interpretive quality, particularly in rapidly evolving fields like AI-driven research.

Practices for Achieving Reproducibility

Methodological Approaches

One key methodological approach to enhancing reproducibility involves pre-registration of studies, which entails publicly documenting research plans, hypotheses, sample sizes, and analysis strategies prior to data collection. This practice mitigates selective reporting and p-hacking by establishing a time-stamped record that distinguishes confirmatory from exploratory analyses, thereby increasing transparency and reducing the flexibility to alter plans post hoc based on observed results. The AsPredicted platform, launched in 2015, exemplifies this by providing a simple, standardized template for pre-registration that generates a single-page PDF with a timestamp, facilitating easy creation and verification while allowing options for delayed public release to protect intellectual property. Adoption of pre-registration has grown significantly, with over 1,200 submissions monthly on platforms like AsPredicted by the late 2010s, demonstrating its role in bolstering scientific integrity across fields such as psychology and economics. Detailed methodology reporting represents another foundational protocol for reproducibility, ensuring that experimental procedures, materials, and statistical analyses are described with sufficient precision to allow independent replication. The Consolidated Standards of Reporting Trials (CONSORT), first published in 1996, provides a structured checklist and flow diagram specifically for randomized controlled trials, covering aspects such as participant eligibility, intervention details, randomization methods, and outcome measures to facilitate assessment of trial validity. This standard addresses historical deficiencies in reporting, where incomplete descriptions often obscured potential biases, and has been endorsed by major journals to standardize transparency in clinical research outputs. By mandating explicit subheadings for protocol, assignment, and analysis, CONSORT enables readers to evaluate the rigor of methods, thereby supporting reproducible interpretations of results. Randomization and blinding techniques are essential protocols to minimize systematic biases in experimental design and execution, ensuring that treatment effects are attributable to interventions rather than confounding factors. Randomization, pioneered by in his 1925 work on experimental design, involves assigning participants or units to groups using chance-based methods (e.g., simple or stratified random allocation) to balance known and unknown covariates across conditions, thereby validating inferential statistics. Blinding, or masking, complements this by concealing group assignments from participants, investigators, or analysts to prevent performance, detection, or expectation biases; for instance, double-blinding hides allocations from both subjects and researchers during outcome assessment. These techniques, when properly implemented—such as through allocation concealment to avoid prediction of assignments—have been shown to result in a 17% larger odds ratio for treatment effects in unblinded versus double-blinded trials in systematic reviews. Effective data management practices, including versioning and comprehensive documentation, further promote reproducibility by maintaining the integrity and traceability of research artifacts throughout the lifecycle. The FAIR principles, introduced in 2016, outline guidelines for making data findable (e.g., via persistent identifiers), accessible (through standardized protocols), interoperable (using shared formats and vocabularies), and reusable (with detailed metadata and provenance information). Versioning tracks iterative changes to datasets and code, often via tools that log modifications with timestamps, while documentation includes rich annotations describing collection methods, processing steps, and assumptions to enable independent verification. These practices ensure that data remain usable for replication, as evidenced by their adoption in repositories like , where versioned datasets support reproducible workflows and reduce errors from ambiguous records. Quantitative metrics, such as reuse rates in shared repositories, can verify adherence to these principles by measuring accessibility and citation impacts.

Tools and Technologies

Containerization technologies, such as introduced in 2013, enable the packaging of software applications along with their dependencies into portable, isolated environments, ensuring that computational experiments can be executed consistently across different systems without variations in underlying infrastructure. This approach addresses common reproducibility issues arising from differences in operating systems, library versions, or hardware configurations, allowing researchers to share self-contained "images" that replicate the exact runtime conditions of their original analyses. Notebook systems like , launched in 2014, facilitate reproducible workflows by integrating executable code, visualizations, and narrative text within a single interactive document, enabling readers to rerun analyses step-by-step and verify outputs directly. These environments support literate programming paradigms, where code cells can be executed in sequence to produce reproducible results, and extensions like allow conversion to static formats for sharing while preserving the ability to execute the notebook in compatible kernels. Version control systems such as , developed in 2005, track changes to code and data files over time, providing a historical record that supports auditing and rollback to specific states, which is essential for documenting the evolution of reproducible research pipelines. Complementing this, archiving platforms like , established in 2013, offer persistent storage for datasets, code, and software with automatically assigned Digital Object Identifiers (DOIs), ensuring long-term accessibility and citability while integrating with version control repositories for comprehensive provenance tracking. In the 2020s, AI-assisted tools like , released in 2021, have been used to enhance code quality, readability, and functionality, potentially aiding reproducibility by reducing certain implementation errors, though evidence on overall impact is mixed. Similarly, blockchain technologies are being piloted for data integrity in scientific workflows, leveraging immutable ledgers to verify the authenticity and unaltered state of datasets, as explored in initiatives such as platforms and blockchain-based provenance tracking for clinical trials and research collaboration as of 2024-2025.

Case Studies and Examples

Successful Reproductions

One prominent example of successful reproduction in physics is the detection of gravitational waves by the Laser Interferometer Gravitational-Wave Observatory (LIGO). On September 14, 2015, the two LIGO detectors in Hanford, Washington, and Livingston, Louisiana, simultaneously observed a signal consistent with the merger of two black holes approximately 1.3 billion light-years away, marking the first direct detection of gravitational waves predicted by general relativity. This initial observation was immediately corroborated by the independent analysis of data from both detectors, confirming the signal's astrophysical origin through consistent waveform matches and exclusion of instrumental artifacts. Subsequent observations further validated the discovery. In December 2015, LIGO detected a second gravitational wave event from another binary black hole merger, announced in June 2016, which replicated the waveform characteristics and strain amplitude patterns of the first event, strengthening confidence in the detection methodology and analysis pipelines. These reproductions across multiple events and detectors built robust scientific consensus, culminating in the 2017 Nobel Prize in Physics awarded to Rainer Weiss, Barry C. Barish, and Kip S. Thorne for their decisive contributions to LIGO and the observation of gravitational waves. In psychology, the Many Labs projects exemplify successful multi-site reproductions that confirmed numerous behavioral effects under standardized, high-powered conditions. The inaugural Many Labs project in 2014, involving 36 laboratories, replicated 13 classic and contemporary psychological findings, such as the effect of smiling on emotional experience and the gain-loss theory of attraction, achieving successful replication (significant effect in the expected direction) for 10 of the 13 effects with effect sizes comparable to originals in most cases. This effort demonstrated that coordinated replication across diverse samples and settings can reliably reproduce effects when protocols are preregistered and powered adequately (average power >90%), fostering greater trust in foundational social and results. Follow-up efforts like Many Labs 2 (2018), spanning 36 countries and 68 samples, targeted 28 effects and confirmed 14 (50%) with statistically significant results in the predicted direction, while providing precise estimates for all, which advanced understanding of generalizability. These projects not only verified specific mechanisms, such as the impact of similarity on liking, but also highlighted how rigorous, collaborative enhances the field's cumulative knowledge. In computational science, the reproducibility of climate models has been advanced through shared multi-model ensembles in the Intergovernmental Panel on Climate Change (IPCC) assessments. The Sixth Assessment Report (AR6), released in 2021, relied on the Coupled Model Intercomparison Project Phase 6 (CMIP6), where over 30 international modeling groups contributed standardized simulations using common forcing scenarios and protocols, enabling direct comparison and reproduction of global warming projections. This ensemble approach reproduced observed historical climate trends, such as the 1.1°C global surface temperature rise since pre-industrial times, with high consistency across models (inter-model standard deviation ~0.2°C for equilibrium climate sensitivity), confirming human-induced influences with very high confidence. The transparent data archiving in the Earth System Grid Federation allowed independent verification, underpinning the report's consensus on future risks like sea-level rise. These successful reproductions have profoundly shaped scientific progress by establishing reliable foundations for theory and policy. The confirmations opened a new era in multimessenger astronomy, enabling routine detections that number over 200 events as of 2025. In , Many Labs outcomes spurred methodological reforms, increasing preregistration adoption and elevating replicable findings to core curriculum status. Similarly, CMIP6's reproducible ensembles informed the Paris Agreement's climate targets, demonstrating how verified models drive international on strategies. Collectively, tying awards like the Nobel to reproducible work incentivizes , ensuring advancements endure scrutiny.

Notable Irreproductions

In the field of , the 2010 study on "" by Dana R. Carney, Amy J. C. Cuddy, and Andy J. Yap claimed that adopting brief high-power nonverbal poses could elevate testosterone levels, reduce , and increase feelings of power and risk tolerance. Subsequent replication attempts between 2015 and 2018 consistently failed to reproduce these hormonal and behavioral effects. For instance, a 2015 large-scale study by Eva Ranehill and colleagues with over 200 participants found no significant impact on hormones or risk-taking, attributing the original results to potential confounds like self-reported feelings rather than physiological changes. A 2017 of 11 new experiments further confirmed no positive effects on behavioral measures such as performance or hormone levels, highlighting issues with statistical power and in the original work. In response, Carney issued a 2016 statement disavowing belief in the power posing effects, though she argued against full retraction of the original paper due to its methodological transparency at the time. This case exemplifies how irreproducibility can erode confidence in influential findings without formal retraction, prompting broader scrutiny of nonverbal behavior research. In , the 2014 claim of (STAP) cells by and colleagues at Japan's institute asserted that subjecting somatic cells to mild stress, such as acid baths or mechanical pressure, could reprogram them into pluripotent s with potential for regenerative therapies. The protocol proved non-reproducible from the outset, with independent labs worldwide, including those at Harvard and the , unable to generate STAP cells despite following the described methods. herself failed to reproduce the results under supervised conditions at in late 2014, revealing inconsistencies in image handling and . The two papers were retracted in July 2014 after all co-authors agreed the findings could not be validated, citing irreproducible experiments and selective image use as key flaws. This rapid debunking, occurring within months, underscored vulnerabilities in high-stakes stem cell research, where overhyped claims can divert resources from viable alternatives like induced pluripotent stem cells. In and , the 2014 study by Michael J. LaCour and Donald P. Green in reported that brief conversations with gay canvassers could persistently increase support for among opponents, based on a large-scale with over 500 doors canvassed. The results were fabricated; LaCour admitted to inventing data from a nonexistent survey firm and misrepresenting participant incentives, with no available for . Green, upon discovering the irregularities, requested retraction in May 2015, which issued editorially despite LaCour's initial resistance, noting the paper's reliance on unverifiable claims. This irreproduction exposed flaws in for observational data studies, as the apparent statistical robustness masked the absence of underlying evidence. These notable irreproductions have led to significant consequences, including formal retractions that damaged institutional reputations and prompted reforms. In the STAP case, RIKEN's president Ryoji Noyori resigned in 2015 amid public outcry, and the institute implemented stricter misconduct guidelines, including mandatory data audits and ethics training to prevent future lapses. Funding losses followed, with Japanese grants for research scrutinized more rigorously, contributing to a decade-long push for in national . For LaCour, the halted his UCLA candidacy and career prospects, as he had falsely claimed over $700,000 in grants from foundations like , leading to investigations into grant reporting integrity. The power posing controversy, while not resulting in retraction, influenced funding decisions in behavioral , with reviewers increasingly demanding preregistration and replication plans to avoid supporting non-robust effects. Overall, these cases have accelerated policy changes, such as enhanced retraction databases and journal mandates for , to mitigate the broader reproducibility crisis.

References

  1. [1]
    Summary | Reproducibility and Replicability in Science
    Reproducibility is obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis.
  2. [2]
    Six factors affecting reproducibility in life science research and how ...
    For scientists to be able to reproduce published work, they must be able to access the original data, protocols, and key research materials. Without these, ...<|control11|><|separator|>
  3. [3]
    1,500 scientists lift the lid on reproducibility - Nature
    ### Key Findings from the 2016 Nature Survey on Reproducibility
  4. [4]
    Why Most Published Research Findings Are False | PLOS Medicine
    Aug 30, 2005 · The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio ...Correction · View Reader Comments · View Figures (6) · View About the Authors
  5. [5]
    The Alleged Crisis and the Illusion of Exact Replication
    Jan 14, 2014 · The Alleged Crisis and the Illusion of Exact Replication. Wolfgang Stroebe and Fritz StrackView all authors and affiliations. Volume 9, Issue ...
  6. [6]
    Making sense of replications - PMC - NIH
    Jan 19, 2017 · Together, direct and conceptual replication provides confidence in the reproducibility of a finding and the explanation for the finding.
  7. [7]
    Rigor, Transparency and Reproducibility - USC
    What is Reproducibility. Reproducibility means that an experiment will achieve results within statistical margins of error when repeated under like conditions.
  8. [8]
    Reproducibility vs. Replicability: A Brief History of a Confused ...
    Repeatability (Same team, same experimental setup): The measurement can be obtained with stated precision by the same team using the same measurement procedure, ...
  9. [9]
    New Report Examines Reproducibility and Replicability in Science ...
    Reproducibility means obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis.
  10. [10]
    Repeatability vs. Reproducibility - Technology Networks
    Dec 20, 2021 · Repeatability is producing the same result with the same setup, while reproducibility is achieving the same results with a different team using ...
  11. [11]
    Understanding Reproducibility and Replicability - NCBI - NIH
    B1: “Reproducibility” refers to instances in which the original researcher's data and computer codes are used to regenerate the results, while “replicability” ...DEFINING... · PRECISION OF... · VARIATIONS IN METHODS...
  12. [12]
    Reproducibility, Replication, and Generalization in Research about ...
    A recent consensus report from the National Academies of Sciences (NAS) (4) tries to consider issues of reproducibility and replication more broadly in science.
  13. [13]
    Reproducibility vs Replicability | Difference & Examples - Scribbr
    Aug 19, 2022 · Reproducibility reanalyzes existing data with the same methods. Replicability re-conducts the entire research with new data, using the same ...
  14. [14]
    Replicability, Robustness, and Reproducibility in Psychological ...
    Achieving replicability is important for making research progress. If findings are not replicable, then prediction and theory development are stifled. If ...
  15. [15]
    Experimenting with reproducibility: a case study of robustness in ...
    Reproducibility has been shown to be limited in many scientific fields. This question is a fundamental tenet of scientific activity, but the related issues ...
  16. [16]
    Robustness and reproducibility for AI learning in biomedical sciences
    Jan 22, 2024 · The ability of independent researchers to reproduce a study's results is key to scientific progress. Nevertheless, the existence of a “ ...
  17. [17]
    What is the Difference Between Repeatability and Reproducibility?
    Jun 27, 2014 · Repeatability measures the variation in measurements taken by a single instrument or person under the same conditions, while reproducibility measures whether ...
  18. [18]
    Francis Bacon - Stanford Encyclopedia of Philosophy
    Dec 29, 2003 · Part 2 develops Bacon's new method for scientific investigation, the Novum Organum, equipping the intellect to pass beyond ancient arts and ...
  19. [19]
    Scientific Method - Stanford Encyclopedia of Philosophy
    Nov 13, 2015 · Scientific method should be distinguished from the aims and products of science, such as knowledge, predictions, or control.
  20. [20]
    Descartes, Rene: Scientific Method
    René Descartes' major work on scientific method was the Discourse that was published in 1637 (more fully: Discourse on the Method for Rightly Directing One's ...Missing: reproducible | Show results with:reproducible
  21. [21]
    Justus von Liebig and Friedrich Wöhler | Science History Institute
    In the first half of the 19th century, Germany was a leading force in chemistry, thanks in large part to the work of Justus von Liebig and Friedrich Wöhler.Missing: reproducibility | Show results with:reproducibility
  22. [22]
    The Continuity of Scientific Discovery and Its Communication
    This paper documents the cognitive strategies that led to Faraday's first significant scientific discovery.Missing: protocols reproducibility
  23. [23]
    History of Philosophical Transactions | Royal Society
    Philosophical Transactions is the world's first and longest-running scientific journal. It was launched in March 1665 by Henry Oldenburg.Missing: reproducible | Show results with:reproducible
  24. [24]
    The art of validating science: four centuries of peer review - PMC - NIH
    ... Philosophical Transactions and widely regarded as the father of peer review. ... Evolution of Peer Review and Editorial Technologies in Scientific Journals (1665– ...
  25. [25]
    R A Fisher: "Statistical Methods" Introduction - MacTutor
    In 1925 R A Fisher published Statistical Methods for Research Workers in the Biological Monographs and Manuals Series by the publisher Oliver and Boyd of ...
  26. [26]
    R. A. Fisher: The Founder of Modern Statistics - Project Euclid
    R.A. Fisher made major steps in establishing statistics as a discipline, introducing fundamental concepts and methods that continue to be used today.
  27. [27]
    International biological standardization in historic and contemporary ...
    The field of biological standardization and regulation can be viewed in its need and role to foster technological development and to assure that the products ...Missing: post WWII reproducibility
  28. [28]
    Rescuing US biomedical research from its systemic flaws - PNAS
    The idea that the research enterprise would expand forever was adopted after World War II, as the numbers and sizes of universities grew to meet the ...<|separator|>
  29. [29]
    Reproducibility Crisis Timeline: Milestones in Tackling Research ...
    Dec 5, 2016 · According to Goodman & co (2016), the term “reproducible research” was coined in 1992. Computer scientist Jon Claerbout used it in the sense of ...
  30. [30]
    Git can facilitate greater reproducibility and increased transparency ...
    Git provides a lightweight yet robust framework that is ideal for managing the full suite of research outputs such as datasets, statistical code, figures, lab ...Missing: 2005 | Show results with:2005
  31. [31]
    Sharing Detailed Research Data Is Associated with Increased ...
    This correlation between publicly available data and increased literature impact may further motivate investigators to share their detailed research data.
  32. [32]
    Over half of psychology studies fail reproducibility test - Nature
    Aug 27, 2015 · Over half of psychology studies fail reproducibility test ... Largest replication study to date casts doubt on many published positive results.
  33. [33]
    Replication, falsification, and the crisis of confidence in social ...
    Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Rev. Gen. Psychol. 13, 90–100. doi ...<|control11|><|separator|>
  34. [34]
    Reproducibility and research integrity: the role of scientists and ...
    Dec 14, 2021 · Therefore, reproducibility is more or less the ability to draw similar conclusions from replicates studies. Also, in our field, reproducibility ...
  35. [35]
    Transparency and Trust - NCBI - NIH
    Following its assessment of reproducibility and replicability in science, the committee made numerous recommendations directed toward funders, policy makers, ...
  36. [36]
    Retracted Science and the Retraction Index - PMC - NIH
    Any retraction represents a tremendous waste of scientific resources that are often supported with public funding, and the retraction of published work can ...
  37. [37]
    Trust in scientists and their role in society across 68 countries - Nature
    Jan 20, 2025 · Public trust in scientists can help decision makers act on the basis of the best available evidence, especially during crises. However, in ...
  38. [38]
    A manifesto for reproducible science | Nature Human Behaviour
    Jan 10, 2017 · An effective solution to mitigate self-deception and unwanted biases is blinding. In some research contexts, participants and data collectors ...Missing: interdisciplinary validation
  39. [39]
    A meta-review of transparency and reproducibility-related reporting ...
    Furthermore, reproducible meta-analyses can be easily updated with new data and reanalysed applying new and more refined analysis techniques. We ...
  40. [40]
    Interdisciplinary Approaches and Strategies from Research ...
    Sep 19, 2022 · Reproducibility and replicability are cornerstones of research integrity and science, ensuring that experimental procedures can be performed by ...Missing: validation | Show results with:validation
  41. [41]
    ORI Introduction to the Responsible Conduct of Research
    The ORI Introduction to the Responsible Conduct of Research covers shared values, planning, human subject protection, animal welfare, conflicts of interest, ...Missing: reproducibility | Show results with:reproducibility
  42. [42]
    Responsible Conduct of Research Training
    Sep 27, 2024 · RCR training aims to foster integrity, prevent unethical conduct, and covers topics like data management, scientific rigor, and conflicts of ...
  43. [43]
    Reproducibility of Scientific Results
    Dec 3, 2018 · The first is that the study is replicable in principle the sense that it can be carried out again, particularly when its methods, procedures and ...Replicating, Repeating, and... · Meta-Science: Establishing...
  44. [44]
    Raise standards for preclinical cancer research - Nature
    Mar 28, 2012 · C. Glenn Begley and Lee M. Ellis propose how methods, publications and incentives must change if patients are to benefit.
  45. [45]
    Estimating the reproducibility of psychological science
    We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.
  46. [46]
    Evaluating replicability of laboratory experiments in economics
    We applaud the effort of Camerer et al. (2016) to replicate studies in experimental economics. We were pleased to see that the results from the replication ...
  47. [47]
    Rein in the four horsemen of irreproducibility - Nature
    Apr 24, 2019 · Rein in the four horsemen of irreproducibility. Dorothy Bishop describes how threats to reproducibility, recognized but unaddressed for decades, ...
  48. [48]
    An alarming retraction rate for scientific publications on Coronavirus ...
    Jun 23, 2020 · An alarming retraction rate for scientific publications on Coronavirus Disease 2019 (COVID-19). Nicole Shu Ling Yeo-Teha Research Compliance ...
  49. [49]
    The ARRIVE guidelines 2.0: Updated guidelines for reporting animal ...
    The ARRIVE guidelines (Animal Research: Reporting of In Vivo Experiments) were originally developed in 2010 to improve the reporting of animal ...
  50. [50]
    The ARRIVE guidelines 2.0
    The guidelines are organised into two prioritised sets: ARRIVE Essential 10 These ten items are the basic minimum that must be included in any manuscript ...Translations · 4. Randomisation · 5. Blinding/Masking · 1. Study design
  51. [51]
    The GRADE approach is reproducible in assessing the quality of ...
    Our findings suggest that trained individuals using the GRADE approach improves reliability in comparison to intuitive judgments about the QoE.
  52. [52]
    Badges to Acknowledge Open Practices: A Simple, Low-Cost ...
    Badges are simple, effective signals to promote open practices and improve preservation of data and materials by using independent repositories.
  53. [53]
    APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for ...
    Sep 25, 2023 · APPRAISE-AI was designed to evaluate primary studies that develop, validate, or update any machine learning model for clinical decision support.
  54. [54]
    The preregistration revolution - PNAS
    Widespread adoption of preregistration will increase distinctiveness between hypothesis generation and hypothesis testing and will improve the credibility of ...
  55. [55]
    AsPredicted: Home
    A platform that makes pre-registrations easy to make and evaluate. All pre-registrations can be downloaded as single page PDFs that are time-stamped and include ...Private Forever · BACK · Help · Length
  56. [56]
    [PDF] Improving the Quality of Reporting - of Randomized Controlled Trials
    (CONSORT) statement—a checklist. (Table) and a flow diagram (Figure). The checklist consists of 21 items that per¬ tain mainly to the methods, results, and.
  57. [57]
    R. A. Fisher and his advocacy of randomization - PubMed
    The requirement of randomization in experimental design was first stated by RA Fisher, statistician and geneticist, in 1925 in his book Statistical Methods for ...
  58. [58]
    Blinding: Who, what, when, why, how? - PMC - NIH
    The best method to avoid this potential bias is blinding of the data analyst until the entire analysis has been completed. This rationale strongly suggests that ...
  59. [59]
    The FAIR Guiding Principles for scientific data management ... - Nature
    Mar 15, 2016 · This article describes four foundational principles—Findability, Accessibility, Interoperability, and Reusability—that serve to guide data ...
  60. [60]
    An introduction to Docker for reproducible research
    In this paper, I explore common reasons that code developed for one research project cannot be successfully executed or extended by subsequent researchers.
  61. [61]
    [PDF] An introduction to Docker for reproducible research - POLARIS
    Docker is an open source project that builds on many long- familiar technologies from operating systems research: LXC containers, virtualization of the OS, and ...
  62. [62]
    Jupyter Notebooks—a publishing format for reproducible ...
    We present Jupyter notebooks, a document format for publishing code, results and explanations in a form that is both readable and executable.
  63. [63]
    Using Jupyter for reproducible scientific workflows - arXiv
    Feb 18, 2021 · In this work, we report two case studies - one in computational magnetism and another in computational mathematics - where domain-specific software was exposed ...
  64. [64]
    Git can facilitate greater reproducibility and increased transparency ...
    Version control systems (VCS), which have long been used to maintain code repositories in the software industry, are now finding new applications in science.
  65. [65]
    About records - Help | Zenodo
    Zenodo will automatically register a Digital Object Identifier (DOI) for a record once you publish it. The DOI is a globally unique persistent identifier which ...Files · Persistent Identifier · Life Cycle
  66. [66]
    Does GitHub Copilot improve code quality? Here's what the data says
    Nov 18, 2024 · Our findings overall show that code authored with GitHub Copilot has increased functionality and improved readability, is of better quality, and receives ...
  67. [67]
  68. [68]
    Climate Change 2021: The Physical Science Basis
    The novel AR6 WGI Interactive Atlas allows for a flexible spatial and temporal analysis of both data-driven climate change information and assessment findings.IPCC Sixth Assessment Report · Summary for Policymakers · Press · Chapter 4Missing: reproducibility shared ensembles
  69. [69]
    Power Posing - Dana R. Carney, Amy J.C. Cuddy, Andy J. Yap, 2010
    Sep 20, 2010 · High-power posers experienced elevations in testosterone, decreases in cortisol, and increased feelings of power and tolerance for risk.
  70. [70]
    Stem-cell scientist found guilty of misconduct - Nature
    Apr 1, 2014 · But Japanese researcher stands by her claim to be able to produce stem cells using an acid bath or mechanical stress.
  71. [71]
    Obokata Can't Reproduce STAP Cells - The Niche
    Dec 18, 2014 · Haruko Obokata has herself been unable to make STAP cells again as part of a RIKEN team testing STAP under watchful supervision.Missing: debunked | Show results with:debunked
  72. [72]
    Papers on 'stress-induced' stem cells are retracted - Nature
    Jul 2, 2014 · The retraction notice includes a handful of problems with the papers that had not been previously considered by institutional investigation ...
  73. [73]
    Author retracts study of changing minds on same-sex marriage after ...
    May 20, 2015 · One of the authors of a much-ballyhooed Science paper claiming that short conversations could change people's minds on same-sex marriage is retracting it.
  74. [74]
    Editorial retraction - Science
    Green, is retracting the 12 December 2014 Report “When contact changes minds: An experiment on transmission of support for gay equality” by LaCour and Green (1) ...Missing: notice | Show results with:notice
  75. [75]
    President of Japan's RIKEN research labs resigns - Nature
    Mar 24, 2015 · RIKEN reform​​ RIKEN and Japan are moving on from the scandal. Last week, a committee commended Noyori's reform efforts, which aimed to create a ...
  76. [76]
    Little change in Japan's research sector 10 years after stem cell fraud
    Apr 9, 2024 · With STAP, short for stimulus-triggered acquisition of pluripotency, Obokata and her team claimed to have found a way to reprogram adult mice ...<|separator|>
  77. [77]
    The Unraveling of Michael LaCour
    Jun 2, 2015 · He claimed to have received $793,000 in research grants. In fact, he admits now, there were no grants. The researchers who attempted to ...Missing: consequences | Show results with:consequences