Fact-checked by Grok 2 weeks ago

Statistical significance

Statistical significance is a fundamental concept in statistical hypothesis testing that assesses whether observed results in a are unlikely to have arisen from random variation alone, thereby providing evidence against the . It is quantified primarily through the , which represents the probability of observing data at least as extreme as that obtained, given that the is true; a below a conventional , such as 0.05, is interpreted as indicating statistical significance. The origins of statistical significance trace back to the early , pioneered by British statistician and geneticist Ronald A. Fisher during his work at the Rothamsted Experimental Station. In his seminal 1925 book Statistical Methods for Research Workers, Fisher introduced the idea of using p-values to evaluate the strength of evidence against a of no effect, selecting the 0.05 level as a practical benchmark because it approximates the point where results lie beyond two standard deviations from the mean in a —roughly a 1-in-20 chance occurrence. This threshold gained widespread adoption in fields like , biology, and medicine, though Fisher himself viewed it as a guideline rather than a rigid rule, emphasizing the need for judgment in interpretation. Subsequent developments by and in the 1930s refined the framework into the Neyman-Pearson lemma, which formalized testing with explicit rates (Type I and Type II s), contrasting with Fisher's more inductive approach focused on s. Today, statistical significance is computed using various tests (e.g., t-tests, chi-square tests) that derive the from the test statistic's distribution under the , often assuming or other conditions. Despite its ubiquity, the concept has faced for potential misinterpretation; for instance, a statistically significant result does not quantify , practical relevance, or the probability that the is true. In response to widespread misuse, the issued a 2016 statement outlining six key principles for p-values, stressing that they indicate incompatibility with a model but should not drive decisions alone, and advocating for complementary approaches like confidence intervals and estimates to ensure robust scientific inference. Recent discussions, including a 2021 task force report, further urge moving beyond binary "significant/non-significant" dichotomies to improve and emphasize the distinction between statistical evidence and substantive importance across disciplines like , , and clinical trials.

Fundamentals

Definition

Statistical significance refers to the determination that an observed in a is unlikely to have occurred due to random chance alone, serving as a key criterion in testing to infer whether the results reflect a genuine underlying relationship or difference in the . This assessment evaluates the consistency of the with a of no , providing against the possibility that the observed outcome arose merely from sampling variability. A critical distinction exists between statistical significance and practical or : the former addresses whether an effect is detectable beyond , while the latter evaluates the 's and its meaningfulness in real-world applications. For instance, a pharmaceutical study might demonstrate statistical significance for a that extends average by just 10 minutes, yet this tiny gain may lack practical value for patients or healthcare systems due to its negligible impact on or . Such cases highlight how large sample sizes can yield statistical significance for trivially small s, underscoring the need to consider both types of significance together. At its core, statistical significance builds on foundational concepts of probability, which measures the likelihood of specific outcomes in uncertain processes, and sampling distributions, which model the expected range of variation in sample statistics drawn from a . These elements enable researchers to quantify and determine whether sample evidence reliably points to a population-level .

Key Components in Hypothesis Testing

Hypothesis testing begins with the formulation of two competing hypotheses: the and the . The , denoted as H_0, represents the default assumption of no effect, no difference, or no association between variables in the population. It serves as the baseline against which evidence is evaluated, positing that any observed variation in sample data arises solely from random rather than a systematic effect. For instance, in testing whether a is fair, H_0 would state that the probability of heads is exactly 0.5. This concept was introduced by Ronald A. Fisher as a tool for assessing the improbability of observed data under the assumption of no real effect. The alternative hypothesis, denoted as H_1 or H_a, posits the existence of an effect, difference, or association that contradicts the null. It encapsulates the researcher's claim or expectation, guiding the direction of the inquiry. Alternatives can be one-sided, specifying the direction of the effect (e.g., the coin is biased toward heads, with probability > 0.5), or two-sided, allowing for deviation in either direction (e.g., the coin is biased, with probability ≠ 0.5). This framework for specifying alternatives was formalized by Jerzy Neyman and Egon Pearson to enable the design of tests that maximize the detection of true effects while controlling error rates. Central to hypothesis testing are the risks associated with decision-making under uncertainty, embodied in Type I and Type II errors. A Type I error occurs when the null hypothesis is true but is incorrectly rejected, representing a false positive conclusion. The probability of committing a Type I error is denoted by \alpha, often set at a conventional level like 0.05, which defines the significance threshold for rejecting H_0. Conversely, a Type II error happens when the null hypothesis is false but fails to be rejected, resulting in a false negative. Its probability, \beta, depends on factors such as sample size and the magnitude of the true effect. Neyman and Pearson introduced these error types to quantify the reliability of tests, emphasizing the trade-off between controlling \alpha and minimizing \beta. To balance these risks, the concept of statistical power is employed, defined as $1 - \beta, which measures the probability of correctly rejecting a false null hypothesis. Higher power indicates greater ability to detect true effects, achieved through larger samples or more sensitive tests. This metric underscores the importance of designing studies that not only limit false positives but also enhance detection of meaningful differences. These components collectively frame the hypothesis testing process as a structured decision preceding the assessment of statistical significance. The and hypotheses delineate the question, error probabilities set the boundaries for , and ensures interpretability. By assuming H_0 initially, researchers collect data to evaluate whether the is sufficiently improbable under H_0 to warrant rejection, thereby informing conclusions about the . This logical sequence, integrating Fisher's with Neyman and Pearson's error-controlled approach, forms the foundation for rigorous .

Historical Context

Origins in Early Statistics

The concept of statistical significance traces its roots to early probabilistic reasoning in the , with providing one of the first explicit applications to empirical data. In his 1710 paper published in the Philosophical Transactions of the Royal Society, Arbuthnot examined christening records in from 1629 to 1710, observing a consistent excess of male births over females each year. Assuming an equal probability of male and female births under random chance (a model with p=0.5), he calculated the probability of this pattern occurring by chance alone as extraordinarily low—specifically, less than 1 in 2^{82} for the 82 years of data. Arbuthnot concluded that such regularity could not be attributed to chance but rather to , marking an early use of probability to assess the implausibility of random variation in observed outcomes. Building on these foundations, advanced the theoretical underpinnings of assessing evidence against chance in the late 18th and early 19th centuries through his development of . In his 1774 memoir "Mémoire sur la probabilité des causes par les événements," Laplace formalized the idea of inferring the probability of underlying causes from observed events, using Bayesian-like principles to update beliefs based on . This work, expanded in his 1812 Théorie Analytique des Probabilités, applied to astronomical observations and demographic , enabling quantitative judgments about whether deviations from expected patterns were likely due to chance or systematic causes. Laplace's approach emphasized the ratio of likelihoods under competing hypotheses, providing a framework for what would later evolve into significance testing by quantifying the improbability of under a null assumption of . By the early , synthesized these ideas into more structured statistical tools, notably through his development of the and the concept of . In his 1900 paper in the Philosophical Magazine, Pearson introduced the chi-squared statistic as a measure to determine whether observed deviations in categorical from an expected distribution could reasonably be ascribed to random sampling. The test computes the sum of squared differences between observed and expected frequencies, scaled by expected values, to yield a quantity whose distribution under the approximates a chi-squared form for large samples. Concurrently, Pearson's work in the Philosophical Transactions elaborated on —a measure, rooted in Gaussian theory, representing the deviation within which half the estimates would fall, offering a practical way to gauge the reliability of statistical constants like means and correlations against sampling variability. These contributions formalized probabilistic assessments of fit, bridging early calculations to systematic . Discussions on these emerging significance-like ideas gained prominence at the Statistical Congress in , where statisticians including Pearson debated the role of probability in distinguishing systematic patterns from random fluctuations in social and . Convened amid the Exposition Universelle, the congress featured presentations on probabilistic methods for and vital statistics, highlighting the need for criteria to evaluate whether observed discrepancies warranted rejection of chance-based explanations. These exchanges underscored the growing consensus on using threshold probabilities to guide scientific inference, setting the stage for broader adoption in .

Evolution in the 20th Century

The concept of statistical significance began to take formal shape in the early 1920s through the work of Ronald A. Fisher. In a 1921 presentation to the Royal Society of London, Fisher outlined foundational ideas for theoretical statistics, emphasizing the role of probability in assessing deviations from expected results under a , which laid the groundwork for modern significance testing. Fisher's ideas gained wider traction with the publication of his seminal book Statistical Methods for Research Workers in 1925, where he introduced the as a measure of the probability of observing data as extreme as that obtained, assuming the null hypothesis is true, and advocated for significance levels around 0.05 to guide scientific inference in fields like and . This approach formalized significance testing as a tool for research workers, promoting its use without rigid decision rules, and rapidly influenced experimental practices by providing accessible methods for evaluating evidence against hypotheses. In 1933, and advanced the framework by developing the Neyman-Pearson lemma, which established the as the most powerful method for distinguishing between two simple while controlling the error rate (alpha level) at a fixed , such as 0.05. This lemma shifted the focus toward a decision-theoretic , emphasizing Type I and Type II errors and fixed significance levels to optimize test power, contrasting with Fisher's more inductive, p-value-based approach. The Neyman-Pearson formulation provided a structured alternative that complemented and sometimes rivaled Fisher's methods, becoming integral to hypothesis testing theory. Following , statistical significance testing saw widespread adoption in experimental design across sciences, particularly in , , and social sciences, as institutions like Rothamsted Experimental Station under Fisher's influence integrated these methods into randomized trials and data analysis protocols. In the 1950s, critiqued Fisher's significance testing for its lack of emphasis on and alternative hypotheses, arguing in publications that it failed to adequately address under and should incorporate explicit error control, intensifying the philosophical divide between the two schools. By the 1960s, Bayesian approaches began challenging both Fisherian and Neyman-Pearson frameworks, with statisticians like Dennis Lindley and Leonard Savage promoting prior probabilities and posterior inference as superior for incorporating subjective knowledge and avoiding arbitrary significance thresholds in evaluation.

Computation and Interpretation

Calculating P-Values

The p-value represents the probability of observing sample data at least as extreme as the data actually obtained, assuming the H_0 is true. This measure quantifies the against H_0 provided by the sample, serving as the foundation for assessing statistical significance in testing. In general, the p-value is computed from the of a under H_0. The formula is given by p = P(T \geq t_{\text{obs}} \mid H_0) for a one-tailed test in the upper tail, where T denotes the for the test statistic and t_{\text{obs}} is its observed value; for a two-tailed test, the p-value accounts for both tails by doubling the one-tailed probability or integrating over the relevant regions. The exact form depends on the assumed under H_0, such as the normal or t-distribution. Common test statistics include the z-score for scenarios with large sample sizes or known population variance, calculated as z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}, where \bar{x} is the sample mean, \mu_0 is the hypothesized mean under H_0, \sigma is the deviation, and n is the sample size. For smaller samples or when the population variance is unknown, the is used instead: t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}, with s denoting the sample deviation; this follows the t-distribution with n-1 . The process of calculating a p-value follows these steps: (1) formulate the null hypothesis H_0 and alternative hypothesis H_a; (2) select the appropriate test based on data characteristics, such as and sample size; (3) compute the from the sample data using the relevant formula; (4) obtain the p-value by evaluating the cumulative probability from the test statistic's distribution, typically via statistical software, tables, or computational functions like those in or Python's library. This computation assumes the data meet the test's prerequisites, such as and approximate for t-tests. For illustration, consider a one-sample t-test evaluating whether the mean IQ of a sample of 25 students differs from the population mean of 100. The sample yields \bar{x} = 105 and s = 15. The is t = \frac{105 - 100}{15 / \sqrt{25}} = \frac{5}{3} \approx 1.667, with 24 degrees of freedom. For a two-tailed test, the is $2 \times P(T_{24} > 1.667) \approx 0.109, obtained from t-distribution tables or software, indicating the probability of observing such a deviation under H_0.

Interpreting Test Results

Interpreting the results of a test begins with applying the decision rule to the calculated . The standard rule is to reject the H_0 if the is less than or equal to the chosen significance level \alpha, which is often set at 0.05. If this condition is met, the result is labeled as "statistically significant," indicating that the observed data provide sufficient evidence against H_0 at the specified \alpha level. This threshold balances the risk of Type I error—the probability of incorrectly rejecting a true —with the need to detect meaningful effects, though it does not prove the is true. Confidence intervals serve as a complementary for , offering a of plausible values for the of interest rather than a decision. A 95% () that excludes the value (e.g., 0 for a difference in means) corresponds directly to statistical at \alpha = 0.05, as it implies the null would be rejected in a two-sided test. For instance, if a 95% for a treatment effect is (0.2, 0.8), the null value of 0 lies outside, supporting significance; conversely, a like (-0.1, 0.3) that includes 0 suggests non-significance. enhance judgment by revealing the precision and magnitude of the estimate, helping avoid overreliance on p-values alone and highlighting potential practical implications. The interpretation of test results also depends on whether a one-tailed or two-tailed was used, as this affects how the is computed and what it represents. In a one-tailed , the specifies a (e.g., \mu > \mu_0), so the measures the probability in only one tail of the distribution, making the more powerful for detecting effects in the predicted but requiring strong justification for the directional assumption to prevent bias. A two-tailed , in contrast, examines both tails for any deviation from the null (e.g., \mu \neq \mu_0), yielding a that is typically twice that of a one-tailed for the same data, which provides a more conservative interpretation suitable for exploratory analyses without preconceived . Misapplying the tail choice can lead to inflated significance claims, underscoring the need to align the with the research question's intent. Visual aids, such as density plots of the under the , aid in contextualizing p-values by depicting the tail area(s) corresponding to the observed . These plots show the with the critical region shaded, where the p-value is the integrated area beyond the statistic—small areas indicate rarity under H_0. For a two-tailed test, shading occurs in both extremes; for one-tailed, only in the relevant tail. Such visualizations clarify that p-values quantify extremeness rather than , helping researchers avoid pitfalls like equating low p-values with large practical importance or ignoring the role of sample size in shrinking tail areas.

Applications

Role in Scientific Research

Statistical significance serves as a fundamental gatekeeper in the , enabling researchers to assess whether empirical data from experiments, surveys, or observational studies provide compelling evidence against the (H0), thus supporting inferences about underlying phenomena beyond mere chance variation. This process integrates seamlessly into testing by quantifying the probability that observed results occurred under H0, allowing scientists to evaluate the plausibility of alternative explanations and advance knowledge across disciplines. By establishing a for evidence strength, it helps distinguish systematic effects from random noise, ensuring that conclusions drawn from sample data can be generalized to broader populations with quantified uncertainty. In scientific , statistical significance informs the acceptance or rejection of theoretical models and shapes practical applications, such as recommendations derived from empirical findings. For instance, in experiments testing behavioral interventions, a significant result may validate a new theory of , prompting its incorporation into educational or clinical guidelines. Similarly, in research, it guides decisions on by identifying interventions with reliable effects, thereby influencing evidence-based policies without relying solely on . This role underscores its utility in bridging raw data and actionable insights, fostering rigorous evaluation of competing ideas. A standard incorporating statistical significance begins with researchers formulating clear and hypotheses based on existing theory, followed by designing and collecting data through controlled methods like randomized trials or to minimize bias. then involves selecting an appropriate , computing the to measure evidence against H0, and interpreting the outcome—rejecting H0 if the falls below a predetermined level, which is reported alongside effect sizes and confidence intervals in peer-reviewed papers for and peer validation. This structured approach ensures , as subsequent studies can build upon or challenge the reported significance to refine scientific understanding.

Field-Specific Thresholds

In various scientific disciplines, the choice of significance threshold, denoted as \alpha, reflects the balance between the risk of false positives (Type I errors) and the field's tolerance for uncertainty, often influenced by the consequences of erroneous conclusions. In the social sciences, such as and , \alpha = 0.05 remains the conventional threshold for establishing statistical significance, allowing a 5% chance of incorrectly rejecting the . In fields like physics, the conventional \alpha = 0.05 is often used, though stricter levels such as \alpha = 0.01 or the 5-sigma standard (p ≈ 3 × 10^{-7}) are applied in subfields like for major discoveries. Genomics research, particularly in genome-wide association studies (GWAS), adopts a highly stringent \alpha = 5 \times 10^{-8} or lower, based on for multiple testing across approximately 1 million independent genetic markers, to control the . Recent analyses (as of 2024) suggest this threshold may yield 20-30% false positive rates in large cohorts, prompting discussions on further adjustments. These variations arise from differing stakes and research paradigms within each field. In and , where decisions impact , regulatory bodies like the U.S. (FDA) require p ≤ 0.05 for statistical significance in pivotal trials supporting drug approval, but they also emphasize clinical and practical relevance to avoid over-reliance on p-values alone. This cautious approach distinguishes confirmatory clinical trials, which demand low error rates to protect , from exploratory studies in fields like , where higher thresholds may suffice due to less immediate consequences. In and , a 10% significance level (\alpha = 0.10) is sometimes accepted alongside the standard 5% to assess result robustness, especially in observational data prone to factors. To handle multiple comparisons, which can artificially increase the , fields like and routinely apply adjustments such as the . This method divides the overall \alpha by the number of tests performed (k), yielding an adjusted threshold of \alpha / k, ensuring the cumulative probability of false positives remains controlled. For instance, in a study with 100 hypotheses, \alpha = 0.05 would be corrected to 0.0005 per test. A notable example in is the "5-sigma" rule for discovery claims, equivalent to a of approximately $3 \times 10^{-7} (or \alpha \approx 2.87 \times 10^{-7} for one-tailed tests), as adopted by the experiments to confirm phenomena like the with near-certainty.

Criticisms and Limitations

Common Misinterpretations

One prevalent misinterpretation is the belief that a low definitively proves the (H₀) is false or that the observed effect is both real and substantial. In reality, a represents the probability of obtaining results at least as extreme as those observed, assuming H₀ is true; it provides against H₀ under the specified model and assumptions but does not confirm its falsity or quantify the magnitude of any alternative effect. This misconception can lead researchers to overstate findings, treating statistical significance as conclusive proof rather than probabilistic . Another common error involves the dichotomization of results based on an arbitrary threshold, such as p < 0.05, where outcomes below this cutoff are deemed "significant" or "real" while those above are dismissed as meaningless, often ignoring the size, precision, or practical importance of the effect. The American Statistical Association (ASA) emphasizes that decisions should not hinge solely on crossing such a threshold, as this binary approach oversimplifies the continuous nature of evidence and can distort scientific inference. Statistical significance is frequently mistaken for evidence of causation, particularly in observational studies where a significant correlation between variables is interpreted as one causing the other. However, significance only indicates an association unlikely under the null, without establishing causal direction or ruling out confounders; causation requires additional considerations like experimental design or causal modeling. The 2016 ASA statement also highlights misuses such as p-hacking, where researchers manipulate data collection, analysis, or reporting—such as selectively stopping data collection after achieving significance or testing multiple outcomes without adjustment—to artificially inflate the chance of obtaining a low p-value, thereby increasing false positives. Seminal work demonstrates that undisclosed flexibility in these practices can raise false-positive rates from 5% to over 60%, undermining the reliability of published results.

Relationship to Effect Size

Statistical significance indicates whether an observed effect is likely due to chance, but it does not convey the magnitude or practical importance of that effect. Effect size addresses this limitation by quantifying the strength or magnitude of the phenomenon under study in a standardized way, independent of sample size. One common effect size measure for comparing means between two groups is Cohen's d, defined as the standardized difference between the two population means divided by the pooled standard deviation: d = \frac{\mu_1 - \mu_2}{\sigma} where \mu_1 and \mu_2 are the population means, and \sigma is the standard deviation. This measure allows comparison of effects across studies or contexts. For assessing relationships between continuous variables, the Pearson correlation coefficient r serves as an effect size, representing the strength and direction of the linear association, with values ranging from -1 to 1. In epidemiology, the odds ratio is frequently used as an effect size to , calculated as the ratio of the odds of the outcome in the exposed group to the odds in the unexposed group. Statistical power, the probability of correctly rejecting the null hypothesis when it is false, depends on the effect size, sample size (n), and significance level (\alpha): power = f(effect size, n, \alpha). Larger effect sizes or sample sizes increase power, making it easier to detect true effects. Jacob Cohen proposed conventional benchmarks for interpreting effect sizes in the behavioral sciences: for Cohen's d, 0.2 is small, 0.5 medium, and 0.8 large; for r, 0.1, 0.3, and 0.5 correspond to small, medium, and large effects, respectively. These guidelines emphasize that even small effects can achieve statistical significance with sufficiently large sample sizes, as the standard error decreases with increasing n, potentially leading to overemphasis on trivial findings.

Impact on Reproducibility

The reproducibility crisis in scientific research, which gained prominence in the 2010s and remains an ongoing challenge as of 2025, refers to the widespread difficulty in replicating published findings across various fields. In psychology, a landmark large-scale replication effort attempted to reproduce 100 studies from three high-impact journals published in 2008, finding that only 36% of the original effects replicated with statistical significance at the p < 0.05 level, with effect sizes in replications being about half the magnitude of those in originals. Similar challenges emerged in other disciplines; for instance, a 2016 survey of over 1,500 scientists across fields like biology and physics reported that more than 70% had failed to reproduce another scientist's experiments, and over 50% had failed to reproduce their own. More recently, a January 2025 survey of over 1,600 biomedical researchers found that 72% believed their field was facing a , with 62% attributing it to the "publish or perish" culture. A key contributor to this crisis is the conventional reliance on statistical significance thresholds, particularly the p < 0.05 cutoff, which drives by favoring studies that report statistically significant results while suppressing null or non-significant findings. This selective reporting inflates the prevalence of false positives in the published literature, as non-significant results are less likely to be submitted or accepted for publication, distorting the scientific record. Compounding this issue is the "garden of forking paths" in data analysis, where researchers face multiple decision points—such as variable selections, subgroup analyses, or outcome measures—without pre-specifying them, increasing the chance of obtaining p < 0.05 by chance alone and leading to non-reproducible patterns that appear significant in initial studies but fail upon replication. John Ioannidis's influential 2005 analysis formalized these concerns through a Bayesian-inspired theorem demonstrating that the positive predictive value (PPV)—the probability that a significant research finding is true—often falls below 50% under realistic conditions of low prior probability, small study power, and bias, implying that most published findings in many fields are false. His model highlights how flexible analyses and the pressure to achieve significance exacerbate false discovery rates, contributing directly to reproducibility failures observed in empirical replication projects. To mitigate these impacts, reforms emphasizing preregistration of hypotheses and analyses—committing to specific plans before data collection—and greater transparency in reporting have gained traction. The Transparency and Openness Promotion (TOP) guidelines, introduced as a modular framework in 2015 and widely adopted by journals thereafter, promote practices such as data sharing and preregistration to reduce selective reporting and enhance verifiability.

Modern Reforms and Alternatives

Publishing and Overuse Issues

One major issue in the publishing of statistical research is the file-drawer problem, where studies yielding non-significant results are disproportionately left unpublished, leading to a distorted representation of the scientific literature that overemphasizes positive findings. Rosenthal introduced this concept in 1979, noting that journals tend to publish the approximately 5% of studies that produce Type I errors (false positives), while the remaining 95% of null results remain in researchers' file drawers, biasing meta-analyses and cumulative knowledge. Journal editorial practices exacerbate this bias, as many outlets explicitly or implicitly favor manuscripts with statistically significant results over those reporting null findings. An experimental vignette study across authors, reviewers, and editors demonstrated that statistically significant outcomes are more likely to be recommended for publication, with non-significant results facing higher rejection rates at every stage of the review process. This selective emphasis creates skewed incentives for researchers, encouraging the suppression or modification of non-significant data to meet publication thresholds. A related questionable research practice is HARKing (Hypothesizing After the Results are Known), in which researchers formulate or adjust hypotheses post hoc based on observed data and then present them in publications as if they were pre-registered a priori predictions. Kerr first described in 1998 as a common tactic to inflate the appearance of statistical significance, which undermines transparency and increases the risk of false positives by capitalizing on chance findings without proper correction for multiple testing. Analyses of p-value distributions in high-impact journals from the 2020s reveal patterns consistent with these practices, including clustering of p-values just below the 0.05 threshold, indicative of selective reporting or p-hacking. For instance, Brodeur et al. examined over 22,000 test statistics from top economics journals and found a significant excess of p-values in the 0.045–0.05 bin compared to a uniform distribution expected under no bias, suggesting widespread manipulation to achieve significance. Such distortions not only perpetuate the overuse of statistical significance but also contribute to broader issues in scientific reproducibility.

Proposals for Change

In response to widespread criticisms of dichotomous significance testing, the American Statistical Association (ASA) released a statement in 2016 articulating six principles to promote more nuanced interpretations of p-values and discourage their misuse as binary decision tools. These principles emphasize that p-values indicate the incompatibility of observed data with a specified statistical model but do not measure the probability that a hypothesis is true, the likelihood that data arose by chance alone, or the size or importance of an effect. They further stress that p-values cannot reliably determine the presence or absence of an effect, require careful study planning to yield valid results, and should not drive oversimplified reject-or-accept decisions regarding the null hypothesis. The ASA's guidance aims to foster a shift away from rigid thresholds toward integrated consideration of substantive and statistical evidence in research. Building on such calls for reform, a 2018 proposal by and colleagues advocated redefining the conventional threshold for statistical significance from 0.05 to 0.005, particularly for novel or exploratory claims, to substantially reduce the rate of false positive discoveries in scientific literature. This stricter criterion would classify p-values between 0.005 and 0.05 as suggestive but not definitive evidence, encouraging replication before strong assertions of new findings and addressing the high false discovery risk associated with the traditional 5% level. The authors argued that this change could be implemented without requiring additional statistical power or sample sizes, while aligning better with the evidentiary standards in fields like , where thresholds as low as $10^{-5} or stricter are routine. A complementary reform emphasizes estimation over hypothesis testing, advocating a paradigm shift toward reporting effect sizes and confidence intervals to convey the magnitude, precision, and practical relevance of results rather than focusing on p-values. Geoff Cumming's 2014 overview of "the new statistics" highlights how confidence intervals provide a range of plausible values for parameters, enabling researchers to assess uncertainty and overlap between studies more intuitively than binary significance tests. This approach, which also incorporates meta-analysis for synthesizing evidence across studies, has gained traction as a way to mitigate the limitations of p-values by prioritizing descriptive inference and avoiding the allure of dichotomous outcomes. Cumming illustrates its application through examples in psychology, showing how effect sizes like Cohen's d quantify practical importance alongside interval estimates. Efforts to implement these reforms have included journal policies discouraging or prohibiting p-value reporting to break reliance on significance testing. In 2015, Basic and Applied Social Psychology became the first major journal to ban p-values, null hypothesis significance testing procedures, and confidence intervals in submissions, requiring instead detailed descriptions of data, methods, and effect magnitudes to promote transparency and substantive interpretation. This policy, intended to curb p-hacking and enhance research quality, drew significant debate for potentially complicating statistical communication, though it exemplified institutional action toward estimation-focused practices.

Bayesian and Other Approaches

Bayesian hypothesis testing provides an alternative to frequentist p-value approaches by quantifying the relative evidence for competing hypotheses through the Bayes factor, defined as BF = \frac{P(\text{data}|H_1)}{P(\text{data}|H_0)}, where H_0 and H_1 represent the null and alternative hypotheses, respectively. This measure, pioneered by in his 1961 work , compares the marginal likelihood of the data under each hypothesis without relying on long-run frequencies. Bayes factors can be interpreted on a scale from decisive evidence against H_0 (BF > 100) to strong support for H_0 (BF < 1/100), offering a continuous assessment rather than a binary decision. Central to Bayesian inference is the computation of posterior probabilities, which update prior beliefs about hypotheses using Bayes' theorem: P(H|\text{data}) = \frac{P(\text{data}|H) P(H)}{P(\text{data})}, where P(H) is the prior probability, P(\text{data}|H) is the likelihood, and P(\text{data}) is the marginal likelihood. This framework incorporates subjective or objective priors to yield direct probabilities for hypotheses, such as the posterior odds \frac{P(H_1|\text{data})}{P(H_0|\text{data})} = BF \times \frac{P(H_1)}{P(H_0)}, enabling researchers to express uncertainty in probabilistic terms. Other alternatives to traditional significance testing include likelihood ratio tests, which evaluate the ratio of the maximized likelihood under the full model to that under a restricted model, providing evidence against the null without p-value thresholds. Equivalence testing, often implemented via the two one-sided tests (TOST) procedure, shifts the null hypothesis to claim a meaningful difference exceeds a predefined equivalence bound (e.g., |\mu_1 - \mu_2| > \delta), testing instead for practical non-inferiority or similarity. Bayesian methods offer advantages in handling uncertainty by avoiding the dichotomization inherent in p-value cutoffs, as illustrated by : in large samples, a statistically significant (e.g., p < 0.05) may coexist with a favoring the null due to diffuse priors spreading probability mass. This paradox, first highlighted by Dennis Lindley in 1957, underscores how Bayesian approaches better accommodate prior information and provide coherent evidence accumulation across studies. Adoption of Bayesian methods has grown since 2020, particularly in for in models like Gaussian processes and in clinical trials for adaptive designs that incorporate historical , with recent analyses (as of 2024) indicating that approximately 50% of Bayesian trials started in the last five years despite overall low usage.

References

  1. [1]
    [PDF] p-valuestatement.pdf - American Statistical Association
    Mar 7, 2016 · A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. 6. By itself, a p-value does ...
  2. [2]
    The statistical significance revolution - PMC - NIH
    How was this level selected? In 1925, Ronald Fisher, a British statistician, selected .05 as a reasonable level at which to reject the null hypothesis. Fisher ...Missing: paper | Show results with:paper
  3. [3]
    The Significant Problem of P Values | Scientific American
    Oct 1, 2019 · In 1925 British geneticist and statistician Ronald Fisher published a book called Statistical Methods for Research Workers.
  4. [4]
    Statistical Significance - StatPearls - NCBI Bookshelf
    Statistical significance measures the probability of the null hypothesis being true compared to the acceptable level of uncertainty regarding the true answer.Missing: NIST | Show results with:NIST
  5. [5]
    ASA President's Task Force Statement on Statistical Significance ...
    Jul 30, 2021 · A succinct statement about the proper use of statistical methods in scientific studies, specifically hypothesis tests and p-values, and their connections to ...
  6. [6]
    An Easy Introduction to Statistical Significance (With Examples)
    Jan 7, 2021 · A statistically significant result has a very low chance of occurring if there were no true effect in a research study.Null And Alternative... · Test Statistics And P Values · Problems With Relying On...
  7. [7]
    ASA President's Task Force Statement on Statistical Significance ...
    Aug 1, 2021 · The president of the American Statistical Association, Karen Kafadar, convened a task force to consider the use of statistical methods in scientific studies.
  8. [8]
    Practical vs. Statistical Significance
    Statistical significance indicates an effect exists, while practical significance refers to the effect's magnitude and if it's meaningful in the real world.
  9. [9]
    Sampling Distribution In Statistics - Simply Psychology
    Sep 26, 2023 · A sampling distribution is the probability distribution of a statistic from all possible samples of a given size from a population.
  10. [10]
    6.4 - Practical Significance | STAT 200
    Practical significance refers to the magnitude of the difference, which is known as the effect size. Results are practically significant when the difference is ...
  11. [11]
    [PDF] Statistical Methods For Research Workers Thirteenth Edition
    Page 1. Statistical Methods for. Research Workers. BY. Sir RONALD A. FISHER, sg.d., f.r.s.. D.Sc. (Ames, Chicago, Harvard, London), LL.D. (Calcutta, Glasgow).
  12. [12]
    [PDF] On the Problem of the Most Efficient Tests of Statistical Hypotheses
    Jun 26, 2006 · * It is the frequency of these errors that matters, and this-for errors of the first kind-is equal in all four cases. It is when we turn to ...
  13. [13]
    II. An argument for divine providence, taken from the constant ...
    Arbuthnot John. 1710II. An argument for divine providence, taken from the constant regularity observ'd in the births of both sexes. By Dr. John Arbuthnott ...
  14. [14]
    Memoir on the Probability of the Causes of Events - Project Euclid
    August, 1986 Memoir on the Probability of the Causes of Events. Pierre Simon Laplace.Missing: inverse | Show results with:inverse
  15. [15]
    a philosophical essay on probabilities. - Project Gutenberg
    A PHILOSOPHICAL ESSAY ON PROBABILITIES. BY PIERRE SIMON, Marquis de LAPLACE. TRANSLATED FROM THE SIXTH FRENCH EDITION
  16. [16]
    [PDF] Karl Pearson a - McGill University
    To cite this Article Pearson, Karl(1900)'X. On the criterion that a given system of deviations from the probable in the case of a.
  17. [17]
    IV. On the probable errors of frequency constants and on ... - Journals
    On the probable errors of frequency constants and on the influence of random selection on variation and correlation. Karl Pearson.
  18. [18]
    [PDF] On the Origins of the .05 Level of Statistical Significance
    Fisher's (1925) statement in his book, Statistical. Methods for Research Workers, seems to be the first specific mention of the p = .05 level as deter- mining ...Missing: original | Show results with:original
  19. [19]
    Full article: Statement on P-values - Taylor & Francis Online
    Feb 19, 2021 · The p-value is often defined as the probability of observing data as extreme or more extreme than those observed, assuming that a specified (null) hypothesis ...
  20. [20]
    The P Value and Statistical Significance - NIH
    The P value should be interpreted as a continuous variable and not in a dichotomous way. So, we should not conclude that just because the P value is < 0.05 or ...
  21. [21]
    Statistics review 3: Hypothesis testing and P values - Critical Care
    Mar 18, 2002 · In other words, the P value is the probability of seeing the observed difference, or greater, just by chance if the null hypothesis is true.Missing: definition sources
  22. [22]
    The roles, challenges, and merits of the p value - ScienceDirect
    Dec 8, 2023 · Put simply, the p value is the tail probability calculated using a test statistic (see Figure 3A). To define it formally, let us use an example.
  23. [23]
    7.4.1 - Hypothesis Testing - STAT ONLINE
    General Form of a Test Statistic​​ Recall the formula for a z score: z = x − x ― s . The formula for a test statistic will be similar. When conducting a ...
  24. [24]
    T Test - StatPearls - NCBI Bookshelf - NIH
    The 1-sample t-test evaluates a single list of numbers to test the hypothesis ... test statistic, which can then be used to calculate a p-value. Often ...
  25. [25]
    Chapter 10: Hypothesis Testing with Z - Maricopa Open Digital Press
    Hypothesis testing with z uses a z-score to test a sample mean against a population parameter, where the null hypothesis is not necessarily zero.
  26. [26]
    7. The t tests - The BMJ
    The t-test, also known as Student's t-test, is used when sample sizes are small, especially under 60, to test differences between means.
  27. [27]
  28. [28]
    S.3.2 Hypothesis Testing (P-Value Approach) | STAT ONLINE
    That is, since the P-value, 0.0254, is less than α = 0.05, we reject the null hypothesis H0 : μ = 3 in favor of the alternative hypothesis HA : μ ≠ 3.
  29. [29]
    How Hypothesis Tests Work: Significance Levels (Alpha) and P values
    For instance, a significance level of 0.05 signifies a 5% risk of deciding that an effect exists when it does not exist. Lower significance levels require ...<|separator|>
  30. [30]
    What Do P Values and Confidence Intervals Really Represent? - NIH
    A CI provides a range of plausible values of the effect size estimate. While a CI can be used to determine whether a finding is statistically significant or not ...
  31. [31]
    What are the differences between one-tailed and two-tailed tests?
    A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x.
  32. [32]
    One-Tailed and Two-Tailed Hypothesis Tests Explained
    In a one-tailed test, you have two options for the null and alternative hypotheses, which corresponds to where you place the critical region. You can choose ...
  33. [33]
    p-values Explained in Plain English (with Visuals) - Statology
    Sep 2, 2025 · In this article, we'll explore what p-values really mean, what they do not mean, and how to interpret them correctly.Missing: representation density
  34. [34]
    Statistical Significance Tests
    A significance test is a formal procedure to infer knowledge about a population from a sample, deciding if sampling error is a probable source of difference.
  35. [35]
    Understanding Statistical Significance and Its Role in Developing ...
    Policymakers, program operators, and researchers often depend on statistically significant findings to identify what works in public policy and programs.
  36. [36]
    The role of statistical significance testing in public law and health ...
    We here summarized the history in the use of statistical significance testing and its implication for toxicological risk assessment and for public law, ...
  37. [37]
    Hypothesis Testing | A Step-by-Step Guide with Easy Examples
    Nov 8, 2019 · Step 1: State your null and alternate hypothesis · Step 2: Collect data · Step 3: Perform a statistical test · Step 4: Decide whether to reject or ...
  38. [38]
    1.2 - The 7 Step Process of Statistical Hypothesis Testing | STAT 502
    Step 1: State the Null Hypothesis · Step 2: State the Alternative Hypothesis · Step 3: Set α · Step 4: Collect Data · Step 5: Calculate a test statistic · Step 6: ...
  39. [39]
    Historical Hypothesis Testing
    Hypothesis testing, as we know it, was formalized in the twentieth century by R.A. Fisher, and Jerzy Neyman with Egon Pearson. In modern usage, a hypothesis ...
  40. [40]
    Sir Ronald Aylmer Fisher was born February 17, 1890 in East ...
    He clarified the distinction between sample statistics and population values. These ideas gave researchers many tools to deal with variants, small sample sizes ...
  41. [41]
    [PDF] The Fisher, Neyman-Pearson Theories of Testing Hypotheses
    The Fisher and Neyman-Pearson approaches to testing statistical hypotheses are compared with respect to their attitudes to the interpretation.
  42. [42]
    (PDF) On Evolution of Statistical Inference - ResearchGate
    The foundations of statistics have evolved over many centuries, perhaps millennia, with major paradigm shifts of the form described in Kuhn (1962).
  43. [43]
    The ASA Statement on p-Values: Context, Process, and Purpose
    Jun 9, 2016 · A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. Statistical significance is not ...
  44. [44]
    Correlation and causation | Australian Bureau of Statistics
    Feb 2, 2023 · A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the change in the values ...
  45. [45]
    False-Positive Psychology - Joseph P. Simmons, Leif D. Nelson, Uri ...
    False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Joseph P. Simmons, Leif D. Nelson, ...
  46. [46]
    Using Effect Size—or Why the P Value Is Not Enough - PMC - NIH
    Cohen's term d is an example of this type of effect size index. Cohen classified effect sizes as small (d = 0.2), medium (d = 0.5), and large (d ≥ 0.8).
  47. [47]
    What is Effect Size and Why Does It Matter? (Examples) - Scribbr
    Dec 22, 2020 · Effect size tells you how meaningful the relationship between variables or the difference between groups is, indicating practical significance.How do you calculate effect... · How do you know if an effect...
  48. [48]
    [PDF] Effect Size (ES)
    Cohen (1988) defined d as the difference between the means, M1 - M2, divided by standard deviation, σ, of either group. Cohen argued that the standard deviation ...
  49. [49]
    Cohens D: Definition, Using & Examples - Statistics By Jim
    Cohens d is a standardized effect size for measuring the difference between two group means. Its use is common in psychology.
  50. [50]
    Effect Size Guidelines, Sample Size Calculations, and Statistical ...
    Researchers typically use Cohen's guidelines of Pearson's r = .10, .30, and .50, and Cohen's d = 0.20, 0.50, and 0.80 to interpret observed effect sizes as ...Abstract · Method · Results · Discussion<|control11|><|separator|>
  51. [51]
    Principles of Epidemiology | Lesson 3 - Section 5 - CDC Archive
    Odds ratio​​ An odds ratio (OR) is another measure of association that quantifies the relationship between an exposure with two categories and health outcome. ...
  52. [52]
    [PDF] Statistical Power Analysis for the Behavioral Sciences
    ... Cohen, Jacob. Statistical power analysis for the behavioral sciences I Jacob. Cohen. - 2nd ed. Bibliography: p. Includes index. ISBN 0-8058-0283-5. 1. Social ...
  53. [53]
    Sample size, power and effect size revisited: simplified and practical ...
    This review holds two main aims. The first aim is to explain the importance of sample size and its relationship to effect size (ES) and statistical ...
  54. [54]
    Statistical Power Analysis for the Behavioral Sciences | Jacob Cohen |
    First Published 1988. eBook Published 13 May 2013. Pub. Location ... * a chapter considering effect size, psychometric reliability, and the efficacy of " ...Missing: conventions original source
  55. [55]
    A power primer. - APA PsycNet - American Psychological Association
    A convenient, although not comprehensive, presentation of required sample sizes is provided. Effect-size indexes and conventional values for these are given.Abstract · Publication History · Other PublishersMissing: conventions original
  56. [56]
    Estimating the reproducibility of psychological science
    Aug 28, 2015 · We conducted a large-scale, collaborative effort to obtain an initial estimate of the reproducibility of psychological science.
  57. [57]
    1,500 scientists lift the lid on reproducibility - Nature
    May 25, 2016 · More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own ...
  58. [58]
    Why Most Published Research Findings Are False | PLOS Medicine
    Aug 30, 2005 · The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio ...Correction · View Reader Comments · View Figures (6) · View About the Authors
  59. [59]
    The file drawer problem and tolerance for null results. - APA PsycNet
    The extreme view of the file drawer problem is that journals are filled with the 5% of the studies that show Type I errors, while the file drawers are filled ...
  60. [60]
    [PDF] The "File Drawer Problem" and Tolerance for Null Results
    1979, Vol. 86, No. 3, 638-641. The "File Drawer Problem" and Tolerance for Null Results. Robert Rosenthal. Harvard University. For any given research area, one ...
  61. [61]
    A direct comparison across authors, reviewers, and editors based on ...
    Our findings suggest that statistically significant findings have a higher likelihood to be published than statistically non-significant findings.
  62. [62]
    HARKing: Hypothesizing After the Results are Known - Sage Journals
    This article considers a practice in scientific communication termed HARKing (Hypothesizing After the Results are Known). HARKing is defined as presenting a ...
  63. [63]
    HARKing: hypothesizing after the results are known - PubMed
    HARKing is defined as presenting a post hoc hypothesis (ie, one based on or informed by one's results) in one's research report as if it were, in fact, an a ...
  64. [64]
    Methods Matter: p-Hacking and Publication Bias in Causal Analysis ...
    Brodeur, Abel, Nikolai Cook, and Anthony Heyes. 2020. "Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics." American Economic ...Missing: value distribution
  65. [65]
    Redefine statistical significance | Nature Human Behaviour
    Sep 1, 2017 · We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries.
  66. [66]
    The New Statistics - Geoff Cumming, 2014 - Sage Journals
    Nov 12, 2013 · The new statistics refers to recommended practices, including estimation based on effect sizes, confidence intervals, and meta-analysis.
  67. [67]
    [PDF] Harold Jeffreys's default Bayes factor hypothesis tests
    Aug 28, 2015 · The Bayes factor follows logically from Jeffreys's philosophy of model selection. • The ideas are illustrated with two examples: the Bayesian t- ...
  68. [68]
    Sifting the evidence: Likelihood ratios are alternatives to P values
    A small P value means that what we observe is possible but not very likely under the null hypothesis. But then life is made up of unlikely events.
  69. [69]
    Conditional equivalence testing: An alternative remedy for ...
    Statistical Significance Tests: Equivalence and Reverse Tests Should Reduce Misinterpretation Equivalence tests improve the logic of significance testing ...
  70. [70]
    History and nature of the Jeffreys–Lindley paradox
    Aug 26, 2022 · The Jeffreys–Lindley paradox exposes a rift between Bayesian and frequentist hypothesis testing that strikes at the heart of statistical inference.
  71. [71]
    Being Bayesian in the 2020s: opportunities and challenges in the ...
    In this paper, we touch on six modern opportunities and challenges in applied Bayesian statistics: intelligent data collection, new data sources, federated ...
  72. [72]
    Use of Bayesian approaches in oncology clinical trials - NIH
    In this study, we aim to describe the use of Bayesian methods and designs in oncology clinical trials in the last 20 years.