Fact-checked by Grok 2 weeks ago

p -value

In statistics, the p-value is the probability of obtaining a test statistic at least as extreme as the one observed in the data, assuming that the null hypothesis is true.^[1] This measure serves as an indicator of how compatible the observed data are with a specified statistical model, such as the null hypothesis of no effect or no difference.^[2] Introduced by British statistician Ronald A. Fisher in the 1920s as part of his development of significance testing methods, the p-value emerged from Fisher's work on agricultural experiments and became a cornerstone of inferential statistics.^[3] Fisher's 1925 book Statistical Methods for Research Workers formalized its use, proposing thresholds like p < 0.05 to gauge the "strength of evidence" against the null hypothesis, though he emphasized it as a continuous measure rather than a strict cutoff.^[4] The p-value plays a central role in hypothesis testing across scientific disciplines, where it helps researchers assess whether observed results are likely due to chance under the null hypothesis.^[2] For instance, a small p-value (e.g., below 0.05) suggests the data are incompatible with the null model, prompting consideration of alternative hypotheses, but it does not quantify the probability that the null hypothesis is true or false, nor does it measure the size or importance of any effect.^[2] Despite its ubiquity in peer-reviewed literature—appearing in the vast majority of quantitative research papers—the p-value has faced significant criticism for frequent misinterpretation, such as equating low p-values with practical significance or proof of causation.^[5] In response to these issues, the American Statistical Association (ASA) issued a landmark statement in 2016 outlining six principles for proper p-value use: it indicates data incompatibility with a model; does not measure hypothesis probability; should not drive binary decisions; requires contextual reporting; does not reflect effect size; and alone does not gauge evidence strength.^[2] This guidance underscores the need for complementary approaches, including confidence intervals, effect sizes, and Bayesian methods, to enhance reproducibility and avoid practices like "p-hacking" (manipulating analyses to achieve desired p-values).^[5] Ongoing debates highlight the p-value's value when applied judiciously, while advocating for reforms in statistical education and reporting to mitigate its misuse in fields like medicine, psychology, and social sciences.^[6]

Fundamentals

Definition

In statistical hypothesis testing, the p-value is defined in relation to a null hypothesis H_0, which posits a specific condition such as no effect or no difference in the population parameter, and a test statistic T, which summarizes the observed data to assess compatibility with H_0.^[7] The formal definition of the p-value is the probability of obtaining a test statistic at least as extreme as the observed value t_{\text{obs}}, calculated under the assumption that H_0 is true; mathematically, this is expressed as p = P(T \geq t_{\text{obs}} \mid H_0) for a one-sided test or p = P(|T| \geq |t_{\text{obs}}| \mid H_0) for a two-sided test.^[7] The notion of "extreme" refers to values of the test statistic that are further from the null hypothesis in the direction specified by the alternative hypothesis; in one-tailed tests, extremity is measured in a single direction (e.g., greater than or less than the null expectation), whereas two-tailed tests consider deviations in both directions to capture any significant difference regardless of sign.^[8] For a simple null hypothesis where the parameter under H_0 is fully specified, the p-value is directly computed from the sampling distribution of T under H_0. For a composite null hypothesis, where the null parameter space \Theta_0 includes multiple values, the p-value is generalized as the supremum over \Theta_0 of the probability of observing data at least as extreme as the observed data under each possible null parameter: p = \sup_{\theta \in \Theta_0} P(\text{more extreme data} \mid \theta).^[9]

Interpretation

The p-value represents the probability of obtaining observed data, or more extreme data, assuming the null hypothesis H_0 is true, thereby serving as a measure of the compatibility of the data with H_0.^[2] In the frequentist framework, a small p-value indicates evidence against H_0, suggesting that the observed results are unlikely under this assumption, but it does not quantify the probability that H_0 is true, the probability that the alternative hypothesis is true, or the likelihood of the data given the alternative.^[2] Nor does it measure the size of an effect or the importance of a result, as a small p-value can arise from small effects with large sample sizes or large effects with small samples.^[2] This interpretation is strictly conditional on H_0 being true; thus, a small p-value does not imply that the effect is large or that H_0 is false, only that the data are surprising if H_0 holds.^[2] In contrast to Bayesian approaches, where posterior probabilities update beliefs about hypotheses by incorporating prior information and directly estimating the probability that H_0 or the alternative is true, the frequentist p-value does not provide such a belief update or a direct probability for the hypotheses themselves.^[10] Frequentist inference treats parameters as fixed unknowns and focuses on long-run error rates over repeated sampling, whereas Bayesian methods yield interpretable probabilities about parameters via the posterior distribution derived from Bayes' theorem.^[10] The American Statistical Association emphasizes that p-values alone do not measure evidence for or against a specific model or hypothesis in the Bayesian sense, highlighting the distinct philosophical foundations of the two paradigms.^[2] Significance thresholds, such as \alpha = 0.05, are arbitrary conventions introduced by Ronald Fisher to guide decisions on whether to consider deviations significant, rather than providing definitive proof of rejection or acceptance of H_0.^[4] Fisher described the 0.05 level as a convenient limit corresponding roughly to twice the standard deviation, but noted flexibility for stricter levels like 0.01 if preferred, underscoring that such cutoffs should not mechanistically dictate conclusions.^[4] The ASA further warns against relying solely on whether a p-value crosses these thresholds, as this can mislead inference and ignores the broader context of study design and data quality.^[2]

Statistical Foundations

Sampling Distribution Under the Null Hypothesis

Under the null hypothesis H_0, the p-value for a continuous test statistic follows a uniform distribution on the interval [0, 1].^[11] This property arises because the p-value is constructed as the probability of observing a test statistic at least as extreme as the observed value, assuming H_0 is true. For a one-sided right-tailed test, let T denote the test statistic with cumulative distribution function (CDF) F_T under H_0. The p-value is then given by p = 1 - F_T(t_{\text{obs}}), where t_{\text{obs}} is the observed value.^[11] To see why p is uniformly distributed, consider the random variable p = 1 - F_T(T). By the probability integral transform theorem, if F_T is continuous, then F_T(T) \sim \text{Uniform}(0,1), and thus $1 - F_T(T) also follows \text{Uniform}(0,1).^[11] A sketch of the proof involves computing the CDF of p:

P(p \leq \alpha) = P(1 - F_T(T) \leq \alpha) = P(T \geq F_T^{-1}(1 - \alpha)) = 1 - F_T(F_T^{-1}(1 - \alpha)) = 1 - (1 - \alpha) = \alpha,

for $0 < \alpha < 1, confirming uniformity.^[11] For two-sided tests, common in practice (e.g., for symmetric distributions like the normal or t-distribution), the p-value is typically defined as p = 2 \min(1 - F_T(t_{\text{obs}}), F_T(t_{\text{obs}})). Under H_0, this construction also results in a uniform distribution on [0, 1] for continuous test statistics, as the folding of the distribution preserves the uniformity property via the probability integral transform applied to the absolute value or equivalent transformations.^[11] In practice, simulations under H_0 generate many p-values, and their histogram appears flat across [0, 1], visually demonstrating the uniform density.^[12] For discrete test statistics, the p-value distribution is not exactly uniform but stochastically greater than or equal to the uniform distribution on [0, 1].^[13] Specifically, P(p \leq \alpha \mid H_0) \leq \alpha for any significance level \alpha, ensuring the test remains conservative and controls the Type I error rate at or below \alpha.^[13] With large sample sizes, the discreteness diminishes, and the distribution approaches uniformity.^[11]

Behavior Under Alternative Hypotheses

Under the alternative hypothesis H_1, the distribution of the p-value deviates markedly from the uniformity under the null hypothesis H_0, becoming skewed toward zero with increased density near 0 as the true effect size grows or sample size increases. This skewness arises because the test statistic under H_1 tends to produce more extreme values, making small p-values more probable and reflecting stronger evidence against H_0.^[14] For a one-sided standardized normal test statistic, the probability density function of the p-value under H_1 is

g_{\delta}(p) = \frac{\phi(Z_p - \sqrt{n} \delta)}{\phi(Z_p)},

where Z_p = \Phi^{-1}(1-p) is the (1-p)-quantile of the standard normal distribution, \phi is the standard normal density function, n is the sample size, and \delta is the standardized effect size under H_1. The corresponding cumulative distribution function is G_{\delta}(p) = 1 - \Phi(Z_p - \sqrt{n} \delta), which is stochastically smaller than the uniform, confirming the leftward shift. For two-sided tests, the distribution under H_1 is more complex, often involving the distribution of the minimum of two tail probabilities, but still concentrates toward zero with increasing power. Smaller observed p-values under H_1 thus provide stronger evidence against H_0, with the overall distribution's concentration near 0 directly tied to the test's power, defined as $1 - \beta = G_{\delta}(\alpha) at significance level \alpha for the one-sided case. As power increases, the density near 0 rises, enhancing the test's ability to detect deviations from H_0.^[15]^[14] The expected value of the p-value under H_1, E_{\delta}(P), is strictly less than 0.5 and decreases toward 0 for larger effects or sample sizes; for instance, with n=15 and \delta=1/3, it equals approximately 0.181. The median p-value under H_1 is also less than 0.5 and typically smaller than the expected value, further emphasizing the downward bias. For composite null hypotheses, the p-value is conservatively defined as the supremum of the tail probabilities over the null parameter space, \sup_{\theta \in H_0} P(T \geq t_{\text{obs}} \mid \theta), which can inflate p-values under H_1 by considering the least favorable null scenario, potentially reducing power.^[15]^[16]^[17] This behavior stems from the test statistic following a non-central distribution under H_1, such as the non-central normal, t, or chi-squared, which shifts the tails and compresses the p-value distribution toward lower values compared to the central distribution under H_0.^[14]

Computation

Exact Calculation Methods

Exact calculation methods for p-values involve directly computing the probability of observing data as extreme as or more extreme than the sample under the null hypothesis, using the precise sampling distribution or exhaustive enumeration, without relying on large-sample approximations. These methods are particularly valuable for small or discrete samples where asymptotic approaches may lead to inaccuracies.^[18] In discrete tests, such as the binomial test, the exact p-value is obtained by summing the probabilities of all outcomes under the null hypothesis that are as extreme as or more extreme than the observed outcome. For a two-sided test of a fair coin (null probability p = 0.5), if the observed number of heads is k in n flips, the p-value is calculated as

p = 2 \times \min\left( \sum_{i=0}^{k} P(X = i \mid n, p=0.5), \sum_{i=k}^{n} P(X = i \mid n, p=0.5) \right),

where P(X = i \mid n, p=0.5) = \binom{n}{i} (0.5)^n. This ensures the p-value reflects the exact tail probabilities from the binomial distribution.^[18] For tests based on continuous reference distributions, such as the one-sample t-test assuming normality, the exact p-value uses the cumulative distribution function (CDF) of the Student's t-distribution with df = n-1 degrees of freedom. The two-sided p-value is p = 2 \times \min( F(t_{obs}; df), 1 - F(t_{obs}; df) ), where F is the CDF and t_{obs} is the observed t-statistic.^[19] Permutation tests provide an exact method for comparing groups or assessing associations by enumerating or sampling from the reference distribution under the null hypothesis of exchangeability. The p-value is the proportion of permuted datasets (out of all possible rearrangements) where the test statistic is greater than or equal to the observed statistic: p = \frac{1 + \# \{ \text{permutations with } T^* \geq T_{obs} \}}{N + 1}, where N is the number of permutations (often all \binom{n}{k} for complete exactness) and T^* is the statistic from a permuted sample. This approach is distribution-free and exact when all permutations are enumerated.^[20] Exact p-values maintain the nominal type I error rate precisely in small samples, avoiding inflation or deflation that can occur with approximate methods.^[21]

Asymptotic Approximations

Asymptotic approximations provide efficient methods for computing p-values when exact distributions are difficult or computationally intensive to obtain, particularly in large-sample settings. These techniques rely on the central limit theorem (CLT), which posits that under certain conditions, the distribution of a test statistic converges to a known limiting distribution—such as the standard normal or chi-squared—as the sample size n increases.^[22] For instance, in the z-test for a population mean, the test statistic z_{\text{obs}} is asymptotically standard normal under the null hypothesis, allowing the two-sided p-value to be approximated as

p \approx 2 \left(1 - \Phi(|z_{\text{obs}}|)\right),

where \Phi denotes the cumulative distribution function of the standard normal distribution.^[23] Similarly, for the chi-square goodness-of-fit test under the multinomial model, the p-value is approximated as p \approx 1 - F(\chi^2_{obs}; df), where F is the CDF of the chi-square distribution with appropriate degrees of freedom; this approximation is reliable when expected frequencies are sufficiently large (e.g., all at least 1, with no more than 20% less than 5).^[24] Bootstrap methods offer a nonparametric alternative for estimating the p-value distribution empirically through resampling. Introduced by Efron, the bootstrap involves repeatedly drawing samples with replacement from the observed data to generate an empirical distribution of the test statistic, from which the p-value is computed as the proportion of bootstrap statistics at least as extreme as the observed one.^[25] This approach is particularly useful for complex statistics where asymptotic normality may not hold exactly, providing a flexible approximation without strong parametric assumptions. Monte Carlo simulations extend these ideas to cases with intractable distributions by generating random samples from the null hypothesis to approximate the sampling distribution of the test statistic. The p-value is then estimated as the fraction of simulated statistics that are more extreme than the observed value, offering a practical solution for non-standard tests where closed-form approximations are unavailable.^[26] The validity of these asymptotic approximations hinges on verifying underlying conditions, such as independence of observations, finite variance, and sufficiently large sample sizes, to ensure the limiting distributions are reliable.^[22] While exact methods serve as a precise baseline for small samples, asymptotic and simulation-based approaches scale better for large or intricate models.^[23]

Applications in Hypothesis Testing

Role in Decision-Making

In the hypothesis testing framework, the p-value facilitates decision-making by quantifying the compatibility of observed data with the null hypothesis H_0. The process begins with the computation of a test statistic from the sample data, followed by derivation of the p-value as the probability of obtaining a result at least as extreme as observed, assuming H_0 is true. This p-value is then compared to a pre-specified significance level \alpha (commonly 0.05); if p < \alpha, H_0 is rejected in favor of the alternative hypothesis H_a, indicating that the data provide sufficient evidence against H_0 to warrant the decision.^[3]^[27] The significance level \alpha serves as a control for Type I errors, defined as the long-run proportion of false rejections of a true H_0 across repeated tests, ensuring that the decision rule limits the risk of erroneous conclusions to at most \alpha.^[27]^[28] In the Neyman-Pearson framework, these decisions are framed as optimal rules that balance Type I error (\alpha) and Type II error (\beta) rates by defining critical regions where the likelihood ratio favors H_a over H_0, with the p-value determining entry into such regions.^[29]^[3] The nature of the decision also depends on whether the test is one-sided or two-sided, aligned with the directional specificity of the research question encoded in H_a. One-sided tests compute the p-value by considering extremity in only the predicted direction (e.g., greater than a value), yielding a more sensitive threshold for rejection when directionality is theoretically justified, whereas two-sided tests account for deviations in either direction, doubling the one-sided p-value to reflect bidirectional uncertainty.^[30]^[31] Unlike the binary outcome of rejection or failure to reject, the p-value offers a continuous measure of evidential strength against H_0, allowing nuanced interpretation where values closer to zero suggest progressively stronger incompatibility with the null, though the formal decision remains dichotomous based on \alpha.^[28]^[3] This gradation supports the Neyman-Pearson emphasis on error-controlled actions rather than probabilistic beliefs about hypotheses.^[27]

Practical Examples

A classic example of p-value application is the binomial test for assessing coin fairness. Suppose a coin is flipped 10 times, resulting in 8 heads observed. The null hypothesis H_0 states that the probability of heads is p = 0.5 (fair coin), while the alternative H_1 posits p \neq 0.5 (biased coin). The test statistic is the number of heads, and under H_0, it follows a binomial distribution with parameters n=10 and p=0.5. The two-sided p-value is calculated as $2 \times P(X \geq 8 \mid n=10, p=0.5) = 2 \times \sum_{k=8}^{10} \binom{10}{k} (0.5)^{10} = 2 \times (45 + 10 + 1)/1024 = 112/1024 \approx 0.109.^[32] Since 0.109 > 0.05, the null hypothesis is not rejected at the 5% significance level, indicating insufficient evidence of bias. To contextualize, the effect size can be measured using Cohen's h, approximately 0.63 here, suggesting a moderate deviation from fairness despite the non-significant p-value.^[33] Another common scenario involves the independent samples t-test for comparing means between two groups, such as petal lengths from two iris species. Sample data yield a mean of 1.46 cm (SD = 0.206, n=25) for species 1 and 5.54 cm (SD = 0.569, n=25) for species 2. The null hypothesis H_0 is that the population means are equal (\mu_1 = \mu_2), against H_1: \mu_1 \neq \mu_2. The t-statistic is computed as t = (\bar{x}_1 - \bar{x}_2) / \sqrt{s_p^2 (1/n_1 + 1/n_2)}, where s_p^2 is the pooled variance, yielding t ≈ -33.719 with df ≈ 30.196. The two-sided p-value from the t-distribution is approximately 2.2 × 10^{-16}.^[34] As p < 0.001, H_0 is rejected, supporting a difference in means. The effect size, Cohen's d ≈ 9.53, indicates a very large practical difference, far beyond statistical significance alone.^[33] For multi-group comparisons, one-way ANOVA tests the equality of means across three or more groups, exemplified by sprint times differing by smoking status. Data include means of 6.411 seconds (n=261) for nonsmokers, 6.835 seconds (n=33) for past smokers, and 7.121 seconds (n=59) for current smokers. The null hypothesis H_0 assumes equal population means across groups (\mu_1 = \mu_2 = \mu_3), versus H_1 that at least one differs. The F-statistic is the ratio of between-group to within-group mean square, F(2, 350) = 9.209. The p-value is the probability under the F-distribution, p < 0.001.^[35] Thus, H_0 is rejected, indicating significant differences. The effect size, partial eta-squared ≈ 0.05, reflects a small to moderate overall impact of smoking status on sprint time.^[33]

Pitfalls and Misuses

Common Interpretive Errors

One of the most prevalent interpretive errors surrounding p-values is the misconception that a p-value represents the probability that the null hypothesis is true, denoted as P(H_0 \mid \text{data}). In reality, the p-value is the probability of observing data at least as extreme as the actual data assuming the null hypothesis is true, P(\text{data} \mid H_0), and it provides no direct measure of the posterior probability of the null. This confusion leads researchers to erroneously conclude that a small p-value, such as 0.05, implies only a 5% chance the null is correct, thereby overstating the evidence against it. Such misinterpretation is widespread across disciplines, including psychology and medicine, where it undermines proper Bayesian updating and evidence assessment.^[36]^[37] Another common error is the dichotomization of p-values, where results are rigidly classified as "statistically significant" if p < 0.05 and dismissed otherwise, ignoring the continuous nature of evidence and the magnitude of effects. This binary threshold, arbitrarily set at 0.05, fosters a false sense of certainty and discourages nuanced evaluation of practical importance or effect sizes. Applied researchers, including statisticians, frequently misuse p-values in this way, leading to distorted inferences and reduced reproducibility in fields like social sciences and biomedicine. The practice amplifies selective reporting, as borderline p-values near 0.05 are more likely to be highlighted or manipulated than those slightly above the threshold.^[38]^[7] This dichotomization contributes to the replication crisis, as studies with p-values just below 0.05 often reflect low statistical power, particularly for small effect sizes common in behavioral and neuroscience research. For instance, meta-analyses reveal that typical neuroscience studies have median power around 21%, meaning even a significant p-value of 0.05 provides only weak evidence against the null and has roughly a 50% chance of replicating under similar conditions for modest effects. Low power inflates false positives and overestimates effect sizes, explaining why large-scale replication efforts, such as those in psychology, succeed in only about 36% of cases at the 0.05 level. These issues highlight how interpretive reliance on threshold p-values exacerbates non-replicability across sciences.^[39] P-hacking represents a deliberate interpretive and methodological error, involving the manipulation of data collection, analysis choices, or reporting to achieve a p-value below 0.05, often through practices like optional stopping, selective outcome variables, or covariate inclusion. This abuse creates illusory significance, publishing false positives that mislead meta-analyses and waste resources on non-replicable findings. Simulations and empirical surveys indicate p-hacking substantially inflates the rate of significant results in fields prone to flexibility in analysis, further eroding trust in published research.^[40] Prior to the 2016 American Statistical Association (ASA) statement, scientific journals heavily overemphasized p-values as the primary criterion for publication, treating p < 0.05 as definitive proof of importance and sidelining effect sizes or study context. This gatekeeping role encouraged p-hacking and the file-drawer problem, where non-significant results were suppressed, skewing the literature toward inflated effects. The ASA statement explicitly warned against such overreliance, noting that p-values alone cannot substantiate claims and should be supplemented with broader evidence assessment to restore scientific integrity. A follow-up 2021 ASA President's Task Force statement reinforced these principles, emphasizing that p-values and significance tests are valid but should not be the sole basis for inference, and advocating for estimation, replication, and full contextual reporting to mitigate misuses.^[2]^[41]

Issues with Multiple Comparisons

When conducting multiple hypothesis tests in a single study, the probability of encountering at least one false positive—known as the family-wise error rate (FWER)—increases substantially without appropriate adjustments.^[42] For k independent tests each conducted at significance level \alpha, the FWER approximates $1 - (1 - \alpha)^k, which approaches 1 as k grows large even for modest \alpha = 0.05.^[43] One straightforward method to control the FWER is the Bonferroni correction, which adjusts each p-value by multiplying it by the number of tests k and capping at 1, yielding the adjusted p-value p_i' = \min(k p_i, 1); a hypothesis is rejected if p_i' \leq \alpha.^[44] This procedure ensures the overall FWER does not exceed \alpha under the assumption of independence or positive dependence among tests.^[45] A less conservative alternative to the single-step Bonferroni method is the Holm-Bonferroni procedure, a stepwise approach that orders p-values from smallest to largest and applies sequentially decreasing thresholds \alpha / (k - i + 1) for the i-th smallest p-value, stopping when a p-value exceeds its threshold.^[46] This method maintains FWER control at level \alpha while being uniformly more powerful than Bonferroni, as it avoids overly penalizing early significant results.^[47] In scenarios involving many tests, such as large-scale genomics studies, controlling the FWER can be overly stringent, leading to low power; instead, the false discovery rate (FDR)—the expected proportion of false positives among all rejected hypotheses—offers a more flexible alternative.^[48] The Benjamini-Hochberg procedure controls the FDR by sorting p-values and rejecting all hypotheses up to the largest i where p_{(i)} \leq (i/k) \alpha, providing less conservative control suitable for high-dimensional data like genome-wide association studies.^[45] These corrections involve trade-offs: FWER methods like Bonferroni and Holm-Bonferroni are conservative, ensuring strict control over any false positives but reducing statistical power, whereas FDR approaches like Benjamini-Hochberg enhance power at the cost of allowing some false discoveries, which is preferable when many true effects are anticipated.^[49]

Historical Context

Origins and Early Development

The concept of the p-value has roots in pre-20th-century statistical practices, particularly in the framework of inverse probability, where early probabilists like Pierre-Simon Laplace (1774–1829) sought to infer causes from observed effects by calculating probabilities of data under assumed hypotheses.^[50] This approach laid groundwork for assessing the improbability of observations under a null model, though it lacked formalization as a tail probability. In the late 19th century, Francis Galton popularized the "probable error" (PE)—a measure equivalent to about 0.6745 standard deviations—as a way to quantify variability in anthropometric data, using multiples of PE (e.g., 3PE) to gauge unlikely deviations and foreshadowing significance thresholds.^[4] The p-value was formally introduced by Karl Pearson in 1900 within the context of his chi-square goodness-of-fit test, published in The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. Pearson described the p-value as the probability of observing data at least as extreme as that obtained, assuming the fitted model holds, thereby providing a criterion to judge whether deviations from expected frequencies could reasonably arise by random sampling.^[50] This innovation, detailed in his seminal paper "X. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can Be Reasonably Supposed to Have Arisen from Random Sampling," marked the p-value's entry into statistical practice for testing distributional assumptions. Early adoption expanded in the 1900s and 1920s through contributions from William Sealy Gosset and Ronald A. Fisher. In 1908, Gosset, writing as "Student," developed the t-distribution for small-sample inference on means, incorporating p-value-like probabilities to assess the "probable error" of estimates in agricultural experiments at Guinness Brewery. Fisher advanced the p-value in the 1920s, notably through his exact test for 2x2 contingency tables—illustrated in a famous 1920s thought experiment involving a lady tasting tea to distinguish milk-first versus tea-first preparation—and formalized in his 1925 book Statistical Methods for Research Workers. There, Fisher proposed the p-value as a measure of evidential weight against the null hypothesis, suggesting a 0.05 threshold for practical convenience while emphasizing its continuous nature over rigid cutoffs.^[51] In the 1930s, Jerzy Neyman and Egon Pearson built on these foundations with their likelihood ratio framework for hypothesis testing, introduced in their 1933 paper "On the Problem of the Most Efficient Tests of Statistical Hypotheses."^[29] Unlike Fisher's focus on p-values as degrees of evidence, the Neyman-Pearson approach treated them within a decision-theoretic paradigm, emphasizing control of error rates (Type I at α=0.05) and power against alternatives, which contrasted sharply with Fisher's inductive, non-decisionistic interpretation.^[50] This duality shaped the p-value's dual roles in evidential assessment and formal testing during its early development.

Modern Debates and Reforms

In the 2010s, concerns over the reproducibility of scientific findings intensified, with studies attributing low replication rates to an over-reliance on the conventional p < 0.05 threshold. The Open Science Collaboration's large-scale replication attempt in psychology, for instance, successfully reproduced only 36% of original effects from 100 studies, highlighting how dichotomous significance testing often leads to inflated false positives and fragile results. This reproducibility crisis prompted widespread debate, as similar patterns emerged across fields like economics and medicine, where selective reporting of significant p-values contributed to non-replicable claims. In response, the American Statistical Association (ASA) issued a landmark statement in 2016, clarifying key misconceptions and advocating for reformed practices. The statement emphasized that a p-value is not the probability that the null hypothesis is true given the data—P(H_0|data)—but rather the probability of observing data at least as extreme as those obtained, assuming the null is true. It urged researchers to avoid treating p < 0.05 as a bright-line rule for dichotomizing results, instead prioritizing estimation of effect sizes, uncertainty via confidence intervals, and contextual interpretation over rigid significance testing. This guidance influenced statistical education and policy, promoting a shift toward more nuanced inference. Several reform proposals emerged to address these issues, including outright bans on p-values and stricter thresholds. David Trafimow, as editor of Basic and Applied Social Psychology, announced in 2015 a policy prohibiting the publication of p-values, arguing they encourage mechanical hypothesis testing without substantive insight and recommending alternatives like Bayesian methods or confidence intervals.^[52] Conversely, Daniel Benjamin and colleagues proposed in 2018 redefining statistical significance at p < 0.005 for novel findings, aiming to reduce false positives while maintaining computational feasibility, though this sparked debate over increased false negatives. These ideas underscore ongoing tensions between tradition and innovation in statistical practice. By 2025, major journals had incorporated these debates into updated reporting guidelines, mandating effect sizes alongside p-values to ensure comprehensive evaluation of results. For example, the Journal of Marketing revised its policy effective March 2025, requiring authors to report exact p-values (to three decimal places) without significance asterisks, paired with domain-appropriate effect sizes like Cohen's d or elasticities, to highlight practical importance beyond statistical significance.^[53] In high-dimensional big data settings, such as genomics or machine learning, p-values face additional challenges, including exacerbated multiple testing problems where the sheer volume of tests inflates Type I error rates despite corrections like Bonferroni, often rendering traditional thresholds unreliable without dimensionality reduction or alternative metrics.^[54]

Connection to Confidence Intervals

In frequentist statistical inference, p-values and confidence intervals exhibit a duality, where a (1 - \alpha) \times 100\% confidence interval for a parameter \theta comprises all null values \theta_0 for which the p-value of testing H_0: \theta = \theta_0 exceeds \alpha.^[55] This means the interval excludes precisely those \theta_0 values that would lead to rejection of the null hypothesis at significance level \alpha.^[56] Formally, for a test of H_0: \theta = \theta_0, the condition p < \alpha holds if and only if \theta_0 lies outside the (1 - \alpha) \times 100\% confidence interval for \theta. For instance, when \alpha = 0.05, this equivalence states that p < 0.05 if and only if \theta_0 is not contained in the 95% confidence interval.^[57] This correspondence arises because confidence intervals are constructed by inverting hypothesis tests, collecting all non-rejected null values.^[56] Confidence intervals thus offer a range of plausible values for \theta consistent with the data at the chosen confidence level, whereas p-values evaluate evidence against a single point hypothesis.^[58] For invertible tests—those where the test statistic is monotone in the parameter, such as the t-test—this duality yields an exact one-to-one relationship, enabling straightforward derivation of intervals from test procedures.^[57] A key advantage of confidence intervals over p-values alone is their emphasis on the magnitude and precision of estimates, promoting interpretation of effect sizes and uncertainty ranges rather than reliance on arbitrary significance thresholds.^[59]

Distinction from Effect Sizes

Effect size quantifies the magnitude of a phenomenon or the strength of a relationship between variables, providing a measure of practical or substantive importance that remains independent of sample size.^[33] For instance, Cohen's d is a widely used standardized effect size for comparing means, calculated as the difference between two group means divided by the pooled standard deviation:
d = \frac{\mu_1 - \mu_2}{\sigma}
where \mu_1 and \mu_2 are the population means of the two groups, and \sigma is the standard deviation.^[60] This metric focuses on the scale of the difference relative to variability, offering insight into whether the effect is meaningful in real-world terms, regardless of how many observations were collected.^[61] In contrast, the p-value indicates the probability of observing data as extreme as that obtained, assuming the null hypothesis is true, and is highly sensitive to sample size. Even trivial effect sizes can yield statistically significant p-values (e.g., p < 0.05) when the sample is large enough, as increased n reduces the standard error and boosts the test statistic, such as in a t-test where p derives from the t-statistic: t = ( \bar{x}_1 - \bar{x}_2 ) / (s \sqrt{2/n}).^[33] Thus, a small effect might appear "significant" solely due to a large dataset, misleading interpretations if magnitude is ignored. Effect sizes, being invariant to sample size, complement p-values by revealing whether the detected effect warrants attention beyond mere statistical detection.^[62] Jacob Cohen provided interpretive guidelines for effect sizes in behavioral sciences, classifying Cohen's d as small (0.2), medium (0.5), or large (0.8), emphasizing that these are conventional benchmarks rather than universal thresholds.^[63] Relying solely on p-values can be misleading, as it conflates evidence against the null with the effect's practical relevance; for example, a highly significant p-value from a large study might correspond to a negligible d < 0.2, indicating minimal real-world impact.^[64] Professional standards, such as those from the American Psychological Association (APA), advocate reporting both p-values and effect sizes for comprehensive inference, ensuring results convey not only statistical reliability but also substantive meaning.^[65] This dual reporting promotes better evaluation of findings across studies, avoiding overemphasis on arbitrary significance thresholds.^[66]

References

[1]
7.1.3.1. Critical values and p values
The p -value is the probability of the test statistic being at least as extreme as the one observed given that the null hypothesis is true. A small p -value is ...
[2]
[PDF] p-valuestatement.pdf - American Statistical Association
Mar 7, 2016 · The ASA releases this guidance on p-values to improve the conduct and interpretation of quantitative science and inform the growing emphasis on ...
[3]
P Value and the Theory of Hypothesis Testing: An Explanation ... - NIH
Abstract. In the 1920s, Ronald Fisher developed the theory behind the p value and Jerzy Neyman and Egon Pearson developed the theory of hypothesis testing.
[4]
[PDF] On the Origins of the .05 Level of Statistical Significance
Fisher's (1925) statement in his book, Statistical Methods for Research Workers, seems to be the first specific mention of the p = . 05 level as deter- mining ...
[5]
The ASA Statement on p-Values: Context, Process, and Purpose
Jun 9, 2016 · Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean ...
[6]
The p‐value statement, five years on - Matthews - 2021 - Significance
Mar 26, 2021 · Statisticians have long warned against using p-values to decide whether an effect is real or not. Given the confusion over their meaning, many ...
[7]
P – VALUE, A TRUE TEST OF STATISTICAL SIGNIFICANCE ... - NIH
Thus, Fisher saw P value as an index measuring the strength of evidence against the null hypothesis (in our examples, the hypothesis that there is no ...Introduction · What Does P Value Mean? · Fallacies Of P Value
[8]
What are the differences between one-tailed and two-tailed tests?
A two-tailed test will test both if the mean is significantly greater than x and if the mean significantly less than x.
[9]
The “P”-Value: The Primary Alphabet of Research Revisited - NIH
Apr 26, 2023 · In these circumstances (so-called composite null hypothesis), the P value is defined by taking the least favorable null-hypothesis case, which ...
[10]
Interpreting frequentist hypothesis tests: insights from Bayesian ...
Oct 4, 2023 · In this article, we explore the relationship that exists between real treatment effects and declarations of statistical significance based on P values and ...Part I: Frequentist... · Part Ii: Bayesian Inference · Bayes Factors
[11]
The p-value and model specification in statistics - PMC
Let FT denote the distribution of the test statistic T under the null hypothesis. The p value is 1−FT(T). If FT is continuous, the p value is uniformly ...
[12]
p-Value Histograms: Inference and Diagnostics - PMC
Aug 31, 2018 · If all null hypotheses are true, these p-values follow a uniform distribution, which corresponds to a flat-looking histogram. Figure 1 ...Missing: proof | Show results with:proof
[13]
[PDF] Simple Facts about P-Values
Jan 5, 2006 · The p-distribution cannot be uniform in the sense of dn/dp being constant. However it is. 'as uniform as possible' for a discrete distribution ...Missing: proof | Show results with:proof
[14]
The Behavior of the P-Value When the Alternative Hypothesis is True
In contrast, the distribution of the P-value under the alternative hypothesis is a function of both sample size and the true value or range of true values of ...
[15]
[PDF] The Behavior of the P-Value When the Alternative Hypothesis is True
The P-value is a random variable derived from the distribution of the test statistic used to analyze a data set and to test a null hypothesis.
[16]
[PDF] Median of the p-Value Under the Alternative Hypothesis
The distribution of p-values under the null hypothesis is uniform. However, under the alternative hypothesis the distribution of the p-values is skewed. The ...
[17]
P Values for Composite Null Models - jstor
When the null model has unknown parameters, p values are not uniquely defined. The proposals for computing a p value in such a situation include the plug-in and ...<|control11|><|separator|>
[18]
[PDF] Some Properties of the Exact and Score Methods for Binomial ...
The p-value for a two-sided alternative hypothesis, that is, for testing (4), is given by 2 min{P(X ≤ k|n, p0),P(X ≥ k|n, p0)}.
[19]
[PDF] t-tests and p-values 1 Student's t distribution and t-tests
p-value = P(T ≥ t∗|T ∼ p0). In other words, the p-value is the probability under H0 of observing a test statistic at least as extreme as what was observed. If ...Missing: exact | Show results with:exact
[20]
[PDF] The Chi Square Test
The P-value is the area under the density curve of this chi-square distribution to the right of the value of the test statistic.Missing: exact | Show results with:exact
[21]
Chapter 7 Computer-intensive Tests
3. Calculate the permutation test p-value, which is the proportion of test statistic values from the re-arranged data that equal or exceed the value.
[22]
[PDF] Using randomization tests to preserve type I error with response ...
Hence, the two-sided type I error will approach 1.0 for large values of λ. In order to avoid this inflation in type I error, an alternative analysis is ...
[23]
[PDF] Asymptotic Approximations
Asymptotic approximations are used when exact sampling distributions are unavailable, using a limit of a sequence of functions as n tends to infinity.
[24]
[PDF] Statistical Hypothesis Tests - Kosuke Imai
Mar 24, 2013 · Thus, we can use this z-score as a test statistic and compute the p-value based on the asymptotic ... p-value is given as p = Φ(Z ≥ Zobs).
[25]
[PDF] Bootstrap Methods for Standard Errors, Confidence Intervals, and ...
Apr 24, 2001 · The bootstrap is then extended to other measures of statistical accuracy such as bias and prediction error, and to complicated data structures ...
[26]
On Estimating P Values by Monte Carlo Methods - PMC - NIH
North et al. (2002) propose a new formula for the empirical estimation of P values by Monte Carlo methods to replace a standard conventional estimator.Missing: paper | Show results with:paper
[27]
1 Using p-values to test a hypothesis - GitHub Pages
Neyman and Pearson (1933) built on insights about p-values by Gosset and Fisher, and developed an approach called statistical hypothesis testing. The main ...
[28]
The P-value
The P-value is the probability of observing a test statistic (ie, a summary of the data) that is as extreme or more extreme than currently observed test ...
[29]
IX. On the problem of the most efficient tests of statistical hypotheses
The problem of testing statistical hypotheses is an old one. Its origin is usually connected with the name of Thomas Bayes.
[30]
Should we use one-sided or two-sided P values in tests of ... - PubMed
'P' stands for the probability, ranging in value from 0 to 1, that results from a test of significance. It can also be regarded as the strength of evidence ...Missing: definition | Show results with:definition
[31]
Should we use one-sided or two-sided P values in tests of ...
Aug 6, 2025 · I argue that the choice between one‐ and two‐sided P values depends on the alternative hypothesis ( H 1 ), which corresponds to the scientific hypothesis.
[32]
9. Hypothesis Testing
The p-value is P[8 heads] + P[9 heads] + P[10 heads]. From the binomial probability distribution, P[8 heads]=0.044, P[9 heads]=0.01, and P[10 heads]=0.001. Thus ...
[33]
Using Effect Size—or Why the P Value Is Not Enough - PMC - NIH
Effect size is the main finding of a quantitative study. While a P value can inform the reader whether an effect exists, the P value will not reveal the size ...
[34]
An Introduction to t Tests | Definitions, Formula and Examples - Scribbr
Jan 31, 2020 · A t test is a statistical test used to compare the means of two groups. The type of t test you use depends on what you want to find out.
[35]
LibGuides: SPSS Tutorials: One-Way ANOVA
### Summary of One-Way ANOVA Example
[36]
Significance, P values and t-tests | Nature Methods
Oct 30, 2013 · Unfortunately, the P value is often misinterpreted as the probability that the null hypothesis (H0) is true. This mistake is called the ...Author Information · Authors And Affiliations · Ethics Declarations
[37]
Misinterpretations of the p-value in psychological research
Feb 25, 2025 · Common misinterpretations ... : A prevalent misconception is treating the p-value as the probability that the null hypothesis is true.
[38]
Statistical Significance and the Dichotomization of Evidence
Oct 30, 2017 · In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice ...
[39]
Power failure: why small sample size undermines the reliability of ...
We show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low ...
[40]
The Extent and Consequences of P-Hacking in Science - PMC - NIH
Mar 13, 2015 · Consequently, a p-hacked p-curve will have an overabundance of p-values just below 0.05 [12,40,41]. If researchers p-hack when there is no true ...
[41]
What is Family-wise Error Rate? - The Analysis Factor
A concept called Family-wise Error Rate (FWER). FWER is the probability that you will get at least one Type I error, in a set (or family) of tests.
[42]
6.1: Multiple Comparisons - Statistics LibreTexts
Jan 8, 2024 · The classic approach to the multiple comparison problem is to control the familywise error rate. Instead of setting the critical P level for ...
[43]
(PDF) Multiple significance tests: the Bonferroni correction
Aug 6, 2025 · The Bonferroni correction reduces the probability of making a type I error (answer a) but not a type II error (answer b).
[44]
[PDF] Controlling the False Discovery Rate: A Practical and Powerful ...
Dec 20, 2004 · A simple sequential Bonferroni- type procedure is proved to control the false discovery rate for independent test statistics, and a simulation ...
[45]
A Simple Sequentially Rejective Multiple Test Procedure - jstor
ABSTRACT. This paper presents a simple and widely ap- plicable multiple test procedure of the sequentially rejective type, i.e. hypotheses are rejected one ...
[46]
Adjusting for multiple testing when reporting research results
The Bonferroni procedure is the most widely recommended way of doing this, but another procedure, that of Holm, is uniformly better.
[47]
A practical guide to methods controlling false discoveries in ...
Jun 4, 2019 · The false discovery rate (FDR), or expected proportion of discoveries which are falsely rejected [13], was more recently proposed as an ...
[48]
False discovery rate control in genome-wide association ... - PNAS
The false discovery rate (FDR) is a particularly appropriate control target: As we expect to make hundreds or thousands of true discoveries (corresponding to ...<|control11|><|separator|>
[49]
Using History to Contextualize p-Values and Significance Testing
Fisher's student L. H. C. Tippett (1931, p. 48), wrote in The Method of Statistics that the 0.05 threshold was “quite arbitrary” but “in common use.
[50]
Fisher (1925) Chapter 1 - Classics in the History of Psychology
STATISTICAL METHODS FOR RESEARCH WORKERS. By Ronald A. Fisher (1925). Posted March 2000. I. INTRODUCTORY. 1. The Scope of Statistics. The science of statistics ...
[51]
Change in Journal of Marketing's Policy for Reporting Results
Feb 11, 2025 · JM submissions should report the actual p-values (three digits) rather than the threshold p-value in tables, running text, and in hypothesis ...
[52]
Statistical analysis of high-dimensional biomedical data: a gentle ...
May 15, 2023 · Either large n or p may present difficulties for data storage or computations, but large p presents several major statistical challenges and ...
[53]
[PDF] Inference
Confidence intervals often have a duality with two-sided hypothesis tests. A 95% confidence interval contains all the null hypotheses that would not be ...<|control11|><|separator|>
[54]
https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-023-02858-y
[55]
[PDF] STAT 226 Lecture 1 & 2
For the COVID-19 example, the P-value is P(χ2 ... Confidence Intervals for Binomial. Proportions. Page 39. Duality of Confidence Intervals and Significance Tests.
[56]
The relevance of confidence interval and P-value in inferential ... - NIH
Confidence interval is more informative than P-value. The P-value can be derived from the confidence interval; it provides no information on clinical ...
[57]
Confidence Interval or P-Value? Part 4 of a Series on ... - NIH
May 8, 2009 · Confidence intervals provide information about statistical significance, as well as the direction and strength of the effect (11).
[58]
What Does Effect Size Tell You? - Simply Psychology
Jul 31, 2023 · Therefore, a significant p -value tells us that an intervention works, whereas an effect size tells us how much it works. It can be argued that ...
[59]
Effect Size - Research Rundowns
The simple definition of effect size is the magnitude, or size, of an effect. Statistical significance (e.g., p < .05) tells us there was a difference between ...
[60]
6 Effect Sizes – Improving Your Statistical Inferences - GitHub Pages
Whereas a p-value is used to make a claim about whether there is an effect, or whether we might just be looking at random variation in the data, an effect size ...<|control11|><|separator|>
[61]
Effect Size Guidelines, Sample Size Calculations, and Statistical ...
Researchers typically use Cohen's guidelines of Pearson's r = .10, .30, and .50, and Cohen's d = 0.20, 0.50, and 0.80 to interpret observed effect sizes as ...
[62]
Cohen's effect sizes
Jan 23, 2022 · “The primary product of a research inquiry is one or more measures of effect size, not p values.” ~ Jacob Cohen. “Statistical significance is ...
[63]
[PDF] JARS-Quant Table 1 - APA Style
Report findings, including effect sizes and confidence intervals or statistical significance levels. Conclusions. • State conclusions, beyond just results ...
[64]
[PDF] STANDARDIZED REPORTING OF STATISTICAL RESULTS IN APA ...
APA guidelines stress that the use of p-values alone is not sufficient to understand the significance of results, which is why effect sizes and confidence ...<|separator|>