Fact-checked by Grok 2 weeks ago

Fisher's method

Fisher's method is a statistical technique for combining p-values obtained from multiple independent hypothesis tests to produce an overall assessment of significance, particularly useful when testing the same null hypothesis across different datasets or experiments. The method computes a test statistic defined as -2 \sum_{i=1}^{k} \ln(p_i), where p_i are the individual p-values and k is the number of tests; under the null hypothesis, this statistic follows a chi-squared distribution with $2k degrees of freedom.^[1] Developed by British statistician and geneticist Sir Ronald A. Fisher, the method was first suggested in the 1925 edition of his influential book Statistical Methods for Research Workers, where he proposed using the product of the individual probabilities to obtain "a single test of the significance of the aggregate." Fisher elaborated on the approach in a 1948 article in The American Statistician, emphasizing its application to independent tests and the chi-squared approximation for determining the combined p-value.^[2] The procedure assumes that the tests are independent and that the p-values are uniformly distributed under the null hypothesis, making it particularly powerful for detecting subtle effects when multiple lines of evidence converge against the null.^[1] In practice, Fisher's method transforms small individual p-values into an even smaller combined p-value, enhancing the ability to detect signals in scenarios where no single test reaches conventional significance thresholds like 0.05.^[3] It has been shown to be asymptotically optimal among common combination methods under certain conditions, such as when effect sizes are equal across tests, due to its high Bahadur relative efficiency.^[1] However, its performance can degrade with dependent tests or highly unequal effect sizes, prompting extensions like weighted versions or adaptations for correlation structures.^[4] The method finds broad application in fields requiring evidence synthesis, including meta-analyses of clinical trials, genomic studies for gene enrichment, and bioinformatics for integrating multi-omics data.^[3] For instance, in large-scale genomics, it helps identify pathways by pooling p-values from association tests across traits or datasets.^[5] Implementations are available in statistical software such as R's poolr package and Python's SciPy library, facilitating its routine use while accounting for nuances like one-sided versus two-sided p-values.^[6] Despite its strengths, users must verify independence assumptions, as violations can inflate type I error rates, and alternative methods like Stouffer's Z-score may be preferable for dependent data.^[7]

Introduction

Definition and Purpose

Fisher's method is a statistical technique used in meta-analysis to aggregate p-values from multiple independent hypothesis tests, each providing evidence against a common null hypothesis. It combines the p-values p_1, p_2, \dots, p_k from k such tests into a single test statistic, enabling a more powerful assessment of the overall evidence than any individual test alone. This approach is particularly valuable when individual tests may yield non-significant results due to limited sample sizes or effect magnitudes, yet collectively suggest a stronger signal.^[1] The primary purpose of Fisher's method is to enhance statistical power for detecting shared effects across studies or experiments, making it suitable for fields such as genomics, where thousands of tests are performed to identify associations between genetic variants and traits, and epidemiology, where evidence from multiple cohorts or endpoints is pooled to evaluate risk factors. By focusing solely on p-values, the method is non-parametric, requiring no assumptions about the underlying effect sizes, test distributions, or parametric forms beyond the uniformity of p-values under the null hypothesis. This flexibility allows its application to diverse data types and test statistics without needing raw data or standardized effect measures.^[1]^[8] Named after Ronald A. Fisher, the method was developed to combine probabilities in the context of experimental design, as introduced in his seminal work on statistical methods for research.^[1]

Historical Background

Ronald A. Fisher introduced the method for combining p-values in his seminal 1925 book Statistical Methods for Research Workers, where he proposed using the product of probabilities from independent tests to assess overall significance in replicated experiments. This approach allowed researchers to aggregate evidence from multiple similar tests, transforming individual p-values into a single chi-squared statistic under the null hypothesis. Fisher elaborated on the inferential principles underlying this technique in his 1956 work Statistical Methods and Scientific Inference, emphasizing its role in inductive reasoning for scientific discovery.^[9] The method emerged amid the foundational debates on statistical inference in the 1920s and 1930s, particularly the Neyman-Pearson framework versus Fisher's significance testing paradigm. Fisher advocated combining p-values specifically for synthesizing results from homogeneous replicated experiments, contrasting with Neyman and Pearson's focus on power and error rates in hypothesis testing.^[10] This period of contention shaped modern statistical practice, with Fisher's method positioning evidence accumulation as central to rejecting null hypotheses based on improbability alone. During his tenure at the Rothamsted Experimental Station from 1919 to 1943, Fisher applied the method to analyze agricultural field trials and biological assays, where vast datasets from long-term experiments required integrating multiple significance tests to draw robust conclusions about crop yields and treatments.^[11] These early uses demonstrated its practicality in handling variability in experimental data, influencing the station's adoption of randomized designs and significance testing protocols. The method gained wider prominence in the mid-20th century alongside the development of meta-analysis techniques, particularly through extensions by statisticians like William Cochran, though Fisher consistently stressed its suitability for scenarios assuming homogeneous effects across studies rather than heterogeneous ones.^[11] This emphasis underscored its original intent for controlled, replicated scientific inquiries rather than broad syntheses of diverse evidence.

Mathematical Formulation

For Independent Test Statistics

Fisher's method applies specifically when the test statistics from multiple hypothesis tests are independent, providing a way to aggregate evidence against a common null hypothesis across the tests. The core of the method is the computation of a combined test statistic from the individual p-values obtained from these tests. This statistic leverages the logarithmic transformation of the p-values to produce a quantity that follows a known distribution under the null hypothesis, enabling a unified assessment of significance.^[7] The test statistic is defined as

\chi^2 = -2 \sum_{i=1}^k \ln(p_i),

where p_i denotes the p-value from the i-th independent test, for i = 1, \dots, k, and k is the number of tests. This formula arises from the property that, under the null hypothesis, each p_i is uniformly distributed on [0, 1], implying that -2 \ln(p_i) follows a chi-squared distribution with 2 degrees of freedom. Since the tests are independent, the sum of these independent chi-squared random variables yields \chi^2 distributed as chi-squared with $2k degrees of freedom. The derivation transforms the product of p-values into a scalable chi-squared form.^[7]^[1] Key assumptions underpinning the method include the independence of the underlying test statistics, ensuring no correlation between the p-values, and the uniform distribution of each p_i under the null hypothesis for continuous test statistics. Additionally, the method assumes a homogeneous alternative hypothesis, where the direction and strength of evidence against the null are consistent across tests, to maintain optimal power; p-values should derive from one-sided or two-sided tests as appropriate to the research context, with two-sided p-values commonly used when the direction of effect is unspecified. Violations of independence can distort the reference distribution, though the method is robust to moderate departures under certain conditions.^[7]^[1] To apply the method, first collect the p-values p_i from each of the k independent tests, ensuring they are computed consistently (e.g., all two-sided if applicable). Next, compute the test statistic \chi^2 using the summation formula. Finally, compare \chi^2 to the critical value from the chi-squared distribution with $2k degrees of freedom at the desired significance level \alpha, or derive the combined p-value as the survival function of this distribution evaluated at \chi^2; reject the null if the combined p-value is below \alpha. This procedure integrates the evidence from all tests into a single decision rule, enhancing detection power when multiple lines of evidence align against the null.^[7]^[1]

Distribution and Computation

Under the null hypothesis, assuming the individual p-values are independent and uniformly distributed on [0,1], the test statistic -2 \sum_{i=1}^k \log p_i follows a chi-squared distribution with $2k degrees of freedom exactly.^[7] This result holds for p-values derived from continuous test statistics, providing a precise distributional basis for inference without reliance on asymptotic approximations.^[12] The combined p-value is computed as the survival function of the chi-squared distribution evaluated at the observed test statistic, that is,

\text{p-value} = P(\chi^2_{2k} > -2 \sum_{i=1}^k \log p_i),

which can be obtained directly from the cumulative distribution function (CDF) of the chi-squared distribution. Since the \chi^2_{2k} distribution is equivalent to a gamma distribution with shape parameter k and scale parameter 2, the p-value may alternatively be expressed using the regularized upper incomplete gamma function for numerical evaluation. For small k (such as 2 to 10), exact critical values and p-values can be referenced from published tables derived from the distribution.^[7] Simulation methods, involving repeated generation of uniform p-values and recomputation of the statistic, offer an additional approximate approach for verification when k is small, though direct CDF evaluation is typically sufficient and more efficient.^[1] Numerical computation requires care with very small p-values, as \log(0) is undefined and products of small p-values can cause underflow. To address this, the test statistic is calculated in logarithmic space by summing -2 \log p_i, and any zero p-values are conventionally replaced with a small positive value (e.g., machine epsilon or $10^{-16}) to ensure stability. For large k, standard statistical software employs optimized algorithms for the chi-squared CDF, maintaining accuracy without excessive computational cost. The distributional result remains valid across all k, with practical accuracy of the chi-squared form improving for larger k in cases involving discrete p-values or minor deviations from uniformity, though it is inherently exact under ideal conditions.^[1]

Applications

In Meta-Analysis

Fisher's method plays a central role in meta-analysis by aggregating p-values from multiple independent studies that test the same null hypothesis, thereby increasing the statistical power to detect an overall effect where individual studies may lack sufficient evidence alone. This approach is particularly valuable in systematic reviews, such as those evaluating drug efficacy, where it synthesizes evidence from disparate trials without requiring access to raw data beyond the reported p-values. By transforming and combining these p-values, the method produces a single test statistic that follows a known distribution under the null hypothesis, enabling a unified assessment of significance across the body of research.^[3] One key advantage of Fisher's method in meta-analysis is its simplicity, as it requires only the p-values from each study and does not necessitate estimates of effect sizes or their variances, making it computationally straightforward and applicable even when detailed study data are unavailable. It is especially suitable for scenarios involving homogeneous effects across studies, where the assumption of independence holds, allowing for robust detection of subtle signals that might be obscured in single analyses. This efficiency has made it a preferred choice over more complex methods in resource-limited settings, though its performance shines in balanced datasets without extreme outliers.^[13]^[3] The method finds widespread application in fields like genetics, where it is routinely used to combine signals from genome-wide association studies (GWAS) to identify genetic variants associated with traits or diseases. In clinical trials, it supports the integration of results from randomized controlled trials to assess treatment outcomes, such as in oncology or pharmacology. Similarly, in environmental science, Fisher's method aids in synthesizing evidence from ecological studies, for instance, evaluating the impact of pollutants on biodiversity across multiple sites. These applications leverage the method's ability to handle diverse datasets while assuming study independence.^[14] In practice, the workflow for applying Fisher's method in meta-analysis begins with selecting relevant studies that provide independent data and report p-values for the hypothesis of interest, ensuring exclusion of overlapping samples to maintain the independence assumption. Researchers then extract these p-values, compute the combined test statistic as described in the mathematical formulation, and derive the overall p-value from its chi-squared distribution. The final step involves reporting the combined p-value alongside sensitivity analyses to confirm robustness, providing a clear summary of the aggregated evidence for decision-making in policy or further research.^[15]^[3]

Modern Examples

In genomics research, Fisher's method has been applied to combine p-values from multiple genome-wide association studies (GWAS) to enhance the detection of rare variants associated with complex diseases. For instance, a 2021 study evaluated Fisher's method alongside other combination techniques for identifying incomplete associations in rare variant analyses, finding it robust but demonstrating that weighted variants like wFisher showed superior power in scenarios with moderate effects across diverse genomic datasets, such as those from prostate cancer cohorts. This approach allows researchers to aggregate evidence from independent GWAS on cancer susceptibility, improving statistical power for rare variants that might otherwise be overlooked in individual studies.^[3] In epidemiology, particularly during the COVID-19 pandemic, Fisher's method has facilitated the aggregation of statistical signals from hypothesis tests across geographic regions to detect outbreaks without centralizing sensitive data. A 2023 study on federated epidemic surveillance applied Fisher's method to combine p-values from local tests on hospitalization counts reported to the U.S. Department of Health and Human Services (HHS), achieving high recall rates (up to 99.2% true positives detected at the week of the true surge) in semi-synthetic COVID-19 data spanning multiple regions. This enabled timely identification of hospitalization trends while preserving privacy, outperforming uncombined local analyses in unevenly distributed data scenarios.^[16] Microbiome analysis has leveraged Fisher's method for meta-analyses that integrate p-values from diverse statistical tests to assess the significance of taxa in gut health studies. A 2022 investigation compared Fisher's method with alternatives like Stouffer's for combining p-values from methods such as ANCOM-BC and Wilcoxon tests on gut microbiome datasets, noting challenges with type I error control due to p-value correlations and recommending alternatives like the Cauchy test for better performance in heterogeneous samples. This application has supported efforts to identify key microbial signatures in gut dysbiosis across multiple cohorts, though with caveats on assumptions.^[17] In federated learning frameworks, weighted variants of Fisher's method have been employed for privacy-preserving statistical inference in distributed healthcare data. The aforementioned 2023 federated surveillance study extended Fisher's method through site-specific weighting to combine p-values from local models on HHS hospitalization data, detecting trends with 99.2% true positive rates (at the week of the surge) during COVID-19 surges across decentralized custodians. This weighted approach mitigated biases from varying regional sample sizes, enabling robust trend detection in real-time without data sharing, and has implications for broader applications in multi-institutional health monitoring.^[16]

Limitations and Assumptions

Independence Requirement

The independence requirement in Fisher's method stipulates that the individual hypothesis tests must be statistically independent, such that the resulting p-values are uncorrelated under the null hypothesis. This condition holds when the tests do not share underlying data or variables that could induce correlation, such as overlapping samples or common covariates across analyses.^[5] The rationale for this assumption lies in its role in deriving the null distribution of the combined test statistic. Specifically, independence guarantees that the quantity -2 \sum_{i=1}^k \ln(p_i) follows a \chi^2 distribution with $2k degrees of freedom, enabling accurate computation of the combined p-value; any correlation among the p-values disrupts this property and leads to an incorrect reference distribution.^[5]^[1] To assess whether the independence assumption is met prior to applying the method, researchers can compute the correlation matrix of the underlying test statistics and verify that off-diagonal elements are negligible or zero, or conduct empirical simulations to evaluate the joint behavior of the p-values under the null.^[4] Fisher's method is particularly suitable for scenarios involving disjoint datasets, where tests draw from completely separate samples, such as independent clinical trials evaluating the same intervention across different populations or laboratory experiments replicated under non-overlapping conditions.^[1]

Consequences of Violations

Violations of the independence assumption in Fisher's method, particularly positive dependence among the test statistics or p-values, lead to an anti-conservative test where the Type I error rate is inflated beyond the nominal level, resulting in combined p-values that are too small and an increased likelihood of false positives.^[12] For instance, when p-values exhibit positive correlation, the standard chi-squared approximation underestimates the true distribution, causing the method to reject the null hypothesis more frequently than intended.^[18] In contrast, negative dependence renders the test conservative, producing combined p-values that are larger than they would be under independence, thereby reducing statistical power and making it harder to detect true effects.^[19] Empirical simulations demonstrate the severity of these violations, with pre-2020 studies showing that under moderate positive correlations (e.g., 0.3 to 0.5), the Type I error rate can approximately double the nominal 0.05 level, especially at small significance thresholds.^[12] To address such issues, preliminary mitigation can involve sensitivity analyses that evaluate the robustness of results to assumed dependence structures, though comprehensive remedies require specialized extensions beyond the standard method.^[18]

Extensions

Handling Dependent Statistics

When the test statistics in Fisher's method are dependent, the standard chi-squared distribution no longer holds under the null hypothesis, leading to inflated type I error rates if independence is assumed. Adaptations extend the method by accounting for the correlation structure, typically through adjustments to the test statistic's distribution or empirical estimation of the null. These approaches maintain the core idea of combining transformed p-values but modify the inference procedure to preserve validity. Brown's method addresses known dependence structures by assuming the underlying test statistics follow a multivariate normal distribution with a specified covariance matrix. It approximates the null distribution of the Fisher statistic, -2 ∑ log(p_i), as a scaled chi-squared distribution where the scale factor and degrees of freedom are adjusted based on the covariance. Specifically, the degrees of freedom are modified using a Satterthwaite-type approximation that incorporates the trace and quadratic form of the covariance matrix of the transformed p-values, ensuring the first two moments match those of the approximated distribution. This method is particularly suitable when the dependence is fully specified, such as in designed experiments with known correlations. Kost's approach extends Brown's method to cases with unknown covariance matrices, providing analytical approximations for the scale and degrees of freedom via polynomial regressions on the correlation coefficients. It derives the covariance of the log-transformed p-values through numerical integration and matches moments to a scaled chi-squared, offering improved accuracy for moderate to high positive correlations among test statistics. For highly correlated datasets, such as those from genomic studies, Kost's approximations perform well.^[20] Weighted versions of Fisher's method incorporate correlations by generalizing the combining function with weights derived from the dependence structure. For instance, the weighted inverse chi-square method modifies the statistic to ∑ w_i (-2 log(p_i)), where weights w_i are chosen based on the estimated correlations to adjust for non-independence, approximating the null as a weighted sum of chi-squared variables. This allows flexibility in emphasizing tests with varying reliability or correlation levels, often using data-derived weights from the covariance matrix. Generalized forms, such as the weighted sum of transformed p-values, further adapt the approach for arbitrary dependence while preserving asymptotic properties. These extensions are commonly applied in scenarios with overlapping samples, such as multi-omics analyses integrating genomics and transcriptomics data where features share biological pathways, or multi-site clinical trials with correlated outcomes across locations. In multi-omics, for example, they combine p-values from gene set enrichment across layers like proteomics and metabolomics, where dependencies arise from shared experimental conditions. Computational demands rise with the number of tests k, as covariance estimation requires O(k^2) operations, making them feasible for moderate k (up to hundreds) but challenging for very large-scale data without approximations. In terms of performance, these methods maintain nominal type I error rates under dependence, unlike the standard Fisher's approach which can exceed 5% error for correlations ρ > 0.2. However, they often exhibit reduced power compared to the independent case, with losses proportional to the average correlation; for instance, Empirical Brown's method (an adaptation for estimated covariances), introduced in 2016, maintains type I error control and outperforms unadjusted Fisher in correlated settings.^[20]^[21] Permutation-based resampling can complement analytical methods by empirically deriving the null distribution under observed dependence, though it increases computational cost for large k.

Alternative Combination Methods

While Fisher's method remains a cornerstone for combining independent p-values under the assumption of homogeneity, several alternative approaches have been developed to address scenarios involving heterogeneity, dependence, or different sensitivity to p-value distributions. These methods aggregate p-values in distinct ways, offering robustness or power advantages depending on the data characteristics.^[22] The harmonic mean p-value (HMP) provides a robust alternative, particularly for combining dependent tests where correlations are unknown. It computes the harmonic mean by weighting each p-value inversely (as 1/p), emphasizing smaller p-values while downweighting larger ones, and has been shown to control the family-wise error rate more powerfully than conservative corrections like Bonferroni in dependent settings. This method, introduced by Wilson in 2019, is especially useful in genomic studies with correlated signals.^[23] Edgington's method offers a straightforward additive approach, summing the individual p-values directly and comparing the total to a uniform distribution under the null hypothesis for k tests (ranging from 0 to k). Proposed by Edgington in 1972, it treats all p-values more equally than Fisher's logarithmic transformation, making it less sensitive to outliers and suitable for balanced contributions across studies.^[24] Tippett's method, dating back to 1931, focuses on the most extreme evidence by taking the minimum p-value and raising it to the power of 1/k, where k is the number of tests; this yields a conservative combined p-value that is particularly effective when signals are sparse or extreme but less powerful for diffuse effects. ^[22] Selection among these methods depends on the underlying assumptions: Fisher's method excels with homogeneous, independent tests, whereas alternatives like Edgington's or Tippett's are preferable for heterogeneous effects, and the HMP for potential dependence. Recent trends in the 2020s emphasize hybrid methods for large-scale data, such as the Cauchy combination test, which transforms p-values via the Cauchy distribution and sums them for analytic p-value computation under arbitrary dependence, enhancing power in big data applications like high-throughput screening.^[22]

Interpretation

Assessing Combined Significance

The combined p-value from Fisher's method is typically evaluated against a significance threshold of α = 0.05 to determine whether there is sufficient evidence to reject the overall null hypothesis.^[3] When performing multiple such combined tests, adjustments for multiple testing, such as the Bonferroni correction (dividing α by the number of tests), are recommended to control the family-wise error rate and reduce the risk of false positives. This conservative approach ensures robustness in applications like meta-analyses involving numerous hypotheses. Power analysis for Fisher's method reveals that its statistical power to detect true effects depends on the individual effect sizes and sample sizes of the constituent tests; it generally offers higher power than single tests, particularly for detecting small effects across multiple studies by aggregating subtle evidence.^[3] Simulations indicate that the method excels when effect sizes are equal across tests but may underperform relative to weighted alternatives if effect sizes vary substantially.^[1] Due to the -2 log transformation, Fisher's method is highly sensitive to small individual p-values, which can dominate the combined statistic and drive significance even if only a few tests show strong evidence, while large p-values contribute minimally.^[25] This asymmetry necessitates careful interpretation, accounting for the quality, heterogeneity, and potential outliers among the input studies to avoid overemphasizing isolated strong results. In reporting results, it is standard to include the number of combined tests k, a summary of the individual p-values (such as their range, median, or distribution), the combined χ² statistic, its degrees of freedom (2k), and the final p-value to provide transparency and allow reproducibility. A key risk of misinterpretation is assuming that combined significance implies a large effect size or causal relationship; in reality, it only assesses the joint probability under the null and requires separate evaluation of effect magnitudes and study designs for broader inferences.^[26]

Practical Implementation

Fisher's method is commonly implemented in statistical software packages that facilitate meta-analysis and p-value combination. In R, the 'metap' package provides a straightforward function for Fisher's method, allowing users to input p-values and obtain the combined test statistic and significance level. Similarly, Python's SciPy library includes the scipy.stats.combine_pvalues function, which supports Fisher's method via chi-squared distribution calculations for efficient computation. For genomics applications, the Bioconductor suite in R offers specialized tools like the 'metap' integration within packages such as 'limma' or 'topGO', enabling scalable analysis of high-throughput data. Practical application begins with data preparation, where individual p-values from independent tests must be extracted and verified to be uniformly distributed under the null hypothesis. The computation step involves applying the formula to obtain the combined -2 log(p) statistic and referencing it against a chi-squared distribution with 2k degrees of freedom, where k is the number of tests. Visualization aids interpretation, such as forest plots in meta-analysis software to display individual and combined effect sizes alongside p-values. Recent advancements include applications of Fisher's method in federated learning for privacy-preserving analysis, such as in epidemic surveillance where p-values are combined across distributed datasets without sharing raw data.^[27] Best practices emphasize documenting the independence assumption and conducting diagnostics, such as permutation tests, to check for potential dependencies among p-values. Researchers should report the full set of input p-values, the combined statistic, degrees of freedom, and p-value to ensure reproducibility. Open-source implementations have been widely accessible since the early 2010s, with ongoing updates in packages like 'metap' (version 1.5 released in 2021; current version 1.12 as of 2025) to handle large-scale data through parallel processing and optimized algorithms.

Stouffer's Z-Score Method

Stouffer's Z-score method provides a parametric alternative to Fisher's method for combining evidence from multiple independent hypothesis tests by transforming p-values into standard normal deviates, or Z-scores, and then computing a weighted sum. The combined test statistic is calculated as

Z = \frac{\sum_{i=1}^k w_i Z_i}{\sqrt{\sum_{i=1}^k w_i^2}},

where Z_i = \Phi^{-1}(1 - p_i) for the i-th p-value p_i, \Phi^{-1} is the inverse cumulative distribution function of the standard normal distribution, and w_i are non-negative weights (often set to \sqrt{n_i}, the square root of the sample size in the i-th study, to reflect precision). Under the null hypothesis of no effect in all tests, Z follows a standard normal distribution N(0,1).^[28] This approach was originally proposed by Samuel A. Stouffer and colleagues in 1949 within their sociological analysis of U.S. Army personnel attitudes, serving as a practical tool for aggregating survey-based probabilities in social sciences where Fisher's chi-squared method was less intuitive.^[25] The method was later extended by Lipták in 1958 to explicitly include weights, enhancing its utility in meta-analyses with varying study qualities.^[28] In comparison to Fisher's non-parametric method, which combines p-values via the sum of -2 \ln p_i to yield a chi-squared statistic and assumes only uniformity of p-values under the null without weights, Stouffer's approach is explicitly parametric, relying on the normality of Z-scores derived from the tests, and naturally accommodates unequal variances through weighting.^[28] Fisher's method remains unweighted and better suited to raw p-values from diverse test types, but it lacks the flexibility to prioritize larger or more precise studies inherent in Stouffer's framework.^[29] Stouffer's method is particularly advantageous when the direction of effects is anticipated (e.g., using one-sided p-values to preserve sign in Z-scores) or in weighted meta-analyses accounting for study-specific sample sizes, scenarios where Fisher's uniform treatment of p-values may dilute evidence from stronger studies.^[28] Conversely, Fisher's method excels with purely uniform p-values from exploratory or two-sided tests without prior directional assumptions.^[29] Simulation-based empirical comparisons prior to 2020 indicate that Stouffer's weighted Z-score method often exhibits greater statistical power than Fisher's under effect homogeneity across studies, especially when sample sizes vary—for instance, achieving power levels up to 0.807 in pooled-like homogeneous settings at \alpha = 0.05 compared to Fisher's lower values in similar simulations—due to its efficient incorporation of precision weights.^[28] However, Stouffer's reliance on Z-score transformations renders it more sensitive to outliers, such as extreme deviations from normality in individual test statistics, potentially reducing robustness in highly heterogeneous data relative to Fisher's non-parametric resilience in pre-2020 benchmark evaluations of meta-analytic performance.^[29]

Other Techniques

In meta-analysis, effect size-based techniques aggregate standardized measures of association, such as Cohen's d, across studies using fixed- or random-effects models to combine evidence while accounting for variability. In fixed-effects models, all studies are assumed to estimate a single true effect size, with weights assigned inversely proportional to each study's variance to emphasize more precise estimates; this approach minimizes the variance of the pooled effect. Random-effects models, in contrast, incorporate between-study heterogeneity by adding a variance component, allowing for diverse true effects, and still use inverse-variance weighting for aggregation. These methods provide interpretable summaries of overall impact, such as a weighted average Cohen's d, but require effect size estimates and their standard errors from each study.^[30] Bayesian alternatives to Fisher's method focus on combining posterior probabilities from multiple tests, often employing Dirichlet process priors to flexibly model uncertainty and dependence in hypothesis testing. For instance, spiked Dirichlet process priors enable nonparametric treatment of random effects in multiple hypothesis testing, updating prior beliefs with observed data to yield posterior probabilities for each hypothesis while controlling the false discovery rate. This hierarchical approach allows incorporation of prior knowledge about effect sizes or dependencies, producing joint posteriors that quantify evidence strength more probabilistically than p-value combinations. Such methods are particularly useful in high-dimensional settings, like genomics, where Dirichlet priors facilitate clustering of similar tests. In machine learning, ensemble methods integrate Fisher's method within feature selection pipelines to combine statistical evidence across subsets or models, enhancing biomarker discovery in high-dimensional data. The 2021 FRL algorithm, for example, combines Fisher scores (for ranking features by discriminative power) with recursive feature elimination and logistic regression in an ensemble framework, iteratively selecting and validating features while implicitly aggregating evidence from multiple iterations to identify cancer genomic biomarkers. This integration leverages Fisher's principles for scoring but extends them through ensemble voting, improving robustness in predictive modeling tasks like classification.^[31] Compared to Fisher's method, these techniques offer greater flexibility in handling heterogeneous effects or prior information but often demand larger datasets and computational resources for estimation; Fisher's approach excels in simplicity and rapid p-value aggregation for uniform tests. Emerging applications include using Fisher's method in AI ethics to combine fairness tests across large language models, as in analyses of communication biases where it aggregates p-values from multiple demographic parity checks to detect systemic trends (e.g., P < 10^{-16} after correction).^[32]

References

[1]
Choosing an Optimal Method to Combine P-values - PMC
Fisher [1925] was the first to suggest a method of combining the p-values obtained from several statistics and many other methods have been proposed since then.
[2]
Combining Independent Tests of Significance
It is shown that no single method of combining independent tests of significance ... It is shown that for such problems Fisher's method and a method proposed by ...
[3]
Powerful p-value combination methods to detect incomplete ... - Nature
Mar 26, 2021 · The Fisher's method has been the most commonly used to combine p-values. The following test statistic T has -distribution with DF of ...
[4]
The Generalized Fisher's Combination and Accurate P-Value ...
Feb 18, 2022 · The paper presents several new p-value calculation methods based on two novel ideas: moment-ratio matching and joint-distribution surrogating.
[5]
Combining dependent P-values with an empirical adaptation of ...
Fisher R.A. (1948) Answer to question 14 on combining independent tests of significance. Am. Statistician, 2, 30–31. [Google Scholar]; Friedman N. et al ...
[6]
combine_pvalues — SciPy v1.16.2 Manual
Combine p-values from independent tests that bear upon the same hypothesis. These methods are intended only for combining p-values from hypothesis tests based ...
[7]
Fisher's method of combining dependent statistics using ... - NIH
A classical approach to combine independent test statistics is Fisher's combination of Inline graphic -values, which follows the Inline graphic distribution.<|control11|><|separator|>
[8]
Combining p-values in large scale genomics experiments - PMC
If all p-values are smaller than 1/e, the combined pc goes to zero, and if they are larger that 1/e, the pc goes to one. Thus, 1/e provides a threshold point ...<|control11|><|separator|>
[9]
Statistical methods and scientific inference : Fisher, Ronald Aylmer ...
Aug 13, 2019 · Statistical methods and scientific inference ; Publication date: 1956 ; Topics: Logic, Symbolic and mathematical, Mathematical statistics, ...
[10]
Using History to Contextualize p-Values and Significance Testing
Ronald A. Fisher and his contemporaries formalized these methods in the early twentieth century and Fisher's 1925 Statistical Methods for Research Workers ...
[11]
An historical perspective on meta-analysis: dealing quantitatively ...
Fisher's influence on meta-analysis is hard to exaggerate. For instance, one of the earliest publications warning about preferential publication of studies ...
[12]
Fisher's method of combining dependent statistics using ...
Oct 29, 2013 · One approach is to combine the p-values of one-sided tests using Fisher's method (Fisher, 1932), referred to here as the Fisher's combination ...
[13]
P-value evaluation, variability index and biomarker categorization ...
In this paper, we develop an importance sampling scheme with spline interpolation to increase the accuracy and speed of the P-value calculation.
[14]
21 Meta-analysis in environmental statistics - ScienceDirect.com
This chapter reviews standard methods for such synthesis, including combining p-values, effect sizes, and methods for combining contingency tables. Recent ...
[15]
Meta-analysis based on weighted ordered P-values for genomic ...
We consider weighted versions of classical procedures such as Fisher's method and Stouffer's method where the weight for each p-value is based on its order ...
[16]
[PDF] arXiv:2307.02616v2 [stat.AP] 13 Sep 2024
Sep 13, 2024 · This paper explores federated epidemic surveillance, using hypothesis tests and meta-analysis to detect outbreaks without sharing data, ...
[17]
Combining p-values from various statistical methods for ... - Frontiers
The most common method is Fisher's method that uses a chi-square distribution to calculate the combined value of p (Fisher, 1925). The method using the minimum ...Introduction · Materials and methods · Results · Discussion
[18]
A modified generalized Fisher method for combining probabilities ...
Feb 19, 2014 · In this work, we propose modifications to the Lancaster procedure by taking the correlation structure among p-values into account.
[19]
None
Nothing is retrieved...<|separator|>
[20]
Evaluating statistical significance in a meta-analysis by using ...
We applied Fisher's ( p F ) and our ( p N ) methods to combine the two p-values from the two TWASs. There were 3,175 statistical tests (because of 3,175 genes ...<|control11|><|separator|>
[21]
https://academic.oup.com/bioinformatics/article/32/17/i430/2450768
[22]
[PDF] Choosing Between Methods of Combining p-values - arXiv
Dec 14, 2017 · Tippett's and Fisher's methods are clearly more sensitive to the smallest p-value, ... Then the optimal p-value combination method is ST .
[23]
The harmonic mean p-value for combining dependent tests - PNAS
In this paper, I introduce the harmonic mean p-value (HMP), a simple to use and widely applicable alternative to Bonferroni correction motivated by Bayesian ...
[24]
An Additive Method for Combining Probability Values from ...
(1972). An Additive Method for Combining Probability Values from Independent Experiments. The Journal of Psychology: Vol. 80, No. 2, pp. 351-363.
[25]
Combining probability from independent tests: the weighted Z ...
Aug 25, 2005 · Fisher's method is asymmetrically sensitive to small P-values compared to large P-values. The undesirability of this result can be seen when we ...Missing: numerical | Show results with:numerical<|control11|><|separator|>
[26]
The Effect Size: Beyond Statistical Significance - PMC - NIH
The effect size is considered an essential complement of the statistical significance test when a significant difference is found.
[27]
Optimally weighted Z-test is a powerful method for combining ... - NIH
The weighted Z-test is a method for combining P-values in meta-analysis, superior to Fisher's method, and uses weights to improve power.
[28]
Full article: Combining independent p-values in replicability analysis
Via simulations, we find that the Stouffer method works well if the null p-values are uniformly distributed and the signal strength is low, and the Fisher ...
[29]
https://www.tandfonline.com/doi/full/10.1080/00949655.2021.2022678
[30]
FRL: An Integrative Feature Selection Algorithm Based on the Fisher ...
This paper proposes an integrative feature selection algorithm named FRL to explore potential cancer genomic biomarkers on cancer subsets.Missing: rare variants
[31]
AI–AI bias: Large language models favor communications ... - NIH
Jul 29, 2025 · Using Fisher's method (BH-corrected, α = 0.05), the combined P-value confirms a highly significant overall trend: P < 10−16. Our findings ...