Fact-checked by Grok 2 weeks ago

Multiple comparisons problem

The multiple comparisons problem, also known as the multiplicity problem, arises in statistical testing when multiple are tested simultaneously on the same , leading to an inflated probability of committing a Type I error—incorrectly rejecting a true —compared to testing a single . This inflation occurs because the overall (FWER), the probability of at least one false rejection across all tests, can exceed the nominal significance level (e.g., α = 0.05) even if each individual test is controlled at that level; for instance, with 10 independent tests at α = 0.05, the FWER can approach 40% if all nulls are true. In practice, the problem is prevalent in fields like , , and clinical trials, where high-dimensional data often requires testing thousands of hypotheses, potentially yielding hundreds of false positives without adjustment—for example, at α = 0.05, testing 10,000 true null hypotheses could result in about 500 spurious rejections assuming independence. To address this, statisticians employ correction methods that control either the FWER, which strictly limits the chance of any false positives (e.g., via the conservative Bonferroni procedure that divides α by the number of tests, or the stepwise Holm method), or the (FDR), which tolerates some false positives while controlling their proportion among rejections (e.g., the Benjamini-Hochberg procedure). These approaches balance the trade-off between reducing false positives and maintaining statistical power to detect true effects, though overly conservative corrections like Bonferroni can lead to Type II errors by failing to identify genuine differences. Philosophically, handling multiple comparisons involves debates over whether to focus on individual test error rates (suitable for pre-planned, few comparisons) or family-wise control (for exploratory analyses with many tests), with some researchers arguing that the problem is overstated in applied settings where null hypotheses are rarely exactly true and multilevel modeling or Bayesian approaches can naturally incorporate multiplicity through shrinkage and hierarchical structures, often obviating the need for ad-hoc corrections. Common procedures also include the Newman-Keuls test for ordered means and the least significant difference () method, though these have limitations in controlling error rates for more than a few comparisons. Overall, appropriate adjustment is crucial to ensure reliable inferences, particularly in an era of where unadjusted p-values can mislead scientific conclusions.

Historical Background

Early Recognition

The multiple comparisons problem gained initial recognition in the 1950s as researchers grappled with the inflated risk of false positives when conducting numerous statistical tests on the same dataset, particularly in the context of analysis of variance (ANOVA). John W. Tukey played a pivotal role in formalizing this issue through his 1953 memorandum "The Problem of Multiple Comparisons," which offered the first systematic exploration of simultaneous inference procedures and emphasized the need for methods to construct confidence intervals that hold jointly across multiple comparisons. Concurrently, Henry Scheffé introduced a versatile method for evaluating all possible linear contrasts within ANOVA frameworks, enabling researchers to assess differences among means while controlling overall error rates, as detailed in his 1953 paper. These foundational works shifted attention from isolated hypothesis testing to the challenges of multiplicity in experimental design. Early applications of multiple comparisons procedures were especially prevalent in agricultural experiments, where ANOVA had become a staple for evaluating treatment effects in randomized field trials since the , but post-hoc analyses in the highlighted the need for safeguards against erroneous conclusions from repeated pairwise tests. For instance, in studies involving multiple or variety comparisons, Tukey's studentized range test emerged as a practical tool for identifying significant differences among group means, building on the studentized range statistic to account for the number of comparisons. These methods addressed the practical demands of , where overlooking multiplicity could lead to misguided recommendations on farming practices. A straightforward yet conservative strategy to mitigate the multiple comparisons problem was the application of the Bonferroni inequality, a probabilistic bound stating that the probability of at least one false rejection across m tests satisfies P\left( \bigcup_{i=1}^m A_i \right) \leq \sum_{i=1}^m P(A_i), where A_i is the event of rejecting the i-th . This implies dividing the desired \alpha by m, adjusting each test's significance level to \alpha / m to ensure the overall error probability remains at or below \alpha. While effective as a simple upper bound, this approach often proved overly stringent, reducing statistical power in scenarios with many tests. Despite their innovations, early methods like Tukey's studentized range test exhibited limitations, including over-conservatism when sample sizes varied across groups, which inflated the effective confidence levels and resulted in fewer detected differences than warranted. Such conservatism stemmed from the test's reliance on the maximum range among means, making it less efficient for unbalanced designs common in agricultural settings, though it remained a for all-pairwise comparisons.

Key Milestones and Conferences

The introduction of the (FDR) by Yoav Benjamini and Yosef Hochberg in their 1995 paper represented a pivotal advancement in multiple comparisons, offering a less conservative alternative to (FWER) control that enhanced power in scenarios involving numerous hypotheses. This innovation addressed limitations of earlier methods like Bonferroni corrections, facilitating broader applications in fields generating high-dimensional data. The first International Conference on Multiple Comparison Procedures (MCP) convened in , , from June 23 to 26, 1996, at , organized by Yoav Benjamini along with committee members including Juliet Popper Shaffer. This event marked the inaugural dedicated gathering for researchers focused on multiple comparisons procedures, spurred by the recent surge in methodological developments. Subsequent MCP conferences have occurred roughly every two to four years, with the series reaching its 12th iteration in , , from August 30 to September 2, 2022, and its 13th in , , from August 12 to 15, 2025, at . These gatherings have served as essential platforms for disseminating cutting-edge research, fostering collaborations, and promoting consistent terminology across the discipline through invited talks, proceedings, and discussions on unified frameworks for error control. In the 2000s, Bradley Efron's contributions further propelled the field, particularly through tailored for large-scale hypothesis testing, as outlined in his 2007 work integrating estimation with FDR assessments to balance power and error rates. This approach complemented FDR innovations by providing adaptive tools for and data, influencing subsequent conference agendas on scalable inference.

Fundamental Concepts

Problem Definition

In statistical hypothesis testing, researchers often formulate a H_0 (typically asserting no effect or no difference) and an H_a (asserting an effect or difference), then compute a representing the probability of obtaining the observed data (or more extreme) assuming H_0 is true. If the falls below a pre-specified \alpha (commonly 0.05), the null hypothesis is rejected in favor of H_a, indicating . This framework controls the Type I error rate—the probability of falsely rejecting a true H_0—at \alpha for a single test. The multiple comparisons problem arises when conducting m such hypothesis tests simultaneously on the same dataset, which inflates the overall Type I error rate beyond the nominal \alpha. Without adjustment, the probability of at least one false rejection across all tests increases dramatically; for independent tests where all nulls are true, this probability is $1 - (1 - \alpha)^m. For instance, with \alpha = 0.05 and m = [20](/page/2point0), the chance of at least one spurious significant result exceeds 64%, even if no true effects exist. This inflation occurs because each test's individual error rate compounds multiplicatively, leading to unreliable inferences and potential spurious discoveries that undermine scientific validity. Motivational examples illustrate this risk vividly. In , a study might compare the efficacy of multiple drug doses (e.g., low, medium, high) against a across several endpoints, yielding numerous p-values; unadjusted analyses could flag a false positive for one dose, misleading treatment decisions. Similarly, in educational studies evaluating different teaching methods (e.g., online vs. in-person vs. hybrid) on various outcomes like test scores and retention, ignoring multiplicity might produce illusory significant improvements, prompting misguided policy changes. These scenarios highlight how routine practices in fields like and social sciences amplify the problem, necessitating safeguards beyond per-test controls. While controlling the per-comparison error rate (PCER)—the expected proportion of false positives among all m tests, bounded by \alpha—maintains the nominal level for each individual test, it fails to address the cumulative risk of errors across the of tests. PCER control is akin to single-test , where the overall false positive is \alpha m_0 / m (with m_0 true nulls), but this permits a high likelihood of at least one error when m is large, as in comparing 20 true nulls at \alpha = 0.05, where the probability of a false rejection nears %. In contrast, global error is essential for multiple testing to preserve the integrity of inferences, ensuring the experiment-wide Type I error does not exceed acceptable levels despite the increased testing volume.

Classification of Tests

Multiple comparison tests can be classified logically by the of the hypotheses being tested or chronologically by the order in which tests are conducted. Logical groups tests based on their relational , such as all-pairs comparisons among group means in an analysis of variance (ANOVA), where every pair of means is evaluated simultaneously to identify differences, or many-one comparisons that focus on contrasts between multiple treatments and a single . Chronological , in contrast, organizes tests sequentially, often through stepwise procedures that adjust significance levels based on prior outcomes, allowing for adaptive decision-making as tests progress. Key types of multiple comparison procedures include closed testing procedures, which consider all possible of to ensure of error rates, and step-up or step-down methods that iteratively reject or retain . Closed testing procedures maintain logical by requiring that a is rejected only if all containing it are also rejected, adhering to principles of that prevent contradictory decisions across the hypothesis family. Step-down methods, such as Holm's procedure, begin with the smallest and progressively relax the significance threshold for remaining , while step-up methods, like Hochberg's, start from the largest and tighten thresholds upward, both enhancing power over single-step approaches under certain conditions. An example of a logically structured test is Dunnett's procedure, which specifically compares multiple treatment means to a mean while controlling the , making it suitable for experimental designs where the control serves as a . Intersection-union tests form another category, where the is the of individual nulls, and rejection requires evidence against all individual null hypotheses, often used in contexts like testing. Marcus et al. (1976) established that such closed testing families, when coherent, provide exact control of the without sacrificing power in ordered settings like ANOVA. The dependency structure among tests—whether independent or positively dependent—significantly influences error inflation in multiple comparisons. For independent tests, the family-wise error rate (FWER) under the complete approximates $1 - (1 - \alpha)^m \approx m\alpha for m tests and small \alpha, leading to substantial inflation as m grows. In contrast, positive dependence, where test statistics are positively correlated (e.g., due to shared covariates), tends to reduce the actual FWER compared to the independent case because larger intersection probabilities decrease the union probability of false rejections, though it can complicate power calculations for alternatives.

Error Control Frameworks

Family-Wise Error Rate

The (FWER) is defined as the probability of making at least one Type I error (false positive) across a family of m simultaneously conducted tests, formally expressed as FWER = (V > 0), where V denotes the number of false rejections. This criterion aims to control the overall probability of any false rejection within the family at a designated level α, such that FWER ≤ α, thereby providing a conservative safeguard against erroneous conclusions in multiple testing scenarios. FWER control can be categorized as weak or strong. Weak control limits the FWER to α only under the complete null hypothesis, where all null hypotheses are true, which is a less stringent requirement often met by unadjusted individual tests. In contrast, strong control ensures the FWER remains bounded by α under any arbitrary configuration of true and false null hypotheses, offering robust protection regardless of the underlying truth pattern; this is typically achieved through structured procedures like closed testing. The mathematical foundation for FWER control often relies on the union bound (), which states that the probability of at least one false rejection is at most the sum of the individual Type I error probabilities: FWER ≤ ∑_{i=1}^m Pr(Type I error for test i). For independent tests, if each is conducted at level α/m, the bound simplifies to FWER ≤ m × (α/m) = α, ensuring control at level α but at the cost of conservatism. For example, with m=5 tests and desired FWER ≤ 0.05, the adjusted significance level per test becomes α/m = 0.05/5 = 0.01, reducing the chance of individual detections but guaranteeing no more than a 5% risk of any family-wide error. In confirmatory settings, such as clinical trials, FWER control is particularly advantageous because it prioritizes avoiding any false positives, thereby maintaining high positive predictive value and aligning with regulatory expectations for reliable evidence before widespread treatment adoption. This strict error management is critical when the consequences of erroneous inferences could impact or .

False Discovery Rate

The (FDR) is defined as the expected proportion of false positives among all null hypotheses, formally expressed as FDR = E[V/R | R > 0] P(R > 0), where V denotes the number of false discoveries (incorrectly null hypotheses) and R the total number of rejections. This measure controls the expected false positive proportion conditional on at least one rejection, providing a balance between discovering true effects and limiting erroneous claims in large-scale testing. The Benjamini-Hochberg procedure, introduced in , establishes a framework for controlling the FDR at a specified level, making it particularly suitable for exploratory analyses where many true alternative hypotheses are anticipated among a large number of tests. Unlike more conservative approaches, this method allows for a controlled proportion of errors while maximizing the detection of signals. Storey (2002) distinguished the positive false discovery rate (pFDR), defined as pFDR = E[V/R | R > 0], from the standard FDR by conditioning solely on the event of at least one rejection, which aligns more closely with Bayesian interpretations of error rates in discovery settings. To enhance power, Storey's approach estimates the proportion of true null hypotheses, π₀, using spline-based methods that model the distribution of p-values under the null, enabling adaptive adjustments to the FDR control. In scenarios with signal sparsity—where only a small of null hypotheses are false—the FDR offers substantial advantages over (FWER) controls, which are stricter in guaranteeing no false positives. For instance, with m = 1000 tests and an FDR level of 0.05, the procedure can reject up to several times more hypotheses than an FWER method at the same significance level, increasing true discoveries while maintaining the targeted error proportion.

Controlling Procedures

FWER-Based Methods

(FWER)-based methods aim to control the probability of making at least one type I error across a family of m tests at a designated level α. These procedures provide strong control of the FWER under the complete , ensuring that the overall error rate does not exceed α regardless of the true configuration of alternatives.

Single-Step Procedures

Single-step methods apply a uniform adjustment to all p-values or significance thresholds before conducting any tests, making them straightforward but often conservative. The , introduced by Carlo Emilio Bonferroni in 1936 and later applied to multiple comparisons by Olive Jean Dunn in 1961, divides the overall significance level α by the number of tests m. A test i is rejected if its p-value p_i satisfies p_i ≤ α / m. This procedure controls the FWER at level α under arbitrary dependence structures among the tests, as it relies solely on the union bound from . The , proposed by Zbyněk Šidák in 1967, offers a slightly less conservative alternative under the assumption of among the tests. It adjusts the level to α_i = 1 - (1 - α)^{1/m}, so a test is rejected if p_i ≤ 1 - (1 - α)^{1/m}. This formula derives from the exact probability calculation for the intersection of events under the null, providing exact FWER control at α when tests are , and approximate control otherwise. For small α and large m, the Šidák threshold approximates the Bonferroni level α / m.

Stepwise Procedures

Stepwise methods sequentially adjust thresholds based on ordered p-values, improving power over single-step approaches while maintaining FWER control. The Holm-Bonferroni step-down procedure, developed by Sture Holm in , orders the p-values in ascending order as p_{(1)} ≤ p_{(2)} ≤ ⋯ ≤ p_{(m)}. It begins by testing if p_{(1)} ≤ α / m; if rejected, it proceeds to p_{(2)} ≤ α / (m-1), continuing until p_{(k)} > α / (m-k+1) for some k, at which point all remaining tests are accepted. This sequentially rejective approach controls the FWER at α for any dependence structure and is uniformly more powerful than the Bonferroni method. The Hochberg step-up procedure, introduced by Yosef Hochberg in 1988, reverses the ordering by starting from the largest p-value p_{(m)} ≤ α / m, then p_{(m-1)} ≤ α / (m-1), and so on, rejecting all tests up to the first non-rejection. It provides strong FWER control at α when the test statistics exhibit positive dependence, such as positive regression dependence, which is common in applications like . Under , it matches the power of the Holm procedure but can be more powerful under certain dependence structures.

Other Specialized Methods

For specific experimental designs, tailored FWER-controlling procedures enhance applicability. Tukey's honestly significant difference (HSD) test, originally proposed by in 1949, is designed for all pairwise comparisons among k means following a one-way ANOVA, assuming equal variances and sample sizes. It rejects the null for a pair if the absolute difference in means exceeds q_{α,k,n-k} \cdot s / \sqrt{2/n}, where q is the critical value from the , s is the pooled , and n is the sample size per group. This method controls the FWER exactly under normality and equal variances. Dunnett's test, developed by Charles W. Dunnett in 1955, focuses on comparing k-1 treatment s to a single control , often in one-sided settings. For the one-sided case, it rejects if the treatment exceeds the control by t_{α,k,n} \cdot s / \sqrt{2/n}, where t_{α,k,n} is a from the Dunnett distribution tailored to the number of comparisons. This procedure controls the FWER at α under normality, providing higher power than Bonferroni for control-focused designs.

Implementations and Considerations

These FWER-based methods are widely implemented in statistical software. In , the p.adjust function in the base stats package supports Bonferroni, Holm, and Hochberg adjustments via the method argument, applying them to a of p-values to return adjusted values for FWER control. Similarly, SAS's PROC GLM and PROC ANOVA procedures include options for Tukey's HSD, , and Bonferroni adjustments within post-hoc analyses, outputting adjusted p-values or confidence intervals. A key advantage of FWER methods is their strong guarantee against any false positives, making them suitable for confirmatory analyses where Type I errors must be minimized. However, they suffer from loss as m increases, becoming overly conservative—e.g., the per-test α drops to impractically low levels for m > 100—potentially missing true effects in large-scale testing. Stepwise variants like Holm mitigate this somewhat by recycling α, but overall, these methods trade for stringent control.

FDR-Based Methods

The Benjamini-Hochberg (BH) procedure is a seminal method for controlling the (FDR) at a specified level q^*. To apply it, the m p-values are ordered as p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}, and the largest k such that p_{(k)} \leq \frac{k}{m} q^* is found; all hypotheses with p-values up to p_{(k)} are then rejected. Under the assumption of independence among test statistics, this procedure guarantees that the FDR is controlled at level q^* in . The control extends to settings with positive regression dependence on the subset of true nulls (PRDS), where the p-values under the null hypotheses are positively dependent, without modification to the procedure. Adaptive methods build on the BH procedure by estimating the proportion of true null hypotheses, \pi_0, to improve power while maintaining FDR control. Storey's q-value approach estimates \pi_0 using methods such as the histogram of p-values (splitting the uniform density under the null at a tuning parameter \lambda < 1) or bootstrap resampling, then computes adjusted p-values as q_i = \min \left( p_i \frac{m \hat{\pi}_0}{k}, 1 \right), where k is the rank of p_i among ordered p-values and \hat{\pi}_0 is the estimate. These q-values can then be thresholded similarly to the BH procedure at level q^* to control the positive FDR, a variant that conditions on at least one rejection. The knockoff framework provides a model-agnostic (Model-X) approach for exact FDR control in settings with arbitrary dependence among features or tests. It generates knockoff copies of the original variables that mimic their joint distribution while ensuring exchangeability with originals under the null, allowing construction of test statistics W_j (e.g., based on differences in importance scores) that are swap-symmetric. The FDR is then estimated using the swap rate, defined as the proportion of selected knockoffs among rejections, leading to a conservative estimate \widehat{\text{FDR}} = \frac{1 + \# \{ j : W_j < 0 \}}{\max \{ 1, \# \{ j : W_j > 0 \} \}}, which is controlled at level q^* when thresholding at zero. Post-2020 developments have integrated Bayesian perspectives into FDR control, enhancing adaptability for complex dependencies in high-dimensional settings like feature selection. For instance, Bayesian extensions of knockoffs incorporate prior distributions over models to select variables while controlling FDR, achieving higher stability and power compared to frequentist knockoffs in simulations. Similarly, local FDR methods, which estimate the that each is null given its or statistic, have been refined using empirical Bayes mixtures for one- and two-sided tests, providing decision-theoretic thresholds that boost power under sparsity. These approaches are particularly suited for applications, where local FDR aids in interpreting importance scores from black-box models. More recent advances as of 2025 include feedback-enhanced online multiple testing procedures that control FDR in sequential and environments, such as decision-making, and closure-based methods that provide necessary and sufficient principles for control in multiple testing.

Large-Scale Applications

Genomics and High-Throughput Data

In high-throughput genomic experiments, such as DNA microarray and RNA sequencing (RNA-seq) analyses, researchers routinely conduct simultaneous statistical tests on over 20,000 genes to detect differential expression or other associations, often in the context of sparse signals where only a small fraction—typically less than 1%—of hypotheses are truly alternative. This scale amplifies the multiple comparisons problem, causing severe inflation of false positives; for instance, applying a nominal significance level of 0.05 without correction could yield up to 1,000 spurious discoveries in a typical experiment. Such sparsity, inherent to biological variation and technical noise in these assays, necessitates tailored error control to balance discovery power against erroneous claims, as uncorrected testing would overwhelm downstream validation efforts. To address these challenges, (FDR)-based methods have gained prominence in over stricter (FWER) controls, prioritizing power in environments with few true effects. The Benjamini-Hochberg (BH) procedure, introduced in 1995, dominates applications due to its simplicity and ability to maintain FDR at a desired level (e.g., 5%) while retaining substantially more discoveries than conservative FWER alternatives like Bonferroni, which often prove underpowered for large m. Building on this, (2002) developed an empirical Bayes optimal discovery procedure that estimates the proportion of true null hypotheses (π₀, often close to 1 in genomic data) to refine FDR thresholds, enhancing sensitivity without excessive false positives; this approach has been widely adopted in tools like qvalue for and analysis. A key application arises in genome-wide association studies (GWAS), where up to 10 million or more single nucleotide polymorphisms (SNPs) are tested for trait associations, demanding rigorous control amid and population structure. Here, the persists for its conservatism, particularly when detecting rare variants, yielding a standard genome-wide threshold of α ≈ 5 × 10^{-8} based on approximately 1 million independent tests across the . This threshold ensures low false positive rates but can miss subtle effects, prompting hybrid uses of FDR in exploratory phases. Criticisms of multiple testing in highlight persistent risks of p-hacking, where iterative adjustments to analysis pipelines—such as gene filtering or covariate selection—can selectively emphasize significant results, undermining amid the field's high-dimensional data. To mitigate this, pre-registration of study protocols and analysis plans has been promoted in biomedical to foster and curb questionable behaviors, with NIH emphasizing such practices in clinical trials and grants.

Emerging Fields and Challenges

In , the multiple comparisons problem arises prominently in hyperparameter tuning and , where numerous candidate configurations or variables are evaluated, necessitating controls like the (FDR) to avoid spurious findings. For instance, in regression paths, sequential selection procedures enable FDR control by stopping at appropriate knots along the regularization path, addressing the issue of false discoveries that occur early in the process. Recent advancements include knockoff methods integrated with deep neural networks, which generate knockoff features to identify nonlinear causal relations while controlling FDR in high-dimensional settings, as demonstrated in biological applications. Beyond traditional domains, the multiple comparisons problem extends to social sciences, particularly in large-scale where platforms evaluate numerous variants simultaneously, increasing the risk of false positives without proper adjustments like Bonferroni corrections. In particle physics, discovery claims, such as those at the , rely on stringent thresholds like the five-sigma standard to account for multiple comparisons across vast search spaces, mitigating the probability of erroneous detections. These applications highlight challenges posed by dependencies in , where correlated observations complicate error rate controls and require tailored procedures to maintain validity. Criticisms of multiple testing practices center on over-reliance on p-values, which has fueled the reproducibility crisis since the by enabling p-hacking and inflating false positives across fields. Gaps persist in standardized software for handling complex dependencies; for example, R's multtest package offers robust FDR and (FWER) controls, whereas Python's statsmodels provides multipletests functions but lacks equivalent depth for advanced graphical models, hindering interdisciplinary adoption. Future directions emphasize integrating multiple testing with , as seen in 2024 frameworks that combine knockoffs with for valid FDR control in heterogeneous effects, and methods to handle unequally powered tests through parametric under unequal variances. As of 2025, ongoing developments include -enhanced knockoff methods for in high-dimensional data.

References

  1. [1]
    A general introduction to adjustment for multiple comparisons - PMC
    In the present paper, we provide a brief review on mathematical framework, general concepts and common methods of adjustment for multiple comparisons.Missing: sources | Show results with:sources
  2. [2]
    Multiple comparisons: philosophies and illustrations
    In this review, I demonstrate the statistical issue embedded in multiple comparisons, and I summarize the philosophies of handling this issue.Newman-Keuls Procedure · Bonferroni Procedure · False Discovery Rate...
  3. [3]
    [PDF] Why We (Usually) Don't Have to Worry About Multiple Comparisons
    Jul 13, 2009 · Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments.
  4. [4]
  5. [5]
    Multiple Comparison Procedures—Cutting the Gordian Knot
    Mar 1, 2015 · Multiple comparison procedures (MCPs), or mean separation tests, have been the subject of great controversy since the 1950s.
  6. [6]
    7.4.7.1. Tukey's method - Information Technology Laboratory
    In other words, the Tukey method is conservative when there are unequal sample sizes. ... The Tukey method uses the studentized range distribution. Suppose we ...
  7. [7]
    Controlling the False Discovery Rate: a Practical and Powerful - jstor
    (g) It is known that Hochberg's method offers a more powerful alternative to the traditional Bonferroni method. Nevertheless, it is important to note that the ...
  8. [8]
    Bioinformatik - MCP Conference
    1st INTERNATIONAL CONFERENCE ON MULTIPLE COMPARISON. 1. INTERNATIONALE KONFERENZ FÜR MULTIPLE METHODEN. TEL AVIV UNIVERSITY, ISRAEL 23-26 JUNE, 1996.
  9. [9]
    Yoav Benjamini - Vita - School of Mathematical Sciences
    Feb 19, 2018 · 1995-2015 Member of the organizing committees of the International Conferences on Multiple Comparisons, Tel Aviv (1996), Berlin (2000) ...
  10. [10]
    [PDF] Juliet Popper Shaffer - Berkeley Statistics
    Mar 13, 2019 · Member of organizing committee of first International Conference on Multiple Comparisons, Tel Aviv,. Israel, June 1996. Appointed to ...
  11. [11]
    MCP Conference 2022 – 12th International Conference on Multiple ...
    The 12th International Conference on Multiple Comparison Procedures will take place from 30. August 2022 to 02. September 2022 in Bremen, Germany.
  12. [12]
    Recent Developments in Multiple Comparison Procedures
    This volume is a collection of 11 papers, covering a broad range of topics in Multiple Comparisons, which were invited for the conference. The goal of this ...
  13. [13]
    What is the proper way to apply the multiple comparison test? - PMC
    Based on F-distribution, it is a method for performing simultaneous, joint pairwise comparisons for all possible pairwise combinations of each group mean [6].
  14. [14]
    On closed testing procedures with special reference to ordered ...
    This paper presents a method for stepwise multiple testing procedures with fixed experimentwise error, requiring hypotheses to be closed under intersection. It ...
  15. [15]
    A Multiple Comparison Procedure for Comparing Several ...
    A multiple comparison procedure for comparing several treatments with a control. Charles W. Dunnett American Cyanamid Company. Pages 1096-1121.
  16. [16]
    Effects of dependence in high-dimensional multiple testing problems
    Feb 25, 2008 · Maximal elements are completely dependent structures without any conditional independence constraints, that is every entry of inverse of the ...
  17. [17]
    4.2 - Controlling Family-wise Error Rate | STAT 555
    Pr(V > 0) is called the family-wise error rate or FWER. It is easy to show that if you declare tests significant for p<α ...
  18. [18]
    A primer on strong vs weak control of familywise error rate - Proschan
    Feb 27, 2020 · A primer on strong vs weak control of familywise error rate. Michael ... Multiple comparisons. In: R D'Agostino, L Sullivan, J Massaro ...Missing: family- wise
  19. [19]
    A primer on strong vs weak control of familywise error rate - PubMed
    Apr 30, 2020 · A primer on strong vs weak control of familywise error rate ... Stat Med. 2020 Apr 30;39(9):1407-1413. doi: 10.1002/sim.8463. Epub 2020 Feb 27.
  20. [20]
    [PDF] Lecture 19 Multiple Testing 1 Example: Comparing Restaurant Quality
    the union bound) to bound the familywise error rate. Let I0 be the indices of the null hypotheses that are true, then we have. FWER = P(∪i∈I0 (pi ≤ α/k)) ...
  21. [21]
    Controlling type I error rates in multi-arm clinical trials - NIH
    This factor would motivate one to control the total chance of making any type I error, known as the family-wise error rate (FWER). The disadvantage of ...2. False Discovery Rate · 4. Simulation Study · 5. Results
  22. [22]
  23. [23]
    Controlling the False Discovery Rate: A Practical and Powerful ...
    A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics.
  24. [24]
    A direct approach to false discovery rates - Royal Statistical Society
    Aug 12, 2002 · This new approach offers increased applicability, accuracy and power. We apply the methodology to both the positive false discovery rate pFDR and FDR, and ...
  25. [25]
    Etymologia: Bonferroni Correction - PMC - NIH
    The Bonferroni correction compensates for multiple comparisons by dividing the significance level by the number of comparisons.
  26. [26]
    [PDF] Multiple Comparisons Among Means
    The purpose of this paper, then, is to suggest and evaluate a simple use of the Student t statistic for simultaneous confidence intervals for linear combi-.
  27. [27]
    Multiple significance tests: the Bonferroni method - The BMJ
    Jan 21, 1995 · The probability that two correlated variables both give non-significant differences when the null hypothesis is true is now greater than (1-( ...
  28. [28]
    [PDF] The Bonferonni and Šidák Corrections for Multiple Comparisons
    α[PT] = 1−(1−α[PF])1/C . This formula—derived assuming independence of the tests—is some- times called the Šidàk equation. It shows that in order to reach ...
  29. [29]
    A Simple Sequentially Rejective Multiple Test Procedure - jstor
    In this paper we will study multiple test procedures and we will use the most common type of protection against error of the first kind by requiring the tests ...
  30. [30]
    Comparisons of Methods for Multiple Hypothesis Testing in ... - NIH
    In this report, the authors aimed to develop a multiple hypothesis testing strategy to maximize power while controlling Type I error.Missing: classification | Show results with:classification
  31. [31]
    A general introduction to adjustment for multiple comparisons - Chen
    In the present paper, we provide a brief review on mathematical framework, general concepts and common methods of adjustment for multiple comparisons.Introduction · Adjusted P value or... · Common methods for adjustmentMissing: original | Show results with:original
  32. [32]
    [PDF] Hochberg's step-up method: cutting corners off Holm's step-down ...
    Holm's method and Hochberg's method for multiple testing can be viewed as step-down and step-up versions of the Bonferroni test. We show that both are.
  33. [33]
    [PDF] Tukey's Honestly Significant Difference (HSD) Test
    An easy and frequently used pairwise comparison technique was developed by. Tukey under the name of the honestly significant difference (hsd) test. The main.Missing: original | Show results with:original
  34. [34]
    p.adjust Adjust P-values for Multiple Comparisons - RDocumentation
    The `p.adjust` function adjusts p-values for multiple comparisons using methods like Bonferroni, Holm, Hochberg, Hommel, BH, and BY.
  35. [35]
    The GLM Procedure - Multiple Comparisons - SAS Help Center
    Oct 28, 2020 · Multiple-comparison procedures (MCPs), also called mean separation tests, give you more detailed information about the differences among the means.
  36. [36]
    IS THE BONFERRONI CORRECTION REALLY SO BAD? - NIH
    The Bonferroni correction (1) for multiple testing is sometimes criticized as being overly conservative. The correction is indeed conservative, and there ...Missing: original | Show results with:original<|separator|>
  37. [37]
    [PDF] Multiple Hypothesis Testing Procedures in Global Test
    Advantages: Strongly controls FWER; Does not require that the tests be independent. Disadvantages: Power decreases significantly(too conservative) as m ...Missing: loss | Show results with:loss
  38. [38]
    The control of the false discovery rate in multiple testing under ...
    Benjamini and Hochberg suggest that the false discovery rate may be the appropriate error rate to control in many applied multiple testing problems.
  39. [39]
    The positive false discovery rate: a Bayesian interpretation and the q ...
    Storey. "The positive false discovery rate: a Bayesian interpretation and the q-value." Ann. Statist. 31 (6) 2013 - 2035, December 2003. https://doi.org ...Missing: original | Show results with:original
  40. [40]
    Controlling the false discovery rate via knockoffs - Project Euclid
    October 2015 Controlling the false discovery rate via knockoffs. Rina Foygel ... This paper introduces the knockoff filter, a new variable selection ...
  41. [41]
    Bayesian variable selection using Knockoffs with applications to ...
    Sep 18, 2022 · In this study, we propose a multiple testing procedure that unifies key concepts in computational statistics, namely Model-free Knockoffs, ...
  42. [42]
    Multiple testing and its applications to microarrays - PMC - NIH
    Multiple testing in microarrays involves controlling false positives when testing thousands of genes, using criteria like familywise error rates, false ...
  43. [43]
    A practical guide to methods controlling false discoveries in ...
    Jun 4, 2019 · The Benjamini and Hochberg step-up procedure (BH) [13, 23] was the first method proposed to control the FDR. Soon afterwards, the q-value was ...
  44. [44]
    The (in)famous GWAS P-value threshold revisited and updated for ...
    Jan 6, 2016 · Although the genome-wide significance P-value threshold of 5 × 10−8 has become a standard for common-variant GWAS, it has not been updated to ...
  45. [45]
    The Extent and Consequences of P-Hacking in Science - PMC - NIH
    Mar 13, 2015 · One type of bias, known as “p-hacking,” occurs when researchers collect or select data or statistical analyses until nonsignificant results become significant.Missing: pipelines | Show results with:pipelines
  46. [46]
    [PDF] Sequential Selection Procedures and False Discovery Rate Control
    Apr 25, 2014 · In this paper we introduce new testing procedures that address this problem, and control the False Discovery Rate (FDR) in the ordered setting. ...
  47. [47]
    False discoveries occur early on the Lasso path - Project Euclid
    It can be observed that the issue of FDR control becomes more severe when the ... Rate optimal multiple testing procedure in high-dimensional re- gression.
  48. [48]
    Deep neural networks with knockoff features identify nonlinear ...
    Deep neural networks with knockoff features identify nonlinear causal relations and estimate effect sizes in complex biological systems. Zhenjiang Fan ...
  49. [49]
    A/B Testing Gone Wrong: How to Avoid Common Mistakes ... - Medium
    Jun 18, 2024 · Similarly, when you run multiple A/B tests, the chance of finding at least one statistically significant result by pure chance increases. This ...
  50. [50]
    Common pitfalls in statistical analysis: The perils of multiple testing
    Multiple testing refers to situations where a dataset is subjected to statistical testing multiple times - either at multiple time-points or through multiple ...Missing: seminal | Show results with:seminal
  51. [51]
    Origin of "5$\sigma$" threshold for accepting evidence in particle ...
    Jul 3, 2012 · The 5σ threshold originated in the 1960s particle physics work, promoted by a 1968 Rosenfeld article, to account for multiple comparisons.
  52. [52]
    Five sigma revisited - CERN Courier
    Jul 3, 2023 · The standard criterion for claiming a discovery in particle physics is that the observed effect should have the equivalent of a five standard- ...
  53. [53]
    The role of the p-value in the multitesting problem - PMC - NIH
    The reproducibility crisis is a direct consequence of the misleading statistical conclusions. In this paper, the authors revisit some of the controversies on ...
  54. [54]
    The Role of p-Values in Judging the Strength of Evidence and ...
    p-Values are viewed by many as the root cause of the so-called replication crisis, which is characterized by the prevalence of positive scientific findings that ...
  55. [55]
    Causal Inference Meets Deep Learning: A Comprehensive Survey
    In this survey, we provide a comprehensive and structured review of causal inference methods in deep learning.<|separator|>
  56. [56]
    [PDF] Causal Inference for Genomic Data with Multiple Heterogeneous ...
    Apr 22, 2025 · Overall, the results in Figure 2 demonstrate the valid FDP control of the proposed multiple testing procedure for various causal estimands.
  57. [57]
    Multiple comparisons of treatment against control under unequal ...
    Dunnett's test is used to test such differences and assumes equal variances of the response variable for each group.Missing: original | Show results with:original