Fact-checked by Grok 2 weeks ago

Look-elsewhere effect

The look-elsewhere effect (LEE) is a statistical phenomenon in hypothesis testing that arises when multiple observations or tests are conducted across a parameter space to search for a signal, thereby increasing the probability of observing a spurious significant result purely by chance, as compared to testing at a single predefined location. This effect inflates the apparent unless explicitly corrected for, potentially leading to false claims of discovery in scientific analyses. The is especially prominent in high-energy physics, , and astronomy, where experiments routinely scan broad regions—such as mass spectra, energy bins, or spatial coordinates—for rare events or new particles, amplifying the risk of mistaking background fluctuations for signals. A notable example is the discovery of the at the , where ATLAS and collaborations accounted for the LEE by evaluating global significance across the range, ensuring the reported five-sigma threshold reflected the full search scope rather than a local excess alone. Failure to address the LEE can undermine the reliability of results, as demonstrated in cases like early claims of supersymmetric particles that did not survive multiplicity corrections. To counteract the LEE, researchers use methods like simulations to estimate the effective number of independent trials (trial factor) and compute global s, or approximations such as adding the expected number of upward fluctuations to the local p-value for large datasets. More advanced unified approaches integrate frequentist profile likelihood ratios with Bayesian priors to provide consistent significance assessments across diverse search strategies. These corrections are standardized in reviews and ensure that discovery thresholds, like the conventional five-sigma level, maintain a low false-positive rate experiment-wide.

Definition and Background

Core Definition

The look-elsewhere effect refers to a bias in that arises when searching for signals or testing hypotheses across multiple regions, parameters, or outcomes, leading to an inflated probability of detecting a false positive that appears significant purely by chance. In such scenarios, researchers examine a broad space—such as a range of possible values in a —without initially accounting for the cumulative risk of random fluctuations mimicking a genuine effect anywhere within that space. This phenomenon is particularly relevant in exploratory analyses where the location or form of a potential signal is , increasing the likelihood of erroneous claims of . The core mechanism of the look-elsewhere effect stems from the multiplicity inherent in scanning large spaces or datasets, where the overall of a spurious significant result grows with the number of implicit or explicit tests conducted. Here, "look-elsewhere" emphasizes the danger of overlooking that a local fluctuation, which might seem rare in isolation, becomes more probable when considering the entire . For instance, a deviation that has a low probability at any single point can collectively produce an apparent outlier somewhere due to the volume explored. An illustrative is fishing in multiple ponds: the of reeling in a catch by sheer rise substantially if one tries many locations, even if the waters are empty of . This effect starkly contrasts with single hypothesis testing, where a conventional threshold like 0.05 directly indicates a 5% for that isolated test. In the presence of multiplicity, however, the global escalates across the searched space, potentially turning a modest local into a misleading overall claim unless multiplicity is addressed. It relates to the but specifically highlights the challenges of continuous or broad searches.

Historical Development

The look-elsewhere effect traces its origins to the early 20th-century recognition of the in , where performing numerous hypothesis tests on the same dataset increases the chance of false positives without appropriate adjustments. In fields such as astronomy and physics, informal awareness of this issue emerged during the mid-1900s, particularly as researchers analyzed large datasets from observations and early experiments, where signals were sought across multiple energy bins or parameter ranges without systematic corrections for multiplicity. Formalization of the concept in began in the late 1960s, with a key milestone being A.H. Rosenfeld's 1968 review that proposed the 5σ threshold to account for multiple comparisons in resonance searches. This was extended in the 1970s, when physicists at facilities like explicitly discussed the "trial factor" to quantify the inflation of significance from scanning broad parameter spaces, such as mass spectra in resonance searches. A key milestone was R.B. ' 1977 paper, which provided a foundational statistical for testing in the presence of parameters only under the , enabling precise estimation of tail probabilities in continuous search spaces relevant to high-energy experiments. This work was extended in ' 1987 publication, offering asymptotic approximations for level-crossing probabilities that became essential for correcting p-values in multiplicity-heavy analyses. By the 1980s, the trial factor was integrated into experiment protocols, as seen in discussions of event multiplicity and background fluctuations in proton-antiproton collider data, ensuring global significances accounted for the effective number of independent trials. The concept evolved into mainstream statistical discourse by the 1990s, with particle physicists advocating its broader application beyond HEP, influencing treatments of the multiple testing problem in diverse scientific fields. Workshops like the inaugural PHYSTAT series in 2000 further propelled this development, with Louis Lyons' 2003 contribution explicitly naming and analyzing the "look-elsewhere effect" in the context of discovery thresholds. This historical progression shaped publishing standards in particle physics journals, reinforcing the 5σ discovery criterion to mitigate the look-elsewhere effect across model-dependent searches.

Statistical Principles

Multiple Comparisons Problem

In null hypothesis significance testing, researchers posit a H_0 representing no effect or no difference, and compute a as the probability of observing data at least as extreme as the sample, assuming H_0 is true. A low , typically below a significance level \alpha (often 0.05), leads to rejecting H_0 in favor of an . The arises when multiple hypothesis tests are conducted on the same dataset, inflating the overall chance of false positives. For k independent tests each at significance level \alpha, the (FWER)—the probability of at least one false rejection across all tests—is $1 - (1 - \alpha)^k, which exceeds \alpha and grows rapidly with k. Controlling the FWER emphasizes limiting the risk of any false positive in the family of tests, a conservative approach suited to exploratory analyses where even one error undermines credibility. In contrast, the (FDR) controls the expected proportion of false positives among all rejected hypotheses, offering greater power for large-scale testing at the cost of potentially more errors. While FWER prioritizes strict control over any Type I error, FDR balances discovery with error management, particularly in or where many tests are routine. The look-elsewhere effect manifests this problem in scenarios involving scans over continuous parameter spaces, such as signal frequency or position, effectively multiplying the number of implicit tests beyond discrete cases and further elevating false positive risks. This continuous multiplicity demands careful adjustment to maintain valid inference, akin to but more challenging than finite multiple testing.

Trial Factor Concept

The trial factor, often denoted as f or N, quantifies the look-elsewhere effect by representing the effective number of independent tests performed across a search space in statistical analyses, such as the number of resolution bins in discrete data or the effective volume of parameter space explored in continuous cases. It serves as a multiplier that relates the local significance—calculated assuming a signal at a specific point—to the global significance, which accounts for the possibility of the signal appearing anywhere in the searched region. In high-energy physics, for instance, this factor arises when scanning for resonances without a precisely known , effectively increasing the chance of spurious findings due to multiple comparisons. Estimating the trial factor depends on the nature of the search space. For discrete cases, such as binned histograms, it is approximately equal to the number of bins, providing a straightforward count of potential test locations, though this assumes between bins. In continuous parameter spaces, like mass spectra, requires more sophisticated approaches, including the of the expected number of "upcrossings"—points where the exceeds a —using methods such as Davies' bound on tail probabilities for chi-squared processes. simulations are commonly employed to compute this by generating background-only datasets, fitting the across the space, and determining the distribution of the maximum value, often requiring around $10^7 trials for precise results at high levels like 5σ. These simulations help approximate the trial factor as f \approx \langle N(c) \rangle, where \langle N(c) \rangle is the mean number of upcrossings at the observed c, scaled from a lower reference via exponential approximation. In practice, the trial factor adjusts to preserve overall control of the (FWER), ensuring that the reflects the true probability of an excess occurring somewhere in the search space under the . Specifically, the is approximately the local multiplied by the trial factor (p_{\text{global}} \approx f \times p_{\text{local}}) for small local p-values, meaning that to achieve a desired (e.g., p_{\text{global}} = 2.87 \times 10^{-7} for 5σ), the local must be stricter by a factor of f. This adjustment prevents overinterpretation of local fluctuations as discoveries and is essential in fields like for claiming evidence. However, defining the trial factor faces challenges, particularly in establishing the of tests, as nearby points in the search space are often correlated—such as adjacent mass bins sharing similar background contributions—potentially leading to over-correction (underestimating f) or under-correction (overestimating it). Numerical resolution in simulations can also introduce biases in upcrossing counts, and the choice of search region boundaries remains subjective, complicating precise quantification in complex analyses.

Mathematical Formulation

Probability of False Positives

The look-elsewhere effect elevates the probability of false positives by considering multiple potential locations or hypotheses tested simultaneously under the . In the discrete case, where m independent statistical tests are performed, each with a local false positive probability p (the local significance level), the probability of at least one false positive across all tests is given by $1 - (1 - p)^m. For small p, this approximates to m p, indicating that the effective scales linearly with the number of trials m, which serves as the trial factor in this context. In continuous parameter spaces, such as scanning over a signal parameter like in high-energy physics, the situation is more nuanced due to correlations between nearby tests. The effective trial factor f is approximately the search range divided by the resolution width (in units). This leads to an approximate for the global significance Z_{\text{global}}, which accounts for the look-elsewhere effect, given by Z_{\text{global}} \approx \sqrt{Z_{\text{local}}^2 - 2 \ln f}, where Z_{\text{local}} is the significance at the observed peak. This asymptotic approximation holds for high thresholds and highlights how scanning broadens the effective search space, reducing the interpreted significance of a local excess. Under the Gaussian approximation, the look-elsewhere effect is modeled using the distribution of the maximum of a over the scanned region, which governs the probability of exceeding a under the . The excursion probability, or the probability that the supremum of the field exceeds a level u (corresponding to a local significance), can be approximated using methods from , often involving the expected number of upcrossings of the . When analytical approximations are insufficient, particularly for complex or multidimensional search spaces with non-Gaussian backgrounds, approaches such as toy are employed to empirically estimate the of maxima under the . These involve generating numerous pseudo-experiments (e.g., $10^6 to $10^7 trials) by sampling from the background-only model, computing the (e.g., profile likelihood ratio) over the parameter space in each, and recording the maximum value to construct the of the global . The global is then the fraction of simulated maxima exceeding the observed value, directly quantifying the false positive probability while accounting for the full look-elsewhere effect.

Correction Techniques

To address the look-elsewhere effect, statistical corrections adjust thresholds or p-values to control the overall across multiple tests or search regions. These methods range from simple conservative adjustments to more sophisticated approaches tailored to continuous parameter spaces common in fields like . The provides a straightforward way to account for multiple comparisons by dividing the desired global significance level \alpha by the number of tests m (or trial factor f), yielding an adjusted \alpha/m, or equivalently by multiplying the local p_L by m to obtain the global p_G = m \cdot p_L. This method ensures the —the probability of at least one false positive—is controlled at \alpha, but it can be overly conservative, especially when m is large, leading to reduced statistical . In practice, for discrete tests, m equals the number of regions; for continuous searches, it approximates the effective number of elements. For scenarios involving a large number of tests, where controlling the is too stringent, (FDR) methods offer a less conservative alternative by targeting the expected proportion of false positives among significant results. The Benjamini-Hochberg procedure, a widely adopted FDR control technique, sorts p-values in ascending order and rejects null hypotheses for which p_{(i)} \leq (i/m) q, where q is the desired FDR level, i is the rank, and m is the total number of tests; this controls the FDR at q. In high-energy physics searches with many potential signals, such as genome-wide scans or broad resonance hunts, Benjamini-Hochberg balances power and error control better than Bonferroni, though it assumes independence or positive dependence among tests. In , where searches often span continuous parameter spaces, advanced techniques unify local and global significance assessments to mitigate the look-elsewhere effect more efficiently. The Gross-Vitells method employs a unified approach based on the expected number of upcrossings of the (LRT) statistic, approximating the global p-value as p_G \approx P(\chi^2_1 > q_0) + E[N_u(q_0)] \cdot e^{-(q - q_0)/2}, where q_0 is a reference threshold, E[N_u] is estimated via simulations, and the method combines likelihood ratios for local peaks with global trial factors. This is particularly effective for ordering statistics over a parameter grid, where the maximum LRT value across regions determines significance, avoiding Bonferroni's conservatism. Alternatively, likelihood methods, as in Pilla et al., integrate geometric tube formulas to compute global thresholds, incorporating parameters via the score function for multidimensional scans. Best practices for applying these corrections emphasize pre-defining search regions or parameter grids before analysis to minimize post-hoc adjustments and clearly delineate the trial factor, thereby enhancing reproducibility and reducing bias. For complex cases with non-trivial correlations or continuous spaces, simulations—such as toy datasets—are recommended to empirically determine global distributions and validate corrections, ensuring accurate control of false positives without excessive computational overhead.

Applications and Examples

In High-Energy Physics

In high-energy physics experiments at the (LHC) at , the look-elsewhere effect arises prominently in searches for new particles, such as the , where broad scans over mass ranges (typically 110–600 GeV) and multiple channels are conducted. These searches must account for the increased probability of false positives due to testing numerous hypotheses, resulting in trial factors that can reach 10^4 to 10^6 in comprehensive analyses covering extensive parameter spaces. For instance, in beyond-Standard-Model scenarios like , the multidimensional nature of signal models amplifies this effect, necessitating careful correction to avoid mistaking statistical fluctuations for genuine signals. A landmark example is the 2012 discovery of the , where the ATLAS and collaborations explicitly adjusted for the look-elsewhere effect to establish global significance. ATLAS observed a local significance of 5.9σ at a of 126.5 GeV in the combined H → γγ and H → ZZ channels using 4.5–4.8 fb⁻¹ of 7 TeV data and 5.8–5.9 fb⁻¹ of 8 TeV data, but after correcting for the look-elsewhere effect over the range 110–600 GeV via trial factor estimation, the global significance was 5.1σ; narrowing to 110–150 GeV yielded 5.3σ. Similarly, reported a local significance of 5.0σ at 125.5 GeV across γγ, ZZ, and WW channels with 5.1 fb⁻¹ at 7 TeV and 5.3 fb⁻¹ at 8 TeV, adjusted to a global 4.6σ over 115–130 GeV to account for multiple bins and channels. The combined ATLAS- result achieved the required 5σ global , confirming the discovery while mitigating the risk of spurious signals from unadjusted local excesses. Challenges in handling the look-elsewhere effect at the LHC include background fluctuations in complex event topologies, such as varying jet multiplicities or angular distributions, which can produce apparent signals across tested regions if trial factors are underestimated. For example, in dijet searches, discrepancies in angular correlations might yield local peaks that, without correction, inflate significance due to the broad explored. These issues are particularly acute in high-luminosity runs, where increased data volume heightens the potential for chance alignments mimicking new physics. ATLAS and adhere to standardized guidelines for claims, mandating a global significance greater than 5σ that fully incorporates the look-elsewhere effect through Monte Carlo-based background modeling. This involves generating large ensembles of pseudo-experiments to simulate fluctuations across all searched dimensions, ensuring trial factors are robustly estimated and applied via methods like the profile likelihood ratio test. Such practices, rooted in frequentist statistics, have become the for LHC analyses, balancing to subtle signals with rigorous control of false rates.

In Other Scientific Fields

In , the look-elsewhere effect is particularly relevant in searches conducted by observatories such as , where analyses scan vast frequency-time parameter spaces using large template banks to detect signals from compact binary coalescences. These banks can contain thousands to millions of templates, introducing substantial trial factors that inflate the probability of false positives if unaccounted for. To mitigate this, researchers estimate trial factors based on the effective volume of the searched parameter space and apply adjustments via methods like effective chi-squared distributions, ensuring robust significance assessment for detections. In , the look-elsewhere effect arises prominently during genome-wide association studies (GWAS), which test associations between traits and millions of single nucleotide polymorphisms (SNPs) across the , leading to heightened risks of spurious linkage peaks. Standard practice involves controlling the (FDR) through procedures like the Benjamini-Hochberg method, which adjusts p-values to maintain a targeted proportion of false positives among significant results, thereby addressing the multiplicity inherent in scanning the entire . This approach balances discovery power with error control, as demonstrated in large-scale studies identifying true genetic associations. In , (fMRI) voxel-based analyses exemplify the look-elsewhere effect by performing independent statistical tests on thousands of voxels across the , resulting in inflated false rates due to spatial multiplicity. typically employ cluster-level inferences, where is determined not by individual voxels but by the extent of contiguous suprathreshold clusters, with thresholds derived from permutation-based null distributions to control the . This method, while reducing type I errors, has been shown to suffer from inflated false positives under non-Gaussian spatial autocorrelations common in fMRI data, prompting refinements in software like and FSL. In economics and finance, the look-elsewhere effect contributes to biases from data dredging, as researchers or traders test vast arrays of strategies—such as momentum, value, or correlation-based models—across numerous assets and time periods, leading to overfitting and illusory discoveries. To counteract this, multiple testing frameworks like the false discovery rate or bootstrap-based reality checks adjust for the number of trials, calibrating p-values to distinguish genuine anomalies from noise in large datasets like CRSP. Seminal analyses of millions of simulated strategies reveal that without such corrections, up to 50% of reported "significant" effects may be false positives, underscoring the need for out-of-sample validation.

References

  1. [1]
    [PDF] 40. Statistics - Particle Data Group
    Jun 1, 2020 · 40.3.2.2 The look-elsewhere effect. The “look-elsewhere effect” relates to multiple measurements used to test a single hypothesis. The ...
  2. [2]
    [2007.13821] The look-elsewhere effect from a unified Bayesian and ...
    Jul 27, 2020 · To avoid making false claims of detection, one must account for this effect when assigning the statistical significance of an anomaly.
  3. [3]
    [PDF] Higgs Discovery and the Look Elsewhere Effect - PhilSci-Archive
    The discovery of the Higgs particle required a signal of five sigma significance. The rigid application of that condition is a convention that disregards ...
  4. [4]
    The look-elsewhere effect from a unified Bayesian and frequentist ...
    This is known as the look-elsewhere effect and is prevalent throughout cosmology, (astro)particle physics, and beyond. To avoid making false claims of detection ...
  5. [5]
    Trial factors for the look elsewhere effect in high energy physics
    Oct 15, 2010 · Trial factors for the look elsewhere effect in high energy physics. Special Article - Tools for Experiment and Theory; Open access; Published: ...
  6. [6]
    Trial factors for the look elsewhere effect in high energy physics - arXiv
    May 11, 2010 · Title:Trial factors for the look elsewhere effect in high energy physics. Authors:Eilam Gross, Ofer Vitells. View a PDF of the paper titled ...
  7. [7]
    [PDF] The Look Elsewhere Effect - Royal Holloway, University of London
    Jun 25, 2021 · The Look-Elsewhere Effect is when we test a single model (e.g.,. SM) with multiple observations, i.e., in multiple places. This is distinct from ...
  8. [8]
    [PDF] Estimating the “look elsewhere effect” when searching for a signal
    Abstract. The “look elsewhere effect” refers to a common situation where one searches for a signal in some space of parameters - for example, ...
  9. [9]
    [PDF] Introduction to Statistical Issues in Particle Physics
    Particle Physics emerged as a discipline in its own right half a century ago. It pioneered 'big science'; experiments are performed at accelerators of ...
  10. [10]
    [PDF] Volume 62B, number 2 PHYSICS LETTERS 24 May 1976 ...
    May 24, 1976 · The mean charged multiplicity for (~p-pp) ... 2(e)), and yield central hyperons and antihyperons* ~. The statistical significance of these effects ...
  11. [11]
    Five sigma revisited - CERN Courier
    Jul 3, 2023 · Another reason underlying the 5σ criterion is the look-elsewhere effect, which involves the “p-values” for the observed effect. These are ...
  12. [12]
    Judging a Plethora of p-Values: How to Contend With the ... - PMC
    When multiple p-values appear in a single study, this is usually a problem of multiple testing. A number of valid approaches are presented for dealing with the ...
  13. [13]
    A general introduction to adjustment for multiple comparisons - PMC
    Accordingly, the measure of familywise error rate (FWER) is introduced and defined as the probability of incorrectly rejecting at least one H0: FWER = P ( U > ...
  14. [14]
    Dealing with Familywise Error | Real Statistics Using Excel
    In general, if you perform k tests and you don't want a type I error in any of the tests, then the combined type I error becomes 1 – (1 – α)k. If you perform a ...
  15. [15]
    [PDF] Lecture 10: Multiple Testing
    • Correcting for multiple testing in R. • Methods for addressing multiple testing (FWER and FDR). • Define the multiple testing problem and related concepts ...
  16. [16]
    [PDF] A Tutorial on False Discovery Control - Statistics & Data Science
    FDR control offers a way to increase power while maintaining some principled bound on error. It is based on the assessment that. 4 false discoveries out of 10 ...
  17. [17]
    [PDF] Multiple Comparisons: Bonferroni Corrections and False Discovery ...
    In the literature, π is occasionally referred to as the family-wide error rate (FWER), while α is denoted as the comparison-wise error rate, or CWER. Example 3.
  18. [18]
    [1602.03765] On methods for correcting for the look-elsewhere effect ...
    Feb 11, 2016 · Abstract:The search for new significant peaks over a energy spectrum often involves a statistical multiple hypothesis testing problem.
  19. [19]
    [PDF] 40. Statistics - Particle Data Group
    Dec 1, 2023 · ... number of bins. M is not required to equal that of the observed ... look-elsewhere effect”, discussed in Section 40.3.2.2). 40.3.2.1 ...
  20. [20]
    [PDF] arXiv:1602.03765v5 [physics.data-an] 15 Dec 2016
    Dec 15, 2016 · ABSTRACT: The search for new significant peaks over a energy spectrum often involves a statis- tical multiple hypothesis testing problem.
  21. [21]
    [PDF] Correcting for the look-elsewhere effect: why, when and how
    Jul 31, 2019 · Approach 1: Multiple hypotesis testing ⇒ that's how the. e.g., Bonferroni's correction. Look-Elsewhere Effect (LEE) problem was originally ...Missing: FDR | Show results with:FDR
  22. [22]
    [PDF] The look-elsewhere effect from a unified Bayesian and frequentist ...
    Jul 27, 2020 · The larger the trials factor, i.e. the more drugs tested, the larger the chance of a false positive arising due to a statistical fluctuation.
  23. [23]
  24. [24]
  25. [25]
    [PDF] Exotics Searches in Jet Final States with the ATLAS Detector
    Jul 21, 2011 · distribution, and discrepancies in the dijet angular distributions. – ... experiments (PE's); thus account for “look elsewhere effect”. • In ...
  26. [26]
    Should you get excited by your data? Let the Look-Elsewhere Effect ...
    CERN Accelerating science · Should you get excited by your data? Let the Look-Elsewhere Effect decide · News.Missing: definition | Show results with:definition
  27. [27]
    The look-elsewhere effect from a unified Bayesian and frequentist ...
    Oct 2, 2020 · The look-elsewhere effect from a unified Bayesian and ... GRAVITATIONAL WAVES FROM INDIVIDUAL SUPERMASSIVE BLACK HOLE BINARIES ...
  28. [28]
    [PDF] All-sky search for continuous gravitational waves from isolated ...
    count the look-elsewhere effect (on the follow-up stage). [38]. We found only ... [28] All-sky search for periodic gravitational waves in LIGO. S4 data ...
  29. [29]
    [PDF] Template banks to search for compact binaries with spinning ...
    We investigate the ability of several search pipelines to detect gravitational waves from compact binaries with spin. We use the post-. Newtonian ...
  30. [30]
    Scan statistics on Poisson random fields with applications in genomics
    GROSS, E. and VITELLS, O. (2010). Trial factors for the look elsewhere effect in high energy physics. The European Physical Journal C 70 525 ...
  31. [31]
    Multiple hypothesis testing in genomics | Request PDF
    Aug 9, 2025 · ... look-elsewhere' effect (Goeman & Solari 2014) . Consistent with the findings of Eftekhari et al. (2018), we find that for localization ...
  32. [32]
    Cluster failure: Why fMRI inferences for spatial extent have ... - PNAS
    Our results suggest that the principal cause of the invalid cluster inferences is spatial autocorrelation functions that do not follow the assumed Gaussian ...
  33. [33]
    Appendix A: Cluster Correction - Andy's Brain Book! - Read the Docs
    Cluster correction takes advantage of the fact that the voxels in a typical dataset are not completely independent: Instead of testing each voxel individually, ...
  34. [34]
    Corrections for multiple comparisons in voxel-based lesion-symptom ...
    Clusters from the original (true) VLSM analysis that are larger than 95% of the null distribution of cluster sizes are taken to reflect true lesion-symptom ...
  35. [35]
    [PDF] False (and Missed) Discoveries in Financial Economics - Duke People
    Multiple testing plagues many important questions in finance such as fund and fac- tor selection. We propose a new way to calibrate both Type I and Type II ...
  36. [36]
    [PDF] Testing strategies based on multiple signals - mySimon
    Strategies selected by combining multiple signals suffer severe overfitting biases, because underlying signals are typically signed such that each predicts ...
  37. [37]
    [PDF] Evidence from Two Million Trading Strategies - ABFER
    We construct a large laboratory of over two million trading strategies by data- mining the two most commonly used datasets in finance, viz. CRSP and COMPU-.