Fact-checked by Grok 2 weeks ago

F -statistics

In population genetics, F-statistics, also known as fixation indices, are a class of measures developed to quantify the partitioning of genetic variation within and among populations, particularly due to inbreeding and population structure. Introduced by American geneticist Sewall Wright in the 1920s as part of his work on inbreeding coefficients, these statistics provide a framework for understanding evolutionary processes like genetic drift, gene flow, and subdivision. The core F-statistics include FIS ( coefficient within subpopulations), FST ( measuring differentiation among subpopulations), and FIT (total relative to the overall ), related hierarchically by the equation FST = (FIT - FIS) / (1 - FIS). They are computed from heterozygosity levels—observed within individuals (HI), expected within subpopulations (HS), and expected in the total (HT)—with FST = (HT - HS) / HT, for example. Widely applied in fields like and , F-statistics help infer and but assume equilibrium conditions that may not always hold.

Historical Background

Sewall Wright's Formulation

initiated the development of concepts underlying F-statistics in the early during his tenure at the U.S. Department of Agriculture's Bureau of Animal Industry, where he investigated effects in to improve breeding practices. His work focused on calculating inbreeding coefficients using path analysis, a he introduced to trace correlations through pedigrees and quantify the probability of in offspring. This approach was initially applied to domestic animals, addressing challenges in maintaining amid controlled mating systems. Wright's experiments with guinea pigs at the USDA exemplified these efforts, involving over a decade of systematic through full-sib matings across multiple generations. These studies revealed a progressive decline in viability and vigor, alongside increased differentiation among inbred families, underscoring 's role in reducing heterozygosity and amplifying within small populations. Complementing this, his analyses of cattle pedigrees demonstrated how limited bull dispersal and small herd sizes fostered inbreeding, leading to correlated deviations in traits from breed optima and informing practical recommendations for crossbreeding to restore . By 1951, expanded these ideas into broader in his Galton Lecture, published as "The genetical structure of populations," where he formalized F-statistics as ratios of variances to describe hierarchical population subdivision. This formulation built directly on his earlier work, providing a framework to partition in structured populations beyond simple pedigrees. The original motivation for F-statistics was to quantify deviations from —random mating across the entire population—arising from within subpopulations and random between them in finite groups. emphasized their utility in both artificial settings, like breeds, and natural scenarios, drawing on his guinea pig results to illustrate drift's cumulative effects. He further exemplified this through theoretical models of island populations, where isolated demes exchange limited migrants, mimicking subdivision in wild and highlighting drift's evolutionary role under restricted .

Development and Refinements

The advent of molecular techniques, particularly protein electrophoresis in the 1960s, enabled direct measurement of genotypic heterozygosity at enzyme loci, shifting the application of F-statistics from inferred phenotypic traits to observable genetic variation in natural populations. This methodological advance, exemplified by surveys of Drosophila populations, revealed unexpectedly high levels of genetic diversity and facilitated empirical tests of population structure models previously limited by data availability. In the 1970s, refined F-statistics to accommodate multi-allelic loci by reformulating them as ratios of gene diversities (expected heterozygosities) rather than correlations between uniting gametes, extending Wright's original biallelic framework. integrated these heterozygosity-based definitions with measures of , providing a unified approach to quantify subdivision in populations with complex allelic variation. By the 1980s, refinements focused on hierarchical F-statistics, which partition across levels such as individuals within subpopulations and subpopulations within the total population, building on Wright's earlier correlations. Bruce S. Weir and C. Clark Cockerham introduced unbiased estimators for these hierarchical parameters (FIT, FST, FIS) that account for finite sample sizes and multilocus data, improving the accuracy of population structure inference from electrophoretic and other genotypic datasets.

Core Concepts and Definitions

Inbreeding and Fixation Coefficients

The , originally formulated by , quantifies the probability that two homologous in an individual are identical by descent, meaning they are copies of the same ancestral rather than arising independently. This measure captures the extent to which non-random mating, such as consanguineous unions, increases the likelihood of homozygosity within individuals compared to expectations under random mating. Fixation coefficients, also rooted in Wright's framework, represent the proportion of total genetic variation at a locus that arises from non-random or population substructure, leading to reduced heterozygosity across the broader . These coefficients reflect how factors like or limited cause alleles to become "fixed" (homozygous) more frequently than anticipated, thereby partitioning variation in a non-random manner. A key distinction exists between inbreeding coefficients, which primarily address within-population effects such as mating among relatives that elevate homozygosity in subpopulations, and fixation coefficients, which emphasize between-population driven by barriers to exchange. operates at the individual or local level to deviate frequencies from panmictic expectations, while fixation highlights global structure where subpopulations diverge in frequencies. These concepts presuppose an understanding of Hardy-Weinberg equilibrium (HWE), a null model assuming random mating, no evolutionary forces, and infinite population size, under which the expected heterozygosity (He)—the predicted proportion of heterozygous individuals based on frequencies—matches the observed heterozygosity (Ho), the actual proportion measured in the population. Departures from HWE, particularly when Ho falls below He, signal or , providing the baseline against which inbreeding and fixation are assessed.

Notation: F_IS, F_ST, F_IT

In , the F-statistics introduced by provide a framework for quantifying the partitioning of in structured populations using specific notations that reflect different levels of and differentiation. The notation F_{IS} denotes the within subpopulations, which measures the extent of deviation from Hardy-Weinberg (HWE) within individual subpopulations due to non-random or other local processes. This represents the correlation between uniting gametes relative to those drawn at random from the same subpopulation, or equivalently, the average in heterozygosity within subpopulations compared to HWE expectations. The notation F_{ST}, often called the fixation index, quantifies the genetic differentiation among subpopulations relative to the total population. It captures the proportion of total attributable to differences in allele frequencies between subpopulations, reflecting the effects of limited , , or selection across population boundaries. The notation F_{IT} represents the total inbreeding coefficient of individuals relative to the entire population, indicating the overall deviation from HWE when considering the whole structured population as a single unit. This measures the between uniting gametes drawn at random from the total population, encompassing both within- and between-subpopulation effects. These notations are interconnected in a hierarchical manner, where the total F_{IT} decomposes into the within-subpopulation component F_{IS} and the between-subpopulation F_{ST}, expressed by the equation: F_{IT} = F_{IS} + F_{ST}(1 - F_{IS}) This relationship illustrates an additive decomposition adjusted for the interaction between local and population structure, allowing researchers to partition overall genetic correlations across levels.

Theoretical Basis

Partition of Genetic Variation

In population genetics, F-statistics provide a framework for partitioning the total genetic variance observed in a into distinct components that reflect different levels of genetic structure. The total genetic variance, denoted as σ²_total, is decomposed into the variance within individuals (σ²_I), the variance within subpopulations (σ²_S), and the variance between subpopulations (σ²_ST). This partitioning originates from Sewall Wright's work on the of genetic , where he emphasized that such decomposition allows researchers to quantify the effects of , population subdivision, and overall . The F-statistics are defined as ratios of these variance components, highlighting their roots in variance analysis. For instance, F_ST is expressed as the ratio of the between-subpopulation variance to the total variance, F_ST = σ²_ST / σ²_total, which measures the proportion of genetic variation attributable to differences among subpopulations. Similarly, other F-statistics, such as F_IS and F_IT, capture the relative contributions of within-subpopulation and total inbreeding effects. This variance-based approach underscores Wright's original conceptualization in the context of quantitative genetics, where random genetic drift and population structure lead to the accumulation of differences between groups. Conceptually, the partitioning model assumes neutral loci under in diploid organisms, where at a locus is influenced by frequencies. Under the alleles model, each introduces a unique , simplifying the variance decomposition by focusing on heterozygosity and probabilities across hierarchical levels—individuals, subpopulations, and the total . In contrast, the finite loci (or infinite sites) model accounts for multiple mutations at the same locus, which can complicate partitioning but still allows for variance breakdown into the specified components, particularly when assuming equilibrium conditions like the alleles neutral model. This distinction is crucial for understanding how drift-driven processes, such as restricted , contribute to σ²_ST over time in subdivided s. To illustrate, consider a diploid divided into subpopulations experiencing without selection or . The within-individual variance σ²_I represents heterozygosity at the individual level, often idealized as zero under complete homozygosity from , while σ²_S captures variation among individuals within a subpopulation due to local drift. The between-subpopulation component σ²_ST then accumulates as subpopulations diverge, with F_ST quantifying the extent to which this divergence explains the overall relative to the total . This hierarchical partitioning has been foundational for modeling neutral evolution in structured populations, as demonstrated in simulations and theoretical derivations for multi-allelic loci.

Mathematical Equations

The F-statistics, originally formulated by , can be expressed through measures of heterozygosity, which compare observed and expected under Hardy-Weinberg equilibrium (HWE). For a general inbreeding coefficient F, it is defined as the deviation from HWE within a : F = 1 - \frac{H_o}{H_e} where H_o is the observed heterozygosity (proportion of heterozygous individuals) and H_e is the expected heterozygosity under HWE. For a diallelic locus with allele frequencies p and q = 1 - p, H_e = 2pq. In the context of structure, Wright's hierarchical F-statistics relate heterozygosity across levels. The inbreeding F_{IT} measures deviation at the level relative to the : F_{IT} = 1 - \frac{H_I}{H_T} where H_I is the observed heterozygosity across individuals and H_T is the expected heterozygosity in the . Similarly, the within-subpopulation inbreeding is F_{IS} = 1 - \frac{H_o}{H_S}, with H_S as the expected heterozygosity within subpopulations, and the between-subpopulation differentiation is F_{ST} = 1 - \frac{H_S}{H_T}. These satisfy the additive decomposition: F_{IT} = F_{IS} + F_{ST}(1 - F_{IS}) or equivalently, $1 - F_{IT} = (1 - F_{ST})(1 - F_{IS}), which partitions total genetic variation into components due to within-subpopulation inbreeding and among-subpopulation differences. An alternative variance-based formulation emphasizes allele frequency differences across subpopulations. For F_{ST}, it is given by: F_{ST} = \frac{\text{Var}(p)}{p(1-p)} where \text{Var}(p) is the variance of the allele frequency p among subpopulations, and p(1-p) represents the total binomial variance under HWE in the overall population. This equivalence to the heterozygosity ratio holds because H_S \approx 2 \bar{p}(1 - \bar{p}) - 2 \text{Var}(p) and H_T \approx 2 p (1 - p), leading to $1 - F_{ST} = H_S / H_T.

Measuring Population Differentiation

Interpretation of F_ST

The fixation index F_{ST}, a key measure in F-statistics, quantifies the proportion of attributable to differences between , ranging from 0 to 1. A value of 0 indicates no genetic differentiation, corresponding to a panmictic (randomly ) population where frequencies are homogeneous across subpopulations due to unrestricted . Conversely, F_{ST} = 1 signifies complete , with populations fixed for different alleles and no shared , often resulting from prolonged separation without . Sewall Wright offered qualitative guidelines for interpreting F_{ST} values in terms of differentiation levels: values below 0.05 suggest little genetic differentiation, 0.05 to 0.15 indicate moderate differentiation, and values exceeding 0.25 reflect great differentiation. These thresholds, derived from empirical and theoretical considerations in subdivided populations, help assess the extent of population structure but should be contextualized with species-specific life history and geography, as they represent broad heuristics rather than strict boundaries. In neutral evolutionary models, F_{ST} primarily reflects the balance between genetic drift, which increases differentiation by randomly fixing alleles in finite populations, and gene flow, which reduces it by exchanging alleles. A common approximation in Wright's island model relates F_{ST} to migration-drift equilibrium as F_{ST} \approx \frac{1}{1 + 4Nm}, where N is the effective population size and m is the per-generation migration rate; low F_{ST} thus implies high gene flow counteracting drift. Other factors, such as selection favoring local adaptations or mutation introducing new variation, can elevate F_{ST} beyond neutral expectations, though in strictly neutral scenarios, drift and migration dominate.

Hierarchical F-Statistics

Hierarchical F-statistics extend the classical framework to populations organized in multi-level nested structures, such as individuals within subpopulations, subpopulations within regions, and regions within a broader . This approach partitions across multiple hierarchical levels, allowing researchers to quantify at each stratum beyond the simple two-level (individual-population) design originally proposed by . For instance, in a three-level , F_CT measures among major regions (e.g., continents or geographic clusters), F_SC captures variation among subpopulations within those regions (e.g., local demes or islands), and F_IS assesses or deviation from Hardy-Weinberg expectations within individual subpopulations. These indices are derived from variance components analogous to analysis of molecular variance (AMOVA), where total genetic variance is decomposed into additive contributions from each level. In a full hierarchical model, the overall fixation index F_total quantifies total relative to the global population and is expressed as F_{\text{total}} = 1 - \frac{H_{\text{individual}}}{H_{\text{total}}}, where H_{\text{individual}} is the expected heterozygosity within individuals (or observed at the lowest level) and H_{\text{total}} is the total heterozygosity across the entire . This encompasses nested partitions of heterozygosity, such that at higher levels compounds with lower ones; for example, the effective F_ST across all levels is the product of conditional probabilities of identity-by-descent across strata, reflecting cumulative . For an arbitrary number of k levels, the approach generalizes through recursive variance partitioning, where each F_{i,j} represents the between alleles at level i relative to level j, enabling scalable analysis of complex structures like subdivided demes. These statistics find application in structured populations where varies by scale, such as models where islands form subpopulations within oceanic regions, or in with demes nested in patches. For example, in a study of the subterranean Reticulitermes flavipes with a four-level (individuals within colonies within transects within sites), hierarchical F-statistics revealed strong differentiation among colonies overall (F_CT = 0.311), minimal differentiation among transects within sites (F_SC = 0.024), and negative F_IS = -0.319 within colonies, indicating excess heterozygosity due to colony founding by outbred pairs. This framework aids in dissecting evolutionary processes like isolation by distance in metapopulations, prioritizing contributions from regional barriers over local ones.

Estimation Techniques

Classical Methods from Allele Frequencies

Classical methods for estimating F-statistics rely on observed frequencies from codominant markers, such as allozymes or microsatellites, to quantify genetic and in structured . These approaches, developed prior to the widespread use of genomic data, use moment-based estimators derived from analyses of variance in frequencies across subpopulations. The estimators are designed to provide unbiased assessments under assumptions of neutrality and , making them foundational for early population genetic studies. A seminal contribution to these methods is the work of Weir and Cockerham (1984), who proposed unbiased estimators for F-statistics using an analysis of variance (ANOVA) framework applied to genotype data. For FST (denoted as θ in their notation), the estimator is given by \hat{\theta} = \frac{\text{MSB} - \text{MSE}}{\text{MSB} + (n-1)\text{MSE}}, where MSB is the mean square between subpopulations, MSE is the mean square error within subpopulations, and n is the number of subpopulations. This formula partitions the total genetic variance into components attributable to differences among subpopulations (MSB) and within them (MSE), providing a direct measure of differentiation that accounts for finite sample sizes and multiple alleles. The estimators are computed locus by locus and then averaged across loci to obtain overall F-statistics. For the inbreeding coefficient FIS (denoted as φ), the classical estimator is \hat{\phi} = \frac{H_e - H_o}{H_e}, where Ho is the observed heterozygosity and He is the expected heterozygosity under Hardy-Weinberg equilibrium, averaged over loci. This measures the deficit of heterozygotes within subpopulations relative to expectations, reflecting non-random mating or Wahlund effects. and Cockerham's framework extends this to incorporate frequencies directly, ensuring consistency with the overall correlation-based definition of F-statistics. These methods assume an infinite model for , neutrality with no selection acting on loci, and random sampling of individuals from subpopulations. For multi-allelic loci, the estimators handle complexity by the variance of allele frequencies relative to the expected variance under Hardy-Weinberg proportions, which allows for the of contributions across alleles without assuming diallelic systems. This ensures that the estimators remain applicable to highly polymorphic markers, though they can be sensitive to rare alleles if sample sizes are small. As an illustrative example for a diallelic locus, FST can be estimated simply as the variance in frequencies across subpopulations divided by the expected heterozygosity in the total : F_{ST} = \frac{\text{Var}(p_i)}{\bar{p}(1 - \bar{p})}, where pi is the of the in subpopulation i, and \bar{p} is the mean across all subpopulations. This formula, rooted in Wright's original of variance, highlights how arises from drift-induced fluctuations in frequencies, and it aligns with the Weir-Cockerham for two-allele cases.

Modern Approaches with Molecular Data

With the advent of high-throughput sequencing technologies, modern estimation of F-statistics has shifted toward leveraging single nucleotide polymorphisms (SNPs) and whole-genome sequences, enabling finer-scale analyses of genetic . These data types allow for genome-wide scans that capture local variation patterns, such as in selective sweeps or events, far beyond the resolution of traditional markers. A key application is the use of window-based F_ST scans, where the genome is divided into sliding windows (typically 50–100 kb) to compute localized F_ST values, identifying regions of elevated indicative of or barriers to . Several software packages facilitate these computations, tailored to large genomic datasets. Arlequin implements F-statistics for SNPs and sequences, supporting input from VCF files and providing options for pairwise and hierarchical analyses. GENEPOP, updated for modern formats, computes F-statistics from multilocus data including s, with exact tests for differentiation. For -focused f4-statistics, ADMIXTOOLS uses block-jackknife resampling on data to test treeness and proportions. VCFtools offers efficient bulk computation of Weir and Cockerham's F_ST across populations directly from VCF files, suitable for whole-genome data. Additionally, ANGSD with realSFS enables F_ST from low-coverage whole-genome sequencing by modeling site spectra without explicit calling, accommodating uncertainty in frequencies. Bias corrections are essential when using ascertained SNPs, as discovery schemes (e.g., from arrays) can inflate F_ST by oversampling common variants. Methods adjust for this by reweighting allele frequencies based on ascertainment protocols or using unbiased subsets like variants. (LD) effects, particularly from alleles, can downward bias F_ST estimates; corrections involve filtering linked SNPs or applying LD-pruned subsets to ensure independence. or over genomic regions provides confidence intervals, accounting for sampling variance in large datasets. For hierarchical structures, the Analysis of Molecular Variance (AMOVA) framework extends F-statistics to multi-level partitions using genomic data, estimating variance components analogous to F_CT (among groups), F_SC (among subpopulations within groups), and F_ST (total subpopulations). Implemented in tools like Arlequin, AMOVA on SNPs quantifies nested differentiation, such as in metapopulations, with significance tested via . This approach integrates whole-genome sequences by treating haplotypes or distances as input, enhancing for complex hierarchies.

Applications in Population Genetics

Human Population Studies

In human population genetics, F-statistics have been instrumental in quantifying the apportionment of across global populations. A seminal analysis by Lewontin in 1972, based on 17 genetic markers from diverse human groups, revealed that approximately 85% of occurs within local populations, with only about 15% distributed between populations, corresponding to an overall F_ST value of roughly 0.15. This finding underscored the limited genetic differentiation among humans compared to other , emphasizing shared ancestry despite geographic separation. At the continental scale, F_ST values between major human groups—such as those from , , and —typically range from 0.10 to 0.12, indicating moderate driven by historical and drift. Within continents, these values drop significantly, often below 0.05, reflecting ongoing and recent shared histories. Such patterns highlight how F_ST captures the subtle structuring of , with higher involving African populations due to their deeper ancestral roots. F-statistics have illuminated key aspects of human migration history, including the Out-of-Africa expansion. Gradients in F_ST values, showing increasing differentiation with geographic distance from , support a serial founder model where migrating groups experienced successive bottlenecks, reducing diversity outward from the continent. In admixed populations like , F_ST analyses reveal complex ancestry proportions, with typical values around 0.008 between African Americans and West African reference groups, reflecting 15-25% European from historical events. These case studies demonstrate F_ST's utility in tracing events and migration routes without requiring . Modern genomic datasets, such as those from the , refine these insights with high-resolution F_ST estimates, revealing subtle subcontinental structure—for instance, values of 0.056 to 0.063 between broad continental superpopulations like and . These lower figures, influenced by dense SNP coverage and rare variant effects, confirm the overall low level of human differentiation while highlighting fine-scale patterns, such as elevated F_ST in isolated groups.

Conservation and Evolutionary Biology

In conservation genetics, F-statistics play a crucial role in assessing fragmentation and in . For instance, pairwise F_ST values among (Acinonyx jubatus) often exceed 0.2, with the highest recorded at 0.497 between the Asiatic subspecies A. j. hecki and A. j. venaticus, signaling severe isolation and elevated risks of due to reduced . These high F_ST estimates, derived from genome-wide data, underscore the need for subspecies-specific management strategies to prevent further in populations. F-statistics also facilitate evolutionary inferences, such as estimating population divergence times under drift models without mutation, where F_ST reflects the accumulation of genetic differences over time since isolation. In addition, elevated F_ST in specific genomic regions can detect barriers to , as seen in butterflies where F_ST outliers identify "genomic islands of divergence" indicative of restricted migration between species. Such applications extend to and animals, helping delineate evolutionary boundaries shaped by ecological or geographic constraints. Representative examples highlight varying F_ST levels across taxa. In island endemics like (Geospiza spp.), moderate mean F_ST values around 0.057 across species reflect interisland differentiation driven by limited dispersal and historical radiation, with higher values (e.g., 0.125 in the warbler finch) emphasizing localized isolation. Conversely, many marine species exhibit low F_ST (often <0.01) due to extensive larval dispersal; for example, teleost fishes like (Gadus morhua) show minimal differentiation across broad ranges, promoting despite geographic separation. F-statistics integrate with phylogenetic approaches in tools like software, which uses multilocus genotypes to detect population clusters and , aiding by identifying distinct evolutionary units for protection. This Bayesian clustering method, applied to non-human species, complements F_ST by revealing subtle structure in fragmented habitats, as in studies of hybrid zones and migrant detection.

Limitations and Considerations

Assumptions and Biases

F-statistics rely on several key assumptions to accurately measure population differentiation. Primarily, they assume neutral evolution, where genetic variation among populations arises solely from and , without confounding effects from or mutation biases that could systematically alter allele frequencies. Additionally, the model presumes random sampling of individuals from discrete populations, with no substructure within sampling units and inheritance at loci. These assumptions underpin the interpretation of F_ST as a proportion of genetic variance attributable to between-population differences under conditions. Violations of these assumptions can significantly bias F_ST estimates. For instance, deviations from neutrality due to balancing selection, which maintains polymorphism within populations through mechanisms like or , typically deflate F_ST by elevating within-subpopulation heterozygosity relative to the total. Conversely, positive or divergent selection can inflate F_ST at affected loci by accelerating . Mutation biases, such as those favoring certain alleles, or non-random sampling (e.g., due to family structure) can also lead to inflated estimates by mimicking drift-induced variance. Such violations highlight the importance of testing neutrality at candidate loci, often through comparisons with genome-wide neutral expectations. Several biases further compromise the reliability of F-statistics. Ascertainment is prevalent in single nucleotide polymorphism () data, where markers are selected for polymorphism in a or panel; this skews toward common alleles with low , systematically underestimating F_ST across populations. Small sample sizes exacerbate upward in estimators like Weir and Cockerham's, particularly when subpopulation sizes are unequal, as rare alleles are more prone to fixation or loss, inflating apparent . Statistical challenges arise from the non-normal of F_ST under finite sample sizes and complex , which violates assumptions in likelihood-based methods. Consequently, tests are recommended to evaluate , reshuffling alleles or individuals to generate empirical distributions and assess whether observed exceeds chance expectations. Linkage among loci reduces their effective , inflating the variance of multi-locus F_ST estimates and potentially overestimating if linked markers are not accounted for in analyses. Brief reference to estimation methods underscores that bias correction, such as weighting by , can mitigate some issues but requires careful implementation.

Alternative Measures

While F-statistics provide a foundational for assessing , alternative measures have been developed to address specific limitations in scenarios involving multi-allelic loci, high rates, or complex evolutionary histories. These alternatives often emphasize different aspects of , such as allelic richness or distance-based variances, offering complementary insights into structure. One prominent alternative is Jost's D, introduced to correct the underestimation of differentiation by F_{ST} in systems with multiple alleles per locus. Unlike F_{ST}, which is based on heterozygosity and can saturate at high levels of differentiation, Jost's D quantifies the standardized difference in allelic diversity between populations, providing a more unbiased estimate when numbers are high. The for Jost's D is given by D = \frac{n (H_T - H_S)}{(n-1) (1 - H_S)}, where n is the number of subpopulations, H_T is the total genetic diversity across all subpopulations, and H_S is the average genetic diversity within subpopulations; here, diversity is typically measured as expected heterozygosity (or equivalent measures like 1 minus the probability of identity by descent) to emphasize allelic turnover rather than raw heterozygosity. This measure ranges from 0 (no differentiation) to 1 (complete differentiation) and performs better under the infinite alleles model with high mutation rates. Other indices include Nei's G_{ST}, an analog to F_{ST} that extends gene diversity partitioning to multi-allelic data by calculating the proportion of total attributable to between-population differences as G_{ST} = (H_T - H_S)/H_T, where H_T and H_S are gene diversities. For distance-based analyses, particularly with molecular data like haplotypes or sequences, \Phi_{ST} from analysis of molecular variance (AMOVA) serves as an F_{ST} equivalent, incorporating genetic distances to partition variance among populations and accounting for phylogenetic relationships among . In studies of and complex demographic histories, f_4-statistics are used within admixture graph frameworks to detect by evaluating correlations in frequencies across four populations, with a significant f_4(A,B;C,D) indicating events that violate tree-like . Alternatives like Jost's D are particularly useful when F_{ST} fails due to high mutation rates, which increase within-population and cause F_{ST} to underestimate true , or unequal allele frequencies that bias heterozygosity-based metrics. For instance, in microbial or highly mutable systems, D better captures allelic divergence without saturation effects. Similarly, \Phi_{ST} is preferred for non-additive distance data, while f_4-statistics excel in reconstructing admixture graphs for species with reticulate evolution, such as humans. Comparisons between F_{ST} and Jost's D reveal that F_{ST} reaches a plateau (saturation) at high differentiation levels for multi-allelic loci, approaching values below 0.3 even when over 80% of allelic diversity is partitioned between populations, whereas D continues to increase monotonically toward 1, providing a more sensitive measure of extreme isolation. This difference arises because F_{ST} is constrained by heterozygosity, which diminishes relatively as allelic richness grows, while D directly scales with effective allele number differences. Empirical simulations confirm that D correlates more strongly with actual gene flow rates under diverse mutation-drift equilibria.

References

  1. [1]
    7.4.3.3. The ANOVA table and tests of hypotheses about means
    The F-test, The test statistic, used in testing the equality of treatment means is: F = M S T / M S E . The critical value is the tabular value of the F ...
  2. [2]
    1.3.6.6.5. F Distribution - Information Technology Laboratory
    The F distribution is the ratio of two chi-square distributions, used for hypothesis tests and determining confidence intervals, like in analysis of variance.
  3. [3]
    7.4.2.3. The ANOVA table and tests of hypotheses about means
    The test statistic, used in testing the equality of treatment means is: F = MST / MSE. The critical value is the table value of the F distribution.
  4. [4]
    Notable Advances in Statistics: 1919 - 1943 - Montana State University
    Apr 17, 2021 · While analyzing crop experiments, he conceived the analysis of variance (ANOVA) and the F-distribution. Then he developed and tied together the ...Missing: definition | Show results with:definition
  5. [5]
    [PDF] The Design of Experiments By Sir Ronald A. Fisher.djvu
    By. Sir Ronald A. Fisher, Sc.D., F.R.S.. Honorary Research Fellow, Division of Mathematical Statistics, C.S.I.R.O.,. University of Adelaide; Foreign ...
  6. [6]
    [PDF] RESEARCH
    Snedecor (1934) subsequently proposed an ANOVA test statistic, that he named “F” in honor of Fisher, who of course subsequently became “Sir” Ronald Fisher.
  7. [7]
    1.3.5.4. One-Factor ANOVA
    The F statistic is the batch mean square divided by the residual mean square. This statistic follows an F distribution with (k-1) and (N-k) degrees of freedom.<|separator|>
  8. [8]
    2.6 - The Analysis of Variance (ANOVA) table and the F-test
    Let's review the analysis of variance table for the example concerning skin cancer mortality and latitude (Skin Cancer data).
  9. [9]
    Coefficients of Inbreeding and Relationship
    SEWALL WRIGHT. BUREAU OF ANIM2AL INDUSTRY, UNITED STATES DEPARTMENT. OF AGRICULTURE. IN the breeding of domestic animals consanguineous matings are frequently ...Missing: USDA | Show results with:USDA
  10. [10]
    Sewall Wright | Biographical Memoirs: Volume 64
    Statistics. Wright's first statistical paper (1917, 1) corrected Raymond Pearl on the use of probable error to test Mendelian ratios. In the same year (1917, ...
  11. [11]
    A molecular approach to the study of genic heterozygosity in natural ...
    Amount of variation and degree of heterozygosity in natural populations of Drosophila pseudoobscura. Genetics. 1966 Aug;54(2):595-609. doi: 10.1093/genetics/ ...Missing: electrophoresis F-
  12. [12]
    Hubby and Lewontin on Protein Variation in Natural Populations
    Aug 5, 2016 · The 1966 GENETICS papers by John Hubby and Richard Lewontin were a landmark in the study of genome-wide levels of variability.
  13. [13]
    F-statistics and analysis of gene diversity in subdivided populations
    It is show that Wright's F-statistics can be defined as ratios of gene diversities of heterozygosities rather than as the correlations of uniting gametes.Missing: Masatoshi heterozygosity multi- loci
  14. [14]
    ESTIMATING F-STATISTICS FOR THE ANALYSIS OF ... - PubMed
    ESTIMATING F-STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE. Evolution. 1984 Nov;38(6):1358-1370. doi: 10.1111/j.1558-5646.1984.tb05657.x.Missing: unbiased | Show results with:unbiased
  15. [15]
    Coefficients of Inbreeding and Relationship | The American Naturalist
    Next article. Free. Coefficients of Inbreeding and Relationship. Sewall Wright. Sewall Wright. Search for more articles by this author.<|control11|><|separator|>
  16. [16]
    INBREEDING AND GENETIC DRIFT
    The INBREEDING COEFFICIENT, F, is used to gauge the strength of inbreeding. F = probability that two alleles in an individual are identical by descent (IBD). F ...
  17. [17]
    Wright's Hierarchical F-Statistics | Molecular Biology and Evolution
    May 2, 2024 · Wright. S . The genetical structure of populations . Ann Eugen . 1951. : 15. (. 1. ): 323. –. 354 . https://doi.org/10.1111/j.1469-1809.1949 ...
  18. [18]
    [PDF] defining, estimating and interpreting FST.
    Sep 1, 2009 · Wright's F‐statistics, and especially FST, provide important insights into the evolutionary processes that influence the structure of genetic ...
  19. [19]
    Estimating F-Statistics: A Historical View | Philosophy of Science
    Jan 1, 2022 · Wright defined the basic inbreeding coefficient, or F, as the correlation between genes on uniting gametes relative to the total array of those ...<|control11|><|separator|>
  20. [20]
    Genetics in geographically structured populations: defining ...
    This Review clarifies how F ST is defined, how it should be estimated, how it is related to similar statistics and how estimates of F ST should be interpreted.
  21. [21]
    Impact of population structure, effective bottleneck time, and allele ...
    Wright (11) provided the hierarchical model FIT = FIS + (1 – FIS) FST, where FIS is inbreeding relative to a population S and its allele frequencies, FST ...
  22. [22]
    Estimating F-statistics: A historical view - PMC - PubMed Central
    Sewall Wright introduced a set of “F-statistics” to describe population structure in 1951 and he emphasized that these quantities were ratios of variances.
  23. [23]
    Indirect measures of gene flow and migration: F ST ≠1/(4Nm+1)
    Feb 1, 1999 · Wright (1931) introduced a simple model of population structure, called the island model, which predicts a simple relationship between the ...
  24. [24]
    ESTIMATING HIERARCHICAL F‐STATISTICS - Yang - 1998
    May 31, 2017 · This paper presents an analysis of variance (ANOVA) approach by which estimation of F-statistics can be made from data with an arbitrary ...
  25. [25]
    [PDF] HIERFSTAT, a package for R to compute and test hierarchical F ...
    Abstract. The package HIERFSTAT for the statistical software R, created by the R Development Core. Team, allows the estimate of hierarchical F-statistics ...
  26. [26]
    ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF ...
    A comparison of theoretical and electrophoretic assessment of genetic structure in populations of the house sparrow (Passer domesticus).
  27. [27]
    Maximum SNP FST Outperforms Full-Window Statistics for Detecting ...
    Our results suggest that FST_MaxSNP is highly complementary to typical window-based approaches for detecting local adaptation, and merits inclusion in future ...
  28. [28]
    Sliding window differentiation, variance and introgression
    As SNP Fst values are very noisy, it is better to compute Fst estimates for entire regions. Selection is expected to not only affect a single SNP and the ...<|control11|><|separator|>
  29. [29]
    Arlequin (version 3.0): An integrated software package for ... - NIH
    Arlequin is a software package for population genetics data analysis, integrating methods for diversity indices, allele frequencies, and population subdivision.
  30. [30]
    Genepop on the Web
    Genepop is a population genetics software. The web version is for teaching or when local PC/Mac use is not possible, limited to 50 loci/100 populations.1. Hardy Weinberg Exact Tests · GenePop Input/Output Help · Genepop Option 6
  31. [31]
    f-statistics • admixtools
    f-statistics are the foundation of ADMIXTOOLS. In ADMIXTOOLS 2, f2-statistics are of particular importance as f3- and f4-statistics can be computed from ...
  32. [32]
    VCF Manual - VCFtools
    OUTPUT FST STATISTICS. --weir-fst-pop <filename>. This option is used to calculate an Fst estimate from Weir and Cockerham's 1984 paper. This is the preferred ...
  33. [33]
    ANGSD: Analysis of Next Generation Sequencing Data
    Nov 25, 2014 · This program can calculate various summary statistics, and perform association mapping and population genetic analyses utilizing the full information in next ...
  34. [34]
    Ascertainment Biases in SNP Chips Affect Measures of Population ...
    We estimated standard errors of the FST estimates using 10,000 bootstrap samples where the ascertained SNPs were randomly sampled (with replacement). Because of ...
  35. [35]
    Estimating and interpreting FST: The impact of rare variants - NIH
    While rare variants do influence the result, we show that this is largely through differences in estimation methods.Missing: whole- | Show results with:whole-
  36. [36]
    Analysis of molecular variance inferred from metric distances among ...
    This analysis of molecular variance (AMOVA) produces estimates of variance components and F-statistic analogs, designated here as phi-statistics, reflecting ...
  37. [37]
    [PDF] The Apportionment of Human Diversity - Vanderbilt University
    Two analyses for man, one on enzymes by Harris (1970) and one on blood groups by Lewontin (1967), give respective estimates of 30% and 36% for polymorphic loci ...
  38. [38]
    Implications of the apportionment of human genetic diversity for the ...
    Many of these studies presented their results as estimates of FST, a quantity that can be interpreted as the proportion of allelic variance—that is, variance in ...
  39. [39]
    Empirical Distributions of F ST from Large-Scale Human ...
    Nov 21, 2012 · Depicting this hierarchical framework with F-statistics required six indices: F IS that measures the correlation between alleles of individuals ...
  40. [40]
    Empirical Distributions of F ST from Large-Scale Human ...
    We analyzed 3 million SNPs on 602 samples from eight worldwide populations and a consensus subset of 1 million SNPs found in all populations.
  41. [41]
    (PDF) Ancient Human Migration after Out-of-Africa - ResearchGate
    Aug 6, 2025 · The total genetic divergence (FST) is 16% and within subregions it varies between 2% and 10%. The correlation (r) between geographic and ...<|separator|>
  42. [42]
    African and Non-African Admixture Components in African ... - NIH
    Even though Barbadians had a greater proportion of African genes than African Americans, these two groups are genetically very close (FST= 0.001), relative to ...
  43. [43]
    [PDF] Estimating and interpreting FST: The impact of rare variants
    Additionally, multiple estimators for FST have been described in the literature (Nei 1973, 1986; Weir and Cockerham. 1984; Hudson et al. 1992; Holsinger 1999; ...<|control11|><|separator|>
  44. [44]
    Genomic analyses show extremely perilous conservation status of ...
    Jun 24, 2022 · The dashed line indicates the average value for the cheetah as a species. TABLE 2. FST values between the five classical subspecies of cheetahs.INTRODUCTION · MATERIALS AND METHODS · RESULTS · DISCUSSION
  45. [45]
    Demographically explicit scans for barriers to gene flow using gIMble
    Many FST outliers between H. melpomene and H. cydno may therefore simply reflect increased rates of within-species coalescence [14]. Using this well studied ...
  46. [46]
    [PDF] Genic Variability and Differentiation in the Galapagos Finches
    Nevertheless, the island populations of finch species are several times more dif- ferentiated than are populations of mainland avian species.
  47. [47]
    Global patterns in marine dispersal estimates: the influence of ... - NIH
    Similarly, FST values for 246 marine species, 83 invertebrate species (table A4 in the electronic supplementary material) and 163 fish species (table A5 in ...
  48. [48]
  49. [49]
  50. [50]
    The Effect of Balancing Selection on Population Differentiation
    Jul 31, 2018 · Theory predicts that balancing selection reduces population differentiation, as measured by FST. However, balancing selection regimes in ...