Fact-checked by Grok 2 weeks ago

Genetic distance

Genetic distance is a quantitative measure in that quantifies the extent of genetic divergence between populations, individuals, or species, typically based on differences in frequencies, identities, or shared alleles across loci. It serves as a metric to summarize accumulated genetic differences, reflecting evolutionary processes such as , , and selection over time. One of the most widely used measures is Nei's genetic distance, introduced in , which defines distance D as D = -\ln I, where I is the average identity of between two , ranging from 1 (identical populations) to 0 (no shared ). Thus, D ranges from 0 to \infty. This distance assumes a constant rate of substitution and is linearly related to the time since divergence under the hypothesis, making it applicable to diverse organisms regardless of or mating systems. Other common metrics include the shared alleles distance, calculated as one minus half the average number of shared alleles per locus, which is particularly useful for assessing individual-level similarities in human studies. Genetic distances are employed to estimate divergence times, construct phylogenetic trees, and test hypotheses like isolation by distance, where genetic differentiation correlates with geographic separation. In , they help evaluate population structure and the probability that individuals from different groups are more genetically similar than those from the same group, informing biomedical research and ancestry inference. Statistically, these measures vary in robustness to and distributions, with Nei's distance showing favorable properties for moderate sample sizes in evolutionary analyses.

Fundamentals

Definition and Biological Basis

Genetic distance refers to the quantitative measure of between two populations, individuals, or biological sequences, reflecting the accumulation of differences in their genetic material over time. This divergence arises primarily from evolutionary processes such as mutations that introduce new genetic variants, that causes random fluctuations in frequencies, that favors certain variants, and that exchanges genetic material between populations. At its core, genetic distance quantifies variations at the genotypic level—the underlying DNA sequences and frequencies—rather than phenotypic traits, which are observable characteristics influenced by both and environment. The biological basis of genetic distance lies in the fundamental genetic variation within and between organisms, originating from differences in nucleotide sequences of DNA or in the frequencies of alleles at specific loci. Under neutral evolution, where mutations are neither advantageous nor disadvantageous, genetic distance serves as a proxy for the time elapsed since two lineages diverged, as neutral mutations accumulate at a relatively constant rate. This accumulation reflects the balance between mutation introducing variation and genetic drift fixing or eliminating it in isolated populations, leading to measurable divergence without directional selection or migration. The concept of genetic distance traces its historical origins to the work of in the 1920s, particularly his development of inbreeding coefficients and population differentiation measures in studies of livestock , which laid the groundwork for quantifying genetic structure across populations. , introduced to describe correlations in gametic unions and population subdivision, provided an early framework for assessing genetic differences due to and . Notably, genetic distance is a relative rather than absolute measure, often standardized in units such as substitutions per site to account for sequence length and evolutionary rate, enabling comparisons across diverse taxa.

Importance in Genetics

Genetic distance serves as a fundamental tool in for quantifying by measuring the extent of genetic among populations or , which helps assess the overall genetic and variability within ecosystems. It enables researchers to infer evolutionary histories through the reconstruction of phylogenetic relationships, revealing patterns of divergence over time based on accumulated genetic differences. Additionally, genetic distance identifies barriers to , such as geographic isolation or ecological factors, by highlighting elevated differentiation levels that indicate restricted genetic exchange between groups. In genetics, genetic distance is crucial for evaluating divergence in , allowing prioritization of distinct populations as Evolutionary Significant Units to prevent loss of unique genetic lineages, as demonstrated in cases like the mountain pygmy-possum where it guided successful genetic rescue efforts. In , it measures recombination distances between short tandem repeat () loci to refine analyses and frequencies, improving accuracy in paternity and complex relatedness determinations without relying on extensive family studies. In agriculture, genetic distance quantifies differentiation among livestock breeds using metrics like FST, informing breeding programs and strategies; for instance, studies on and sheep breeds have shown FST values ranging from 0.06 to 0.17, highlighting breed-specific relationships for sustainable management. A notable application includes tracing human migrations, where Y-chromosome genetic distances revealed multiple early movements of modern humans from via to , with indicating northward expansions 25-30 thousand years ago. The evolutionary significance of genetic distance lies in its support for the molecular clock hypothesis, which posits that genetic differences between species accumulate at a relatively constant rate, making distance a proxy for the time since divergence from a common ancestor. Under this framework, greater genetic distances correlate directly with longer divergence times, enabling estimates of evolutionary timescales; for example, cytochrome b genes in birds diverge at approximately 2% per million years, allowing calibration of phylogenetic events. Unlike phenotypic distance, which reflects observable trait differences influenced by environmental interactions and gene-environment effects, genetic distance provides a more objective measure of divergence based on genome-wide molecular markers, making it preferable for deep-time evolutionary comparisons where phenotypic data may mislead due to low genotype-phenotype correlations. This distinction ensures that evolutionary inferences rely on stable genetic signals rather than variable phenotypic expressions.

Measurement Methods

Genetic Markers

Genetic markers are variable DNA sequences or proteins used to detect differences in genetic variation among individuals and populations, providing the empirical foundation for measuring genetic distances. These markers capture polymorphisms that reflect evolutionary processes such as , drift, and , enabling inferences about relatedness without relying on phenotypic traits. Early genetic markers relied on protein , introduced in the , which analyzed allozymes—enzymatic variants encoded by different alleles at protein-coding loci. Allozymes offered the advantage of requiring no prior genomic knowledge and allowing direct assessment of functional genetic variation, but they suffered from low polymorphism levels (often limited to coding regions) and poor resolution for non-coding diversity, making them less suitable for fine-scale population studies. By the 1970s and 1980s, the shift to DNA-based markers began with restriction fragment length polymorphisms (RFLPs), which detected sequence variations via enzyme digestion, though they were labor-intensive. The invention of (PCR) in the 1980s revolutionized marker development, paving the way for more accessible techniques. In the 1990s, dominant markers like randomly amplified polymorphic DNA (RAPD) and amplified fragment length polymorphism (AFLP) gained prominence for their simplicity and lack of need for prior sequence information; RAPDs used short arbitrary primers for quick amplification, while AFLPs combined restriction digestion with selective for higher multiplex ratios. However, both exhibited drawbacks such as low reproducibility (due to sensitivity to PCR conditions in RAPDs) and inability to distinguish heterozygotes from homozygotes (as dominant markers), leading to their decline in favor of co-dominant alternatives by the early . Microsatellites, or short tandem repeats (STRs), emerged as highly polymorphic co-dominant markers, offering superior resolution for parentage and analysis owing to their multi-allelic nature (often 10–20 alleles per locus), though they are prone to from mutational slippage and uneven genomic distribution. Single nucleotide polymorphisms (SNPs), biallelic substitutions abundant across the genome, provide stability and high-throughput via arrays, excelling in large-scale population structure studies but requiring more loci for equivalent power due to lower per-locus variability. (mtDNA) serves as a uniparental marker with a high , facilitating detection of recent demographic events and phylogeographic patterns through its non-recombining, haploid inheritance, yet it underrepresents nuclear diversity, is susceptible to symbiont-induced selective sweeps, and may not reflect biparental . The advent of next-generation sequencing (NGS) in the mid- enabled whole-genome sequencing, offering comprehensive polymorphism detection but at high computational and cost barriers. Selection of genetic markers emphasizes neutrality to avoid biases from , high polymorphism to ensure sufficient variability for differentiation, and minimal (LD) to promote independence among loci—ideally choosing markers from unlinked genomic regions or blocks where recombination has decayed associations. Neutral markers, such as many non-coding SNPs or microsatellites, better proxy in , while polymorphic loci (e.g., those with heterozygosity >0.5) maximize statistical power; LD avoidance, assessed via metrics like r² < 0.2, prevents overestimation of structure from correlated alleles. These criteria guide marker panels, balancing resolution with feasibility in empirical studies. These markers feed into distance calculations via allele counting or frequency estimation.

Calculation Processes

The calculation of genetic distances begins with the collection of raw genetic data, typically from genetic markers such as (SNPs) or , obtained through genotyping or sequencing efforts across populations or individuals. For sequence-based data, the initial step involves to identify homologous positions and account for insertions or deletions, ensuring comparable sites for subsequent analysis. In cases involving allele frequency data, allele frequencies are estimated for each locus within populations, often using to derive population-specific proportions. Data preparation is crucial and includes handling missing data, which can be addressed through pairwise deletion (excluding sites with missing values only for affected comparisons), mean imputation, or model-based methods to minimize bias in distance estimates. For analyses requiring rooted phylogenetic trees, an outgroup—a distantly related taxon—is selected to define the root position, providing a reference for evolutionary directionality. Distances can be computed in pairwise fashion, comparing two entities at a time, or as multi-locus distances, which aggregate information across multiple loci to yield a composite measure per pair, reducing variance from single-locus sampling. To account for biases such as multiple substitutions at the same site (multiple hits) in sequence data, corrections are applied using probabilistic models that model mutation processes, such as the to estimate the underlying number of events from observed differences. These models assume neutrality of mutations, meaning evolutionary changes occur without selective pressure, which underpins many distance calculations in . The process culminates in the generation of a symmetric distance matrix, where each entry represents the corrected genetic divergence between pairs of taxa or populations. Statistical robustness is enhanced through techniques like bootstrapping, where data are resampled with replacement multiple times (typically 1,000 or more replicates) to generate a distribution of distance estimates, from which confidence intervals are derived to assess uncertainty. These workflows are commonly implemented in phylogenetic software pipelines that integrate data preprocessing, model application, and matrix output, facilitating downstream analyses like tree construction.

Specific Measures

Nucleotide Sequence Distances

Nucleotide sequence distances quantify evolutionary divergence between aligned DNA or protein sequences by correcting the raw proportion of differing sites for multiple, unobserved substitutions that occur over time. These model-based measures rely on probabilistic frameworks that simulate nucleotide substitution processes, assuming a Markov chain where changes at each site evolve independently. By estimating the expected number of substitutions per site, they provide a more accurate gauge of genetic distance than uncorrected p-distances, particularly for closely related sequences where saturation (multiple hits masking true differences) is minimal. The Jukes-Cantor model, introduced in 1969, represents the simplest such approach, positing equal nucleotide frequencies (0.25 each) and uniform substitution rates among all nucleotide pairs. It derives the distance from a Poisson correction for unobserved changes, yielding the formula d = -\frac{3}{4} \ln\left(1 - \frac{4}{3} p \right), where p is the observed proportion of differing sites between two sequences. This model assumes stationarity—constant base composition over evolutionary time—and time-reversibility, meaning the substitution process is symmetric forward and backward. These assumptions facilitate closed-form estimation but limit applicability to scenarios with biased base frequencies or rate heterogeneity. Building on this, the Kimura two-parameter (K2P) model, proposed in 1980, accounts for the biological reality that transitions (A↔G or C↔T) occur at higher rates than transversions (A↔C, A↔T, G↔C, G↔T). It separates these into parameters P (transition proportion) and Q (transversion proportion), with the distance calculated as d = -\frac{1}{2} \ln\left( (1 - 2P - Q) \sqrt{1 - 2Q} \right). Like the Jukes-Cantor model, it relies on stationarity and reversibility but offers improved accuracy for nucleotide data exhibiting transition bias, a common pattern in vertebrate mitochondrial DNA (mtDNA) analyses. The Kimura three-parameter model, an extension from 1981, further relaxes assumptions by distinguishing three substitution rate classes: transitions, one type of transversion, and another, allowing for more nuanced rate variation. The distance formula incorporates parameters for these classes (often denoted α for transitions, β and γ for transversions), providing d = -\frac{1}{2} \ln\left( (1 - 2P - Q_1 - Q_2)(1 - 2Q_1)^{1/2}(1 - 2Q_2)^{1/2} \right), where P, Q₁, and Q₂ represent proportions of transitions and the two transversion types. This model maintains stationarity and reversibility while better capturing complexities in substitution patterns, though it remains computationally tractable for pairwise comparisons. These models are particularly suited to mtDNA phylogenetics due to the elevated substitution rates and transition biases in mitochondrial genomes, enabling robust divergence estimates in population and evolutionary studies. However, for highly divergent sequences (e.g., p > 0.75), the logarithmic corrections can lead to underestimation of true distances as observed differences saturate, rendering the models less reliable without additional adjustments for rate variation across sites.

Allele Frequency Distances

Allele frequency distances quantify between by comparing the proportions of at multiple loci, typically using codominant markers such as allozymes or . These measures are particularly useful in multilocus studies where data consist of frequency distributions rather than aligned sequences, allowing assessment of differentiation due to drift, , or selection. Often computed from frequencies, they provide a probabilistic view of structure without requiring assumptions about mechanisms beyond basic models. Nei's standard genetic distance, introduced in , is designed for codominant alleles and assumes the alleles model, where each produces a allele. The formula is given by D = -\ln\left( \frac{\sum_i \sqrt{a_{i1} a_{i2}}}{\sqrt{\left( \sum_i a_{i1} \right) \left( \sum_i a_{i2} \right)}} \right), where a_{ij} represents the frequency of the i-th in population j. This distance estimates the average number of allele substitutions per locus and is well-suited for allozyme data, as it accounts for genetic drift and mutation effects in a logarithmic scale that approximates evolutionary time for small divergences. For multilocus data, the identity is averaged over loci before taking the logarithm. An unbiased , Nei's D_A from , adjusts the measure for finite sample sizes to reduce in gene identity estimates. It is formulated as D_A = 1 - \frac{1}{L} \sum_{l=1}^L \sum_i \sqrt{a_{i1} a_{i2}}, where L is the number of loci; this makes it a linear measure approximating the average proportion of non-shared alleles, more robust for phylogenetic reconstruction from due to its embeddability in . Reynolds, , and Cockerham's distance, proposed in 1983, is based on coancestry coefficients and focuses on short-term divergence under drift alone. The measure, often denoted as \theta, is \theta = \frac{\bar{p}(1 - \bar{p}) - \overline{p_1 p_2}}{\bar{p}(1 - \bar{p})}, where \bar{p} is the mean allele frequency across populations and loci, and \overline{p_1 p_2} is the average product of frequencies for the two populations; it provides a standardized estimate of differentiation suitable for allozyme surveys emphasizing recent population history. A simpler alternative, Nei's minimum genetic distance from 1973, avoids the logarithmic transformation of the standard version, yielding a linear measure of the proportion of non-shared alleles: D_m = \frac{J_X + J_Y}{2} - J_{XY}, where J terms represent average gene identities within and between populations. This variant is computationally straightforward and applicable to allozyme data for preliminary assessments of divergence.

Geometric and Other Distances

Geometric distances in treat data as points in a multidimensional , where populations are represented by vectors of frequencies for multiple loci or alleles. This geometric interpretation allows for the application of standard distance metrics, such as those derived from , to quantify . These measures are particularly useful for visualizing population structure through techniques like or , building on frequency data like those used in distances. One seminal geometric measure is the Cavalli-Sforza chord distance, introduced in , which models frequencies as points on a hypersphere to account for the constraint that frequencies sum to 1. The distance is calculated using the square roots of the frequencies to approximate the chord length between two points on the unit hypersphere. The formula is D = \sqrt{2 \left(1 - \sum \sqrt{x_i y_i}\right)}, where x_i and y_i are the frequencies at locus i in the two populations. This measure assumes genetic differences arise primarily from random and provides a scale where complete gene substitution corresponds roughly to a distance of 1. Notably, the squared chord distance relates to the F_{ST} by D^2 = 2(1 - F_{ST}) under the infinite alleles model without mutation bias. The offers a simpler geometric approach, computing the straight-line separation between frequency vectors in . Its formula is D = \sqrt{\sum (x_i - y_i)^2}, where x_i and y_i are allele frequencies across loci. This measure is straightforward and computationally efficient but treats frequencies as independent coordinates without enforcing the simplex constraint, potentially leading to inflated distances when allele numbers vary across loci. It has been widely applied in early population genetic studies for its interpretability in low-dimensional projections. Roger's distance, proposed in 1972, is a normalized variant using square-root transformed frequencies to better handle the compositional nature of data. The formula is D = \sqrt{\sum ( \sqrt{x_i} - \sqrt{y_i} )^2}, which simplifies to \sqrt{2(1 - \sum \sqrt{x_i y_i})}. Developed for allozyme data, it performs well for codominant markers and shows strong correlations with tree-building methods under drift models, though it can be sensitive to in rare alleles. Other geometric and alternative distances include the Czekanowski distance, a Manhattan-like that sums differences normalized by , given by D = \frac{\sum |x_i - y_i|}{\sum (x_i + y_i)}, which is useful for ordinal or abundance data in genetic contexts and equates to the Sokal-Michener distance for binary alleles. For data, particularly microsatellites, the Goldstein distance (1995) extends stepwise models by averaging squared differences in repeat lengths across loci: (\delta \mu)^2 = \sum (\mu_i - \nu_i)^2, where \mu_i and \nu_i are mean repeat scores for haplotypes in the two populations; this measure is effective for estimating divergence times but assumes symmetric stepwise mutations. These simple geometric formulas share limitations, such as heightened sensitivity to the number of alleles per locus—distances tend to increase with more alleles even under similar divergence—and lack of correction for mutation processes, making them less suitable for highly polymorphic markers without normalization.

Applications

Population Genetics and Structure

In population genetics, genetic distance serves as a fundamental tool for inferring population structure by generating distance matrices from data, which are then subjected to clustering algorithms like unweighted pair group method with arithmetic mean () to delineate subpopulations. This approach reveals hierarchical relationships among groups, often highlighting subtle subdivisions driven by limited or local . For instance, UPGMA dendrograms constructed from Nei's genetic distance have been applied to diverse taxa, such as and , to identify clusters that align with ecological or geographic boundaries. Genetic distances also enable estimation of , where smaller distances between populations signal ongoing and reduced . In Wright's island model, which assumes symmetric among demes, low genetic —quantifiable through distance metrics or related fixation indices—corresponds to high values of (the product of N and rate m), indicating effective gene exchange that homogenizes allele frequencies across . This framework has been instrumental in assessing connectivity in fragmented habitats, such as oceanic archipelagos or metapopulations. A key integration of genetic distance in lies in its relation to , which measure deviations from due to subdivision; distances derived from frequencies approximate these statistics and aid in computing proportions by modeling shared ancestry in zones. For example, distance-based methods complement FST calculations to estimate the fractional contributions from source populations, providing robust inferences of recent mixing events without assuming linkage equilibrium. Applications extend to landscape genetics, where genetic distances are correlated with geospatial data to link population differentiation to environmental barriers. Studies on riverine species, such as stream insects, demonstrate that act as formidable barriers, yielding elevated genetic distances between populations separated by waterways compared to those in continuous habitats, thereby quantifying effects on contemporary . This approach has illuminated how linear features like constrain dispersal in aquatic taxa, fostering over short geographic scales.

Evolutionary Biology and Phylogenetics

In , genetic distances serve as fundamental inputs for reconstructing phylogenetic , which depict the branching patterns of evolutionary histories among taxa. Distance-based methods, such as neighbor-joining, utilize pairwise genetic distances to iteratively cluster taxa by minimizing total branch lengths, thereby approximating the topology that best fits the observed dissimilarities. This approach assumes that genetic distances reflect evolutionary divergence and is particularly efficient for large datasets, as demonstrated in seminal implementations that outperform methods under certain models. Similarly, least-squares methods, like the Fitch-Margoliash algorithm, optimize tree topologies by minimizing the squared differences between observed genetic distances and those predicted by the tree, providing a statistical framework for assessing fit and uncertainty in phylogenetic . The hypothesis posits that genetic distances accumulate at a relatively constant rate over time under neutral , allowing distances to be calibrated against known times to estimate evolutionary timescales. For instance, the between humans and chimpanzees is approximately 1.24%, corresponding to a split around 6-7 million years ago, which has been used to calibrate clocks across . Such calibrations rely on anchors and multiple loci to account for rate variation, enabling the inference of events across broader phylogenetic scales. Kimura's two-parameter distances, which correct for multiple substitutions, are often employed in clock-like analyses to better approximate neutral . Evolutionary forces profoundly shape patterns of genetic distance by influencing how accumulates or erodes across lineages. is the primary driver increasing genetic distance, as it introduces changes or allelic variants that accumulate proportionally to time under neutral conditions, forming the baseline for measures. , particularly in small populations, accelerates distance between lineages through random shifts, leading to fixation of differing variants and reduced within-lineage faster than alone would predict. modulates these patterns directionally: purifying selection constrains distance by eliminating deleterious variants, maintaining similarity at functional sites, while positive or divergent selection can amplify distance by favoring adaptive differences in response to varying environments. Migration, or , counteracts by homogenizing frequencies across populations, thereby reducing genetic distances and potentially obscuring phylogenetic signals in interconnected taxa. In , genetic distances aid delimitation by identifying thresholds of divergence indicative of . For birds, a 2-3% distance is commonly used as a threshold to distinguish , reflecting accumulated neutral mutations over sufficient time for , though this varies by and requires integration with other data for accuracy.

Human and

In , genetic distance measures applied to autosomal DNA have been instrumental in tracing ancestry and quantifying proportions, particularly in populations with recent intercontinental mixing. For instance, analyses of African- gradients in African American populations reveal varying degrees of ancestry, typically ranging from 15-25%, through comparisons of frequencies across genomic segments. These distances help reconstruct historical patterns and individual by identifying shared chromosomal blocks inherited from diverse ancestral sources. Such approaches, often leveraging frequency-based metrics like Nei's genetic distance for population-level ancestry , provide a framework for understanding without relying solely on self-reported . In forensic , pairwise genetic distances between individuals are widely used to establish relationships, such as parentage or verification, by quantifying identity-by-descent (IBD) segments shared across the . For example, methods based on the index of chromosome sharing calculate distances from the and number of IBD blocks, enabling accurate discrimination between degrees of relatedness even in cases with limited markers like short tandem repeats (STRs). This application is critical in legal contexts, including disaster victim identification and paternity disputes, where distances below certain thresholds confirm close familial ties with high confidence. Within , genetic distance serves to assess relatedness in genome-wide association studies (GWAS) and analyses for s, helping to control for confounding factors like cryptic relatedness that can inflate false positives. In GWAS cohorts with potential familial connections, pairwise SNP-based distances identify and adjust for shared ancestry, improving the detection of disease-associated variants. For , such measures quantify sharing to trace inheritance patterns; a notable example is the use of genetic distances around and loci to map founder mutations in breast and families, where phylogenetic trees based on shared chromosomal segments reveal common ancestral origins among carriers. Following the 2010s, whole-genome sequencing from the has enabled computation of fine-scale genetic distances, uncovering subtle population structures within continents, such as sub-regional variations in European or African groups, through of variant data. However, these distance-based clustering methods raise ethical concerns regarding privacy, as aggregated genomic data can inadvertently reveal individual identities or sensitive ancestry information, prompting calls for enhanced consent protocols and in public repositories.

Limitations and Tools

Limitations of Measures

Genetic distance measures, particularly those based on sequences, are prone to , where multiple substitutions at the same site obscure the true number of evolutionary changes, leading to underestimation of distances between distantly related taxa. This issue arises because observed differences plateau as sequences become increasingly randomized, rendering simple correction models ineffective for deep divergences. Ascertainment bias further compromises measures derived from single nucleotide polymorphisms (SNPs), as SNPs are often selected from specific populations or criteria that favor common variants, distorting estimates of and population . For instance, this inflates genetic distances between underrepresented populations and reduces to detect subtle structure within discovery cohorts. Many distance measures rely on foundational assumptions that are frequently violated in real data, such as the Jukes-Cantor model's premise of equal rates across types, which fails when biases or unequal rates predominate, resulting in biased distance estimates. Model misspecification exacerbates these problems, as simple parametric distances inadequately capture complex evolutionary processes like varying rates or structural variations, often leading to systematic errors in phylogenetic inference. In loci influenced by selection, such models can underestimate by overlooking reduced neutral variation or altered patterns driven by adaptive pressures. The fixation index F_{ST}, defined as F_{ST} = \frac{\sigma^2_p}{\bar{p}(1-\bar{p})} where \sigma^2_p is the variance of allele frequencies across populations and \bar{p} is the mean allele frequency, serves as a robust alternative or complement to traditional genetic distances by standardizing differentiation relative to expected heterozygosity. Unlike raw distances, F_{ST} directly quantifies the proportion of genetic variance attributable to between-population differences, making it preferable for assessing structure in scenarios with unequal allele frequencies or non-neutral evolution where distances may conflate drift and selection effects. Genetic distances also exhibit high sensitivity to sample size, with small samples amplifying variance and leading to unreliable estimates of , particularly for alleles or low-diversity loci. This sensitivity underscores the importance of large, balanced sampling to stabilize measures in studies.

Computational Software

Several software packages are available for calculating genetic distances, supporting analyses from sequences to population-level frequencies. , first released in 1993 as a free tool, enables estimation of evolutionary distances for and sequences using models such as p-distance and corrected distances like Jukes-Cantor. It accepts input in formats including and outputs pairwise distance matrices or phylogenetic trees, facilitating phylogenetic inference. For , Arlequin, developed since 1998, computes pairwise genetic distances such as Reynolds' distance and Slatkin's linearized F_ST from data. It supports inputs like or data in structured formats and provides outputs including distance matrices for further analysis like Mantel tests. Similarly, GENEPOP, originating in the early , specializes in F_ST-based distances and isolation-by-distance regressions from data, handling formats compatible with files. Its outputs include genetic distance matrices that can be paired with geographic data for correlation analyses. Advanced tools extend these capabilities to phylogenetic and admixture scenarios. , a longstanding package since 1980, infers phylogenies from distance matrices computed via methods like Fitch-Margoliash, supporting sequential and distance-based inputs for tree construction. , introduced in 2009, estimates ancestry proportions from data, implicitly incorporating distances to model population structure without requiring pre-computed matrices. Developments such as the ongoing PLINK 2.0 alpha (initiated in 2017) calculate identity-by-descent distances for GWAS datasets in VCF format, outputting genomic relationship matrices for large-scale population analyses. TreeMix, released in 2012, builds migration-influenced trees from covariances, using global data to infer admixture edges alongside genetic distances. Integration with programming environments enhances flexibility. The R package adegenet allows computation of custom genetic distances, such as Nei's or distances between populations, from genpop objects derived from various marker types, with outputs suitable for multivariate visualization. Cloud-based platforms like , updated through 2025, offer workflows incorporating these tools for scalable genetic distance calculations, supporting VCF and inputs via web interfaces for collaborative analyses.

References

  1. [1]
    Genetic Distance between Populations | The American Naturalist
    A measure of genetic distance (D) based on the identity of genes between populations is formulated. It is defined as D = -log e I.
  2. [2]
    Evolutionary and statistical properties of three genetic distances
    ### Summary of Genetic Distances from Kalinowski (2002)
  3. [3]
    Genetic Similarities Within and Between Human Populations - PMC
    The genetic distance between individuals is the average of their per-locus distances. Pairs of individuals are classified as “within population” or “between ...
  4. [4]
    Genetic Distance - an overview | ScienceDirect Topics
    Genetic distance is defined as a measure of the genetic differences between two populations or closely related species, often calculated using allele frequency ...
  5. [5]
    Natural Selection, Genetic Drift, and Gene Flow Do Not Act in ...
    A new recessive mutation therefore can't be "seen" by natural selection until it reaches a high enough frequency (perhaps via the random effects of genetic ...
  6. [6]
    Evolutionary Distance Estimation Under Heterogeneous Substitution ...
    Here we present a simple modification for existing distance estimation methods to relax the assumption of the substitution pattern homogeneity among lineages.
  7. [7]
    Genetic Distance - an overview | ScienceDirect Topics
    Genetic distance is a genetic divergence measurement between either species or populations within a species.
  8. [8]
    SYSTEMS OF MATING. II. THE EFFECTS OF INBREEDING ON THE ...
    Sewall Wright; SYSTEMS OF MATING. II. THE EFFECTS OF INBREEDING ON THE GENETIC COMPOSITION OF A POPULATION, Genetics, Volume 6, Issue 2, 1 March 1921, Page.
  9. [9]
    Estimating F-statistics: A historical view - PMC - PubMed Central
    Sewall Wright introduced a set of “F-statistics” to describe population structure in 1951 and he emphasized that these quantities were ratios of variances.Missing: 1920s | Show results with:1920s
  10. [10]
    Distance Estimation --MEGA manual
    The evolutionary distance between a pair of sequences is usually measured by the number of nucleotide or amino acid substitutions between them.
  11. [11]
    Genetic Distance - an overview | ScienceDirect Topics
    A fast method to calculate genetic distances for models allowing substitution rates to vary across alignment sites (following a gamma distribution, see section ...
  12. [12]
    The Molecular Clock and Estimating Species Divergence - Nature
    The molecular clock hypothesis states that DNA and protein sequences evolve at a rate that is relatively constant over time and among different organisms.
  13. [13]
    Genetic variation in subdivided populations and conservation genetics
    Oct 1, 1986 · The genetic differentiation of populations is usually studied by using the equilibrium theory of Wright's infinite island model.Author Information · About This Article · Cite This Article
  14. [14]
    Conservation genetics as a management tool: The five best ... - PNAS
    Dec 20, 2021 · Conservation genetics remains a rapidly developing discipline highly relevant to the management of the increasing number of threatened populations and species.
  15. [15]
  16. [16]
    Genetic Differentiation among Livestock Breeds—Values for Fst - PMC
    The degree of relationship among livestock breeds can be quantified by the Fst statistic, which measures the extent of genetic differentiation between them.
  17. [17]
    Inferring human history in East Asia from Y chromosomes
    Jun 3, 2013 · The current Y chromosome evidence suggests multiple early migrations of modern humans from Africa via Southeast Asia to East Asia.
  18. [18]
    Genetic divergence is not the same as phenotypic divergence - NIH
    Far too often, phenotypic divergence has been misinterpreted as genetic divergence, and based on phenotypic divergence, genetic divergence has been indicated.
  19. [19]
    Genetic marker: a genome mapping tool to decode genetic diversity ...
    This review article summarises the application, advantages, and limitations of developed markers and methods for genotyping applications.
  20. [20]
    A Brief History of Molecular Ecology - Wheaton College OpenPress
    The development of protein electrophoresis in the 1960s allowed scientists to directly examine genetic variation in populations. ... Advances in next-generation ...
  21. [21]
    Perspectives of Population Genetics in the Genetic Improvement ...
    Traditional protein markers such as allozymes and isozymes offer certain advantages, including the lack of a need for prior species-specific information, unlike ...
  22. [22]
    Amplified-Fragment Length Polymorphism Analysis: the State of an Art
    AFLP analysis has established itself as a broadly applicable genotyping method with high degrees of reproducibility and discriminatory power.
  23. [23]
    Genetic marker: a genome mapping tool to decode genetic diversity ...
    RAPD markers are dominant and non-locus-specific. Pros and cons: RAPD is easy to perform, requires less DNA (i.e., 10 ng per reaction), quick result ...Missing: decline | Show results with:decline
  24. [24]
    Linkage disequilibrium — understanding the evolutionary past and ...
    the nonrandom association of alleles at different loci — is a sensitive indicator of the population genetic forces that structure a ...
  25. [25]
    Selection of Genetic Markers for Association Analyses, Using ...
    An optimum strategy would be to genotype enough SNPs to capture the large majority of information on genetic variation within a defined chromosomal region ...Missing: neutrality | Show results with:neutrality
  26. [26]
    Lect. 6. Genetic distances
    1) Distance methods with no biological assumptions. A locus-specific, codominant marker population genetic data set, such as the bear one you have used for ...
  27. [27]
    The Effects of Alignment Quality, Distance Calculation Method ... - NIH
    Jul 8, 2010 · Alignment quality is expected to significantly affect pairwise distances. Investigators have either used reference alignments to align sequences ...
  28. [28]
    pixy: Unbiased estimation of nucleotide diversity and divergence in ...
    May 1, 2021 · The key difference between pixy and existing methods is the handling of missing data via dynamic adjusting of site-level denominators (which are ...
  29. [29]
    Rooting Trees, Methods for - PMC - PubMed Central - NIH
    Here we outline the various ways to root phylogenetic trees, which include: outgroup, midpoint rooting, molecular clock rooting, and Bayesian molecular clock ...
  30. [30]
    Poisson Correction (PC) distance - MEGA Software
    The Poisson correction distance assumes equality of substitution rates among sites and equal amino acid frequencies while correcting for multiple substitutions ...
  31. [31]
    CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING ...
    The recently-developed statistical method known as the "bootstrap" can be used to place confidence intervals on phylogenies. It involves resampling points ...
  32. [32]
    Nei, M. (1973) The theory and estimation of genetic ... - Scirp.org.
    Nei, M. (1973) The theory and estimation of genetic distance. In: Morton, N.E., Ed., Genetic Structure of Populations. University Press of Hawaii, Honolulu, 45- ...Missing: minimum original
  33. [33]
    MODELS AND ESTIMATION PROCEDURES - Cavalli‐Sforza - 1967
    PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES. L. L. Cavalli-Sforza,. L. L. Cavalli-Sforza. International Laboratory of Genetics and Biophysics ...
  34. [34]
    [PDF] Evolutionary and statistical properties of three genetic distances
    Abstract. Many genetic distances have been developed to summarize allele frequency differences between populations. I review the evolutionary and ...
  35. [35]
    Rogers, J.S. (1972) Measures of Genetic Similarity and ... - Scirp.org.
    Rogers, J.S. (1972) Measures of Genetic Similarity and Genetic Distance. In Studies in Genetics VII, University of Texas Publication 7213, Austin, 145-153.
  36. [36]
    A Modified Roger's Distance Algorithm for Mixed Quantitative ...
    Jan 11, 2021 · To do this, a frequent-based approach (Rogers, 1972) can be applied for the purpose of calculating genetic distance among accessions for ...Introduction · Materials and Methods · Results · Discussion
  37. [37]
    Population genetic structure of Wikstroemia monnula highlights the ...
    Oct 7, 2022 · UPGMA and PCoA clustering based on genetic distance. UPGMA using Nei's genetic distance method was shown in Figure 3 . Results found that 38 ...Population Sampling · Gene Flow And Genetic... · Upgma And Pcoa Clustering...
  38. [38]
    Assessment of genetic diversity and population structure in ... - NIH
    May 15, 2020 · Various analyses including UPGMA clustering, PCA, population structure and AMOVA were employed to precisely elucidate the genetic diversity for ...
  39. [39]
    Indirect measures of gene flow and migration: F ST ≠1/(4Nm+1)
    Feb 1, 1999 · Furthermore the island model assumes that there is no selection or mutation and that each population persists indefinitely and has reached an ...
  40. [40]
    Admixture, Population Structure, and F-Statistics - PubMed Central
    A useful methodological framework for this purpose is F-statistics that measure shared genetic drift between sets of two, three, and four populations.
  41. [41]
    Landscape Genetics of Plants: Challenges and Opportunities
    Nov 9, 2020 · Landscape genetics integrates population genetics with Geographic Information Systems (GIS) to evaluate the effects of landscape features on gene flow patterns.Missing: riverine | Show results with:riverine<|control11|><|separator|>
  42. [42]
    A Landscape Genetics Approach Reveals Species‐Specific ...
    Mar 9, 2025 · We observed significant spatial genetic structure at larger geographical distances (populations separated by ~30 and 170 km). However, the ...
  43. [43]
    Evolutionary Consequences of Dams and Other Barriers for Riverine ...
    Mar 16, 2022 · Anthropogenic barriers reduce genetic connectivity of species that disperse or migrate over long distances, unless they are capable of moving ...
  44. [44]
    Resolving Difficult Phylogenetic Questions: Why More Sequences ...
    Mar 15, 2011 · Saturation: When sequences in a multiple alignment have undergone so many multiple substitutions that apparent distances largely underestimate ...
  45. [45]
    Detecting possibly saturated positions in 18S and 28S sequences ...
    Traditionally, plotting ti/tv ratios against genetic distances has been used to reveal saturation by assessing when ti/tv stabilizes at 1.
  46. [46]
    Ascertainment Biases in SNP Chips Affect Measures of Population ...
    We demonstrate that the ascertainment biases will distort measures of human diversity and possibly change conclusions drawn from these measures in some times ...
  47. [47]
    SNP ascertainment bias in population genetic analyses - NIH
    Jul 9, 2013 · SNP ascertainment bias arises from many sources. SNP ascertainment bias is the systematic deviation of population genetic statistics from ...
  48. [48]
    Under-parameterized Model of Sequence Evolution Leads to Bias in ...
    If these assumptions are violated, distances calculated using a Jukes-Cantor ... unequal transition and transversion rates (κ) are ignored (Fig. 1). The ...
  49. [49]
    Simulations of Sequence Evolution: How (Un)realistic They Are and ...
    For inference, a misspecified model can lead to inaccurate and misleading results.
  50. [50]
    Detecting loci under selection in a hierarchically structured population
    Jul 22, 2009 · Patterns of genetic diversity between populations are often used to detect loci under selection in genome scans. Indeed, loci involved in ...
  51. [51]
    Genetics in geographically structured populations: defining ...
    FST is directly related to the variance in allele frequency among populations and, conversely, to the degree of resemblance among individuals within populations ...Missing: formula | Show results with:formula
  52. [52]
    DNA fingerprinting, fixation-index (Fst), and admixture mapping of ...
    Jul 15, 2021 · Genetic differentiation among the accessions due to genetic structure is measured by the fixation index (Fst) using genetic polymorphism data. ...
  53. [53]
    Linkage Disequilibrium, Effective Population Size and Genomic ...
    Knowledge of linkage disequilibrium (LD) patterns is necessary to determine the minimum density of markers required for genomic studies and to infer historical ...
  54. [54]
    Efficient estimation for large-scale linkage disequilibrium patterns of ...
    Dec 27, 2023 · In this study, we proposed an efficient algorithm (X-LD) for estimating linkage disequilibrium (LD) patterns for a genomic grid, which can be of inter- ...X-Ld Estimation For Complex... · Model-Based Ld Decay... · Ld Decay Regression Analysis...Missing: ignore critiques
  55. [55]
    An empirical examination of sample size effects on population ... - NIH
    Sample size is a critical aspect of study design in population genomics research, yet few empirical studies have examined the impacts of small sample sizes.
  56. [56]
    [PDF] A comparison of individual-based genetic distance metrics for ...
    Mar 21, 2017 · Our results provide guidance for which genetic distance metrics maximize model selection accuracy and thereby better inform conservation and man ...
  57. [57]
    MEGA Software
    MEGA is an integrated tool for conducting automatic and manual sequence alignment, inferring phylogenetic trees, mining web-based databases, ...End User Agreement · Online Manual · Manual · MEGA manual
  58. [58]
    Estimating Evolutionary Distances - MEGA Software
    In MEGA, you can estimate evolutionary distances between sequences by computing the proportion of nucleotide differences between each pair of sequences.
  59. [59]
    Arlequin 3.11 - Population Genetics
    Feb 17, 2007 · The goal of Arlequin is to provide the average user in population genetics with quite a large set of basic methods and statistical tests.Philosophy · Installation · Discussion forum - FAQ
  60. [60]
    Arlequin (version 3.0): An integrated software package for ... - NIH
    Arlequin provides methods to analyse patterns of genetic diversity within and between population samples. Intra-population methods. Computation of different ...
  61. [61]
    Genepop on the Web
    GENEPOP is a population genetics software package originally developed by Michel Raymond and Francois Rousset, at the Laboratiore de Genetique et Environment, ...1. Hardy Weinberg Exact Tests · GenePop Input/Output Help · Genepop Option 6
  62. [62]
    [PDF] Population Genetic Data Analysis Using Genepop
    GENEPOP Version 1.2: population genetics software for exact tests and ... Estimates isolation by distance by regression of genetic distance to geographical ...
  63. [63]
    ADMIXTURE - David H. Alexander
    ADMIXTURE is a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets.Download · Publications · NEWS · Contact
  64. [64]
    Distance matrices - PLINK 2.0 - cog-genomics.org
    This allows you to start with a screening step which considers all sample pairs but only a small number of variants scattered across the genome (try --maf + -- ...
  65. [65]
    Inference of Population Splits and Mixtures from Genome-Wide ...
    TreeMix run on populations with continuous migration. We simulated a set of populations on a lattice, where each population has constant gene flow at a rate of ...
  66. [66]
    [PDF] An introduction to adegenet 2.0.0
    Jul 29, 2015 · • dist.genpop (adegenet): implements 5 pairwise genetic distances ... adegenet: a R package for the multivariate analysis of genetic markers.