Linkage disequilibrium
Linkage disequilibrium (LD) refers to the non-random association of alleles at different loci on a chromosome, where certain combinations of alleles occur more or less frequently than expected by chance under random segregation.[1] This phenomenon, first formally termed in 1960 by Lewontin and Kojima, quantifies the deviation from linkage equilibrium and is a fundamental concept in population genetics.[1] The standard measure of LD is the coefficient D, defined as D = p_AB - p_A × p_B, where p_AB is the frequency of the haplotype carrying both alleles A and B, and p_A and p_B are the frequencies of alleles A and B, respectively; this metric was originally introduced by Robbins in 1918.[2] LD arises from various evolutionary forces that disrupt random allele associations, including genetic drift in finite populations, natural selection, mutation, population bottlenecks, admixture, and reduced recombination rates between closely linked loci.[1] Recombination gradually erodes LD over generations, with the expected decay following D_t = D_0 (1 - c)^t, where c is the recombination rate and t is the number of generations; however, factors like chromosomal inversions or gene conversion can maintain it.[2] Additional normalized measures, such as D' (introduced by Lewontin in 1964, which scales D to account for allele frequency constraints, ranging from 0 to 1) and r² (the squared correlation coefficient between loci, indicating the proportion of variance explained, introduced by Hill and Robertson in 1968), are commonly used to compare LD patterns across populations or genomic regions.[2] In humans, LD typically extends over short distances (e.g., tens to hundreds of kilobases) but forms haplotype blocks in regions of low recombination, such as the major histocompatibility complex (MHC).[1] The study of LD has profound applications in genomics and medicine, enabling fine-scale mapping of disease-associated genes through association studies, as demonstrated in the identification of the diastrophic dysplasia locus via LD analysis in Finnish isolates.[1] It underpins genome-wide association studies (GWAS) by leveraging correlated markers to detect causal variants for complex traits and diseases, such as breast cancer susceptibility loci.[1] Furthermore, LD patterns reveal historical demographic events, estimate effective population sizes, and detect signatures of positive or balancing selection, with empirical data from projects like the International HapMap Consortium highlighting its utility in understanding human genetic diversity.[2]Introduction
Definition and Basic Principles
Linkage disequilibrium (LD) refers to the nonrandom association of alleles at different loci in a population, representing a deviation from the independent assortment expected under random mating. This statistical phenomenon measures the extent to which the frequency of a particular allele at one locus predicts the frequency of an allele at another locus, beyond what would be anticipated from their individual allele frequencies. The term "linkage disequilibrium" was first introduced by Lewontin and Kojima in 1960 to describe this non-equilibrium state in the context of two-locus population genetics models. A key distinction exists between genetic linkage and linkage disequilibrium. Genetic linkage describes the physical proximity of loci on the same chromosome, which reduces the probability of recombination between them during meiosis and thus tends to keep alleles together across generations. In contrast, LD is a population-level statistical association that can occur between linked loci but does not strictly require physical linkage; it can also arise between unlinked loci due to other evolutionary forces. LD emerges from various population genetic processes, including mutation, which introduces new allele combinations; natural selection, which can favor or disfavor specific haplotype combinations; genetic drift, which randomly alters allele frequencies in finite populations; and population structure, such as admixture between subpopulations with differing allele frequencies. Over time, LD decays primarily through recombination, which shuffles alleles between loci and promotes the approach to linkage equilibrium, where alleles assort independently according to their marginal frequencies. To illustrate, consider two loci each with two alleles (A/a and B/b) in a population. In a state of complete LD, such as in a small population with no recombination events, only three haplotypes (e.g., AB, Ab, and aB) might be observed, with the fourth (ab) absent, because the initial allele combinations persist without shuffling. This contrasts with linkage equilibrium, where all four haplotypes occur at frequencies equal to the product of their respective allele frequencies (e.g., frequency of AB = frequency of A × frequency of B), reflecting random association. Such patterns highlight how LD provides insights into recent evolutionary history and population dynamics.Biological Significance
Linkage disequilibrium (LD) plays a central role in genome-wide association studies (GWAS) by enabling the identification of common single nucleotide polymorphisms (SNPs) that serve as proxies, or "tags," for causal variants influencing complex traits and diseases. In GWAS, causal variants are often not directly genotyped but are detected through their correlation with nearby SNPs due to LD, which reduces the number of markers needed for comprehensive coverage of genetic variation. This tagging efficiency is particularly effective in regions of high LD, such as haplotype blocks, allowing researchers to infer associations across the genome with fewer assays. For instance, in European-ancestry populations, common SNPs can tag a large proportion of common causal variants, facilitating the discovery of loci associated with conditions like type 2 diabetes and schizophrenia. From an evolutionary perspective, LD provides insights into historical demographic events and selective pressures shaping genetic diversity. Recent admixture between populations with differing allele frequencies generates long-range LD that decays over generations due to recombination, allowing estimation of admixture timing and proportions; for example, LD patterns in African Americans reflect admixture events from the transatlantic slave trade. Population bottlenecks and founder effects increase LD by reducing haplotype diversity and effective recombination, as seen in isolated human groups like the Finnish population, where extended LD signals past reductions in effective population size. Additionally, natural selection can either generate or preserve LD: positive selection on a beneficial allele creates "hitchhiking" LD with linked neutral variants, while balancing selection maintains LD around functional loci. High LD thus indicates low effective recombination rates influenced by these forces, offering a window into evolutionary history. LD has practical applications beyond medical genetics, including forensics, ancestry inference, and conservation biology. In forensics, LD enables matching of genetic profiles across disparate marker sets, such as short tandem repeats (STRs) used in criminal databases and SNPs from whole-genome sequencing, by imputing untyped loci and computing match probabilities; this approach achieves over 98% accuracy for identifying the same individual in disjoint datasets.[3] For ancestry inference, admixture-induced LD (ALD) between unlinked markers allows reconstruction of ancestral proportions and dating of mixing events, as in admixed Latin American populations where ALD decay reveals European, African, and Indigenous contributions from colonial eras. In conservation genetics, elevated LD serves as an indicator of inbreeding and small effective population sizes in threatened species, where genome-wide LD patterns help assess genetic health and guide breeding programs to mitigate erosion of diversity.[4] Despite these utilities, LD exhibits substantial variation across populations and genomic regions, which poses challenges for study design and interpretation. LD typically extends over shorter distances in African populations compared to Europeans due to larger ancestral effective population sizes and older coalescence times, necessitating denser marker coverage in diverse cohorts to avoid missing associations.[5] Within genomes, recombination hotspots and coldspots create heterogeneous LD blocks, with low-recombination regions like subtelomeres showing extended LD influenced by selection or structural variants. This variability underscores the need for population-specific reference panels in analyses, as transferring LD patterns from one group to another can lead to biased tagging and reduced power in GWAS or ancestry studies.[5]Historical Context
Early Discoveries
The foundational theoretical concepts of linkage disequilibrium (LD), the non-random association of alleles at different loci in populations, emerged in the early 20th century within population genetics, building on observations of physical linkage. Thomas Hunt Morgan's experiments with Drosophila melanogaster in the 1910s provided empirical evidence for linked inheritance, demonstrating that genes on the same chromosome recombine at rates depending on their physical proximity. This work established the chromosomal basis of linkage, which underlies the evolutionary forces that generate LD, though LD itself concerns population-level allele associations rather than direct measures of recombination. Morgan's chromosome theory of inheritance, developed through breeding analyses including his 1910 report on sex-linked inheritance, marked a pivotal shift from Mendelian principles to physical mechanisms that influence allele segregation. In 1918, R.B. Robbins introduced the coefficient D in theoretical calculations for two linked loci, quantifying the deviation from random haplotype frequencies expected under linkage equilibrium (D = p_AB - p_A × p_B). Early models by R.A. Fisher and others in the 1920s–1940s assumed infinite populations and predicted that recombination would erode LD over generations, leading to random allele combinations unless maintained by other forces like selection. These theoretical advancements laid the groundwork for understanding LD as a dynamic parameter shaped by population processes, distinct from physical linkage. In human populations, LD was first empirically observed through blood group studies during the 1950s and 1960s, revealing unexpected correlations between alleles at nearby loci. Pioneering work on systems like ABO and Rh blood groups uncovered non-random haplotype frequencies, such as associations between blood group O and peptic ulcer disease, attributed to LD rather than independent assortment. These findings, compiled in comprehensive surveys like Arthur Mourant's 1954 catalog of global blood group distributions, highlighted how population history and selection could maintain allele associations, influencing early interpretations of human genetic variation. Such observations extended principles of linkage to humans, emphasizing LD's role in disease susceptibility and ancestry tracing. The conceptual integration of these discoveries culminated in the late 20th century through influential textbooks that formalized LD within population genetics. Daniel L. Hartl and Andrew G. Clark's 1989 edition of Principles of Population Genetics synthesized early theoretical and empirical observations into a cohesive framework, defining LD measures like the coefficient D and discussing its decay under recombination and drift. This text bridged pre-molecular insights with quantitative models, establishing LD as a core parameter for studying evolutionary forces in finite populations.Advances in the Molecular Era
The advent of molecular biology techniques in the 1980s and 1990s, particularly polymerase chain reaction (PCR), revolutionized linkage disequilibrium (LD) studies by enabling the precise amplification of specific DNA segments, which facilitated the direct genotyping of closely linked loci and improved haplotype resolution in population samples.[1] Concurrently, the discovery of single nucleotide polymorphisms (SNPs) as abundant genetic markers allowed for more accurate phasing of haplotypes, shifting from indirect inference to empirical observation of LD patterns.[1] The completion of the Human Genome Project in 2003 provided a comprehensive reference sequence, enabling systematic mapping of LD across the genome and revealing structured blocks of correlated variants.[6] A pivotal milestone came with the International HapMap Project, launched in 2003 and culminating in its Phase I release in 2005, which genotyped over 1 million SNPs in 269 individuals from four global populations to catalog haplotype blocks and tag SNPs, demonstrating consistent LD structures with low haplotype diversity within blocks.[7] This resource highlighted population-specific LD variations, such as longer blocks in non-African ancestries due to historical bottlenecks.[5] Building on this, the 1000 Genomes Project's 2015 phase sequenced 2,504 individuals from 26 populations at high coverage, refining fine-scale LD by identifying rapid decay in African ancestries (requiring ~8 tagging variants for common alleles) compared to slower decay in East Asians (~20 tags), and enabling >95% imputation accuracy for rare variants through enhanced haplotype diversity resolution.[8] Technological progress further transformed LD analysis, transitioning from labor-intensive gel electrophoresis-based Sanger sequencing in the 1990s—which limited studies to small genomic regions—to next-generation sequencing (NGS) platforms in the 2000s and 2010s, which supported genome-wide scans of millions of variants simultaneously and uncovered subtle LD gradients shaped by recombination hotspots. NGS integration with large cohorts allowed for unprecedented resolution of LD decay and population stratification effects.[9] In recent years up to 2025, CRISPR-Cas9 has been integrated into experimental designs to manipulate and validate LD-associated variants from genome-wide association studies, such as editing regulatory elements in linkage blocks to dissect causal mechanisms in disease models, thereby bridging observational LD patterns with functional outcomes.[10] Simultaneously, AI-driven approaches, including deep transfer learning, have advanced LD prediction in diverse ancestries by modeling population-specific patterns from limited data, improving cross-ancestry imputation accuracy and reducing biases in polygenic risk assessments for underrepresented groups.[11]Foundational Concepts
Genetic Nomenclature
In population genetics, discussions of linkage disequilibrium (LD) commonly employ standardized notation for loci, alleles, and haplotypes to facilitate clear communication across studies. Two linked loci are typically designated as A and B, where each is assumed to be biallelic unless specified otherwise. At locus A, the alleles are denoted A₁ and A₂; similarly, at locus B, the alleles are B₁ and B₂. This notation allows for the identification of the four possible haplotypes: A₁B₁, A₁B₂, A₂B₁, and A₂B₂. Allele frequencies are symbolized with subscripted probabilities, such as p_{A_1} for the frequency of allele A₁ and p_{B_1} for B₁, where p_{A_2} = 1 - p_{A_1} and p_{B_2} = 1 - p_{B_1} under the assumption of biallelic variation. Haplotype frequencies are represented as g_{A_1B_1} (or equivalently p_{11} in some contexts) for the joint occurrence of A₁ and B₁ on the same chromosome, with analogous symbols for the other combinations: g_{A_1B_2}, g_{A_2B_1}, and g_{A_2B_2}. These sum to unity, reflecting the complete set of gametic types in the population. The LD coefficient D is defined in terms of these haploid (gametic) frequencies as D = g_{A_1B_1} - p_{A_1} p_{B_1}, capturing the deviation from independence expected under linkage equilibrium. This formulation applies to haploid phases or inferred haplotypes, but in diploid organisms like humans, where genotypes (e.g., A₁A₂/B₁B₂) are directly observed, haplotype frequencies must be estimated—often via methods like the expectation-maximization (EM) algorithm—to compute D. Extensions to diploid genotype data involve composite measures, such as those accounting for Hardy-Weinberg equilibrium, but the core D remains rooted in gametic associations to avoid confounding by within-locus disequilibria.[12][13] For multi-allelic loci, where more than two alleles exist at A or B, the notation generalizes by specifying allele pairs (e.g., D_{A_i B_j} = g_{A_i B_j} - p_{A_i} p_{B_j}) to avoid ambiguity, as pairwise LD can vary across combinations. This pairwise approach is preferred over composite multi-allelic measures in most analyses to maintain interpretability. Such conventions, emphasizing precise subscripting and phase-specific frequencies, are upheld in foundational population genetics texts to ensure consistency in theoretical and empirical work.Allele and Haplotype Frequencies
Allele frequencies represent the marginal probabilities of specific alleles at a genetic locus within a population and form the basis for assessing associations between loci in linkage disequilibrium studies. For a biallelic locus with alleles A_1 and A_2, the frequency of A_1, denoted p_{A_1}, is estimated from genotype counts as p_{A_1} = \frac{2n_{A_1A_1} + n_{A_1A_2}}{2N}, where n denotes the observed counts of each genotype and N is the number of diploid individuals in the sample.[14] In the context of multi-locus data, these marginal frequencies are equivalently computed from haplotype counts normalized by the total number of haplotypes (twice the sample size for diploids), such that p_{A_1} = g_{A_1 \cdot} = \sum_j g_{A_1 B_j}, where g_{ij} are the proportions of haplotypes carrying allele i at the first locus and j at the second.[14] This estimation assumes random sampling and is robust under Hardy-Weinberg equilibrium, providing a prerequisite for evaluating deviations indicative of non-random associations.[1] Haplotype frequencies quantify the joint occurrence of allele combinations across linked loci on a single chromosome and are essential for precise linkage disequilibrium calculations. When genotype data are phased—meaning the chromosomal origin of alleles is known—haplotype frequencies are directly estimated by counting the occurrences of each unique combination and dividing by the total number of haplotypes in the sample.[14] For unphased diploid genotypes, where phase is ambiguous, frequencies are inferred using maximum-likelihood methods such as the expectation-maximization (EM) algorithm, which iteratively refines estimates by maximizing the likelihood of observed genotypes given assumed haplotype distributions.[14] This approach, introduced for molecular haplotype data, converges reliably for moderate sample sizes and low numbers of alleles, yielding unbiased estimates even with missing phase information.[14] Under the assumption of genetic independence between loci (linkage equilibrium), the expected frequency of a specific haplotype is the product of its constituent marginal allele frequencies; for example, the expected frequency of haplotype A_1B_1 is p_{A_1} \times p_{B_1}.[1] This product rule derives from the null model of no association and serves as a benchmark for detecting excess or deficit of particular haplotype combinations in observed data.[1]Core Measures of Linkage Disequilibrium
The D Statistic
The D statistic provides a fundamental measure of linkage disequilibrium (LD) for pairs of loci in haploid or gametic data, quantifying the deviation between observed and expected haplotype frequencies under the assumption of independence. As introduced by Lewontin and Kojima (1960) in their analysis of complex polymorphisms,[12] it captures non-random associations between alleles at two loci. For biallelic loci, consider the first locus with alleles A (frequency p_1) and a (frequency $1 - p_1), and the second with alleles B (frequency q_1) and b (frequency $1 - q_1). Let g_{11} denote the observed frequency of the AB haplotype. The D statistic is then given by D = g_{11} - p_1 q_1, where the term p_1 q_1 represents the expected frequency of AB if the loci were independent.[15] This formulation assumes haplotype (gametic) frequencies are directly observable or estimable, as in population surveys of phase-known data. This measure arises naturally from a 2×2 contingency table of haplotype counts, where LD corresponds to the covariance between indicator random variables for the alleles. Define I_A = 1 if allele A is present at the first locus (and 0 otherwise), and I_B = 1 if B is present at the second locus. The covariance is \operatorname{Cov}(I_A, I_B) = \mathbb{E}[I_A I_B] - \mathbb{E}[I_A] \mathbb{E}[I_B] = g_{11} - p_1 q_1 = D, reflecting the statistical dependence between the loci. (Note that haplotype frequencies under independence would satisfy g_{11} = p_1 q_1.) The D statistic thus serves as a direct indicator of excess or deficit in specific haplotype combinations relative to random assortment. The sign of D conveys the direction of the association: positive D signifies an excess of coupling haplotypes (AB and ab) compared to repulsion types (Ab and aB), implying overrepresentation of like allele pairs, while negative D indicates the reverse.[16] The magnitude is bounded by allele frequencies, with possible values ranging from D_{\min} = -\min(p_1 (1 - q_1), (1 - p_1) q_1) to D_{\max} = \min(p_1 q_1, (1 - p_1)(1 - q_1)), ensuring D cannot exceed the feasible limits set by marginal probabilities.[15] Although defined for biallelic loci, the D statistic extends to multi-allelic cases through pairwise computation, where for allele i at the first locus (frequency p_i) and allele j at the second (frequency q_j), D_{ij} = g_{ij} - p_i q_j, allowing assessment of disequilibria across all allele pairs.[17]Properties and Calculations of D
The linkage disequilibrium coefficient D extends naturally to diploid populations through estimation from observed genotype frequencies, assuming random mating and Hardy-Weinberg equilibrium within subpopulations. In this context, for two biallelic loci with alleles A/a and B/b, the haplotype frequency p_{AB} is inferred from the joint genotype frequencies, and D_{AB} = p_{AB} - p_A p_B, where p_A and p_B are marginal allele frequencies. Maximum likelihood estimation of D_{AB} from diploid data involves iterative procedures that account for the nine possible genotype combinations under codominance, ensuring the estimates satisfy the constraints of haplotype frequencies summing to unity. For cases with dominant markers, simplified formulas apply, such as D = f_{22} - (N_{.2} N_{2.})/N^2 for two dominant loci, where f_{22} is the estimated frequency of the double recessive haplotype and N denotes sample size.[18] A key property of D is its invariance to allele relabeling within loci; relabeling one allele at a locus (e.g., A to a) changes the sign of D but preserves its absolute value, reflecting the symmetric nature of haplotype associations. The possible values of D are bounded by −\min(p_A (1 − p_B), (1 − p_A) p_B ) ≤ D ≤ \min(p_A p_B, (1 − p_A)(1 − p_B) ), ensuring it remains feasible given marginal allele frequencies. In subdivided populations, D decomposes into within-population and between-population components, as proposed by Ohta: the total disequilibrium D_T = D_{IS} + D_{ST}, where D_{IS} averages the disequilibrium across subpopulations and D_{ST} captures differentiation due to allele frequency differences among them. This decomposition highlights how migration, drift, and selection influence global versus local LD patterns.[15][1] For multi-locus systems, calculations involving D reveal relations such as the product D_{AB} \times D_{CD} approximating the covariance between non-overlapping haplotype pairs in the absence of higher-order interactions, facilitating extensions to composite measures of LD across multiple loci. The composite LD for diploids, \Delta_{AB} = P(AB) - P(A) P(B), where P(AB) is the joint genotype frequency, incorporates deviations from Hardy-Weinberg equilibrium and can be expressed as a sum of pairwise haplotype disequilibria under random mating: \Delta_{AB} = 2 D_{AB} p_A (1 - p_A) p_B (1 - p_B) for biallelic loci, though estimation adjusts for observed genotype counts.[19] Sampling variance of D in finite diploid populations provides insight into estimation precision; under null LD (D = 0) and codominant loci, \text{Var}(D) = p(1-p) q(1-q) / N, where p and q are allele frequencies and N is the number of diploid individuals, reflecting the impact of sample size on reliability. For dominant loci, the variance adjusts accordingly, such as \text{Var}(D) = p(1-p) q(2-q) / (2N) for one codominant and one dominant locus, emphasizing the need for larger samples when dominance obscures heterozygotes. These formulas derive from multinomial sampling distributions of genotype counts, enabling hypothesis tests for LD significance.[18]Normalized Measures
D' Measure
The D' measure normalizes the unnormalized linkage disequilibrium coefficient D by dividing it by the maximum possible value of D given the observed allele frequencies at the two loci, thereby providing a bounded estimate of association strength that is less sensitive to frequency variation.[20] This normalization addresses the limitations of D, which can range widely depending on allele frequencies even under similar evolutionary histories.[20] The formula for D' is defined as D' = \begin{cases} \frac{D}{D_{\max}} & \text{if } D > 0, \\ \frac{D}{D_{\min}} & \text{if } D < 0, \\ 0 & \text{if } D = 0, \end{cases} where D_{\max} = \min(p_1 p_2, q_1 q_2) and D_{\min} = -\min(p_1 q_2, p_2 q_1), with p_1 and p_2 denoting the frequencies of one allele at each locus and q_1 = 1 - p_1, q_2 = 1 - p_2.[20] D' thus ranges from -1 to 1, where absolute values near 1 signify complete linkage disequilibrium (i.e., the observed association reaches the theoretical maximum constrained by allele frequencies), while values near 0 indicate linkage equilibrium.[20] This interpretation highlights the proportion of achievable disequilibrium, emphasizing structural constraints rather than raw deviation.[1] A key advantage of D' is its ability to account for allele frequency imbalances, yielding more comparable estimates of disequilibrium across loci with varying minor allele frequencies than unnormalized measures. It is particularly useful for detecting signatures of recent mutations, as elevated D' values often persist around novel variants before recombination erodes the signal. In computational applications, such as genome-wide association studies, D' is typically calculated pairwise for single nucleotide polymorphism pairs within sliding windows (e.g., 50-100 kb) to identify haplotype blocks and recombination hotspots.[21]r² Measure
The r^2 measure quantifies linkage disequilibrium (LD) as the squared Pearson correlation coefficient between alleles at two loci, providing a standardized assessment of how much variation at one locus predicts variation at the other. It is defined as r^2 = \frac{D^2}{p_1 (1 - p_1) q_1 (1 - q_1)}, where D represents the LD coefficient (serving as the covariance between indicator variables for alleles at the two loci), p_1 is the frequency of one allele at the first locus, and q_1 is the frequency of one allele at the second locus.[22] This formulation normalizes D by the product of the variances at each locus, yielding values between 0 (no LD) and 1 (complete LD), independent of allele frequencies.[22] In genetic association studies, r^2 interprets the proportion of variance in one single nucleotide polymorphism (SNP) explained by another, making it particularly valuable for identifying proxy or tag SNPs that capture common genetic variation efficiently. An r^2 value of approximately 0.8 is often used as a threshold to designate SNPs as strong proxies, ensuring high predictive power while reducing genotyping needs in genome-wide scans. Its advantages include a direct link to statistical power for detecting associations via tagging strategies, as higher r^2 enhances the ability to infer unobserved variants, and robustness to varying sample sizes compared to unnormalized measures. When genotypes are unphased (as in typical diploid data), r^2 is estimated using haplotype frequencies inferred via the expectation-maximization (EM) algorithm, which iteratively maximizes the likelihood of observed genotypes to approximate true haplotype distributions; this contrasts with direct haplotype r^2 from phased data, where EM adjustments account for phasing uncertainty to avoid bias in LD estimates.[23]Additional Measures and Interpretations
d and ρ Measures
The d measure provides a normalized assessment of linkage disequilibrium by scaling the basic D statistic relative to the variance at the second locus, given by the formula d = \frac{D}{p_B (1 - p_B)} where p_B is the frequency of allele B at the second locus (often the disease or trait locus).[24] This formulation, introduced by Nei and Li, yields a value that emphasizes the deviation from independence in a manner suited for association studies, facilitating comparison in contexts where one locus is of particular interest.[24] The measure ranges from 0 to 1, with values approaching 1 indicating strong LD, and it is defined for specific allele frequency configurations (e.g., where D ≥ 0).[24] The ρ measure extends normalization by scaling D relative to the product of allele frequencies at both loci, defined as \rho = \frac{D}{p_A p_B} where p_A and p_B are the frequencies of alleles A and B, respectively, and requires D ≥ 0 with p_B \leq p_A and p_B \leq 1 - p_B.[24] This adjustment, from Collins and Morton, isolates recombination-generated LD and equals |D'| in applicable domains, aiding in theoretical comparisons across loci.[24] Although less frequently employed than D' or r² in empirical studies, d and ρ offer valuable insights in theoretical population genetics models, especially for dissecting LD components in scenarios involving allele frequency asymmetries.[24] For instance, d's normalization by one locus's variance supports its use in simulations where a trait locus is fixed, enhancing reliability in analyses of association mapping.[24]Limits and Ranges of LD Measures
Linkage disequilibrium (LD) measures such as D, D', and r^2 have theoretical bounds that depend on the underlying allele frequencies, influencing their interpretability and application in genetic analyses.[25] The raw D statistic, defined as D = p_{AB} - p_A p_B, exhibits allele frequency-dependent limits, with its possible range constrained by the marginal allele frequencies p_A and p_B. Specifically, the bounds are \max(-p_A p_B, -(1-p_A)(1-p_B)) \leq D \leq \min(p_A (1-p_B), (1-p_A) p_B).[25] In balanced biallelic cases where p_A = p_B = 0.5, these bounds simplify to -0.25 \leq D \leq 0.25, providing a symmetric interval around zero under linkage equilibrium.[25] The normalized measure D', computed as D' = D / D_{\max} (where D_{\max} is the maximum possible |D| given the allele frequencies), standardizes D to remove much of its frequency dependence. Its absolute value ranges from 0 to 1 regardless of allele frequencies, though the normalization is asymmetric: for positive D, D_{\max} = \min(p_A (1-p_B), (1-p_A) p_B); for negative D, it uses the corresponding minimum.[25] This fixed range makes D' particularly useful for comparing LD strength across loci with varying allele frequencies, as it reaches its extremes (0 or 1) at complete linkage equilibrium or disequilibrium, respectively.[25] In contrast, r^2 = D^2 / (p_A (1-p_A) p_B (1-p_B)) also normalizes D but incorporates the variance of the alleles, yielding a range of 0 to 1 in theory, as it represents the squared correlation between loci. However, its maximum attainable value is highly sensitive to allele frequencies and rarely reaches 1 except when p_A = p_B or p_A = 1 - p_B. Under a uniform distribution of allele frequencies, the expected maximum r^2 is approximately 0.43051, peaking at about 0.53091 when the minor allele frequency is around 0.301.[26] With rare alleles or high recombination rates, r^2 approaches 0 rapidly, limiting its power to detect weak LD in such scenarios.[26] Different LD measures saturate at varying strengths depending on allele frequency balance, affecting their suitability for detecting subtle associations. For instance, D' maintains sensitivity to weak LD even with imbalanced frequencies (e.g., one rare allele), where it can exceed 0.8 more readily than r^2, which saturates below 0.5 in ~86% of frequency configurations. D, while informative for absolute deviation, lacks normalization and thus varies widely without direct comparability. The following table summarizes these properties:| Measure | Theoretical Range | Allele Frequency Dependence | Saturation Behavior | Example in Imbalanced Frequencies (e.g., p_A = 0.1, p_B = 0.5) |
|---|---|---|---|---|
| D | [\max(-p_A p_B, -(1-p_A)(1-p_B)), \min(p_A (1-p_B), (1-p_A) p_B)] | High; bounds shrink with rarity | Saturates at frequency-specific maxima, e.g., 0.05 here | Max $ |
| D' | [-1, 1] (absolute 0 to 1) | Low; normalized to max possible D | Reaches 1 at complete LD across frequencies; detects weak LD better | Can reach 1 for full coupling; sensitive to small deviations despite rarity[25] |
| r^2 | [0, 1] | High; max <1 unless balanced | Often <0.5; drops with frequency disparity or recombination | Max r^2 \approx 0.11; underestimates weak LD with rare alleles[26] |
Examples and Applications
Two-Locus Two-Allele Model
The two-locus two-allele model provides a foundational framework for analyzing linkage disequilibrium (LD) by considering two genetic loci, each with two possible alleles, and examining the frequencies of the resulting four haplotypes. This model assumes a population where alleles at locus A are denoted A (frequency p_A) and a (frequency $1 - p_A), and at locus B are B (frequency p_B) and b (frequency $1 - p_B). The haplotypes are AB, Ab, aB, and ab, with their frequencies representing the joint probabilities of allele combinations on the same chromosome. Consider an illustrative example with the following haplotype frequencies: P(AB) = 0.4, P(Ab) = 0.1, P(aB) = 0.2, and P(ab) = 0.3. The marginal allele frequencies are calculated as p_A = P(AB) + P(Ab) = 0.5, p_a = 1 - p_A = 0.5, p_B = P(AB) + P(aB) = 0.6, and p_b = 1 - p_B = 0.4. Under linkage equilibrium, the expected haplotype frequencies would equal the products of the marginal allele frequencies, such as p_A p_B = 0.5 \times 0.6 = 0.3 for AB. The observed deviations from these expectations indicate LD. The LD coefficient D quantifies this deviation for the AB haplotype as D = P(AB) - p_A p_B = 0.4 - 0.5 \times 0.6 = 0.1. To normalize D for allele frequency dependence, D' is computed as D / D_{\max}, where D_{\max} = \min(p_A p_b, p_a p_B) = \min(0.5 \times 0.4, 0.5 \times 0.6) = 0.2, yielding D' = 0.1 / 0.2 = 0.5. The squared correlation coefficient r^2 is given by r^2 = D^2 / (p_A p_a p_B p_b) = 0.1^2 / (0.5 \times 0.5 \times 0.6 \times 0.4) = 0.01 / 0.06 \approx 0.167. These values demonstrate moderate LD in the example. The following table illustrates the observed haplotype frequencies alongside expectations under linkage equilibrium, highlighting the positive deviation for AB and ab, and negative for Ab and aB:| Haplotype | Observed Frequency | Expected Frequency (p_A p_B, etc.) |
|---|---|---|
| AB | 0.4 | 0.3 |
| Ab | 0.1 | 0.2 |
| aB | 0.2 | 0.3 |
| ab | 0.3 | 0.2 |