Fact-checked by Grok 2 weeks ago

Molecular phylogenetics

Molecular phylogenetics is the branch of evolutionary biology that employs molecular data, such as DNA, RNA, and protein sequences, to infer the phylogenetic relationships and evolutionary histories among organisms, populations, or genes. By analyzing genetic similarities and differences, it constructs phylogenetic trees—diagrammatic representations of evolutionary divergence—where branches symbolize lineages and nodes indicate speciation or divergence events. This approach has revolutionized systematics by providing quantifiable, heritable markers that overcome limitations of morphological data, such as convergence or homoplasy. The field traces its origins to the mid-20th century, building on earlier immunological and protein-based studies expanded by George H. F. Nuttall in 1904 using serological tests to assess primate relationships. A pivotal advancement came in the 1960s with the work of Émile Zuckerkandl and Linus Pauling, who proposed using amino acid sequences to estimate divergence times and construct molecular clocks, assuming constant evolutionary rates. The advent of DNA sequencing technologies in the late 1970s, pioneered by Frederick Sanger, propelled molecular phylogenetics forward, enabling large-scale sequence comparisons and shifting the discipline from qualitative to statistical inference. By the 1980s, computational tools like neighbor-joining algorithms formalized tree-building processes, marking the transition to modern phylogenetics. Central to molecular phylogenetics are several key methods for tree reconstruction, each addressing the challenges of sequence evolution under models of nucleotide or amino acid substitution. Maximum seeks the tree requiring the fewest evolutionary changes, a concept formalized by Walter Fitch in 1971. Distance-based methods, such as neighbor-joining introduced by Saitou and Nei in 1987, use pairwise genetic distances derived from models like Jukes-Cantor (1969) to cluster taxa efficiently. More sophisticated model-based approaches include maximum likelihood, developed by Joseph Felsenstein in 1981, which evaluates tree topologies by maximizing the probability of observing the data given an evolutionary model, and , advanced in the late 1990s and early 2000s with software like MrBayes (2001), incorporating prior probabilities for robust uncertainty estimation. These methods often incorporate bootstrap resampling to assess branch support and molecular clock calibrations using fossils to date divergences. Applications of molecular phylogenetics extend across biology, informing fields from conservation to medicine. It has clarified major evolutionary transitions, such as the human-chimpanzee divergence approximately 6–7 million years ago based on genomic analyses. In epidemiology, it traces pathogen origins, like HIV-1's zoonotic jump from chimpanzees circa 1900–1930. Phylogenomics, an extension using genome-wide data, addresses complex issues like incomplete lineage sorting and horizontal gene transfer, enhancing resolution in deep-time phylogenies. Despite challenges like long-branch attraction or rate heterogeneity, ongoing advances in sequencing and computation continue to refine its accuracy and scope.

Fundamentals

Definition and Scope

Molecular phylogenetics is the branch of phylogenetics that reconstructs the evolutionary histories of organisms by analyzing sequences of DNA, RNA, or proteins, with a focus on shared derived characters known as synapomorphies at the molecular level. This approach identifies evolutionary relationships through homologous molecular sequences that reflect common ancestry, enabling the construction of phylogenetic trees that depict branching patterns, such as rooted trees (which designate an ancestral node) or unrooted trees (which emphasize relative divergences without a specified root). Pioneering work by Émile Zuckerkandl and Linus Pauling laid the groundwork for using molecular data to document evolutionary history. The scope of molecular phylogenetics extends from the analysis of single genes to comprehensive comparisons of entire genomes, a field termed phylogenomics, and applies across diverse evolutionary scales, including population-level variations and deep-time divergences between ancient lineages. It integrates sequence homology to trace relationships at higher taxonomic ranks, forming a core component of molecular systematics and evolutionary biology. In distinction from classical phylogenetics, which depends on observable phenotypic traits and morphological similarities, molecular phylogenetics relies on quantifiable rates of , exemplified by the neutral theory of molecular evolution introduced by , positing that most molecular changes are due to random rather than . This molecular basis provides unambiguous for , offering key benefits such as enhanced resolution for distinguishing closely related and elucidating ancient divergences where is limited or absent.

Molecular Data Sources

Molecular phylogenetics primarily utilizes sequences from DNA, RNA, and proteins as data sources to infer evolutionary relationships among organisms. DNA sequences, derived from , mitochondrial, or chloroplast genomes, serve as the most common molecular markers due to their abundance and variability across taxa. RNA sequences, particularly (rRNA), provide conserved regions suitable for broad comparative analyses, while protein sequences, translated from coding regions, offer amino acid-level resolution that captures functional constraints on evolution. These data types are selected based on the phylogenetic scale: DNA for genome-wide patterns, mitochondrial DNA for maternal lineages in animals, and chloroplast DNA for plant-specific histories. Key properties of these data sources influence their suitability for different evolutionary questions. DNA sequences exhibit high variability, particularly in non-coding regions and synonymous sites, making them for resolving recent divergences where substitutions accumulate rapidly; for instance, mitochondrial DNA evolves 5-10 times faster than nuclear DNA, enabling fine-scale population studies. In contrast, protein sequences are more conserved due to selection pressures on amino acid functionality, providing robust signals for deep phylogenies spanning millions of years. RNA molecules, such as rRNA, balance conservation and variability through structured domains, with haplotypes or alleles serving as comparable units to trace lineage-specific inheritance patterns. These properties also inform evolutionary rate inferences, as variable DNA sites help calibrate molecular clocks for divergence timing. Acquisition of these molecular data has evolved from targeted methods to high-throughput approaches. Traditional polymerase chain reaction (PCR) amplification isolates specific loci from extracted genomic material, allowing precise sequencing of genes like rRNA or mitochondrial markers in limited samples. Modern next-generation sequencing (NGS) enables genome-scale data generation by parallelizing millions of reads, facilitating phylogenomic analyses with reduced bias and increased resolution for complex datasets. Sequence alignment is essential prior to analysis to identify homologous positions across taxa. Representative examples illustrate the application of these data sources. Cytochrome c, a highly conserved protein, was among the first used in early molecular studies to reconstruct eukaryotic phylogenies based on amino acid differences, revealing branching patterns consistent with classical taxonomy. The 16S rRNA gene, an RNA marker with conserved core regions, revolutionized microbial phylogenetics by enabling the classification of prokaryotes into domains, as demonstrated in foundational analyses that uncovered the Archaea. For species-level identification, the cytochrome c oxidase I (COI) gene from mitochondrial DNA acts as a DNA barcode, offering rapid discrimination due to its moderate mutation rate and universal primers. Several considerations affect the reliability of these data in phylogenetic inference. Homoplasy, the independent convergence or reversal of sequence states, can obscure true relationships, with rates varying by data type—higher in rapidly evolving DNA than in constrained proteins. Insertion-deletion mutations (indels) introduce gaps in alignments, potentially adding informative characters but complicating homology assessment if not properly coded. In RNA data, secondary structures formed by base pairing must be accounted for, as they influence substitution patterns and alignment accuracy in conserved regions like rRNA stems and loops.

Historical Development

Early Foundations (Pre-1980s)

The foundations of molecular phylogenetics emerged in the mid-20th century, driven by advances in biochemistry that allowed comparisons of molecular sequences to infer evolutionary relationships, shifting the focus from morphological traits to quantifiable genetic and protein data. In 1962, Émile Zuckerkandl and Linus Pauling introduced the concept of "molecular disease," positing that mutations in protein sequences, akin to those causing genetic disorders like sickle cell anemia, could serve as markers of evolutionary change, linking molecular alterations directly to phylogenetic history. This idea laid the groundwork for using proteins as evolutionary documents, emphasizing how sequence variations accumulate over time to reflect divergence among species. Pioneering work by Zuckerkandl and Pauling in 1965 formalized the , proposing that mutations in proteins and nucleic acids occur at a relatively , the of evolutionary timelines through comparisons. They applied this to protein sequences, particularly and , to estimate primate phylogenies, suggesting that differences in sequences could quantify times and traditional taxonomic hierarchies based on . Their of primate highlighted closer relatedness among humans, chimpanzees, and than previously thought from . Early techniques for molecular included protein in the 1950s, which separated proteins based on charge and to detect variations, and immunological methods that measured antigenic distances between proteins from different . These approaches, such as microcomplement fixation, provided initial quantitative estimates of genetic similarity without full sequencing. By the 1970s, DNA-DNA hybridization emerged as a , where the of hybrid DNA duplexes from different indicated , allowing of evolutionary rates for broader taxonomic groups. A event was the by M. Fitch and Emanuel Margoliash of a using sequences from multiple to construct phylogenetic , minimizing deviations between observed and inferred evolutionary distances. This approach demonstrated the feasibility of tree-building from molecular , revealing patterns like equidistant divergence in vertebrates. That same year, Vincent Sarich and Allan C. Wilson published the first molecular phylogeny of primates using immunological distances from serum albumins, estimating the human-chimpanzee divergence at about 5 million years ago—far more recent than morphological estimates—and upending classical views of hominid evolution. This molecular turn enabled a conceptual shift from qualitative morphological assessments to quantitative evolutionary rates, with relatively constant evolutionary rates, typically around 0.1% amino acid substitutions per site per million years for many proteins, providing a clock-like metric for timing splits, though calibrations varied by lineage and were informed by fossil evidence. These pre-1980s innovations established molecules as reliable phylogenetic tools, influencing later DNA sequencing efforts by proving the power of sequence data to resolve deep evolutionary questions.

Expansion and Modernization (1980s-Present)

The 1980s marked a pivotal era in molecular phylogenetics, driven by technological breakthroughs that enabled the routine generation of DNA sequence data. The invention of the polymerase chain reaction (PCR) in 1983 by Kary Mullis revolutionized nucleic acid amplification, allowing researchers to produce sufficient quantities of target DNA for sequencing and analysis from minimal starting material. Concurrently, Sanger sequencing, developed in 1977 by Frederick Sanger and colleagues, saw widespread adoption throughout the 1980s as automated versions became commercially available, facilitating the sequencing of longer DNA fragments and enabling the first systematic comparisons of ribosomal RNA (rRNA) genes across diverse taxa. A landmark achievement came in 1986 when Carl Woese published an rRNA-based phylogenetic tree that underscored the deep evolutionary divergences among prokaryotes, laying the groundwork for recognizing Archaea as a distinct domain. By the late 1980s and into the 1990s, initial whole-genome comparisons emerged, such as those involving small viral and mitochondrial genomes, which provided early insights into genome-wide evolutionary patterns beyond single genes. The 2000s ushered in phylogenomics, characterized by the integration of multi-gene datasets to reconstruct more robust evolutionary histories. This shift was propelled by the completion of the in 2003, which not only sequenced the entire but also established infrastructure for high-throughput genomic analyses, enabling comparative studies across . Seminal reviews highlighted how genome-scale resolved longstanding ambiguities in animal and microbial phylogenies, such as the position of bilaterian , by analyzing hundreds of orthologous genes simultaneously. These multi-locus approaches reduced stochastic errors from single-gene analyses and improved resolution of deep divergences, marking a transition from gene-centric to genome-wide inference. From the 2010s onward, next-generation sequencing (NGS) technologies, exemplified by Illumina's platform introduced in 2005 and commercialized in 2006, dramatically increased data throughput, allowing for metagenomic surveys of unculturable microbes and complex communities. This facilitated real-time phylogenetic tracking in viral epidemics; for instance, NGS enabled detailed reconstruction of HIV transmission networks by capturing intra-host diversity and inter-individual spread. Similarly, during the COVID-19 pandemic starting in 2020, NGS-driven phylogenetics allowed global monitoring of SARS-CoV-2 variants, revealing rapid evolutionary dynamics and informing public health responses through continuous genome surveillance. Post-2020, integration of machine learning has optimized tree search algorithms, using neural networks to predict optimal topologies from vast datasets and accelerate inference under complex models. These advancements precipitated paradigm shifts in the field, notably from single-gene phylogenies to genome-wide analyses, which better capture reticulate evolution but introduce challenges like incomplete lineage sorting (ILS). ILS occurs when ancestral polymorphisms persist through speciation events, leading to discordant gene trees; coalescent models, formalized in the multispecies coalescent framework, address this by modeling gene coalescence within species trees, enabling more accurate species-level inferences. This coalescent-based approach has become essential for resolving rapid radiations and hybridization events, transforming molecular phylogenetics into a data-rich, computationally intensive discipline.

Theoretical Foundations

Principles of Phylogenetic Inference

Molecular phylogenetics reconstructs evolutionary histories by inferring hypotheses of ancestor-descendant relationships among organisms using molecular data, such as DNA sequences. These relationships are typically represented as phylogenetic trees, which are branching diagrams illustrating the divergence of lineages from common ancestors. Cladograms depict these relationships without indicating the amount of evolutionary change along branches, focusing solely on the topology of splits, while phylograms incorporate branch lengths proportional to the extent of change, such as nucleotide substitutions. The primary goals of phylogenetic inference are to identify the tree topology that best explains the observed molecular data by either maximizing parsimony, which favors the tree requiring the fewest evolutionary changes, or maximum likelihood, which selects the tree that maximizes the probability of the data under specified evolutionary processes. To establish directionality in these trees, outgroup rooting is employed, where a distantly related taxon (the outgroup) is included to identify the root, thereby orienting the tree relative to the ingroup of interest and distinguishing ancestral from derived states. Central to this process are principles distinguishing homology—similarities in molecular characters due to shared ancestry—from homoplasy, which arises from independent evolution via convergence, parallelism, or reversal. In molecular data, character state changes, such as transitions (purine-to-purine or pyrimidine-to-pyrimidine substitutions) versus transversions (purine-to-pyrimidine changes), are evaluated to trace evolutionary transformations while minimizing homoplasy. Confidence in inferred trees is assessed through bootstrap resampling, a statistical technique that repeatedly samples the data with replacement to generate pseudoreplicates, from which the proportion of replicates supporting a particular branch indicates its robustness. Phylogenetic inference operates within a statistical framework that treats tree topologies as testable hypotheses, allowing comparisons of alternative arrangements through metrics like parsimony scores or likelihood values to determine the most supported evolutionary scenario. Monophyly—the condition where a group comprises a common ancestor and all its descendants—is rigorously tested using molecular synapomorphies, which are shared derived character states (e.g., unique sequence motifs) that corroborate the group's unity and distinguish it from outgroups. These analyses presuppose common descent, the shared ancestry of the taxa under study, and gradualism in molecular evolution, where changes accumulate incrementally over time rather than in abrupt shifts, enabling the reliable reconstruction of branching patterns from sequence divergences.

Evolutionary Substitution Models

Evolutionary substitution models describe the probabilistic processes by which nucleotides or amino acids change along phylogenetic branches, providing the mathematical foundation for correcting observed differences in sequences to infer evolutionary distances and likelihoods under a given tree topology. These models assume a Markov process where the rate of substitution depends on the current state and time, enabling the computation of transition probabilities between character states over evolutionary time. The simplest model is the Jukes-Cantor (JC69) model, which posits equal rates of substitution among all four nucleotides and equal stationary frequencies of 0.25 for each base. Under this one-parameter model, the evolutionary distance d between two sequences is estimated from the proportion p of observed differences as d = -\frac{3}{4} \ln \left(1 - \frac{4}{3} p \right), which corrects for multiple substitutions at the same site. This distance represents the expected number of substitutions per site, assuming infinite sites and no back-mutations beyond the correction. The JC69 model serves as a baseline for more complex scenarios but underperforms when substitution rates vary. The Kimura two-parameter (K80) model extends JC69 by distinguishing between transitions (purine-to-purine or pyrimidine-to-pyrimidine changes) and transversions (purine-to-pyrimidine or vice versa), with transitions occurring at a higher rate \alpha and transversions at rate \beta. Let P be the proportion of transitional differences and Q the proportion of transversional differences; the evolutionary distance K is then K = -\frac{1}{2} \ln \left( (1 - 2P - Q) \sqrt{1 - 2Q} \right). This two-parameter model better captures empirical patterns in DNA evolution, where transitions are typically 2-10 times more frequent than transversions. For greater flexibility, the general time-reversible (GTR) model accommodates unequal substitution rates among the six possible nucleotide changes and unequal stationary base frequencies \pi_A, \pi_C, \pi_G, \pi_T. With five rate parameters and four frequency parameters (summing to 1), GTR is the most parameter-rich reversible model for nucleotides, allowing the rate matrix Q to be fully specified as Q_{ij} = \mu_{ij} \pi_j for i \neq j, normalized such that the expected rate is 1. The transition probability matrix over time t is derived from the matrix exponential P(t) = \exp(Qt). Codon substitution models, such as the Goldman-Yang 94 (GY94) model, address protein-coding genes by treating codons as states in a 61-state Markov process (excluding stop codons), incorporating nonsynonymous (d_N) and synonymous (d_S) substitution rates via the ratio \omega = d_N / d_S. The GY94 model uses a nucleotide-level process scaled by codon degeneracy and \omega, enabling detection of selective pressures; the instantaneous rate from codon i to j differing by one nucleotide is proportional to the target nucleotide frequency and \omega if nonsynonymous. This approach improves inference for coding sequences compared to nucleotide models alone. Parameters in these models are typically estimated via maximum likelihood (ML), maximizing the likelihood L = \prod_{sites} P(\text{data at site} \mid \text{tree, model}) over branch lengths, rates, and frequencies, often using the expectation-maximization algorithm or numerical optimization. To account for rate heterogeneity across sites, the gamma distribution (+Γ) models continuous variation with shape parameter \alpha, discretized into 4-8 categories for computational efficiency; lower \alpha indicates greater heterogeneity. The proportion of invariant sites (+I) complements this by estimating a fraction \pi_I of sites that never change, often combined as +Γ+I, though they can be non-identifiable without careful estimation. Model selection among candidates like JC69, K80, GTR, or codon models uses information criteria such as the Akaike information criterion (AIC = -2 \ln L + 2k, where k is the number of parameters) or Bayesian information criterion (BIC = -2 \ln L + k \ln n, with n sites), penalizing complexity to favor parsimonious fits that minimize expected prediction error. Lower AIC or BIC values indicate better models, guiding phylogenetic analyses toward those that balance fit and overfitting.

Methodological Approaches

Data Acquisition and Alignment

Molecular phylogenetics relies on the acquisition of high-quality molecular sequences, primarily DNA or RNA, from organisms of interest. Traditional methods for targeted gene acquisition involve polymerase chain reaction (PCR) amplification, which selectively amplifies specific genomic regions such as mitochondrial genes like COI or 16S rRNA using designed primers. This technique, developed in the 1980s, enables the isolation of homologous sequences from diverse taxa for comparative analysis, though it is limited to predefined loci and can introduce biases from primer mismatches. Next-generation sequencing (NGS) technologies have revolutionized data acquisition by enabling high-throughput shotgun sequencing of entire genomes or transcriptomes without prior knowledge of target regions. Platforms like Illumina, which uses reversible terminator chemistry for short-read sequencing (typically 50-300 bp), provide cost-effective, high-coverage data suitable for phylogenomic studies involving thousands of loci. In contrast, PacBio's single-molecule real-time (SMRT) sequencing generates long reads (up to 20 kb or more) since the 2010s, facilitating the resolution of repetitive regions and structural variants that challenge short-read assemblies in phylogenetic contexts. Following acquisition, rigorous quality control is essential to mitigate sequencing errors and artifacts. Raw NGS reads undergo trimming of low-quality bases, adapter removal, and error correction using algorithms that model base-calling inaccuracies, often reducing error rates from ~1% in Illumina data to below 0.1%. For de novo assembly, sequences are reconstructed into contigs—continuous fragments representing genomic regions—using tools like SPAdes, a de Bruijn graph-based assembler optimized for small genomes and single-cell data, which handles uneven coverage and chimeric reads effectively. Once assembled, sequences require () to identify homologous positions for subsequent . alignment methods, such as ClustalW introduced in 1994, build alignments iteratively by first computing pairwise distances and then adding sequences in a order, incorporating affine penalties to the of opening and extending insertions/deletions (s). More advanced tools like MUSCLE (2004) enhance accuracy through refinement and consistency-based scoring, achieving higher for divergent sequences while efficiently handling indels via dynamic programming. Similarly, MAFFT employs iterative refinement with for , offering robust indel handling and scalability for large datasets in phylogenetics. Alignment faces several challenges, particularly with RNA sequences where secondary structures like hairpins can cause misalignment due to evolutionary conservation of folds over primary sequences. Contamination during PCR or sequencing can produce chimeric sequences—artifacts fusing unrelated fragments—that propagate errors into alignments, necessitating validation via read mapping and duplicate removal. Recent advances include single-molecule sequencing platforms like Oxford Nanopore Technologies (ONT), commercialized in the 2010s, which provide real-time, long-read (up to megabase) data without amplification biases, enabling direct RNA sequencing and improved resolution of phylogenetic markers in complex microbial communities.

Tree Reconstruction Techniques

Tree reconstruction techniques in molecular phylogenetics involve algorithms that infer evolutionary relationships from aligned sequence data, typically by optimizing criteria such as evolutionary distance, parsimony, or likelihood under specified models. These methods can be broadly categorized into distance-based, character-based, and probabilistic approaches, each with distinct assumptions and computational strategies. Distance-based methods convert aligned sequences into a matrix of pairwise distances before tree building, while character-based methods directly utilize site-specific substitutions. Probabilistic methods, including maximum likelihood and Bayesian inference, incorporate statistical models to evaluate tree topologies and branch lengths. Distance-based methods construct trees by hierarchically clustering taxa based on estimated evolutionary distances, assuming additivity or ultrametric properties in the data. The unweighted pair group method with arithmetic mean (UPGMA), introduced in 1958, builds rooted trees by successively joining the closest pairs of taxa or clusters, averaging distances arithmetically, and assumes a molecular clock where evolutionary rates are constant across lineages. This assumption limits its accuracy for non-clock-like data, but it remains computationally efficient for exploratory analyses. A more flexible alternative is the neighbor-joining (NJ) algorithm, developed in 1987, which constructs unrooted trees without assuming a clock by minimizing the total branch length through iterative joining of neighboring taxa based on corrected distances. NJ is widely used for its speed and robustness to rate variation, though it can be sensitive to long-branch attraction in highly divergent datasets. Character-based methods infer trees directly from aligned character states, such as nucleotides or amino acids, without intermediate distance matrices. Maximum parsimony (MP) seeks the tree requiring the fewest evolutionary changes (steps) to explain the observed data, using algorithms like Fitch parsimony to score topologies efficiently. Formulated in 1971, Fitch's method employs a two-pass dynamic programming approach: a bottom-up pass to intersect possible ancestral states at internal nodes, followed by a top-down pass to resolve ambiguities, enabling rapid evaluation of candidate trees. MP is intuitive and avoids model assumptions but can be computationally intensive for large datasets due to the NP-hard search space, often requiring heuristic searches like branch-and-bound or genetic algorithms. Maximum likelihood (ML) extends this by optimizing the probability of observing the data given a tree and substitution model, typically using heuristic searches such as hill-climbing or genetic algorithms to explore topologies. Pioneered for DNA sequences in 1981, ML provides statistical rigor and parameter estimation but demands significant computation, especially for site-heterogeneous models. Bayesian approaches treat phylogenetic inference as a posterior probability estimation problem, integrating prior distributions over trees and parameters via Markov chain Monte Carlo (MCMC) sampling. MrBayes, released in 2001, implements MCMC to sample from the posterior distribution of trees under mixed models, providing credible sets and posterior probabilities for clades as measures of support. This method excels in incorporating uncertainty and handling complex models but requires careful assessment of chain convergence. For multi-locus data affected by incomplete lineage sorting, coalescent-based models like *BEAST extend Bayesian inference to simultaneously estimate gene trees and species trees. Introduced in 2010 within the BEAST framework, *BEAST uses MCMC to sample species trees under the multi-species coalescent, accounting for gene tree discordance, though it is computationally demanding for large numbers of loci. Several software packages facilitate these techniques, particularly for large datasets where exact searches are infeasible. PHYLIP, developed since the , supports distance-based, , and likelihood methods through a of modular programs, batch processing for up to thousands of taxa. For ML inference on massive alignments, RAxML employs randomized accelerated searches with model-optimized likelihood calculations, achieving high accuracy on datasets with over sequences via parallelization. To handle computational bottlenecks in ML, approximation methods like quartet puzzling decompose the data into quartets—subsets of four taxa—and reconstruct the tree by maximum likelihood on each, then assembles them using stepwise addition and puzzling steps for branch support. Proposed in , this approach reduces complexity from to time while providing puzzle support values analogous to bootstraps. Hybrid methods, such as supertrees, integrate information from multiple source trees on overlapping subsets of taxa to produce a comprehensive phylogeny, useful when direct analysis of all data is impractical. Supertree construction often employs matrix representation with parsimony (MRP), encoding source trees as binary character matrices and resolving the most parsimonious supertree. Detailed in a 2004 volume, these methods have enabled large-scale phylogenies, like the mammal tree of life, by combining hundreds of partial trees, though they can propagate errors from sources if overlap is sparse.

Tree Evaluation and Validation

Tree evaluation and validation in molecular phylogenetics involves quantitative methods to assess the statistical for inferred tree topologies and the robustness of phylogenetic inferences to variations in or . These techniques help determine the reliability of clades and overall , distinguishing signal from in molecular . measures, topological tests, assessments, and visualization tools are for interpreting results, while recent Bayesian advances provide additional for model adequacy. Support measures quantify the stability of individual clades or branches within a phylogenetic tree. The bootstrap method, introduced by Felsenstein in 1985, resamples the original alignment with replacement to generate pseudoreplicates, from which trees are reconstructed; the proportion of replicates supporting a particular clade indicates its bootstrap value, often interpreted as a measure of confidence (e.g., values above 70-95% suggest strong support). Similarly, the jackknife resampling deletes a subset of sites (typically 37%) without replacement to produce pseudoreplicates, yielding jackknife support percentages that assess node robustness, particularly useful in parsimony-based analyses where it can outperform bootstrap in detecting weak support. Topological tests compare the likelihoods or parsimony scores of alternative tree topologies to evaluate whether differences are statistically significant. The Kishino-Hasegawa (KH) test, developed in 1989, uses a likelihood ratio framework to assess if one tree is significantly better than another by comparing their scores under the same substitution model, assuming a chi-squared distribution for the test statistic. For scenarios involving multiple candidate trees, the Shimodaira-Hasegawa (SH) test extends this approach by incorporating a correction for multiple comparisons, generating null distributions via resampling (e.g., parametric bootstrapping) to avoid overconfidence in the best tree. These tests are particularly valuable when evaluating constrained topologies against unconstrained ones, such as those incorporating prior biological hypotheses. Congruence assessment evaluates whether separate data partitions (e.g., genes or genomic regions) yield compatible phylogenetic signals, crucial for multi-locus analyses. The partition homogeneity test, also known as the incongruence length difference (ILD) test, proposed by Farris et al. in 1994, randomly partitions the combined dataset and compares the sum of tree lengths from separate analyses to that of the pooled data under parsimony; significant differences indicate incongruence, suggesting data combination may mislead. To gauge phylogenetic informativeness, spectral methods decompose sequence data into frequency components to detect and quantify phylogenetic signal versus noise, as in covariance-based spectral analyses that preserve evolutionary structure even with incomplete data. Complementary approaches, like Townsend's phylogenetic informativeness profiling (2007), plot site-specific contributions to resolution over time, highlighting markers optimal for specific divergences. Visualization techniques summarize support across multiple trees, aiding interpretation of uncertainty. Consensus trees, such as strict or majority-rule variants, aggregate compatible clades from a set of trees (e.g., bootstrap replicates), with branch labels indicating the percentage of trees supporting each clade; these provide a compact representation of prevalent topology without implying a single "true" tree. In Bayesian analyses, clade credibility is assessed via posterior probabilities, where the proportion of posterior tree samples containing a clade reflects its probability under the model; values exceeding 0.95 often denote high credibility, as implemented in tools like MrBayes. Recent advances in Bayesian phylogenetics include posterior predictive checks, which simulate replicate datasets from the fitted posterior distribution to evaluate model fit (post-2010 developments). These checks compare observed data statistics (e.g., tree imbalance or site pattern frequencies) to their posterior predictive distributions, identifying discrepancies that indicate inadequate models; for instance, they can flag violations of substitution model assumptions in multi-gene datasets. Such methods enhance validation by integrating uncertainty propagation directly into the inference process.

Applications

In Systematics and Taxonomy

Molecular phylogenetics has played a pivotal role in cladistics by providing genetic evidence that reveals paraphyletic groupings in traditional classifications, thereby necessitating revisions to reflect monophyletic clades. For instance, classical taxonomy treated Reptilia as excluding birds, rendering it paraphyletic, but molecular analyses of ribosomal RNA and protein-coding genes have demonstrated that birds nest within the reptilian clade as the sister group to crocodilians, forming the monophyletic Archosauria and confirming that dinosaurs are avian ancestors. This shift underscores how sequence-based phylogenies prioritize shared derived characters over superficial morphological traits, aligning classifications more closely with evolutionary history. In species delimitation, molecular phylogenetics employs genetic distance thresholds to identify boundaries between taxa, often using DNA barcoding with the mitochondrial COI gene, where a 2% sequence divergence gap serves as an empirical cutoff for distinguishing animal species based on intra- versus interspecific variation. This approach has revolutionized taxonomy by enabling rapid identification of cryptic species that are morphologically indistinguishable, though it is most effective when integrated with other data in an integrative taxonomy framework that combines molecular, morphological, ecological, and behavioral evidence to resolve ambiguities. For example, barcoding has uncovered hidden diversity in insects and marine invertebrates, refining species counts and supporting conservation priorities. For microbial taxonomy, the 16S rRNA gene has become the cornerstone marker due to its conserved structure and variable regions, allowing the delineation of bacterial phyla and the construction of the tree of life for prokaryotes, as pioneered by Carl Woese's analyses that redefined domains Bacteria, Archaea, and Eukarya. This molecular approach has been essential for classifying unculturable microbes, which comprise over 99% of bacterial diversity, by enabling their detection through metagenomic sequencing of environmental samples without the need for cultivation. Metagenomics has thus expanded the known bacterial phylum count from a handful to over 100, revealing novel lineages in habitats like deep-sea vents and soil microbiomes. Notable case studies illustrate these applications: molecular clock estimates from nuclear and mitochondrial sequences confirm the human-chimpanzee divergence at approximately 6-7 million years ago, supporting fossil evidence and clarifying hominid relationships within Primates. Similarly, mitochondrial DNA phylogenies have elucidated whale evolution, showing that mysticete (baleen) whales diverged from odontocetes (toothed whales) around 34-36 million years ago, with mtDNA control region analyses tracing artiodactyl origins and resolving debates on sirenian affinities. The broader impact of molecular phylogenetics on systematics is evident in its transformation of the Linnaean hierarchy, replacing artificial groupings with clade-based classifications supported by genomic data. A landmark example is the 1997 proposal of the Ecdysozoa clade, encompassing arthropods, nematodes, and other molting animals, based on small subunit rRNA sequences that united these phyla in a monophyletic group, overturning prior morphological schemes and influencing metazoan phylogeny ever since. Next-generation sequencing has further scaled these efforts, facilitating phylogenomic analyses of thousands of loci for robust taxonomic revisions across kingdoms.

In Broader Biological and Interdisciplinary Contexts

Molecular phylogenetics extends into evolutionary biology by enabling the estimation of divergence times through relaxed molecular clock models, which accommodate variable evolutionary rates across lineages. The BEAST software package implements Bayesian methods using Markov chain Monte Carlo sampling to infer time-calibrated phylogenies from molecular sequences, allowing researchers to date speciation events while accounting for rate heterogeneity. For instance, these approaches have reconstructed the timeline of mammalian diversification, revealing that relaxed clocks provide more accurate estimates than strict clocks when rate variation is present. Additionally, molecular phylogenetics identifies adaptive evolution via the nonsynonymous-to-synonymous substitution ratio (dN/dS), where values greater than 1 indicate positive selection driving functional changes. Seminal likelihood ratio tests applied to codon models, such as those in the PAML suite, have detected such signals in primate lysozyme genes, highlighting lineage-specific adaptations to dietary shifts. In population genetics, molecular phylogenetics facilitates the inference of gene flow and admixture events by integrating phylogenetic trees with population structure analyses. The STRUCTURE software uses multilocus genotype data to assign individuals to ancestral populations and estimate admixture proportions, revealing historical migration patterns in human and non-human species. Phylogeographic studies, often employing mitochondrial DNA (mtDNA) sequences, map intraspecific genetic lineages to geographic distributions, bridging population genetics and systematics. John Avise's foundational work demonstrated how mtDNA phylogenies uncover post-glacial recolonization routes in North American fauna, illustrating barriers to gene flow. Interdisciplinary applications of molecular phylogenetics span medicine, conservation, and forensics. In medical phylogenetics, pathogen evolution is tracked using whole-genome phylogenies to understand transmission dynamics, as seen in HIV-1 studies where trees reveal quasispecies diversity within hosts and epidemic spread across populations. Conservation efforts leverage single nucleotide polymorphisms (SNPs) from molecular phylogenies to assess relatedness among endangered species, informing breeding programs and prioritizing genetic diversity preservation. For example, genomic analyses of threatened felids have identified distinct lineages to guide reintroduction strategies. In forensics, DNA barcoding of cytochrome c oxidase I (COI) genes constructs phylogenetic references for species identification in poaching investigations, enabling rapid detection of illegal wildlife trade. Recent examples underscore the real-time utility of molecular phylogenetics in global challenges. During the COVID-19 pandemic, GISAID-facilitated phylogenies of SARS-CoV-2 genomes from 2020 onward traced variant emergence and international introductions, supporting public health responses like variant surveillance. In agriculture, phylogenomic analyses of crop genomes elucidate domestication histories, such as the multiple origins of maize from teosinte ancestors, aiding breeding for resilient varieties. Economically, molecular phylogenetics informs drug target identification by comparing parasite-host evolutionary trees, highlighting unique pathogen pathways absent in hosts. For protozoan parasites like Plasmodium, ortholog-based phylogenomics has prioritized essential genes for antimalarial development, reducing trial-and-error costs in neglected disease therapies.

Challenges and Future Directions

Key Limitations

Molecular phylogenetics faces several biological challenges that can lead to discrepancies between gene trees and species trees. Horizontal gene transfer (HGT) is particularly prevalent in bacteria and plants, where genes are exchanged across lineages, violating the assumption of strictly vertical inheritance and complicating the reconstruction of accurate species phylogenies. For instance, up to 20% of bacterial genomes may contain recently acquired foreign genes, which can obscure long-term evolutionary signals and result in phylogenetic discordance. Similarly, incomplete lineage sorting (ILS) occurs when ancestral polymorphisms persist through speciation events, leading to gene tree discordance that does not reflect the species tree topology. This phenomenon is especially pronounced in rapidly radiating lineages, such as bryophytes, where ILS accounts for up to 99% of gene trees differing from the species tree. Methodological biases further undermine phylogenetic inference. Long-branch attraction (LBA) in maximum parsimony methods causes fast-evolving, distantly related lineages to be artifactually grouped together due to convergent changes being misinterpreted as shared synapomorphies, particularly in the Felsenstein zone where branch lengths are unequal. Model misspecification, such as failing to account for pairwise epistasis in substitution models, can introduce systematic biases that favor incorrect tree topologies, even as dataset size increases. Data limitations pose additional hurdles, especially for ancient divergences. Substitution saturation arises when multiple changes at the same site in deep-time analyses erode the phylogenetic signal, affecting up to 50% of loci in some phylogenomic datasets and leading to reduced branch support by 4.8% on average. Alignment errors are exacerbated in highly divergent sequences, where inaccuracies in inferring positional homology propagate into downstream phylogenetic estimates, increasing topological error across methods. Violations of assumptions also results. Non-neutral , driven by positive selection or other forces, can incongruence between markers, as seen in non-neutral loci like that yield topologies significantly different from neutral in fish phylogenies. The assumption of constant substitution rates across lineages is frequently violated, with non-clock-like rate variation indistinguishable from in some cases, leading to biased divergence time estimates. Empirical challenges include insufficient taxon sampling and computational demands. Low taxon sampling can introduce undersampling bias, where sparse representation of lineages reduces accuracy more than expected under resource constraints, though increasing sequence length per taxon often mitigates this better than adding taxa. For large datasets, computational intractability arises from the intensive requirements of orthology prediction, coalescent-based analyses, and model selection, rendering full phylogenetic inference infeasible for thousands of genes without approximations. Phylogenomics represents a major advance in molecular phylogenetics, enabling the construction of species trees from whole-genome data across thousands of loci to better account for processes like incomplete lineage sorting and gene tree discordance. Summary coalescent methods, such as ASTRAL, integrate multiple gene trees into a cohesive species tree by modeling coalescent processes, providing statistically consistent estimates even with large datasets. Extensions like ASTRAL-Pro further enhance accuracy by incorporating paralogous genes and quartet-based inference, allowing robust phylogenomic analyses despite orthology challenges. Computational innovations have accelerated phylogenetic inference through machine learning techniques, including deep learning architectures for sequence alignment and tree optimization, which reduce computational demands on massive datasets. For instance, PhyloTune employs neural networks to guide heuristic searches, achieving near-optimal tree topologies in seconds rather than hours on GPU hardware. GPU-accelerated maximum likelihood searches in updated software like IQ-TREE 3 support mixture models and site-heterogeneous evolution, scaling to phylogenomic datasets with millions of sites. Integration of multi-omics data, such as transcriptomics and proteomics alongside genomics, has expanded phylogenetic resolution by capturing functional evolutionary signals beyond nucleotide sequences alone. This approach reveals correlated evolutionary patterns across omics layers, as seen in studies of plant lineages where combined datasets clarify divergence timings and adaptive traits. Prospective directions include phylogenetics for outbreak surveillance, exemplified by tools like , which place new viral sequences onto existing trees in under a minute to track pathogen evolution during pandemics. AI-driven model selection, such as in ModelRevelator, uses supervised learning to recommend optimal substitution models rapidly, bypassing exhaustive testing for diverse datasets. Efforts to resolve longstanding polytomies in the tree of life leverage recoded amino acid phylogenomics to disentangle deep divergences previously obscured by saturation. Ethical and technical frontiers emphasize privacy protections in big data human phylogenetics, where aggregated genomic sequences risk re-identification without robust anonymization protocols. Scalable open-source tools, including recent IQ-TREE updates, facilitate accessible inference while addressing horizontal gene transfer through advanced partitioning.

References

  1. [1]
    [PDF] Molecular Phylogenetics - UCL Discovery
    Molecular Phylogenetics. INTRODUCTION. Molecular phylogenetics is the science of using molecular data (DNA and protein sequences) to.
  2. [2]
    Molecular phylogenetics: principles and practice - Nature
    Mar 28, 2012 · Here, we review the major methods of phylogenetic analysis, including parsimony, distance, likelihood and Bayesian methods.
  3. [3]
    Molecular Phylogenetics - Genomes - NCBI Bookshelf - NIH
    Molecular phylogenetics predates DNA sequencing by several decades. It is derived from the traditional method for classifying organisms according to their ...
  4. [4]
    Molecules as documents of evolutionary history - ScienceDirect.com
    Different types of molecules are discussed in relation to their fitness for providing the basis for a molecular phylogeny. ... Zuckerkandl and Pauling, 1962. E.
  5. [5]
    Phylogenomics - PubMed
    Phylogenomics aims at reconstructing the evolutionary histories of organisms taking into account whole genomes or large fractions of genomes.
  6. [6]
    What is Molecular Phylogenetics? - AZoLifeSciences
    Nov 3, 2022 · Molecular phylogenetics is a branch of phylogenetics aimed at tracing evolutionary relationships between organisms at higher taxonomic ranks.
  7. [7]
  8. [8]
    A Phylogenomic Approach Based on PCR Target Enrichment ... - NIH
    Feb 1, 2016 · In this paper, we present a method that utilizes microfluidic PCR and HTS to generate large amounts of sequence data suitable for phylogenetic analyses.
  9. [9]
    Next-Generation Sequencing Technology: Current Trends and ... - NIH
    Next-generation sequencing (NGS) is a powerful tool used in genomics research. NGS can sequence millions of DNA fragments at once.
  10. [10]
    PRIMARY STRUCTURE AND EVOLUTION OF CYTOCHROME C
    PRIMARY STRUCTURE AND EVOLUTION OF CYTOCHROME C. E. MargoliashAuthors Info & Affiliations. October 15, 1963. 50 (4) 672-679.
  11. [11]
    Phylogenetic structure of the prokaryotic domain - PNAS
    A phylogenetic analysis based upon ribosomal RNA sequence characterization reveals that living systems represent one of three aboriginal lines of descent.
  12. [12]
    Biological identifications through DNA barcodes - Journals
    We establish that the mitochondrial gene cytochrome c oxidase I (COI) can serve as the core of a global bioidentification system for animals.
  13. [13]
    Homoplasy in genome-wide analysis of rare amino acid replacements
    We provide a direct estimate of the level of homoplasy caused by parallel changes and reversals among the RGC_CAMs using 462 alignments of orthologous genes ...
  14. [14]
    Incorporating indels as phylogenetic characters: Impact for ...
    Insertion and deletion events (indels) provide a suite of markers with enormous potential for molecular phylogenetics. Using many more indel characters than ...
  15. [15]
    Use of rRNA Secondary Structure in Phylogenetic Studies to Identify ...
    rRNA secondary structures are used to guide decisions about homologous positions in phylogenetic studies, as their conservation exceeds that of nucleotides.
  16. [16]
    [PDF] Molecular Disease, Evolution, and Genic Heterogeneity - Evolocus
    189. Page 2. 190. EMILE ZUCKERKANDL AND LINUS PAULING of biological integration. There they are found to be closely linked, with no sharp borderline between ...
  17. [17]
    Molecules as documents of evolutionary history - PubMed
    Molecules as documents of evolutionary history. J Theor Biol. 1965 Mar;8(2):357-66. doi: 10.1016/0022-5193(65)90083-4. Authors. E Zuckerkandl, L Pauling. PMID ...<|control11|><|separator|>
  18. [18]
    Construction of Phylogenetic Trees - Science
    MARGOLIASH, E, PRIMARY STRUCTURE AND EVOLUTION OF CYTOCHROME C, PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 50: 672 (1963).
  19. [19]
    Immunological Time Scale for Hominid Evolution - Science
    Our calculations lead to the suggestion that, if man and Old World monkeys last shared a common ancestor 30 million years ago, then man and African apes shared ...
  20. [20]
    Kary B. Mullis – Facts - NobelPrize.org
    In 1985, Kary Mullis invented the process known as polymerase chain reaction (PCR), in which a small amount of DNA can be copied in large quantities over a ...
  21. [21]
    Human Genome Project Fact Sheet
    Jun 13, 2024 · In 2003, the Human Genome Project produced a genome sequence that accounted for over 90% of the human genome. It was as close to complete as ...
  22. [22]
    History of Illumina Sequencing & Solexa Technology
    The first Solexa sequencer, the Genome Analyzer, was launched in 2006 and gave scientists the power to sequence 1 gigabase (Gb) of data in a single run. 2007.
  23. [23]
    Molecular tools for studying HIV transmission in sexual networks
    Phylogenetics is frequently used for studies of population-based HIV transmission. The purpose of this review is to highlight current utilities and ...
  24. [24]
    The evolution of SARS-CoV-2 | Nature Reviews Microbiology
    Apr 5, 2023 · In this Review, we consider the evolution of SARS-CoV-2 at different scales, the phases of the COVID-19 pandemic, factors that drive the ...
  25. [25]
    Harnessing machine learning to guide phylogenetic-tree search ...
    Mar 31, 2021 · Optimizing the set of branch lengths for each candidate tree is computationally intensive, adding another layer of complexity to this endeavor.
  26. [26]
    IS A NEW AND GENERAL THEORY OF MOLECULAR ...
    Jan 14, 2009 · By building on the foundation laid by concepts of gene trees and coalescent theory, and by taking cues from recent trends in multilocus ...
  27. [27]
    Phylogenetic Inference - Stanford Encyclopedia of Philosophy
    Dec 8, 2021 · Phylogenetic principles predict different patterns of monophyly on these competing hypotheses, which can be tested empirically and used by ...Phylogenetic Inference in... · Phylogenetic Inference and...
  28. [28]
    CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING ...
    It involves resampling points from one's own data, with replacement, to create a series of bootstrap samples of the same size as the original data. Each of ...
  29. [29]
  30. [30]
    Molecular phylogenetic, population genetic and demographic ...
    Oct 6, 2020 · PCR amplification and sequencing. To amplify partial mitochondrial DNA fragments of CO1 and 16S rRNA, and PCR was carried out using the ...Missing: techniques | Show results with:techniques
  31. [31]
    Understanding PCR Processes to Draw Meaningful Conclusions ...
    Aug 20, 2019 · The parameter ai is equivalent to the measure of amplification efficiency, E that is often reported for an individual species in qPCR studies.
  32. [32]
    Using next-generation sequencing approach for discovery and ...
    This review focuses on the latest applications, advances and opportunities of NGS in the discovery and characterization of MM in plants.
  33. [33]
    SPAdes: A New Genome Assembly Algorithm and Its Applications to ...
    We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V−SC ...Missing: control phylogenetics
  34. [34]
    CLUSTAL: a package for performing multiple sequence alignment ...
    CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 1988 Dec 15;73(1):237-44. doi: 10.1016/0378-1119(88)90330-7. Authors.Missing: paper | Show results with:paper
  35. [35]
    MUSCLE: multiple sequence alignment with high accuracy ... - NIH
    We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation ...
  36. [36]
    A simple method to control over-alignment in the MAFFT multiple ...
    We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments).2 Methods · 3 Results · 3.2 Benchmarks
  37. [37]
    Challenges and approaches to predicting RNA with multiple ...
    This review describes an efficient combinatorially complete method and three free energy minimization approaches to predicting RNA structures with more than ...Missing: chimeras | Show results with:chimeras
  38. [38]
    a method for the determination of ncRNA ends using chimeric reads ...
    Mar 12, 2014 · We developed a bioinformatic method, called Vicinal, to precisely map the ends of numerous fruitfly, mouse and human ncRNAs.Missing: Challenges | Show results with:Challenges
  39. [39]
    Using nanopore sequencing to identify fungi from clinical samples ...
    Jun 16, 2023 · A nanopore sequencing technology generates long sequencing reads without a theoretical length limit. DNA metabarcoding with long amplicons could ...
  40. [40]
    Multiple Comparisons of Log-Likelihoods with Applications to ...
    H Shimodaira, M Hasegawa; Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference, Molecular Biology and Evolution, Volume 16,
  41. [41]
    TESTING SIGNIFICANCE OF INCONGRUENCE - Wiley Online Library
    Cladistics · Volume 10, Issue 3 pp. 315-319 Cladistics. Free Access. TESTING SIGNIFICANCE OF INCONGRUENCE. James S. Farris,. James S. Farris. Naturhistoriska ...
  42. [42]
    Phylogenetic Analysis Based on Spectral Methods - Oxford Academic
    Through simulations, we show that the covariance-based methods effectively capture phylogenetic signal even when structural information is not fully retained.
  43. [43]
    Profiling Phylogenetic Informativeness - Oxford Academic
    Abstract. The resolution of four controversial topics in phylogenetic experimental design hinges upon the informativeness of characters about the historica.
  44. [44]
    Posterior Predictive Bayesian Phylogenetic Model Selection
    Abstract. We present two distinctly different posterior predictive approaches to Bayesian phylogenetic model selection and illustrate these methods using e.
  45. [45]
    DNA Barcoding Will Often Fail to Discover New Animal Species over ...
    DNA barcoding can fail to discover new species if isolation is less than 4 million generations ago, and will bias against discovering young species.
  46. [46]
    Metagenomics for studying unculturable microorganisms: cutting the ...
    Aug 1, 2005 · As they limited the comparative database to cultured microorganisms, it was not surprising that they did not identify any 16S rRNA gene ...
  47. [47]
    Estimation of Divergence Times for Major Lineages of Primate Species
    We estimated that the human lineage diverged from the chimpanzee, gorilla, orangutan, Old World monkey, and New World monkey lineages approximately 6 MYA.
  48. [48]
    Mitochondrial Phylogenetics and Evolution of Mysticete Whales
    This study has three main aims: (1) to determine and document the complete mtDNA sequences of ten Mysticeti species; (2) to clarify the uncertain phylogenetic ...
  49. [49]
    Evidence for a clade of nematodes, arthropods and other moulting ...
    May 29, 1997 · The results suggest that ecdysis (moulting) arose once and support the idea of a new clade, Ecdysozoa, containing moulting animals: arthropods, ...Missing: discovery | Show results with:discovery
  50. [50]
    BEAST: Bayesian evolutionary analysis by sampling trees - PMC
    BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation. It also provides a resource for the further development of new ...
  51. [51]
    Accurate Model Selection of Relaxed Molecular Clocks in Bayesian ...
    Oct 22, 2012 · For more detailed information on these methods and the way they are implemented in BEAST (Drummond et al. 2012), we refer interested readers to ...
  52. [52]
    Likelihood ratio tests for detecting positive selection and application ...
    The tests were applied to the lysozyme genes of 24 primate species. The dN/dS ratios were found to differ significantly among lineages.
  53. [53]
    An overview of STRUCTURE: applications, parameter settings, and ...
    STRUCTURE both identifies populations from the data and assigns individuals to that population representing the best fit for the variation patterns found.Missing: phylogenetics | Show results with:phylogenetics
  54. [54]
    The Mitochondrial DNA Bridge Between Population Genetics and ...
    Nov 1, 1987 · INTRASPECIFIC PHYLOGEOGRAPHY: The Mitochondrial DNA Bridge Between Population Genetics and Systematics. J C Avise, J Arnold, R M Ball, ...
  55. [55]
    The evolution of HIV: Inferences using phylogenetics - PMC
    Phylogenies allow researchers to determine patterns of the extensive genetic diversity of HIV to examine human-scale ecological and epidemiological processes.Missing: seminal | Show results with:seminal
  56. [56]
    Conservation of biodiversity in the genomics era
    Sep 11, 2018 · “Conservation genomics” encompasses the idea that genome-scale data will improve the capacity of resource managers to protect species.
  57. [57]
    Assessing the utility of DNA barcoding in wildlife forensic cases ...
    DNA barcoding has been identified as a rapid and practical molecular tool that can be used to identify species due to species-specific variation in short ...
  58. [58]
    Tracing the origins of SARS-COV-2 in coronavirus phylogenies - NIH
    The etiological agent of COVID-19 was rapidly identified at the beginning of the pandemic, and by January 26, 2020, 10 viral genomes had been sequenced (Lu et ...
  59. [59]
    Advances in Genomics Approaches Shed Light on Crop ...
    Jul 30, 2021 · Crop domestication is a major agricultural advance ensuring food security for human society. Domestication is the result of phenotypic and ...Missing: phylogenomics | Show results with:phylogenomics
  60. [60]
    Protozoan genomics for drug discovery - PMC - NIH
    Comparing the metabolic pathways of parasites and their hosts facilitates the identification of new drug targets.
  61. [61]
    Horizontal Gene Transfer and the History of Life - PMC - NIH
    Horizontal gene transfer is a major evolutionary force that constantly reshapes microbial genomes. Emerging phylogenetic methods use information about ...
  62. [62]
    Extensive Genome-Wide Phylogenetic Discordance Is Due to ...
    Mar 3, 2021 · It is well known that identifying the genomic footprints of gene flow is difficult in the presence of incomplete lineage sorting (ILS) among the ...
  63. [63]
    Heterotachy and long-branch attraction in phylogenetics
    Oct 6, 2005 · Probabilistic methods have progressively supplanted the Maximum Parsimony (MP) method for inferring phylogenetic trees.
  64. [64]
    Robustness of Phylogenetic Inference to Model Misspecification ...
    May 27, 2021 · We present a simulation study demonstrating that accuracy increases with alignment size even if the additional sites are epistatically coupled.
  65. [65]
    Excluding Loci With Substitution Saturation Improves Inferences ...
    This phenomenon, known as substitution saturation, is recognized as one of the primary obstacles to deep-time phylogenetic inference using genome-scale data ...
  66. [66]
    The effect of alignment uncertainty, substitution models and priors in ...
    Nov 6, 2019 · Alignment uncertainty can be more influential on phylogenetic tree estimation than the specific tree reconstruction methods used [7,8,9,10], and ...
  67. [67]
    Evaluating the Effects of Non-Neutral Molecular Markers on ...
    If they are not, and selection does negatively affect phylogenetic inference (either by violating assumptions of the models or representing convergent evolution) ...
  68. [68]
    Distinguishing Between Convergent Evolution and Violation of ... - NIH
    An alternative to the molecular clock assumption is a non-clock-like tree. Removing the molecular clock approximately doubles the number of parameters for large ...
  69. [69]
    Taxon Sampling, Bioinformatics, and Phylogenomics - PMC - NIH
    Taxon sampling is often thought to be of extreme importance for phylogenetic inference, and increased sampling of taxa is commonly advocated as a solution ...
  70. [70]
    Phylogenomics — principles, opportunities and pitfalls of big‐data ...
    Dec 16, 2019 · Phylogenetics is the science of reconstructing the evolutionary history of life on Earth. Traditionally, phylogenies were constructed using morphological data ...Taxon Sampling: A Crucial... · Target Enrichment Sequencing · I Have A Phylogeny! What Now...<|control11|><|separator|>
  71. [71]
    Phylogenomics articles within Nature Reviews Genetics
    This Review discusses the biological and analytical factors that lead to incongruence, methodological advances to identify and resolve incongruence, and avenues ...
  72. [72]
    ASTRAL: genome-scale coalescent-based species tree estimation
    We present ASTRAL, a fast method for estimating species trees from multiple genes. ASTRAL is statistically consistent, can run on datasets with thousands of ...Missing: review | Show results with:review
  73. [73]
    ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy
    Sep 4, 2020 · We first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy.<|separator|>
  74. [74]
    Phylogenetic Methods Meet Deep Learning - Oxford Academic
    Sep 19, 2025 · Abstract. Deep learning (DL) has been widely used in various scientific fields, but its integration into phylogenetics has been slower, ...Missing: post- | Show results with:post-
  75. [75]
    PhyloTune: An efficient method to accelerate phylogenetic updates ...
    Jul 26, 2025 · For nearly two decades, molecular phylogenies have primarily relied on data from a few genes obtained through PCR amplification and Sanger ...
  76. [76]
    [PDF] IQ-TREE 3: Phylogenomic Inference Software using ... - EcoEvoRxiv
    Mar 3, 2025 · IQ-TREE 3 significantly extends version 2 with new features, including mixture models as an alternative to partitioned models, gene and site.
  77. [77]
    Progress in phylogenetics, multi-omics and flower coloration studies ...
    Feb 2, 2024 · We describe advances in phylogenetic reconstruction and genome sequencing of Rhododendron species. The metabolic pathways of flower color are ...
  78. [78]
    Multi-omics analysis reveals the evolution, function, and regulatory ...
    Dec 20, 2024 · The phylogenetic relationship, (α-SPF, (σ-SPF, (β-SPF, γ-SPF))), suggests that α-SPF represents the most ancestral form, followed by σ-SPF.
  79. [79]
  80. [80]
    ModelRevelator: Fast phylogenetic model estimation via deep learning
    We introduce ModelRevelator, a machine learning-based approach to model selection for phylogenetic inference. ... AI training, and similar technologies.Missing: driven | Show results with:driven
  81. [81]
    Resolving tricky nodes in the tree of life through amino acid recoding
    Nov 15, 2022 · Abstract. Genomic data allowed a detailed resolution of the Tree of Life, but "tricky nodes" such as the root of the animals remain unresolved.Missing: Cambrian explosion
  82. [82]
    Ethical Considerations in Global HIV Phylogenetic Research - PMC
    The choice of meta-data variables used in phylogenetic analysis is an important ethical decision. Phylogenetic analyses are often based upon individual-level ...
  83. [83]
    IQ-TREE: Efficient phylogenomic software by maximum likelihood
    Efficient software for phylogenomic inference · Latest release 3.0.1 (May 5, 2025) · Legacy release 2.4.0 (February 7, 2025) · IQ-TREE feature highlights · IQ-TREE ...