Fact-checked by Grok 2 weeks ago

Computational phylogenetics

Computational phylogenetics is an interdisciplinary field that applies computational algorithms, statistical models, and optimization techniques to infer evolutionary relationships among organisms, genes, or other taxa, typically represented as phylogenetic trees or networks based on molecular sequence data such as DNA, RNA, or proteins. This approach addresses the challenges of reconstructing evolutionary histories from complex, noisy data by developing methods that estimate tree topologies, branch lengths, and ancestral states while accounting for processes like substitution rates, horizontal gene transfer, and incomplete lineage sorting. The field emerged in the mid-20th century alongside advances in , with early computational methods building on serological and protein sequence comparisons in the 1960s and 1970s, evolving into more sophisticated algorithms by the 1980s, such as Joseph Felsenstein's feasible likelihood calculation for phylogenetic inference. Key methods include distance-based approaches like neighbor-joining, which construct trees from pairwise evolutionary distances; character-based techniques such as , which seek trees minimizing evolutionary changes; , which optimizes tree parameters to maximize the probability of observed data under evolutionary models; and , which incorporates prior probabilities to sample posterior distributions of trees. These methods often rely on models like the Jukes-Cantor or general time-reversible substitution models to simulate evolution, ensuring statistical consistency where estimation errors decrease with larger datasets. Beyond biology, computational phylogenetics extends to for reconstructing trees from data and to other domains like for tracing . Its applications span by identifying conserved targets across species, through assessments of and extinction risks, and via outbreak tracking and design against evolving viruses. Challenges persist in handling from phylogenomics, reticulate via networks, and computational scalability, driving ongoing innovations in high-performance algorithms and integrations.

Introduction

Definition and Scope

Computational phylogenetics is the application of computational algorithms, statistical models, and optimization techniques to infer evolutionary relationships among , typically represented as phylogenetic trees or networks that hypothesize their historical divergences. This field integrates principles from , , and to analyze and reconstruct , which serve as testable hypotheses about evolutionary relatedness rather than definitive truths. Unlike traditional phylogenetics, which often relied on manual interpretation of morphological traits, computational phylogenetics emphasizes scalable methods to handle large datasets, incorporating probabilistic frameworks for assessing uncertainty and robustness. The scope of computational phylogenetics encompasses diverse data sources, including molecular sequences (such as DNA and protein), morphological characters, and fossil records, enabling the reconstruction of evolutionary histories across scales from microbial pathogens to entire biodiversity. It finds applications in systematics for classifying taxa based on shared ancestry, epidemiology for tracing pathogen transmission and evolution, and conservation biology for prioritizing genetic diversity and identifying evolutionarily significant units. These efforts contribute to broader understandings of biodiversity patterns and the dynamics of disease spread, informing strategies to mitigate extinction risks and control outbreaks. A prominent example is the rapid reconstruction of phylogenies during the , where computational tools analyzed thousands of viral genomes to map global transmission routes and emergence, demonstrating the field's capacity for real-time evolutionary .

Historical

The field of computational phylogenetics emerged in the mid-20th century as researchers sought quantitative methods to infer evolutionary relationships, transitioning from manual taxonomic practices to algorithmic approaches. In the and , early distance-based methods laid foundational groundwork; notably, the unweighted pair group method with (UPGMA) was introduced by R. Sokal and Charles D. Michener in 1958 for clustering phenetic , gaining computational traction in the with the advent of accessible computers for . Concurrently, Anthony W. F. and Luigi L. Cavalli-Sforza proposed a minimum approach in 1964 to reconstruct evolutionary s by minimizing the total branch lengths, marking an initial shift toward optimization criteria in phylogenetic . These developments in the and 1970s emphasized and minimum principles, enabling the of morphological and early genetic through programs that automated tree construction. The 1980s and 1990s witnessed a surge in molecular data availability, driven by advances in , which propelled computational phylogenetics into a more data-intensive era. Joseph Felsenstein's 1981 paper formalized for phylogenetic trees from DNA sequences, providing a statistical framework that accounted for evolutionary models and became a cornerstone for rigorous inference. In 1987, Naruya Saitou and developed the neighbor-joining algorithm, an efficient distance-matrix method that relaxed the molecular clock assumption of and facilitated rapid tree reconstruction from large datasets. Software tools proliferated during this period, with Felsenstein's package, first released in 1980 and iteratively updated through the 1990s, offering a comprehensive suite for distance, , and likelihood analyses, widely adopted by the research community. From the 2000s onward, computational phylogenetics integrated probabilistic frameworks and high-throughput technologies, addressing complexities in genomic-scale data. Bayesian methods gained prominence with the 2001 release of MrBayes by John P. Huelsenbeck and Fredrik Ronquist, which employed sampling to estimate posterior probabilities of trees under complex models, enhancing uncertainty quantification in inference. The advent of next-generation sequencing (NGS) around 2005 revolutionized the field by enabling phylogenomics, where whole-genome data facilitated species tree estimation amid gene tree discordance, as reviewed in applications spanning multilocus datasets. Post-2020, shifts toward and have accelerated progress; techniques, for instance, now aid in model selection and tree reconstruction from massive alignments, improving scalability and accuracy in phylogenomic analyses.

Phylogenetic Data and Homology

Coding Characters

In computational phylogenetics, character coding involves transforming observable biological traits into discrete or quantitative units suitable for algorithmic analysis, ensuring that these units reflect potential evolutionary rather than superficial similarities due to or parallelism (analogies). Characters are defined as features that vary among taxa, with states representing specific forms or conditions of those features, and the process prioritizes hypotheses of shared ancestry over functional or positional resemblances alone. This coding establishes the foundational data matrix for methods like or likelihood, where accurate homology assessment is critical to avoid conflating independent evolutionary events. Character coding schemes vary by data type and analytical goals. Binary coding assigns presence (1) or absence (0) to a , commonly used for morphological features to simplify and test for shared derivations. Multistate coding extends this to multiple states, either unordered (where any is equally likely) or ordered (reflecting gradual transformations, such as increasing complexity), allowing representation of qualitative variations like color patterns or segment numbers. Continuous characters, such as measurements of or ratio, are often discretized to avoid assumptions of in evolutionary change; common methods include gap coding, where equal intervals are defined based on observed variation to create multistate categories, ensuring states are phylogenetically informative without arbitrary thresholds. For sequence alignments, gap coding treats insertions/deletions as additional states, coding gaps as distinct characters to capture positional . Characters may be weighted or unweighted depending on perceived reliability or evidential strength; unweighted schemes treat all characters equally to minimize subjective , while weighted approaches assign higher values to characters with greater perceived evolutionary significance, such as those less prone to . schemes, like successive approximations, iteratively adjust based on indices during tree search, but they require justification to prevent overemphasis on convergent traits. Key challenges in character include distinguishing informative characters—those that vary across taxa and potentially support monophyletic groups—from uninformative ones, such as autapomorphies unique to single taxa that contribute no resolution. Informative characters must exhibit variation that aligns with criteria, like position, structure, and continuity from common ancestors, while excluding analogies that arise from similar selective pressures. Circularity poses a significant , where coding decisions implicitly assume phylogenetic relationships (e.g., grouping similar states based on preconceived clades), leading to biased matrices that reinforce initial hypotheses rather than testing them objectively. To mitigate this, coding should precede and rely on explicit, repeatable criteria independent of . A representative example is morphological traits in vertebrate phylogenetics, such as the presence or absence of mammary glands. In analyses of amniote evolution, this might code mammary glands as absent (0) in reptiles and present (1) in mammals, hypothesizing it as a synapomorphy for Mammalia, while multistate could incorporate variations in gland structure or function across taxa to test . Such draws from ontogenetic and positional evidence to avoid analogical pitfalls, like mistaking independently evolved reproductive traits in unrelated lineages.

Molecular, Morphological, and Genomic Data

Molecular data in computational phylogenetics primarily consists of DNA, RNA, and protein sequences derived from organisms, which provide quantifiable characters for inferring evolutionary relationships. DNA sequences, often from mitochondrial or nuclear genes, offer high resolution due to their abundance of variable sites, while RNA sequences, such as ribosomal RNA (rRNA), are valued for their conservation across taxa, enabling comparisons among distantly related species. Protein sequences, translated from coding regions, capture functional constraints and can reveal evolutionary signals at the amino acid level, and retain phylogenetic signal longer than nucleotide data for deep divergences due to slower evolutionary rates and reduced saturation. Preparation of molecular data begins with to identify homologous positions, a critical step that influences downstream phylogenetic analyses. Widely adopted tools include , which uses progressive alignment based on a guide tree to handle and protein sequences efficiently, and MAFFT, known for its speed and accuracy in aligning large datasets through fast Fourier transform-based refinement. correction is essential, particularly for raw sequencing reads, involving quality filtering and to mitigate artifacts like base-calling errors. Multigene approaches concatenate sequences from multiple loci to increase phylogenetic signal, whereas whole-genome data encompass broader coverage but require handling extensive variation. Morphological data involve discrete traits observed in extant or preserved in , such as structures, shapes, or anatomical features, coded as or multistate characters for phylogenetic reconstruction. These data are particularly useful for incorporating evidence, bridging gaps in the where molecular data are unavailable. However, morphological datasets typically feature fewer characters—often hundreds compared to thousands in molecular sets—leading to challenges like higher rates and sensitivity to subjective character selection. Genomic data in phylogenomics leverage next-generation sequencing (NGS) technologies, introduced post-2005, to generate datasets from thousands of loci across the , dramatically expanding the scale of phylogenetic . NGS enables rapid, cost-effective production of massive sequence volumes, facilitating whole- comparisons and resolving contentious relationships previously limited by single-gene studies. Handling such involves strategies like , which combines loci into a supermatrix assuming a single species tree, or multispecies models, which account for gene tree discordance due to incomplete by integrating individual locus histories. Preparation emphasizes robust alignment of orthologous regions and error mitigation through or reference-based to ensure accurate amid high data complexity.

Defining Homology

In computational phylogenetics, refers to traits or character states shared among taxa due to common ancestry, distinguishing them from analogous features arising independently through . Determining is foundational, as only homologous characters provide reliable signals for reconstructing evolutionary relationships. Traditional criteria for assessing emphasize ontogenetic similarity (shared developmental pathways), positional similarity (corresponding topological locations in the ), and structural similarity (comparable composition and configuration). These criteria, originally proposed by Remane, serve as initial hypotheses of primary based on observable resemblances in or . To rigorously test these hypotheses, Colin Patterson formalized three criteria—similarity, , and —that apply across biological data types, including molecular sequences. The similarity test evaluates whether features match in detail beyond superficial resemblance, such as sequence identity or structural folds; the conjunction test checks for non-overlapping correspondences to avoid pseudohomologies from composite traits; and the congruence test assesses whether the hypothesized fits consistently within a phylogenetic derived from independent characters, rejecting it if it implies excessive . Primary statements, posited a priori via similarity-based criteria like Remane's, are thus refined into secondary homologies through testing during phylogenetic analysis. Computational methods operationalize these criteria for large-scale data. For nucleotide or sequences, reciprocal best hits (RBH) identify putative by requiring mutual top-scoring alignments between query and subject databases, approximating under Patterson's similarity and conjunction tests; this approach underpins tools like OrthoMCL for inference. (Basic Local Alignment Search Tool) further supports sequence detection by computing statistically significant alignments based on values, enabling initial similarity assessments across genomes. For protein structures, where sequence divergence may obscure , methods like perform pairwise structural alignments to quantify three-dimensional similarity via distance matrices, revealing conserved folds as evidence of deep . Recent advances, such as for (as of 2024), enhance detection by enabling structural comparisons for distantly related sequences. Challenges persist, particularly in morphological data where secondary homology—confirmed post-phylogenetic analysis—can conflict with primary hypotheses based on overt similarity, leading to debates over interpretation. In , positional homology criteria risk misattribution when conserved positions arise convergently, as ontogenetic pathways may diverge despite topological conservation, complicating automated assessments. For instance, alignments might initially flag for paralogous genes, but congruence with an independent tree topology (e.g., from multi-locus data) can validate or refute it by checking fit, ensuring only ancestral shared states are retained.

Representations of Evolutionary Relationships

Phylogenetic Trees

Phylogenetic trees are branching diagrams that represent the evolutionary relationships among a set of taxa, illustrating patterns of descent from common ancestors. These trees model hierarchical relationships under the assumption of bifurcating or multifurcating events, where branches symbolize lineages and nodes indicate points. Internal nodes typically represent hypothetical ancestral populations, while terminal nodes (leaves) correspond to observed taxa, such as or sequences. Phylogenetic trees can be classified by their structure and scaling. Bifurcating trees, also known as binary trees, feature internal nodes with exactly two descendant branches, maximizing of relationships and assuming dichotomous . In contrast, polytomous (or multifurcating) trees have internal nodes with three or more branches, indicating unresolved polytomies where multiple lineages diverge simultaneously or data insufficiently resolve the branching order. Trees are further distinguished as ed or unrooted: ed trees include a designated representing the common , often determined using an outgroup to polarize the direction of from past to present; unrooted trees lack a , focusing solely on relative branching patterns without implying . Regarding scaling, cladograms depict only topological relationships without branch lengths, emphasizing grouping patterns, whereas phylograms incorporate branch lengths proportional to evolutionary change, such as or time. Key properties of phylogenetic trees include concepts from that define group integrity. A monophyletic group, or , comprises an and all its descendants, forming a complete on the tree. Paraphyletic groups include an but exclude some descendants, resulting in an incomplete , such as reptiles excluding despite shared ancestry. Polyphyletic groups aggregate taxa from multiple ancestors without including the common forebears, often arising from . Trees are commonly represented in computational formats for analysis and storage. The , a parenthesis-based notation, encodes tree topology and optional branch lengths, enabling machine-readable interchange; for example, a simple unrooted tree with three taxa A, B, and C might be written as (A:0.1,(B:0.2,C:0.4):0.3);. Regarding distance properties, additive trees satisfy the four-point condition where path distances between leaves sum correctly along unique paths, allowing reconstruction from distance matrices without assuming equal rates. Ultrametric trees, a subset of additive trees, impose a by requiring all leaves to be equidistant from the root, aligning tips at the same temporal horizon. A representative example is a simplified phylogenetic tree of primate evolution, focusing on great apes. In this rooted, bifurcating phylogram, the root represents the common ancestor of orangutans (), gorillas (), and African great apes ( and ). Branching first separates orangutans, followed by a split between gorillas and the lineage leading to chimpanzees () and humans (), with branch lengths reflecting approximate genetic divergence times: orangutan split 14.0–15.5 million years ago, gorilla split 7.6–9.3 million years ago, and human-chimpanzee split 5.5–6.3 million years ago (as of 2025 estimates). This tree illustrates monophyly within (, , ) sharing a recent common ancestor.

Phylogenetic Networks

Phylogenetic networks extend the representational power of phylogenetic trees by accommodating reticulate evolution, such as and recombination, which introduce conflicts in hierarchical relationships among taxa. Unlike trees, which assume bifurcating descent, networks allow for non-tree-like structures to visualize and model evolutionary incompatibilities or explicit reticulation events in datasets. Trees represent special cases of networks without reticulations, providing a baseline for comparison when strictly vertical . Splits networks are a combinatorial generalization of phylogenetic trees that depict incompatibilities in distance or character data through parallel edges emanating from nodes, where nodes do not necessarily correspond to ancestral taxa but rather highlight conflicting signal partitions. These implicit networks represent evolutionary history without specifying explicit ancestry, focusing instead on splits—bipartitions of the taxon set induced by the data. In contrast, hybridization networks, a type of reticulate network, explicitly model evolutionary histories with directed edges denoting descent or reticulate events like hybridization; they reconcile observed data, such as sets of gene trees, by incorporating reticulation nodes—vertices with in-degree greater than one, indicating multiple parental contributions from ancestors. Key methods for constructing phylogenetic networks include split decomposition, which decomposes a into a sum of elementary splits to reveal compatible and conflicting components, thereby generating a splits network that quantifies data noise or reticulation. Neighbor-Net builds planar splits networks agglomeratively from distance matrices, extending the neighbor-joining algorithm to produce circular orderings of taxa and weighted splits that approximate additive distances while highlighting reticulate signals. Software such as SplitsTree implements these methods, enabling visualization and analysis of networks from sequences, distances, or trees, with support for both implicit splits and explicit reticulate models. Phylogenetic networks find applications in modeling (HGT), where reticulate edges capture gene acquisitions across lineages, and recombination, which generates mosaic genomes visualized as overlapping splits or reticulation nodes in sequence data. They also facilitate reconciling multiple trees into a single network, inferring shared reticulate histories that explain topological discordance without assuming a single tree. For instance, in bacterial , a splits network constructed from concatenated protein sequences of alpha-proteobacteria reveals reticulate patterns indicative of HGT events, such as gene transfers shaping mitochondrial ancestors, where conflicting splits highlight exchanges missed by tree-based analyses.

Distance-Matrix Methods

UPGMA and WPGMA

The Unweighted Pair Group Method with () is a hierarchical clustering algorithm used in computational phylogenetics to construct rooted phylogenetic trees from a distance matrix, assuming a constant rate of evolution across lineages, known as the hypothesis. This method produces ultrametric trees, where all leaves are equidistant from the root, reflecting equal evolutionary time from a common . Introduced for taxonomic classification, has been widely applied in for its simplicity and efficiency in handling stepwise clustering of taxa based on average pairwise distances. The algorithm proceeds as follows: begin with a where each forms its own ; identify the pair of with the smallest average distance; merge them into a new ; and update the by computing the average distance from the new to all remaining , weighted by the number of taxa in each subcluster. This process repeats until all taxa are joined in a single , yielding a that represents the hierarchical relationships. The key distance update formula, which ensures that the average linkage treats all original taxa equally, is given by d(U, V) = \frac{n_X \cdot d(X, V) + n_Y \cdot d(Y, V)}{n_X + n_Y}, where U is the new cluster formed by merging subclusters X and Y, V is any other cluster, and n_X and n_Y are the number of taxa in X and Y, respectively. The Weighted Pair Group Method with Arithmetic Mean (WPGMA) is a variant of UPGMA that addresses cases where evolutionary rates may vary, by treating merged clusters equally regardless of their sizes. Developed as part of numerical taxonomy frameworks, WPGMA follows the same stepwise clustering procedure but updates distances using a simple arithmetic mean between the subclusters, without weighting by taxon count. The distance update formula is d(U, V) = \frac{d(X, V) + d(Y, V)}{2}, which gives equal weight to the distances from each subcluster, making it suitable for non-clock-like data where cluster size differences should not bias the averaging. Both methods are limited by their reliance on the molecular clock assumption, leading to inaccurate topologies when evolutionary rates vary substantially across lineages, as rate heterogeneity can cause long branches to cluster incorrectly. UPGMA, in particular, is sensitive to such violations because it enforces ultrametricity, potentially producing misleading trees for datasets with uneven evolutionary tempos.

Neighbor-Joining

The neighbor-joining (NJ) algorithm is a distance-based method for constructing unrooted phylogenetic trees from a of pairwise evolutionary distances between operational taxonomic units (OTUs), such as or sequences, without assuming a constant rate of (molecular clock). Introduced by Saitou and Nei in , it iteratively identifies and joins pairs of OTUs that minimize the estimated total branch length of the tree, effectively reducing a star phylogeny (where all OTUs connect to a central ) to a fully resolved . This approach corrects for unequal evolutionary rates among lineages by adjusting distances to account for the overall divergence pattern across the dataset. The algorithm proceeds in iterative steps. First, for a set of n OTUs, an n × n matrix Q of adjusted distances is computed, where each entry penalizes pairs with high net divergence relative to the rest of the taxa: Q_{i,j} = (n-2) d_{i,j} - \sum_{k \neq i,j} d_{i,k} - \sum_{k \neq i,j} d_{j,k} Here, d_{i,j} is the observed distance between OTUs i and j. The pair (i, j) with the minimum Q_{i,j} is selected as neighbors and joined into a new composite node u. The distance matrix is then updated by removing i and j, adding u, and recalculating distances from u to remaining OTUs m as d_{u,m} = \frac{1}{2} (d_{i,m} + d_{j,m} - d_{i,j}), with the Q-matrix recomputed for the reduced set. This process repeats until only two nodes remain, yielding the tree topology. Branch lengths are estimated using least-squares criteria to fit the observed distances. For the joined nodes i and j forming u, the lengths of the branches from u to i and u to j are: v_{u,i} = \frac{1}{2} \left( d_{i,j} + \frac{1}{n-2} \left( \sum_{k \neq i,j} d_{i,k} - \sum_{k \neq i,j} d_{j,k} \right) \right) v_{u,j} = d_{i,j} - v_{u,i} Similar formulas apply for distances from u to other nodes. These estimates ensure the tree minimizes deviations from the input distances under an additive model. NJ offers key advantages over clock-assuming methods like , as it robustly handles cases of unequal evolutionary rates, producing more accurate topologies in simulations with rate variation. Its of O(n^3) makes it computationally efficient for large datasets, enabling rapid tree reconstruction even for hundreds of taxa. Computer simulations in the original work demonstrated that NJ recovered correct topologies in over 90% of cases with moderate rate heterogeneity, outperforming several contemporary methods. For example, Saitou (1991) applied NJ to distance matrices derived from complete sequences of hominoids (humans, chimpanzees, gorillas, and orangutans), reconstructing a phylogeny that clustered humans with chimpanzees, supporting their close relationship with branch lengths reflecting sequence divergence estimates of about 5-7% between these species.

Fitch-Margoliash Method

The Fitch-Margoliash method is a distance-matrix approach to phylogenetic tree reconstruction that employs a weighted least-squares criterion to estimate tree and branch lengths from a matrix of pairwise evolutionary distances. Introduced by Walter M. Fitch and Emanuel Margoliash in 1967, it seeks to find the tree that minimizes the sum of squared differences between observed distances and those predicted by the tree. This method does not assume a molecular clock, allowing for unequal evolutionary rates across lineages, and is particularly suited for datasets where distances are derived from molecular sequences like . The algorithm begins with an initial star tree, where all taxa connect to a central node, and proceeds via an iterative search that builds the tree through progressive addition of taxa. At each step, a new is inserted into every possible of the current , and branch lengths are optimized to minimize the least-squares criterion for the updated topology. Tree rearrangement, such as local optimizations via the nearest-neighbor interchange, may be applied to explore alternative topologies and escape local minima. Branch lengths are estimated by solving for path distances that best fit the observed ; for a triplet of taxa (i, j, k), the length of the branch to i is calculated as l_i = \frac{d_{ij} + d_{ik} - d_{jk}}{2}, with adjustments to ensure non-negative values. The overall objective function minimized is: \sum_{i < j} (d_{ij}^{\text{obs}} - d_{ij}^{\text{tree}})^2 where d_{ij}^{\text{obs}} is the observed distance between taxa i and j, and d_{ij}^{\text{tree}} is the sum of branch lengths along the path connecting them in the tree; weights can be applied to emphasize shorter, more reliable distances. This process continues until no further improvements in the criterion are found, yielding an unrooted tree with optimized branch lengths. To root the resulting unrooted tree, an outgroup—a taxon evolutionarily distant from the ingroup—is incorporated into the distance matrix, allowing the root to be placed on the branch leading to the outgroup and clarifying the direction of evolution. This outgroup approach helps mitigate potential artifacts, such as long-branch attraction, where rapidly evolving lineages might otherwise cluster erroneously due to inflated distances. Early computational implementations of the method were provided in the PHYLIP (Phylogeny Inference Package) software suite, particularly through the FITCH program, which supports the core least-squares optimization and has been widely used since the 1980s for distance-based analyses.

Optimality-Based Methods

Maximum Parsimony

Maximum parsimony is an optimality criterion in computational phylogenetics that selects the phylogenetic tree requiring the minimum number of evolutionary changes, or "steps," to explain the observed variation in character data across taxa. This principle embodies by favoring the simplest hypothesis consistent with the data, assuming that fewer changes are more likely than more complex scenarios. Character data, including morphological traits or molecular sequences, are scored such that each site or trait constitutes an independent character with discrete states. For unordered characters, where any state change incurs equal cost, the Fitch algorithm computes the parsimony score for a given tree topology by performing two passes: a bottom-up pass to intersect possible states at internal nodes and a top-down pass to assign states minimizing changes. This dynamic programming approach runs in linear time relative to the number of nodes and characters, making it efficient for scoring trees. In contrast, the Sankoff algorithm handles ordered characters or weighted transformations by maintaining cost matrices at each node, calculating the minimum cost to reach each possible state via dynamic programming during a post-order traversal. These algorithms enable the evaluation of candidate trees by summing the minimum steps required across all characters. The parsimony score for a tree T is defined as the sum of the minimum number of state changes over all characters: s(T) = \sum_{i=1}^{m} l_i(T) where m is the total number of characters, and l_i(T) is the minimum number of steps needed to explain the states of character i on tree T. Phylogenetic inference under involves searching the space of possible trees to find one (or more) minimizing s(T). For small datasets with up to 10-12 taxa, exhaustive search evaluates all possible topologies, which number (2n-5)!! for n leaves in an unrooted binary tree. For larger datasets, exact methods like branch-and-bound prune the search space by establishing upper and lower bounds on the score during tree exploration, guaranteeing optimality without full enumeration if the bounds prove effective. This approach starts from an initial tree and systematically adds branches while discarding subtrees exceeding the current best score. Heuristic strategies, such as stepwise addition, build an initial tree by sequentially adding taxa to the best position on the current tree, followed by branch swapping (e.g., nearest-neighbor interchange) to refine the topology. The branch-and-bound algorithm provides an exact solution for parsimony problems with limited taxa, leveraging the Fitch or Sankoff scoring to bound partial trees. For datasets involving unaligned sequences, the Sankoff-Morel-Cedergren algorithm extends parsimony via dynamic programming to simultaneously optimize alignments and tree topologies, treating insertions and deletions as state changes. This direct optimization approach minimizes the total number of transformations, including substitutions and indels, across all sites. Software implementing maximum parsimony includes PAUP*, which supports exhaustive, branch-and-bound, and heuristic searches for both morphological and molecular data. TNT offers advanced heuristic algorithms optimized for large datasets, emphasizing rapid tree searches under parsimony. For unaligned sequences, POY and its predecessor MALIGN perform direct optimization, integrating alignment costs into the parsimony score to handle variable-length data.

Maximum Likelihood

Maximum likelihood (ML) estimation in computational phylogenetics is a statistical framework for inferring phylogenetic trees by selecting the tree topology and branch lengths that maximize the probability of observing the given sequence data under a specified model of nucleotide or amino acid substitution. Introduced by in 1981, this approach treats the alignment of homologous sequences as realizations of a Markov process along the branches of the tree, where the likelihood is computed as the probability Pr(data | tree, model). The method relies on Felsenstein's pruning algorithm, a dynamic programming technique that efficiently calculates partial likelihoods at each node by summing over possible ancestral states, propagating from the leaves to the root while avoiding exhaustive enumeration of all state combinations. The inference process involves evaluating the likelihood for multiple candidate trees, typically starting from an initial topology such as a neighbor-joining tree, followed by heuristic optimization to explore the vast tree space. Common search strategies include hill-climbing algorithms that iteratively apply topological rearrangements like nearest-neighbor interchanges (NNI) to improve the likelihood score, or stochastic methods such as genetic algorithms that maintain a population of trees and evolve them through crossover and mutation operations to escape local optima. The likelihood function for an alignment is formulated as the product over all sites of the probability of the observed site pattern given the tree and model: L = \prod_{i=1}^{n} P(D_i \mid T, \theta) where D_i is the data at site i, T is the tree, and \theta includes model parameters like substitution rates and branch lengths; to account for identical site patterns, the product is often over unique patterns raised to their frequencies. This computation incorporates transition probability matrices derived from Markov models of evolution, which specify the probability of changing from one state (e.g., nucleotide) to another over a branch of given length. A key advantage of ML is its ability to explicitly model multiple substitutions at the same site, which parsimony-based methods overlook, leading to more accurate branch length estimates and tree topologies under conditions of high sequence divergence. Additionally, ML accommodates among-site rate variation, such as through the , which assigns sites to rate categories to better fit heterogeneous evolutionary rates and improve inference robustness. Widely used software for ML phylogenetics includes RAxML, which employs randomized accelerated searches for rapid inference on large datasets with support for mixed models and parallelization, and PhyML, which uses a fast hill-climbing algorithm with NNI rearrangements to achieve high topological accuracy on datasets up to thousands of taxa.

Bayesian and Probabilistic Approaches

Bayesian Inference

Bayesian inference in phylogenetics applies to estimate the posterior probability distribution of phylogenetic trees and associated parameters given sequence data, incorporating prior knowledge and quantifying uncertainty across the parameter space. The core principle is that the posterior probability of a tree T given data D is proportional to the product of the likelihood of the data under the tree and the prior probability of the tree, allowing integration over all possible trees rather than selecting a single point estimate. Formally, P(T \mid D) = \frac{P(D \mid T) \cdot P(T)}{P(D)}, where P(D \mid T) is the likelihood (computed similarly to maximum likelihood methods but evaluated across many trees), P(T) is the prior on trees, and P(D) is the marginal likelihood serving as a normalizing constant. Priors on trees are often uniform, assigning equal probability to all possible topologies, though more informative priors can be used for branch lengths or other parameters to reflect evolutionary assumptions. To sample from this complex, high-dimensional posterior, Bayesian methods employ Markov chain Monte Carlo (MCMC) algorithms, primarily the Metropolis-Hastings sampler, which generates a chain of tree proposals and accepts or rejects them based on the posterior ratio to approximate the distribution. Tree space exploration relies on proposal operators such as nearest-neighbor interchange (NNI), which swaps adjacent subtrees, and subtree prune-and-regraft (SPR), which prunes a subtree and reattaches it to a different branch, enabling efficient traversal of discrete tree topologies while maintaining detailed balance. These MCMC runs typically discard an initial "burn-in" period and use multiple chains to assess convergence, providing samples from which posterior summaries like credible intervals for branch lengths or divergence times are derived. Prominent software implementing these approaches includes MrBayes, introduced in 2001 for general Bayesian phylogenetic inference using MCMC on nucleotide and amino acid data. An extension, MrBayes 3, accommodates mixed evolutionary models across data partitions, allowing simultaneous analysis of heterogeneous datasets like combined genes with different substitution rates. BEAST, released in 2007, extends this framework to time-calibrated phylogenies, incorporating relaxed clock models and priors on evolutionary rates for dating speciation events. More recent developments include BEAST X (2025), which integrates molecular phylogenetic reconstruction with complex trait evolution and phylogeographic modeling for enhanced scalability. In applications involving population genetics, BEAST integrates coalescent models as tree priors, which describe the genealogy of gene copies within species to infer species trees while accounting for incomplete lineage sorting and varying effective population sizes.

Model Selection

Model selection in computational phylogenetics involves choosing the most appropriate evolutionary substitution model to accurately describe the process of nucleotide or amino acid changes along a phylogenetic tree, particularly for maximum likelihood and Bayesian inference methods. This step is crucial because an inadequate model can lead to biased parameter estimates and incorrect tree topologies. Models vary in complexity, from simple ones assuming equal substitution rates to more sophisticated ones accounting for rate heterogeneity and base frequency differences. Common nucleotide substitution models include the Jukes-Cantor (JC69) model, which assumes equal rates of substitution among the four nucleotides and equal base frequencies. The Kimura 2-parameter (K80) model extends this by distinguishing between transitions and transversions, allowing different rates for each. The general time-reversible (GTR) model is more flexible, incorporating six substitution rates and unequal base frequencies, making it suitable for diverse datasets. To address rate variation across sites, the gamma distribution (+Γ) is often added, modeling among-site rate heterogeneity with a shape parameter α. Additionally, a proportion of invariable sites (+I) accounts for sites that do not evolve, improving fit for conserved sequences. These extensions, such as GTR + Γ + I, are frequently selected for their balance of realism and computational feasibility. Selection criteria evaluate models based on their fit to the data while penalizing excessive complexity to prevent overfitting. The (AIC) is defined as \text{AIC} = -2 \ln L + 2k where \ln L is the log-likelihood and k is the number of parameters, providing a measure that favors parsimonious models. The (BIC) is similar but applies a stronger penalty: \text{BIC} = -2 \ln L + k \ln n with n as the number of sites, making it more conservative for larger datasets. Hierarchical likelihood ratio tests (hLRTs) compare nested models sequentially using a chi-squared distribution to assess significance, though they can be sensitive to assumptions about parameter boundaries. Automated tools facilitate model selection by testing multiple models and applying these criteria. jModelTest evaluates up to 203 models for DNA data using AIC, BIC, and hLRTs, supporting parallel computing for efficiency. ModelFinder, integrated into , rapidly selects models by incorporating branch-length estimation and mixture models, outperforming earlier tools in accuracy and speed for large alignments. For protein-coding genes, codon models like the general time-reversible codon model (GY94) account for synonymous and nonsynonymous substitutions, enabling detection of selection pressures while selecting site-specific rate variations. In recent analyses of genomic data, partitioned models allow different substitution parameters for distinct data blocks (e.g., gene partitions or codon positions), improving fit by accommodating evolutionary heterogeneity. Tools like PartitionFinder automate partitioning scheme selection alongside model choice using AIC or BIC, essential for phylogenomic datasets with thousands of loci.

Evaluating Phylogenetic Hypotheses

Measures of Nodal Support

In computational phylogenetics, measures of nodal support evaluate the robustness of clades or branches in a phylogenetic tree, particularly under parsimony criteria, by quantifying how much the tree topology must change to alter a given resolution. These metrics focus on the internal stability of the tree without relying on data resampling, providing insights into the evidential weight supporting specific bipartitions. Bremer support, also termed the decay index, is a widely used measure that indicates the number of extra parsimony steps required to collapse a clade into a polytomy in the consensus of suboptimal trees. Introduced by Bremer, this index reflects the clade's congruence with the data, where higher values denote greater stability. For instance, a Bremer value of 5 signifies that the clade persists as resolved in all trees up to five steps longer than the most parsimonious tree, highlighting the additional homoplasy needed to refute it. The retention index, specific to parsimony analyses, assesses the overall fit of characters to the tree by measuring the fraction of apparent synapomorphies that are retained without homoplasy. Developed by , it is defined as the ratio of retained synapomorphies to the total potential synapomorphies beyond the minimum required steps, offering a global view of data retention amid incongruence. This index complements by emphasizing character efficiency across the entire tree rather than individual nodes. To compute Bremer support, the analysis identifies the shortest tree lacking the clade in question and subtracts the length of the optimal tree from it, yielding the step difference as a direct count of evidential excess. However, these measures have limitations. In datasets exhibiting substitution saturation, where multiple changes obscure signal in DNA sequences, Bremer support can overestimate nodal robustness by underestimating hidden homoplasy. Furthermore, if characters lack true independence—such as in pseudoreplication scenarios where correlated sites mimic additional evidence—both Bremer support and retention index may inflate perceived clade stability without capturing biological variance.

Bootstrapping and Jackknifing

Bootstrapping is a nonparametric resampling technique introduced to phylogenetics by Felsenstein in 1985 to assess the reliability of inferred phylogenetic trees by estimating confidence intervals on clades. The method generates pseudoreplicate datasets by sampling characters from the original alignment with replacement, maintaining the same number of taxa and characters as the original data, and then reconstructing a phylogenetic tree for each pseudoreplicate using the same optimality criterion, such as maximum parsimony or maximum likelihood. The bootstrap support for a particular clade is calculated as the percentage of pseudoreplicate trees that contain that clade, providing a measure of how consistently the data supports the grouping across resampled datasets. The support value is formally defined as \text{Bootstrap support} = \left( \frac{\text{number of pseudoreplicates supporting the clade}}{\text{total number of pseudoreplicates}} \right) \times 100 In practice, analyses typically involve 100 to 1000 bootstrap replicates to achieve stable estimates, with 1000 often recommended for precise p-value approximations in phylogenetic studies. Clades with bootstrap support exceeding 70% are generally considered well-supported, based on empirical evaluations showing this threshold corresponds to a low false-positive rate under various evolutionary models. Jackknifing, adapted to phylogenetics as a resampling method that deletes subsets of characters without replacement, was developed to evaluate clade stability, particularly in parsimony analyses, as detailed by Farris et al. in 1996. Typically, approximately 37% (or e^{-1}) of the characters are randomly removed in each replicate to create a reduced dataset, after which a tree is inferred and the process is repeated multiple times, often 1000 or more, to generate a distribution of results. Support is then assessed via strict consensus (requiring the clade in all supporting replicates) or relaxed consensus (allowing a frequency threshold, similar to ), with clades appearing in over 50% of jackknife replicates deemed robust in parsimony contexts. Both methods assume character independence and can yield inconsistent support under long-branch attraction or heterogeneous evolutionary rates, as they may underestimate variability when site patterns are correlated or model assumptions are violated.

Posterior Probability and Consensus Trees

In Bayesian phylogenetic inference, the posterior probability of a clade represents the proportion of trees in the posterior distribution that contain that clade, providing a measure of support derived from the integration of prior probabilities and the likelihood of the data under a specified evolutionary model. This probability is estimated empirically as the frequency with which the clade appears across trees sampled via Markov chain Monte Carlo (MCMC) methods during the analysis. Posterior probabilities greater than 0.95 are conventionally interpreted as indicating strong support for a clade, analogous to a 95% credible interval in Bayesian statistics, though this threshold assumes adequate MCMC convergence and model fit. To summarize the uncertainty inherent in the posterior distribution of trees, consensus trees are constructed by aggregating compatible clades from the sampled topologies. The majority-rule consensus tree includes all clades that appear in more than 50% of the posterior samples, offering a balanced representation of prevalent relationships while highlighting areas of discordance. In contrast, the strict consensus tree retains only clades present in 100% of the samples, resulting in a more conservative summary that collapses unresolved polytomies. The Adams consensus tree, which focuses on maximizing the preservation of compatible clades across internal branches, is particularly useful for rooted trees where leaf labels differ minimally, providing a subtree that embeds elements from all input trees without contradiction. Despite their utility, posterior probabilities can overestimate clade support due to biases in MCMC sampling, such as insufficient chain length or improper heating schemes that fail to explore the full posterior adequately, leading to inflated confidence in suboptimal topologies. To address this, analysts often map the posterior to credible sets—collections of trees encompassing a specified probability mass (e.g., 95%)—rather than relying on point estimates like maximum a posteriori trees, enabling a more robust depiction of phylogenetic uncertainty. For instance, in analyses performed with MrBayes, the software outputs a majority-rule consensus tree annotated with posterior probabilities for each node, where values near 1.0 denote high confidence and lower values (e.g., 0.6–0.8) signal potential instability, facilitating direct interpretation of clade reliability from the MCMC samples. Recent advancements, such as quartet sampling, extend these concepts to species tree inference by evaluating posterior support through the distribution of induced quartets from posterior gene trees, distinguishing between lack of signal and conflicting evidence in multispecies coalescent models.

Challenges and Limitations

Homoplasy and Horizontal Gene Transfer

Homoplasy refers to the similarity among taxa in discrete characters that arises independently rather than from common ancestry, posing a significant challenge to tree-based phylogenetic inference by introducing noise that can mislead reconstructions. It manifests through three primary mechanisms: parallelism, where related lineages evolve the same trait independently; convergence, where unrelated lineages develop similar traits due to shared selective pressures; and reversal, where a derived trait reverts to an ancestral state. In computational phylogenetics, homoplasy is quantified using the consistency index (CI), defined for a character as \text{CI} = \frac{m}{s}, where m is the minimum number of evolutionary steps required to explain the character distribution (typically the number of observed states minus one), and s is the actual number of steps observed on the inferred tree. Lower CI values indicate higher levels of homoplasy, reflecting greater incongruence between the character data and the phylogenetic hypothesis. This metric, originally proposed in the context of parsimony analysis, helps evaluate the fit of data to trees and guides method selection in handling noisy datasets. Horizontal gene transfer (HGT), a form of reticulate evolution, further complicates phylogenetic reconstruction by allowing genetic material to move across lineages outside vertical inheritance, often resulting in gene trees that conflict with species trees. In computational phylogenetics, HGT is primarily detected by identifying incongruences among gene trees inferred from individual loci, where unexpected topologies or branch length anomalies signal transfer events. Specialized tools facilitate this process; for instance, HGTector employs a non-parametric approach based on the distribution of BLAST search hits against reference genomes to score genes for potential HGT, enabling genome-wide screening with high sensitivity and specificity. Similarly, methods like RIxML integrate parametric phylogenetic modeling to assess transfer probabilities by comparing reconciled gene and species trees. These approaches are particularly effective in microbial genomes, where HGT rates can exceed 10% of gene content. To mitigate the impacts of homoplasy and HGT, computational strategies include long-branch removal, which involves excluding taxa or sites with excessively long branches to reduce artifacts like long-branch attraction that exacerbate homoplasy signals in distance-based or parsimony methods. For HGT-induced reticulation, phylogenetic network methods provide a workaround by modeling evolutionary histories as directed acyclic graphs that accommodate transfer edges alongside bifurcating trees. A prominent example of HGT's role in phylogenetics is the dissemination of antibiotic resistance genes among bacterial phyla, such as the blaNDM-1 gene transferred via plasmids, which creates mosaic gene trees discordant with core genome phylogenies and drives rapid adaptation in pathogens like Klebsiella pneumoniae.

Hybridization, Introgression, and Incomplete Lineage Sorting

Hybridization, introgression, and incomplete lineage sorting (ILS) represent key processes that generate discordance between gene trees and species trees in computational phylogenetics, particularly in eukaryotic lineages where reticulate evolution and coalescent stochasticity challenge traditional bifurcating models. These phenomena arise from gene flow across species boundaries and incomplete ancestral polymorphism resolution, leading to anomalous gene trees (AGTs) that do not match the underlying species topology. Addressing them requires coalescent-based methods that account for both reticulation and population-level variation, enabling more accurate species tree inference from genomic data. Hybridization involves the merging of divergent genomes, often resulting in in plants, where interspecific crosses followed by chromosome doubling create new species with combined parental contributions. This process generates reticulate phylogenies, as subgenomes retain distinct evolutionary histories, detectable through methods like the , which identifies asymmetric allele patterns indicative of admixture. In allopolyploids, such as those in the family, hybridization fosters rapid speciation by combining adaptive traits, but it complicates phylogenetic reconstruction due to mosaic gene trees. Introgression refers to the post-speciation transfer of genetic material via between diverged lineages, often through occasional hybridization events. This unidirectional or bidirectional exchange can introduce adaptive alleles, as seen in animal systems, and is quantified using D-statistics, which measure imbalances in shared derived alleles (ABBA vs. BABA site patterns) across a reference tree. Under a null model of no introgression, D ≈ 0; significant deviations (e.g., |D| > 0.05 with Z-scores > 3) signal , robust even for ancient events spanning millions of years. Computational tools apply these statistics genome-wide to map introgressed tracts, distinguishing them from ILS by their asymmetric topology frequencies. Incomplete lineage sorting occurs when ancestral polymorphisms persist through speciation events due to insufficient time for fixation, resulting in gene trees that randomly resolve among possible topologies with equal probability under the coalescent. The multispecies coalescent () model formalizes this by integrating with , modeling coalescences backward in time across branches. Seminal implementations include , which efficiently infers species trees by maximizing quartet consistency from unrooted gene trees, handling thousands of loci rapidly while being statistically consistent under the . Complementarily, *BEAST employs Bayesian co-estimation of gene trees and the species tree, incorporating effective sizes () for ancestral branches to account for ILS variability. The probability of ILS for a , representing the chance that lineages fail to coalesce within the branch length t (in generations), is approximated as e^{-t / (2N_e)}, where N_e is the ; this decreases with longer branches or smaller populations, reducing discordance. A prominent example is Neanderthal introgression into modern humans, where genomic analyses revealed 1–2% of non-African genomes derive from Neanderthals via ancient ~50,000 years ago, detected through D-statistics showing excess archaic alleles in Eurasians relative to Africans. This introduced beneficial variants, such as those enhancing , but was limited by selection against deleterious segments, illustrating how shapes human adaptation under frameworks.

Taxon Sampling, Missing Data, and Continuous Characters

In computational phylogenetics, taxon sampling strategies balance density—adding more closely related taxa to resolve shallow divergences—and breadth—incorporating diverse outgroups to anchor deep nodes. Dense sampling enhances accuracy by breaking long branches and reducing stochastic errors in branch length estimation, particularly under maximum likelihood methods. In contrast, broad sampling improves resolution of ancient splits but risks introducing artifacts if rapidly evolving lineages are overrepresented. A key challenge arises from long-branch attraction (LBA), where or certain likelihood models erroneously group distant taxa with elevated substitution rates, as first demonstrated in simulations of unequal branch lengths. Increasing taxon density mitigates LBA by subdividing problematic branches, thereby diluting the signal from homoplasious sites. Missing data in phylogenetic matrices, often arising from incomplete sequencing or alignment gaps, can bias inference if non-randomly distributed, though empirical studies indicate tolerance up to 60-70% overall missingness when taxa evolve slowly and data are not patchily absent. High proportions of missing data reduce resolving power by limiting detectable substitutions, potentially exacerbating LBA in sparse regions, but do not inherently introduce systematic error beyond decreased precision. To address this, imputation methods such as leverage low-rank approximations of distance or trait matrices to estimate absent entries, preserving phylogenetic signal in large datasets. approaches, including autoencoders, further refine imputations for incomplete distance matrices by learning evolutionary patterns from observed data. Recent advances as of 2025 include models for imputing missing phylogenetic data, improving accuracy in phylogenomics. Continuous characters, such as morphological measurements or quantitative traits, require specialized models to account for gradual along branches. Squared-change minimizes the sum of squared differences in trait values across tree edges to reconstruct ancestral states, providing a least-squares optimization suitable for linear trait evolution on fixed phylogenies. This method assumes no reversals and excels in resolving intermediate states but can overestimate changes on long branches. Complementarily, Felsenstein's model treats continuous traits as evolving under —a with variance—enabling ancestral via contrasts that remove phylogenetic non-independence. These contrasts compute differences between sister taxa, scaled by branch lengths, to infer evolutionary rates without assuming a specific root . Software like RAxML accommodates and gaps by treating them as unknown states during likelihood calculations, avoiding deletion of incomplete sites while optimizing memory for supermatrices with up to 90% absence through parallelization. Sensitivity analyses, involving iterative removal or imputation of subsets, further assess robustness by comparing topologies across data partitions. Although historically sparse, recent fossil discoveries have begun to clarify arboreal trait retention in early primates like , aiding inferences about ecological selectivity following the Cretaceous-Paleogene boundary. Such improvements underscore the need for integrated molecular-fossil approaches.

Integrating Fossils and Temporal Aspects

Role of Fossils in Phylogenetics

Fossils play a crucial role in computational phylogenetics by providing empirical constraints on evolutionary timelines and tree topologies, particularly through calibration and the integration of morphological data. In phylogenetic analyses, fossils serve as minimum age constraints for internal s, anchoring molecular divergence estimates to the geological record and preventing unrealistic temporal placements of clades. This is essential because molecular data alone often lack absolute temporal scaling, leading to broad confidence intervals in divergence times. For instance, stratigraphic data from fossils establish hard minimum bounds for events, improving the accuracy of when incorporated into Bayesian frameworks. Beyond calibration, fossils enable total evidence dating, which combines morphological characters from extinct taxa with molecular sequences from extant species to simultaneously infer phylogeny and divergence times. This approach treats s as terminal taxa in analyses, leveraging their morphological traits to resolve ambiguities in molecular trees and account for extinct lineages. Morphological , applied to character matrices derived from fossil specimens, is a key method here; it reconstructs evolutionary relationships by minimizing changes across discrete traits, such as skeletal features, thereby placing fossils within broader phylogenies. Complementing this, fossilized birth-death models integrate fossil occurrence data directly into the phylogenetic process, modeling , , and sampling rates to estimate node ages while accounting for the incompleteness of the fossil record. These methods enhance resolution, as fossils increase the number of informative characters and reduce polytomies in morphological datasets. Despite these benefits, incorporating presents challenges, including the inference of ghost lineages—hypothetical branches implied by phylogenetic topology but unsupported by direct evidence—and the reliance on minimum node ages, which can underestimate true divergence times if represent late-branching events. Ghost lineages arise from gaps in the record, potentially inflating perceived evolutionary durations and complicating balancing. To mitigate this, analysts use supermatrices that link morphological and molecular data, allowing to be scored alongside extant taxa for optimization, though fragmentary preservation often limits completeness. A prominent example is the dinosaur-bird transition, where cladistic analyses of theropod , including and dromaeosaurids, have resolved paravians as a monophyletic group bridging non-avian dinosaurs and modern through shared traits like and feathering.

Molecular Clock and Dating Methods

The molecular clock hypothesis posits that molecular sequences evolve at a relatively constant rate over time, allowing the estimation of divergence times between species based on genetic differences. This concept was first proposed by Émile Zuckerkandl and , who observed that amino acid substitutions in accumulated at a steady pace, akin to the ticking of a clock, enabling the use of genetic data to infer evolutionary timelines. Under the strict molecular clock model, the rate of substitution is assumed to be identical across all branches of the , facilitating straightforward calculations of divergence times. The basic equation for estimating time t between two lineages is t = \frac{d}{2\mu}, where d is the observed (e.g., number of substitutions per site) and \mu is the substitution rate per site per unit time; the factor of 2 accounts for substitutions accumulating independently in each lineage. However, empirical evidence revealed substantial rate variation across lineages due to factors like generation time, metabolic rate, and environmental pressures, leading to the development of relaxed clock models that permit rate heterogeneity while smoothing variations to estimate reliable divergence times. Relaxed clocks are categorized into correlated models, where rates on adjacent branches are similar (e.g., log-normal relaxed clock assuming rates follow a with ), and uncorrelated models, where rates are independently drawn from a (e.g., uncorrelated log-normal or relaxed clock). These models better accommodate real-world rate heterogeneity without assuming constancy. Divergence time estimation using relaxed clocks often employs penalized likelihood or Bayesian frameworks, incorporating fossil evidence as calibration points to anchor the timeline. Penalized likelihood, introduced by Sanderson, optimizes a penalized by a smoothing term to penalize abrupt rate changes, balancing fit to the data with rate smoothness; it uses fixed or flexible smoothing parameters to estimate branch-specific rates and node ages. Bayesian methods, such as those implemented in MCMCtree, integrate distributions on rates and divergence times via (MCMC) sampling, treating fossil calibrations as probabilistic priors—hard bounds enforce strict minimum or maximum ages (e.g., truncated at fossil dates), while soft bounds allow flexibility (e.g., gamma or birth-death priors) to account for in fossil ages. In uncorrelated relaxed clocks, rates for each branch are sampled independently from a distribution, such as log-normal, enabling robust handling of rate variation. Popular software for these analyses includes BEAST2, which supports Bayesian relaxed clock models with MCMC integration for joint estimation of phylogenies, rates, and dates, often using uncorrelated log-normal or exponential distributions. (Least-Squares Dating) provides a faster, non-Bayesian alternative using least-squares optimization under a Gaussian of the Langley-Fitch clock model, suitable for large datasets with rate variation smoothed via algorithms that minimize squared deviations in rates and dates. calibrations serve as essential priors in these methods, typically specifying minimum ages from first appearances and maximum bounds from geological context, though soft bounds are preferred to avoid over-constraining the model. A notable advancement is total-evidence dating, which simultaneously analyzes molecular sequences and morphological data from extant and fossil taxa under relaxed clock models, treating fossils as tip-calibrated data points to improve age estimates without relying solely on node calibrations; this approach, applied to groups like , has estimated crown-group origins in the Permian around 280 million years ago.

References

  1. [1]
    [PDF] Introduction to Computational Phylogenetics
    Computational phylogenetics involves methods for estimating phylogenetic trees, which represent evolutionary histories, in biology and linguistics.
  2. [2]
    Computational Phylogenetics | Annual Reviews
    This is a review on Computational Phylogenetics by Claire Bowern, published in the Annual Review of Linguistics, Volume 4, 2018.
  3. [3]
    Phylogenetic Inference - Stanford Encyclopedia of Philosophy
    Dec 8, 2021 · Among the earliest algorithmic methods for inferring phylogenies are those known as “distance” methods. “Distance” could represent whichever ...
  4. [4]
    Common Methods for Phylogenetic Tree Construction and Their ...
    May 11, 2024 · In this review, we summarize common methods for constructing phylogenetic trees, including distance methods, maximum parsimony, maximum likelihood, Bayesian ...
  5. [5]
    [PDF] Phylogenetics: Applications, Software and Challenges
    The inference of phylogenies with computational methods has many important applications in medical and biological research, such as drug discovery and ...
  6. [6]
    [PDF] Advances in Computational Methods for Phylogenetic Networks in ...
    Aug 27, 2018 · Abstract Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization.
  7. [7]
    Inferring Phylogenies - Joseph Felsenstein - Oxford University Press
    US$190.00Inferring Phylogenies explains clearly the assumptions and logic of making inferences about phylogenies, and using them to make inferences about evolutionary ...
  8. [8]
    Phylogenomics — principles, opportunities and pitfalls of big‐data ...
    Dec 16, 2019 · Phylogenetics is the science of reconstructing the evolutionary history of life on Earth. Traditionally, phylogenies were constructed using morphological data ...Taxon Sampling: A Crucial... · Target Enrichment Sequencing · I Have A Phylogeny! What Now...<|control11|><|separator|>
  9. [9]
    [PDF] Modeling Evolutionary Relationships: Advances in Computational ...
    Jun 29, 2024 · Computational phylogenetics also intersects with conservation biology, where it is used to assess genetic diversity, identify evolutionary ...Missing: important | Show results with:important
  10. [10]
    The Use of Phylogenetic Diversity in Conservation Biology and ...
    The use of phylogenetic tools and studies has strongly increased in the last two decades especially in conservation biology and community ecology.
  11. [11]
    Phylogenetic network analysis of SARS-CoV-2 genomes - PNAS
    Apr 8, 2020 · This is a phylogenetic network of SARS-CoV-2 genomes sampled from across the world. These genomes are closely related and under evolutionary selection in their ...Missing: example | Show results with:example
  12. [12]
    Paper 27 (1964) - Reconstruction of evolutionary trees
    W. F. Edwards: This joint paper was read by AWFE at the meeting of the Systematics Association in Liverpool in April 1964. It introduces the statistical model ...
  13. [13]
    Evolutionary trees from DNA sequences: A maximum likelihood ...
    The application of maximum likelihood techniques to the estimation of evolutionary trees from nucleic acid sequence data is discussed.
  14. [14]
    MRBAYES: Bayesian inference of phylogenetic trees | Bioinformatics
    Summary: The program MRBAYES performs Bayesian inference of phylogeny using a variant of Markov chain Monte Carlo. Availability: MRBAYES, including the source ...
  15. [15]
    Molecular Phylogenetics - Genomes - NCBI Bookshelf - NIH
    DNA yields more phylogenetic information than protein. The two DNA sequences differ at three positions but the amino acid sequences differ at only one position.
  16. [16]
    [PDF] Molecular Phylogenetics: Using DNA, RNA and Protein Se
    Molecular phylogenetics uses DNA, RNA, and protein sequences to analyze evolutionary history, reconstruct relationships, and understand evolutionary change.
  17. [17]
    Review Article Phylogenetic Inferences from Molecular Sequences
    Abstract. Conflicting results often accompany phylogenetic analyses of RNA, DNA, or protein sequences across diverse species.
  18. [18]
    Multiple sequence alignment with the Clustal series of programs - NIH
    The Clustal series are programs used for multiple alignment of nucleic acid and protein sequences, and for preparing phylogenetic trees.
  19. [19]
    MAFFT: a novel method for rapid multiple sequence alignment ... - NIH
    A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods.
  20. [20]
    Role of Morphological Data in Phylogeny Reconstruction
    We live in the age of comparative genomics, and it may seem that there is not much point in reconstructing phylogenies using morphological data anymore.
  21. [21]
    Morphological Phylogenetics in the Genomic Age - ScienceDirect.com
    Oct 5, 2015 · Including the characters of interest during tree reconstruction and the problems of circularity and bias in studies of character evolution.Missing: uninformative | Show results with:uninformative
  22. [22]
    Applications of next-generation sequencing to phylogeography and ...
    Next-generation sequencing (NGS) clearly holds promise for fast and cost-effective generation of multilocus sequence data for phylogeography and phylogenetics.
  23. [23]
    Homology in Classical and Molecular Biology1 - Oxford Academic
    Colin Patterson,. Museum (Natural History), London SW7 5BD, England ... The three tests of homology are similarity, conjunction, and congruence. Testing.
  24. [24]
    Choosing BLAST options for better detection of orthologs as ...
    Motivation: The analyses of the increasing number of genome sequences requires shortcuts for the detection of orthologs, such as Reciprocal Best Hits (RBH), ...INTRODUCTION · RECIPROCAL SHORTEST... · CONCLUDING REMARKS
  25. [25]
    Why ontogenetic homology criteria can be misleading - PubMed
    May 15, 2011 · As the basis for comparative biology, correctly assigning character homology is critical. Yet, identifying homologous characters in practice ...Missing: positional | Show results with:positional
  26. [26]
  27. [27]
    J. Felsenstein, Inferring Phylogenies, Sinauer Assoc., 2004, pp. xx + ...
    Aug 6, 2025 · The Neighbor-Joining (NJ) method of Saitou and Nei is the most widely used distance based method in phylogenetic analysis.
  28. [28]
  29. [29]
    A Molecular Phylogeny of Living Primates | PLOS Genetics
    We conduct a phylogenetic analysis to determine the origin, evolution, patterns of speciation, and unique features in genome divergence among primate lineages.
  30. [30]
    Application of Phylogenetic Networks in Evolutionary Studies
    This article reviews the terminology used for phylogenetic networks and covers both split networks and reticulate networks, how they are defined, and how they ...
  31. [31]
    A new and useful approach to phylogenetic analysis of distance data
    Split decomposition: A new and useful approach to phylogenetic analysis of distance data ... Bandelt and Dress, 1992. Bandelt H.-J., Dress A.W.M.. A canonical ...
  32. [32]
    Neighbor-Net: An Agglomerative Method for the Construction of ...
    We present Neighbor-Net, a distance based method for constructing phylogenetic networks that is based on the Neighbor-Joining (NJ) algorithm of Saitou and Nei.
  33. [33]
    A Genome Phylogeny for Mitochondria Among α-Proteobacteria and ...
    Abstract. Analyses of 55 individual and 31 concatenated protein data sets encoded in Reclinomonas americana and Marchantia polymorpha mitochondrial genomes.Abstract · Introduction · Methods · Results and Discussion
  34. [34]
    [PDF] A Statistical Method for Evaluating Systematic Relationships
    38, pt. 2: http://www.biodiversitylibrary.org/item/23745. Page(s): Page 1409, Page 1410, Page 1411, Page 1412, Page 1413, Page 1414, Page 1415,.
  35. [35]
    the principles and practice of numerical classification : Sneath ...
    Jun 22, 2020 · Numerical taxonomy; the principles and practice of numerical classification ; Publication date: 1973 ; Topics: Numerical taxonomy, Classification.
  36. [36]
  37. [37]
  38. [38]
    Reconstruction of molecular phylogeny of extant hominoids from ...
    ... DNA phylogenies are reconstructed by applying the neighbor-joining method to these distance matrices. The chimpanzee is clustered with the human in most of ...
  39. [39]
    Construction of Phylogenetic Trees - Science
    The method for constructing phylogenetic trees uses mutation distances estimated from cytochrome c sequences.
  40. [40]
    Toward Defining the Course of Evolution - jstor
    FITCH. Abstract. Fitch, W. M. (Dept. of Physiological Chemistry, Univ. of Wisconsin, Madison, Wisconsin,. 53706), 1971. Toward defining the course of evolution ...
  41. [41]
    Exactly Computing the Parsimony Scores on Phylogenetic Networks ...
    In this article, we use maximum parsimony to score phylogenetic networks based on the minimum number of state changes across a subset of edges of the network.
  42. [42]
    Locating the vertices of a steiner tree in an arbitrary metric space
    Mar 22, 1975 · D. Sankoff, “Minimal mutation trees of sequences”,SIAM Journal on Applied Mathematics 28 (1975) 35–42. Google Scholar. D ...<|separator|>
  43. [43]
    [PDF] Introduction Phylogenetic Trees - cs.Princeton
    The branch-and-bound method (as applied here) counts the number of changes for an initial tree (e.g., an initial tree may be obtained using the neighbor-joining ...
  44. [44]
    [PDF] Stat 882: Statistical Phylogenetics – Lecture 3 Contents
    • Exact Methods. – Exhaustive search. – Branch and bound methods. • Heuristic methods. – Divide-and-conquer. – Stepwise addition and branch swapping. – ...
  45. [45]
    [PDF] Exactly Computing the Parsimony Scores on Phylogenetic Networks ...
    Using the principle of maximum parsimony, we can score phylogenetic networks based on the minimum number of state changes across a subset of edges of the ...
  46. [46]
    POY: Phylogenetic Analysis Program | AMNH
    Insertions, deletions, and rearrangements, can then be included in the overall tree score (under Maximum Parsimony), or in the model (under Maximum Likelihood).<|separator|>
  47. [47]
    [PDF] paupmanual.pdf - PhyloSolutions
    The last section provides background information concerning parsimony, maximum likelihood, and distance methods. When you need more specific information ...
  48. [48]
    Efficiency of Parallel Direct Optimization - ScienceDirect
    In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, ...
  49. [49]
    Representation in stochastic search for phylogenetic tree ...
    This study introduces a new stochastic search algorithm that operates over an alternative representation of trees, namely as permutations of taxa.
  50. [50]
    RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses ...
    RAxML-VI-HPC is a program for large phylogenetic inference using maximum likelihood, with optimizations for speed and parallel processing, and mixed models.
  51. [51]
    A biologist's guide to Bayesian phylogenetic analysis - PMC - NIH
    Here, we summarize the major features of Bayesian phylogenetic inference and discuss Bayesian computation using Markov chain Monte Carlo (MCMC).
  52. [52]
    Efficiency of Markov Chain Monte Carlo Tree Proposals in Bayesian ...
    The main limiting factor in Bayesian MCMC analysis of phylogeny is typically the efficiency with which topology proposals sample tree space.
  53. [53]
    MrBayes 3: Bayesian phylogenetic inference under mixed models
    Abstract. MrBayes 3 performs Bayesian phylogenetic analysis combining information from different data partitions or subsets evolving under different stochastic ...
  54. [54]
    BEAST: Bayesian evolutionary analysis by sampling trees
    Nov 8, 2007 · Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A ...
  55. [55]
    Confidence Limits on Phylogenies: An Approach Using the Bootstrap
    to infer the variability of the estimate. This paper will explore the use of the bootstrap in inferring phylogenies, where it leads to a practical method ...
  56. [56]
    Probability distribution of molecular evolutionary trees - PubMed - NIH
    The posterior probability provides a natural measure of the reliability of the estimated phylogeny. Two example data sets are analyzed to infer the ...
  57. [57]
    [PDF] A Classification of Consensus Methods for Phylogenetics
    The strict consensus tree is ((a, b, c), d, e, f) and the majority rule tree is ((a, b, c), ((d, e),f)). Every greedy consensus tree for a collection refines ...
  58. [58]
    Consensus Techniques and the Comparison of Taxonomic Trees
    A method is defined and demonstrated for each of two different types of trees: rooted, fully labelled trees, and rooted trees with unlabelled internal nodes.
  59. [59]
    Overcredibility of molecular phylogenies obtained by Bayesian ...
    We show by computer simulation that posterior probabilities in Bayesian analysis can be excessively liberal when concatenated gene sequences are used.
  60. [60]
    Systematic Exploration of the High Likelihood Set of Phylogenetic ...
    We use the 95% credible set of tree topologies from the golden runs as our assumed high posterior density set. Despite the length of these runs, we cannot ...
  61. [61]
    Quartet Sampling distinguishes lack of support from conflicting ...
    Feb 14, 2018 · Quartet Sampling is an efficient synthesis of phylogenetic tests that offers more comprehensive and specific information on branch support than conventional ...Abstract · Materials and methods · Results and discussion · Conclusions
  62. [62]
    Consistency, Characters, and the Likelihood of Correct Phylogenetic ...
    Archie. Homoplasy excess ratios: New indices for measuring levels of homoplasy in phylogenetic systematics and a critique of the consistency index ... Farris.
  63. [63]
    Measuring homoplasy I: comprehensive measures of maximum and ...
    Jun 25, 2024 · Perhaps the most commonly used homoplasy measure, the consistency index (Kluge and Farris, 1969), CI (character ci = m/s), also depends on ...
  64. [64]
    Farris, J. S. The retention index and rescaled consistency index ...
    Aug 6, 2025 · The retention index (r or ri) is widely applied to compute the fraction of apparent synapomorphy in the character that is retained as synapomorphy on the tree.
  65. [65]
    A New Phylogenomic Approach For Quantifying Horizontal Gene ...
    Jul 24, 2020 · HGT tangles the conventional universal Tree of Life, turning it into a network of Evolution. HGT is pervasive and some estimates of the genes ...
  66. [66]
    HGTphyloDetect: facilitating the identification and phylogenetic ... - NIH
    Feb 8, 2023 · HGTector is a customized pipeline for genome-wide detection of HGT events based on sequence homology search hit distribution statistics, but ...Missing: RIxML | Show results with:RIxML
  67. [67]
    HGTector: an automated method facilitating genome-wide discovery ...
    Aug 26, 2014 · A new computational method of rapid, exhaustive and genome-wide detection of HGT was developed, featuring the systematic analysis of BLAST hit distribution ...Missing: RIxML | Show results with:RIxML
  68. [68]
    an automated method facilitating genome-wide discovery of putative ...
    Aug 26, 2014 · HGTector is an effective tool for initial or standalone large-scale discovery of candidate HGT-derived genes.Missing: RIxML | Show results with:RIxML
  69. [69]
    Long-branch attraction and the phylogeny of true water bugs ...
    We used three complementary strategies to test and alleviate the effects of LBA: (1) the removal of distant outgroups from the analysis; (2) the addition of ...Removal Of Outgroups From... · Taxon Sampling · Tests Of MonophylyMissing: workarounds | Show results with:workarounds
  70. [70]
    Predicting horizontal gene transfers with perfect transfer networks
    Feb 6, 2024 · We introduce perfect transfer networks, which are phylogenetic networks that can explain the character diversity of a set of taxa.
  71. [71]
    The transfer of antibiotic resistance genes between evolutionarily ...
    Jun 3, 2025 · In this study, we use phylogenetic trees reconstructed from half a million ARG sequences to detect horizontal transfers between bacterial phyla.
  72. [72]
    Incomplete Lineage Sorting in Mammalian Phylogenomics
    Sep 8, 2016 · Abstract. The impact of incomplete lineage sorting (ILS) on phylogenetic conflicts among genes, and the related issue of whether to account ...Missing: probability | Show results with:probability
  73. [73]
    ASTRAL: genome-scale coalescent-based species tree estimation
    Other statistically consistent species-tree estimation methods include BEST (Liu, 2008) and *BEAST ... Anomalous unrooted gene trees. ,. Syst. Biol. ,. 2013. , ...
  74. [74]
    Testing for Ancient Admixture between Closely Related Populations
    In this paper, we present a test for ancient admixture that exploits the asymmetry in the frequencies of the two nonconcordant gene trees in a three-population ...
  75. [75]
    Phylogenomic approaches to detecting and characterizing ... - NIH
    Detecting introgression using gene tree frequencies. The D-statistic. A widely used method for inferring introgression is the D-statistic, or—perhaps because ...
  76. [76]
    Gene flow analysis method, the D-statistic, is robust in a wide ...
    Jan 8, 2018 · The D-statistic, as a method to detect gene flow, is robust against a wide range of genetic distances (divergence times) but it is sensitive to population size.Missing: post- | Show results with:post-
  77. [77]
    Bayesian Inference of Species Trees from Multilocus Data
    Nov 11, 2009 · The new method is named *BEAST (pronounced “star beast”). Methods. Given data D, we define the posterior distribution of the complete species ...
  78. [78]
    The impact of taxon sampling on phylogenetic inference - NIH
    Abstract. Over the past two decades, there has been a long-standing debate about the impact of taxon sampling on phylogenetic inference.
  79. [79]
    Impact of Missing Data on Phylogenies Inferred from Empirical ...
    Aug 28, 2012 · These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power.
  80. [80]
    Machine learning based imputation techniques for estimating ...
    Jul 20, 2020 · In this paper, we propose two statistical and machine learning (ML) based methods for imputing missing values in distant matrices. These methods ...
  81. [81]
    [PDF] Machine learning based imputation techniques for estimating ...
    Jan 21, 2020 · In this study, we successfully adapted this idea for imputing missing entries in a distance matrix for phylogenetic estimation. Autoencoder (AE ...
  82. [82]
    Phylogenies and the Comparative Method | The American Naturalist: Vol 125, No 1
    ### Summary of Felsenstein 1985 Model for Continuous Character Evolution and Ancestral State Reconstruction
  83. [83]
    RAxML version 8: a tool for phylogenetic analysis and post ... - NIH
    RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analysis of large datasets under maximum likelihood.
  84. [84]
    Ecological selectivity and the evolution of mammalian substrate ...
    Oct 11, 2021 · Uncertainty surrounding the phylogenetic position of such fossils presents further challenges with respect to interpreting their implications ...2.3 Model Selection · 3 Results · 4 Discussion
  85. [85]
    Calibrating the Tree of Life: fossils, molecules and evolutionary ... - NIH
    Aug 8, 2009 · However, fossils from closely related groups may be used as calibration points if taxa representing them are included in the phylogenetic tree, ...
  86. [86]
    A Total-Evidence Approach to Dating with Fossils, Applied to ... - NIH
    We compare node dating using nine calibration points derived from the fossil record with total-evidence dating based on 343 morphological characters.
  87. [87]
    Fossils improve phylogenetic analyses of morphological characters
    May 5, 2021 · Our results show that fossil taxa improve phylogenetic analysis of morphological datasets, even when highly fragmentary.
  88. [88]
    The fossilized birth–death process for coherent calibration of ... - PNAS
    Jul 9, 2014 · In these cases, fossils still are useful for calibrating a molecular phylogeny of extant species provided that calibration nodes are identified ...
  89. [89]
    Fossil ghost ranges are most common in some of the oldest ... - NIH
    The presence of very numerous and/or extensive ghost ranges is often believed to imply spurious phylogenies or a misleadingly patchy fossil record, or both. It ...Missing: challenges | Show results with:challenges
  90. [90]
    Paravian Phylogeny and the Dinosaur-Bird Transition: An Overview
    We here present a review of the taxonomic composition and main anatomical characteristics of those theropod families closely related with early birds.Setting Up the Phylogenetic... · Paravian Phylogeny... · The Dinosaur to Bird...
  91. [91]
    Estimating Absolute Rates of Molecular Evolution and Divergence ...
    Penalized likelihood takes a parameter-rich model that would ordinarily overfit the data and constrains fluctuations in its parameters by a roughness penalty.Abstract · Introduction · Materials and Methods · Results
  92. [92]
    Molecular‐clock methods for estimating evolutionary rates and ...
    Oct 7, 2014 · This is usually referred to as a strict or global molecular clock. The strict-clock model has only a single parameter: the rate of evolution.Abstract · Variation in rates of molecular... · Molecular-clock methods and...<|control11|><|separator|>
  93. [93]
    Bayesian Estimation of Species Divergence Times Under a ...
    We implement a Bayesian Markov chain Monte Carlo algorithm for estimating species divergence times that uses heterogeneous data from multiple gene loci.
  94. [94]
    BEAST Software - Bayesian Evolutionary Analysis Sampling Trees ...
    BEAST is a cross-platform program for Bayesian analysis of molecular sequences using MCMC. It is entirely orientated towards rooted, time-measured phylogenies.Missing: 2007 | Show results with:2007
  95. [95]
    Fast dating using least-squares criteria and algorithms
    Here, we present very fast dating algorithms, based on a Gaussian model closely related to the Langley–Fitch molecular-clock model.
  96. [96]
    Total-Evidence Approach to Dating with Fossils, Applied to the Early ...
    We compare node dating using nine calibration points derived from the fossil record with total-evidence dating based on 343 morphological characters.Abstract · Materials and Methods · Results · Discussion