Fact-checked by Grok 2 weeks ago

Substitution model

In and , a substitution model is a probabilistic , often Markovian, that describes the rates and patterns of , codon, or replacements over evolutionary time. These models account for biological realities such as unequal substitution rates (e.g., transitions versus transversions), heterogeneous base or frequencies, and site-specific rate variation, enabling the calculation of evolutionary distances, likelihoods of phylogenetic trees, and inference of adaptive processes from aligned sequence data. The foundations of substitution models trace back to the late 1960s, with the Jukes-Cantor model as the pioneering approach for DNA sequences, assuming equal substitution probabilities among the four nucleotides and uniform base frequencies. This was soon refined by Kimura's two-parameter model in 1980, which distinguished higher transition rates from transversion rates to better capture observed mutational biases. For proteins, Dayhoff and colleagues introduced the first empirical amino acid substitution matrix in 1978, derived from closely related sequences to estimate relative mutabilities and replacement probabilities. By the 1980s and 1990s, more complex models emerged, including the Hasegawa-Kishino-Yano (HKY) for DNA and the general time-reversible (GTR) model, which allows different rates for the six types of nucleotide substitutions while satisfying time-reversibility and is widely used due to its flexibility in fitting diverse datasets. Substitution models vary by data type and complexity: DNA models range from simple (e.g., Jukes-Cantor with one ) to advanced (e.g., GTR + invariant sites + gamma-distributed rates, with up to 10 parameters); codon models, such as those by Muse and Gaut (1994), incorporate synonymous/nonsynonymous distinctions to detect selection via dN/dS ratios; and models include empirical matrices like JTT (1992) or LG (2008), often augmented with structural or physicochemical constraints for greater realism. , typically via criteria like (AIC), is critical to avoid under- or over-parameterization, as improper choices can bias tree topologies and branch lengths in maximum likelihood or Bayesian analyses. These models underpin software like IQ-TREE and PhyML, facilitating applications from viral phylogenomics to conservation genetics.

Fundamentals

Definition and role

Substitution models in evolutionary biology provide probabilistic descriptions of how discrete character states, such as nucleotides in DNA or amino acids in proteins, evolve and substitute for one another along branches of a phylogenetic tree. These models are formulated as continuous-time Markov chains, which assume that the future state of a character depends solely on its current state and the elapsed time, independent of prior history. This framework captures the stochastic nature of evolutionary changes, modeling substitutions as rare events occurring at rates that vary between states. The primary role of substitution models is to facilitate likelihood-based in , allowing researchers to estimate evolutionary distances between sequences, reconstruct phylogenetic , and infer ancestral character states. By quantifying the probability of observing a given of sequences under a hypothesized evolutionary process, these models underpin maximum likelihood methods for estimation and support Bayesian approaches that incorporate distributions on parameters and topologies. This enables robust assessments of evolutionary relationships and rates across diverse taxa. At the heart of the substitution process is the instantaneous rate Q, which defines the relative rates of transitions between character states at any moment. The probability of a specific occurring over an evolutionary time interval t is then determined by the transition probability P(t) = \exp(Qt), which integrates these instantaneous rates to yield time-dependent probabilities. This exponential formulation ensures that the model generates realistic evolutionary trajectories while maintaining computational tractability for phylogenetic analyses.

Historical development

The development of substitution models began with the Jukes-Cantor model in 1969 for DNA sequences, which assumed equal rates of substitution among all nucleotides and provided a simple correction for multiple hits in phylogenetic distance estimation. This was soon refined by Motoo Kimura's two-parameter model in 1980, which distinguished between transitions (purine-to-purine or pyrimidine-to-pyrimidine changes) and transversions, better capturing the observed bias in mutation types across nucleotide sequences. Early empirical approaches for protein sequences followed, pioneered by Margaret Dayhoff and colleagues in 1978, who constructed the first (PAM) matrices based on observed substitutions in closely related proteins. These matrices, derived from a limited dataset of 71 protein families, represented the initial quantitative framework for estimating evolutionary distances in proteins by accounting for multiple substitutions at the same site. Key advancements in the 1980s integrated these models with phylogenetic tree inference, notably through Joseph Felsenstein's 1981 pruning algorithm, which enabled efficient computation of likelihoods for evolutionary trees under substitution models by recursively calculating probabilities from tree tips inward. Building on this, the Hasegawa-Kishino-Yano model in 1985 introduced time-reversibility for DNA substitutions, allowing for unequal base frequencies and separate rates for transitions and transversions while facilitating applications in divergence time estimation. A shift toward more empirical models occurred in the 1990s, exemplified by the Jones-Taylor-Thornton (JTT) matrix for proteins in 1992, which automated the generation of substitution rates from a larger database of aligned sequences, improving accuracy over earlier matrices for distant divergences. Recent trends emphasize data-specific empirical matrices, as demonstrated in a 2023 PeerJ study that evaluated methods for estimating custom substitution models in protein , highlighting their superior fit and computational feasibility for diverse datasets. Post-2020 milestones include the incorporation of structural data into substitution models, such as the 3Di matrices published in 2025, which derive substitution rates from three-dimensional protein structures to address limitations in sequence-based approaches for deep phylogenies.

Mathematical Foundations

Core framework

Substitution models in phylogenetics are fundamentally based on continuous-time homogeneous Markov chains, which describe the probabilistic evolution of discrete characters—such as nucleotides or amino acids—across phylogenetic branches. The state space S is finite, typically with |S| = 4 for DNA (corresponding to A, C, G, T) or |S| = 20 for proteins, and the process assumes that the substitution rate depends only on the current state and time elapsed, independent of prior history. This framework allows modeling evolutionary changes as a stochastic process where transitions between states occur at rates specified by an instantaneous rate matrix. The core of the model is the instantaneous matrix Q = (q_{ij}), a |S| \times |S| where the off-diagonal q_{ij} (for i \neq j) represent the from i to j, expressed relative to the overall of . The diagonal are defined as q_{ii} = -\sum_{j \neq i} q_{ij} to ensure the rows sum to zero, reflecting that Q governs the infinitesimal generator of the process. The evolutionary distance along a branch is parameterized by its length t, often in expected substitutions per site, so the scaled is Qt. The probability of transitioning from state i to state j over time t is given by the transition probability matrix P(t) = (p_{ij}(t)), where p_{ij}(t) = \Pr(X(t) = j \mid X(0) = i). For a , this is computed as the matrix exponential: P(t) = \exp(Qt). Direct computation of the exponential is typically achieved through eigendecomposition, assuming Q is diagonalizable. Let Q = V \Lambda V^{-1}, where \Lambda = \diag(\lambda_1, \lambda_2, \dots, \lambda_{|S|}) contains the eigenvalues \lambda_k (with \lambda_1 = 0 corresponding to the ), and V holds the right eigenvectors as columns. Then, P(t) = V \diag\left( e^{\lambda_1 t}, e^{\lambda_2 t}, \dots, e^{\lambda_{|S|} t} \right) V^{-1}. This decomposition facilitates efficient numerical evaluation, especially for small state spaces like DNA. A key component is the equilibrium frequency vector \pi = (\pi_i)_{i \in S}, the stationary distribution of the chain, which satisfies \pi Q = 0 and \sum_{i \in S} \pi_i = 1. It represents the long-term proportions of each state under the model and is given by the left eigenvector of Q corresponding to the zero eigenvalue, normalized. In phylogenetic likelihood calculations, the probability of observing a particular alignment at a site incorporates \pi as the initial distribution at the root and the P(t) matrices along each branch to propagate probabilities to observed states.

Key properties

Substitution models in phylogenetics rely on several core assumptions that facilitate computational tractability and simplify parameter estimation. Stationarity posits that the probability distribution of character states remains invariant over time, implying that the process has reached an equilibrium where the stationary frequencies π govern the long-term behavior of the Markov chain. Under this assumption, the root distribution of the phylogenetic tree is set equal to the equilibrium distribution π, ensuring consistency in likelihood calculations across the tree. Reversibility further simplifies the model by enforcing , where the joint probability of transitioning from state i to j equals that from j to i when weighted by frequencies: \pi_i q_{ij} = \pi_j q_{ji} for the instantaneous Q. This property allows the use of undirected in , as the direction of time becomes irrelevant, and halves the number of independent parameters in Q by symmetrizing the off-diagonal elements relative to π. Homogeneity assumes that substitution rates, as captured by the Q matrix, are constant across all sites and evolutionary lineages, enabling a single rate matrix to describe the entire process. Violations of homogeneity, such as site-specific rate variation, necessitate more complex heterogeneous models to account for differing evolutionary dynamics. These assumptions have profound implications for phylogenetic computation. Reversibility, in particular, underpins Felsenstein's algorithm, which efficiently computes site likelihoods by leveraging the "pulley principle" to propagate probabilities bidirectionally across the tree without enumerating all root positions. Non-reversible models, lacking this symmetry, demand additional parameters and more intensive algorithms for likelihood evaluation, increasing computational demands. To construct a time-reversible Q matrix, the off-diagonal elements q_{ij} (for i \neq j) are typically set proportional to \pi_j times an exchangeability rate, ensuring the detailed balance condition \pi_i q_{ij} = \pi_j q_{ji}; the diagonal elements q_{ii} are then adjusted to make each row sum to zero, with the overall scale normalized by the expected rate of . This approach, as in the general time-reversible (GTR) model, guarantees reversibility while allowing empirical estimation of exchangeabilities and frequencies.

Time scales

Substitution models in often incorporate the hypothesis, which assumes a constant rate of or substitutions across lineages over time. This assumption implies that the expected number of substitutions is proportional to the time since divergence, facilitating the estimation of evolutionary timescales from genetic data. The validity of the is typically tested using relative rate tests, such as those proposed by Tajima, which compare substitution rates between pairs of lineages to detect significant deviations from rate constancy. In substitution models, evolutionary distances are measured in units of expected substitutions per site, where branch lengths in phylogenetic trees represent the anticipated number of changes along a lineage. These relative timescales are scaled to absolute time units through calibration with external evidence, such as fossil records providing minimum divergence ages or geological events marking vicariance. The overall substitution rate \lambda in a model is derived from the rate matrix Q and stationary frequencies \pi as \lambda = -\sum_i \pi_i q_{ii}, where q_{ii} are the diagonal elements of Q; absolute times t are then obtained by dividing branch lengths by \lambda after calibration. Deviations from a strict , such as rate heterogeneity across lineages, are accommodated by relaxed clock models, which allow substitution rates to vary while drawing them from distributions like log-normal or gamma. These models, implemented in software such as , enable more flexible inference of divergence times by modeling rate variation as uncorrelated across branches. Limitations of uniform rate assumptions, including violations due to lineage-specific evolutionary pressures, are addressed through local clock approaches that permit distinct rates within predefined phylogenetic partitions, as introduced for phylogenies. More recent advancements post-2020 incorporate heterogeneous rate evolution across lineages to further mitigate clock violations in complex datasets.

Model Selection and Parameters

Mechanistic vs. empirical

Substitution models in are broadly classified into mechanistic and empirical categories, each grounded in distinct approaches to describing evolutionary change. Mechanistic models derive their structure from biophysical principles of , such as the elevated rates of over due to the biochemical properties of enzymes. These models emphasize underlying biological processes, resulting in fewer parameters that enhance generalizability across diverse datasets. For instance, the HKY85 model incorporates a / bias parameter alongside unequal frequencies to reflect mutational spectra observed in . Empirical models, by contrast, are parameterized directly from observed substitution frequencies in real sequence alignments, typically via on large datasets of related sequences. This data-driven fitting enables empirical models to accommodate intricate patterns, including site-to-site rate heterogeneity that arises from varying selective pressures. Pioneering examples include the Dayhoff matrix for amino acid substitutions, constructed from alignments of closely related proteins to estimate exchange rates. Such models prioritize statistical fidelity over explicit biological mechanisms, often requiring more parameters to mirror empirical complexities. The distinction between these approaches manifests in key trade-offs during phylogenetic analysis. Mechanistic models like HKY85 promote interpretability by linking parameters to evolutionary processes, but they may underfit heterogeneous data compared to empirical counterparts such as GTR+Γ, which better capture asymmetries and rate variation. Model selection often relies on information criteria like AIC and to balance fit against complexity, favoring empirical models when data volume supports their added parameters. Historically, mechanistic models dominated prior to the , reflecting limited sequence availability; the shift toward empirical paradigms accelerated post-Dayhoff in the late 1970s, as growing databases enabled precise estimation from alignments. More recently, data-specific empirical models—fitted directly to individual datasets—have demonstrated superior performance over general empirical ones in protein-based phylogenies, reducing in reconstruction.

Tree topologies and parameters

In phylogenetic , substitution models are integrated with structures by treating lengths as the expected number of substitutions per site along each evolutionary . This parameterization allows the model to quantify the amount of evolutionary change occurring over time or along lineages. The instantaneous rate matrix Q, which defines the substitution rates between states, is used to compute the probability matrices P(t_e) for each length t_e, enabling the evaluation of how observed data at are likely to have evolved from common ancestors. The likelihood of the sequence alignment given a tree topology T and model parameters \theta (including branch lengths, Q, and equilibrium frequencies \pi) assumes independence across sites in the . For each site, the site-specific likelihood is computed by summing over all possible hidden ancestral states at internal nodes, using Felsenstein's pruning —a dynamic programming approach that recursively calculates partial likelihoods from the leaves toward the root. This avoids the exponential computational cost of enumerating all ancestral configurations. The starts at the tips with observed states and propagates conditional likelihoods inward, yielding the full site likelihood as: L_s = \sum_{i} \pi_i \cdot L_i(\text{root}), where L_i(\text{root}) is the conditional likelihood at the root for state i, obtained by multiplying and summing transition probabilities along descendant subtrees. The overall data likelihood is then the product across all sites S: L = \prod_{s=1}^S L_s. This framework applies to both rooted and unrooted tree topologies, with unrooted trees being common in substitution model applications since the likelihood is invariant to root placement under time-reversible assumptions. Key parameters in this integration include the branch lengths t_e (one per edge), the off-diagonal elements of Q (defining relative substitution rates), and the equilibrium frequencies \pi (a vector summing to 1). For a binary unrooted tree with n leaves, there are $2n-3 branch lengths, while Q and \pi contribute a fixed number depending on the state space (e.g., 6 free rates + 3 frequencies for DNA under GTR). The total parameter count thus scales linearly with tree size, primarily driven by branches, making large-tree inference computationally intensive. Estimation of tree , branch lengths, and model occurs jointly via maximum likelihood optimization. Deterministic hill-climbing methods, such as nearest-neighbor interchanges and subtree-pruning-regrafting in PhyML, iteratively rearrange the and adjust to maximize the likelihood starting from an initial tree (e.g., from neighbor-joining). Alternatively, Bayesian MCMC samplers, as in MrBayes, explore the parameter space stochastically by proposing swaps, branch length updates, and rate shifts, approximating the posterior under a (e.g., on topologies, gamma on rates). These approaches ensure robust inference but require careful initialization to avoid local optima in the vast tree space.

Selection methods

Selecting an appropriate substitution model is crucial in phylogenetic analysis to prevent under-parameterization, which may lead to biased inferences, or over-parameterization, which can reduce statistical power and increase computational demands. Model selection methods evaluate candidate models based on their ability to fit the while penalizing excessive , ensuring a balance that optimizes likelihood-based tree estimation. Common criteria include the (AIC), (BIC), and hierarchical likelihood ratio tests (hLRT), each balancing model fit and parsimony differently. The AIC penalizes model complexity moderately, favoring models that explain the data well without excessive parameters, and is calculated as \text{AIC} = -2 \ln L + 2k, where \ln L is the log-likelihood and k is the number of parameters. In contrast, the applies a stronger penalty for larger datasets, given by \text{BIC} = -2 \ln L + k \ln n, where n is the number of sites, making it more conservative and less prone to selecting overly complex models. The hLRT compares nested models via likelihood ratios to test for significant improvements in fit, such as adding rate heterogeneity, but it performs poorly when invariable sites are present and is less accurate overall compared to . Studies on simulated datasets show that consistently outperforms AIC and hLRT in accuracy and precision for selecting evolutionary models in . Automated tools facilitate by evaluating multiple candidates efficiently. jModelTest implements AIC, , and hLRT to select among 203 substitution models, incorporating heuristics and for speed, and is widely used for DNA data. For protein sequences, ProtTest applies similar criteria (AIC and ) to choose from 112 models, optimizing for alignments via Phyml likelihood calculations. IQ-TREE's ModelFinder provides an ultrafast alternative, 10 to 100 times faster than jModelTest or ProtTest, supporting AIC, , and length tests while handling mixtures and partitions. Key considerations in model selection involve balancing statistical fit against complexity to avoid , particularly for datasets with varying evolutionary rates; for instance, is preferred for larger alignments to maintain robustness. Recent advancements in IQ-TREE (version 3.0, 2025) incorporate partition-specific selection, allowing independent model optimization across genomic partitions to better accommodate mixed data types and rate heterogeneity from 2023 onward. Despite these advances, standard selection methods assume site-independent evolution, which fails for heterogeneous data exhibiting time-varying substitution processes across sites or clades, potentially leading to phylogenetic artifacts like long-branch attraction.

Models by Data Type

DNA models

Substitution models for DNA sequences describe the process of nucleotide substitutions () over evolutionary time, assuming a framework where the rate of change depends on the current state. These models account for patterns such as equal or unequal rates and frequencies, enabling accurate estimation of evolutionary distances and phylogenetic trees from aligned nucleotide . Early models assumed equal rates among all nucleotides, while later extensions incorporated distinctions between and substitutions as well as site-specific rate heterogeneity. The Jukes-Cantor model (JC69), introduced in 1969, is the simplest DNA substitution model, assuming equal substitution rates among all nucleotides and equal equilibrium base frequencies (π_A = π_C = π_G = π_T = 0.25). It features a single parameter, α, representing the overall substitution rate, leading to a symmetric rate matrix Q with off-diagonal entries α (for i ≠ j) and diagonal entries -3α. The probability of no change at a site after time t is given by: P_{ii}(t) = \frac{1 + 3\exp(-4\alpha t)}{4} This model corrects for multiple substitutions but underestimates distances when occur more frequently than . The Kimura two-parameter model (K80), proposed in 1980, extends JC69 by distinguishing between (purine-to-purine or pyrimidine-to-pyrimidine changes) and (purine-to-pyrimidine or ), which typically occur at different rates. It introduces a transition/transversion ratio parameter κ (where κ > 1 indicates more ) alongside the overall rate, resulting in two free parameters. Equilibrium frequencies remain equal at 0.25, and the model improves distance estimation for datasets showing transition bias, such as . Building on K80, the Hasegawa-Kishino-Yano model (HKY85), developed in 1985, further relaxes assumptions by allowing unequal equilibrium base frequencies π = (π_A, π_C, π_G, π_T) while retaining the / distinction via κ. With five parameters (four for π and one for κ), HKY85 captures both compositional heterogeneity and substitution type bias, making it suitable for DNA where base frequencies vary. It is widely used as a baseline for more complex models due to its balance of simplicity and realism. The Felsenstein 1981 model (F81) addresses unequal base frequencies π without distinguishing and rates, assuming equal rates scaled by the target nucleotide's frequency. Featuring four parameters (the π values), F81 is nested within HKY85 and performs well when transition bias is minimal but compositional bias is present, such as in some viral genomes. The Tamura-Nei model (TrN or TN93), introduced in 1993, generalizes previous models by allowing rates for two types of transitions (A↔G and C↔T) and two types of transversions (A↔C = G↔T and A↔T = G↔C), plus unequal frequencies π, with 8 free parameters. This provides flexibility for datasets with complex patterns, though it requires more data to estimate reliably. TrN is particularly effective for control regions of exhibiting strong biases. To account for among-site rate variation, common extensions include the discrete gamma distribution (+Γ), proposed by Yang in 1994, which approximates a continuous gamma-distributed rate variation across sites using k categories of rates drawn from a gamma shape parameter α (typically 0.5–1 for DNA). This addition improves model fit for heterogeneous sequences like protein-coding genes, where synonymous sites evolve faster than nonsynonymous ones. Additionally, the invariant sites model (+I), developed by Felsenstein and Churchill in 1996, posits a proportion p_inv of sites that never substitute, often combined with +Γ (+Γ+I) to model both zero-rate sites and variable rates. These extensions enhance phylogenetic accuracy by addressing unobserved substitutions and rate heterogeneity without assuming uniform evolution across the alignment.

Two-state models

Two-state substitution models provide a foundational framework for analyzing evolutionary changes in binary data, such as presence/absence traits or biallelic polymorphisms, by assuming transitions between two states (denoted 0 and 1) follow a continuous-time Markov process. These models simplify more complex multi-state frameworks, like those for DNA, by reducing the state space while capturing essential substitution dynamics. The basic form, analogous to the Jukes-Cantor model but for binary traits, features a rate matrix Q with a single parameter \alpha > 0 governing symmetric substitutions: Q = \begin{pmatrix} -\alpha & \alpha \\ \alpha & -\alpha \end{pmatrix} This matrix implies equal instantaneous rates for changes in either direction, with diagonal elements ensuring row sums of zero. The transition probabilities under this model derive from the matrix exponential P(t) = e^{Qt}, yielding the probability of changing from state 0 to 1 over evolutionary time t as: P(0 \to 1, t) = \frac{1 - e^{-2\alpha t}}{2}. The symmetric structure assumes stationarity under equal equilibrium frequencies (\pi_0 = \pi_1 = 1/2), facilitating distance-based phylogenetic estimation. Known as the Cavender-Farris-Neyman (CFN) model, this framework was developed for morphological characters in cladistic , assuming equal substitution rates to model character state changes along phylogenetic trees. It applies to scenarios like simple traits in , where the equal rates assumption simplifies likelihood computations for tree inference without requiring complex parameter estimation. Extensions incorporate unequal rates to account for biased , modifying the rate matrix to: Q = \begin{pmatrix} -\alpha & \alpha \\ \beta & -\beta \end{pmatrix}, where \alpha and \beta represent forward and backward rates, respectively, allowing for directional biases such as higher rates in one direction. This generalized two-state model is particularly useful for () data, where ascertainment biases or asymmetries necessitate non-symmetric transitions, improving accuracy in phylogenetic from biallelic sites. It also supports analyses in simple by accommodating rate heterogeneity in binary character . Recent advancements integrate two-state models with frameworks in , extending the CFN model to phylogenetic networks that capture reticulate through hybridization or . These network-based approaches, developed post-2023, enable of admixture graphs from , addressing complexities like incomplete in structured populations.

Amino acid models

Amino acid substitution models describe the process of evolutionary change among the 20 standard in protein sequences, providing a framework for inferring phylogenetic relationships from aligned protein data. These models extend the principles of substitution models by accounting for the biochemical diversity and functional constraints of , which result in more complex rate matrices compared to the simpler four-state DNA models. Empirical approaches dominate, deriving substitution probabilities from observed alignments of related proteins, while recent developments incorporate structural information to enhance accuracy for divergent sequences. The foundational empirical matrices are 20×20 symmetric matrices constructed from large sets of protein alignments. The Dayhoff PAM (Point Accepted Mutation) matrices, introduced in 1969 and refined in 1978, were derived from 71 groups of closely related proteins, counting accepted mutations to estimate relative substitution rates, with higher-numbered matrices like PAM250 extrapolated for greater evolutionary distances. Subsequent models improved upon this by using larger, more diverse datasets and advanced estimation methods. The JTT matrix (1992) was generated from 2,707 sequences in 462 alignments via a parsimony-based approach to approximate phylogenetic trees, yielding better performance for distant homologs. The WAG matrix (2001) employed maximum likelihood estimation on 2,256 families from the BRKALN database, optimizing rates across global alignments for enhanced generality. Most recently, the LG matrix (2008) refined these using 500,000 sequences in a concatenated alignment of 15,000 trees, incorporating a broader taxonomic range and demonstrating superior fit in phylogenetic likelihood calculations. These matrices capture patterns of conservative substitutions, such as those between physicochemically similar amino acids (e.g., hydrophobic to hydrophobic), reflecting selective pressures on protein structure and function. Mechanistic elements in models integrate physicochemical properties to inform rates, bridging empirical data with biological realism. For instance, the (BLOcks SUm Substitution Matrix) series, while primarily for , derives log-odds scores from conserved blocks in alignments, implicitly weighting substitutions by local structural context and evolutionary distances, and has been adapted for phylogenetic modeling. The core parameters of these time-reversible models consist of 190 off-diagonal exchangeability rates (forming the symmetric relative rate matrix) and 19 equilibrium frequencies π (normalized to sum to 1), with the full rate matrix Q normalized such that the expected rate of equals 1; rate variation across sites is often modeled using a discrete Γ distribution with four categories. Empirical fitting of these parameters, as detailed in broader model frameworks, relies on maximum likelihood optimization from observed alignments. Recent advances emphasize data-specific and structure-aware models to address limitations in generic matrices for challenging datasets. Custom substitution models estimated directly from individual protein alignments have been shown to significantly improve phylogenetic resolution and bootstrap support compared to standard matrices, particularly for datasets with unique compositional biases, as demonstrated in benchmarks using tools like IQ-TREE and P4. For deep phylogenies, where sequence divergence erodes signal, the 3Di (2025) incorporates tertiary structural interactions from protein folds, derived from universal paralog alignments, enabling robust inference of ancient relationships such as the prokaryotic root. Current trends reflect a shift toward site-specific rate estimation via mutation-selection frameworks, which model nucleotide-level mutations constrained by fitness landscapes, allowing precise prediction of substitution rates from sequence data alone and outperforming uniform models in rate heterogeneity.

Codon models

Codon substitution models treat the codon triplet as the fundamental unit of evolution, explicitly accounting for the genetic code to distinguish between synonymous and nonsynonymous changes while incorporating selection pressures through the ratio of nonsynonymous to synonymous substitution rates, denoted as ω (dN/dS). These models enable inference of adaptive evolution by estimating how ω varies across genes, sites, or lineages, with ω > 1 indicating positive selection, ω = 1 neutrality, and ω < 1 purifying selection. One of the foundational codon models is the empirical 62-parameter model developed by Muse and Gaut in 1994, which parameterizes the instantaneous rate matrix using the transition/transversion rate ratio κ and the nonsynonymous/synonymous rate ratio ω, along with empirical codon frequencies estimated from the data. This model constructs a 61 × 61 rate matrix (excluding stop codons) where substitution rates between codons differing by one nucleotide are scaled by κ for transitions and by ω for nonsynonymous changes, allowing likelihood-based estimation of evolutionary rates under the genetic code. In parallel, Goldman and Yang introduced a mechanistic codon model in 1994 with a parsimonious set of 10 parameters governing nucleotide exchanges within codons, including κ, ω, and equilibrium frequencies for the four nucleotides, derived from an underlying nucleotide substitution process. This approach builds the codon rate matrix mechanistically by considering only single-nucleotide substitutions that respect the genetic code, avoiding the high dimensionality of fully empirical models while enabling phylogenetic inference for protein-coding sequences. Subsequent extensions addressed heterogeneity in selection pressures. The NY98 model by Nielsen and Yang in 1998 incorporates site-specific variation in ω by discretizing it across amino acid positions using a mixture of distributions, allowing detection of pervasive positive selection at individual sites while maintaining the mechanistic structure of the Goldman-Yang framework. Building on this, branch-site models developed by Zhang et al. in 2005 enable episodic selection by permitting ω to vary both among sites and along specific lineages on the phylogeny, using likelihood ratio tests to identify foreground branches where positive selection may act on a subset of codons. In these models, the codon rate matrix Q_codon is constructed from a nucleotide-level instantaneous rate matrix Q_nuc, with entries scaled by dN/dS ratios to reflect selection: for a synonymous single-nucleotide substitution from codon i to j, the rate is proportional to the corresponding Q_nuc entry; for nonsynonymous changes, it is multiplied by ω. Formally, the substitution rate μ_{ij} between codons i and j (differing at one position) is given by \mu_{ij} = \begin{cases} \pi_k \cdot r_{lk} & \text{if synonymous (where } k \text{ is the target nucleotide, } r_{lk} \text{ from Q_nuc)} \\ \omega \cdot \pi_k \cdot r_{lk} & \text{if nonsynonymous} \end{cases} with π denoting equilibrium nucleotide frequencies, ensuring the matrix integrates to zero for each row. Recent advancements in next-generation codon models focus on detecting complex biases such as codon usage and varying selection regimes across genomes, incorporating empirical frequency profiles and hierarchical structures for more accurate inference in diverse taxa. Additionally, integrations with deep learning have emerged to predict substitution rates and ω profiles by training neural networks on large genomic datasets, enhancing parameter estimation in heterogeneous evolutionary scenarios beyond traditional likelihood methods.

Advanced Models

Time-reversible models

Time-reversible substitution models assume that the evolutionary process is reversible, meaning the probability of observing a sequence is the same forward or backward in time along the phylogenetic tree, which simplifies likelihood computations by satisfying detailed balance: π_i Q_{ij} = π_j Q_{ji} for stationary frequencies π and rate matrix Q. The generalized time-reversible (GTR) model represents the most flexible framework under this assumption for multi-state data, such as DNA with four nucleotides. For DNA, GTR parameterizes the instantaneous rate matrix Q using 6 relative exchange rates r_{ij} (one symmetric rate for each of the six unique nucleotide pairs) plus 3 stationary frequencies π_A, π_C, π_G, π_T, yielding 9 free parameters overall. This formulation generalizes simpler reversible models like the Hasegawa-Kishino-Yano (HKY) model, which assumes equal transversion rates, and the Kimura 2-parameter (K80) model, which further equates transition and transversion rates while fixing π at 0.25. The rate matrix is constructed as Q_{ij} = r_{ij} π_j for i ≠ j, with diagonal elements Q_{ii} = -∑{j ≠ i} Q{ij} to ensure row sums of zero, and the entire matrix scaled such that the expected rate of substitution ∑i π_i (-Q{ii}) = 1. The transition probabilities P(t) = exp(Qt) admit a closed-form expression via eigenvalue decomposition: if Q = U Λ U^{-1} with Λ diagonal containing eigenvalues λ_k (one zero and three negative), then P(t) = U exp(Λ t) U^{-1}, enabling efficient likelihood evaluation. GTR's maximal flexibility under reversibility allows it to approximate any time-reversible process without imposing unnecessary equality constraints on rates, making it suitable for diverse datasets where substitution patterns vary across nucleotide pairs. It is widely implemented in phylogenetic software, including for Bayesian inference and for maximum likelihood tree search. For amino acid data with 20 states, GTR variants extend this framework, often incorporating mixtures to account for site-specific heterogeneity; the , for instance, combines a GTR exchangeability matrix with a Dirichlet process mixture of amino acid profiles (up to 300 categories) to model compositional variation across sites. This approach uses the same eigenvalue-closed form for P(t) but profiles π drawn from the mixture for each site. Despite its advantages, GTR can suffer from overparameterization on small datasets, leading to estimation instability and inflated variance in branch lengths or topologies; this is commonly mitigated by data partitioning, where subsets (e.g., codon positions) are assigned independent GTR parameters while sharing the tree topology.

Mixture and heterogeneous models

Mixture models and heterogeneous models extend basic substitution frameworks, such as the general time-reversible (GTR) model, by incorporating variation in evolutionary processes across sites or lineages to better capture the complexity of molecular evolution. These approaches address limitations in homogeneous models by allowing different substitution patterns or rates at individual sites, improving the accuracy of phylogenetic inference. One foundational method for modeling site-to-site rate variation is the discrete gamma plus invariant sites (Γ + I) model. The discrete gamma distribution approximates continuous rate variation by dividing sites into discrete rate categories, each with equal probability, as proposed by Yang in 1994 for maximum likelihood estimation. This is often combined with an invariant sites component, where a proportion of sites (p_inv) are assumed completely resistant to substitution due to strong functional constraints, as developed by Gu, Hasegawa, and Eisenstein in 1995. The combined Γ + I model enhances model fit for diverse sequence data by accounting for both variable and fixed sites. More advanced mixture models, particularly for amino acid sequences, employ site-heterogeneous processes to allow distinct substitution profiles across sites. The CAT (C20 + A) model, introduced by Lartillot and Philippe in 2004, uses a Bayesian nonparametric approach with a Dirichlet process prior to define an infinite mixture of amino acid replacement matrices, drawing empirical profiles directly from the alignment. This enables sites to be assigned to classes with unique equilibrium frequencies and exchange rates, avoiding predefined matrices and better reflecting biochemical heterogeneity. Blanquart and Lartillot extended this in 2008 by integrating the CAT mixture with a nonstationary breakpoint model to handle both site and time heterogeneity in amino acid replacements. In Bayesian implementations of site-heterogeneous mixtures, class frequencies (π_k) are estimated from the posterior distribution, allowing flexible assignment of sites to evolutionary classes without fixed proportions. These models mitigate artifacts like long-branch attraction (LBA) in phylogenetics, where homogeneous assumptions lead to erroneous clustering of fast-evolving lineages; for instance, the CAT model suppresses LBA in animal phylogeny reconstructions by accommodating compositional and rate variations across sites. The likelihood under a mixture model with K classes is computed as the weighted sum over classes: L(\text{data} \mid \boldsymbol{\pi}, \{\text{model}_k\}) = \sum_{k=1}^K \pi_k L(\text{data} \mid \text{model}_k) where π_k is the frequency of class k, and L(data | model_k) is the site likelihood under the k-th substitution process. Recent advancements in protein evolution models incorporate structural information into mixtures, reflecting trends toward integrating 3D data for more realistic substitution rates among amino acids. A 2025 review in Molecular Phylogenetics and Evolution highlights the growing use of structural mixtures to model site-specific constraints based on protein folding and function, improving phylogenetic resolution for deep divergences. Software like IQ-TREE, updated in 2025, now provides enhanced support for such partitioned and mixture models, enabling efficient inference with complex heterogeneous processes across large phylogenomic datasets.

NCM and parsimony

The no common mechanism (NCM) model represents a non-probabilistic alternative to standard by assuming that evolutionary processes vary freely across sites without a shared instantaneous rate matrix Q for the entire tree. Instead, a distinct Q matrix is estimated independently for each site, allowing for maximal flexibility in accommodating site-specific heterogeneity in substitution rates and patterns. This approach was formalized by , who demonstrated its utility in linking probabilistic inference to non-probabilistic methods. A key relation exists between the NCM model and maximum parsimony, where the maximum likelihood tree under NCM coincides with the maximum parsimony tree for alignments evolving under simple models like the . Maximum parsimony can thus be interpreted as a limiting case of NCM when the expected number of substitutions per site is infinitesimally small, effectively minimizing the total number of changes across the tree without incorporating explicit rate parameters or branch lengths. This equivalence highlights parsimony's role as a computationally simple approximation that avoids rate estimation altogether. The NCM model offers advantages in avoiding model misspecification, as it does not presuppose a uniform evolutionary mechanism across sites, making it robust to violations of homogeneity assumptions that plague more restrictive models. It is also computationally lighter than full likelihood-based methods for large phylogenies, since site-specific optimizations reduce to parsimony-like searches that scale efficiently with tree size. Despite these benefits, the NCM model faces criticisms for statistical inconsistency under complex evolutionary scenarios, stemming from its extreme over-parameterization, which can lead to poor asymptotic performance even as data volume increases. Simulations have consistently shown that NCM (and ) is outperformed in accuracy by biologically inspired likelihood models that incorporate structured heterogeneity. In contemporary phylogenetics, the NCM model finds application in analyzing discrete traits, such as morphological characters, where independent evolution per trait aligns with biological realism and avoids forcing a common process. It also serves as a null model for evaluating more sophisticated heterogeneous frameworks in phylogenomic studies.

Applications

Sequence data analysis

Substitution models play a central role in the analysis of molecular sequence data, enabling the inference of evolutionary histories from aligned nucleotide, amino acid, or codon sequences. These models account for the probabilistic nature of substitutions along phylogenetic trees, facilitating methods such as maximum likelihood estimation to reconstruct relationships among taxa. By incorporating rate matrices and transition probabilities, substitution models correct for biases like multiple hits and site-specific rate variation, improving the accuracy of downstream analyses in phylogenetics and evolutionary biology. In phylogenetic inference, substitution models are integrated into maximum likelihood (ML) tree searches, where the likelihood of a given tree topology is optimized by evaluating the probability of observing the sequence data under the model. The process begins with an initial tree, often generated via , followed by iterative branch length and parameter optimization using algorithms like nearest-neighbor interchanges to maximize the likelihood. Model selection, typically via criteria such as , ensures the substitution model (e.g., for nucleotides) best fits the data, enhancing tree resolution. , which generates pseudoreplicates of the alignment by sampling sites with replacement, assesses node support by recomputing ML trees and calculating the proportion of replicates supporting each clade, providing a measure of statistical confidence. For instance, alignments produced by can be analyzed in to perform these ML searches efficiently on large datasets. Distance-based methods rely on substitution models to compute corrected evolutionary distances between sequence pairs, which serve as inputs for tree-building algorithms like . Under the model, assuming equal substitution rates among nucleotides, the uncorrected p-distance (proportion of differing sites) underestimates true divergence due to unobserved multiple substitutions. The model corrects this via the formula: d = -\frac{3}{4} \ln \left(1 - \frac{4}{3} p \right) where d represents the estimated number of substitutions per site. This correction is particularly useful for closely related sequences, preventing saturation effects in distance matrices and yielding more accurate trees. Ancestral sequence reconstruction uses substitution models to infer the most likely states at internal nodes of a phylogenetic tree, aiding in understanding functional evolution. Joint maximum likelihood approaches compute the probability of all ancestral configurations simultaneously, marginalizing over the tree to obtain site-specific posterior probabilities via the transition probability matrices P(t), which describe state changes over branch lengths t. This method outperforms marginal reconstruction by considering joint dependencies, especially for amino acid sequences under models like or . Detection of natural selection in sequence data employs codon substitution models, which partition nonsynonymous (d_N) and synonymous (d_S) substitution rates to estimate the \omega = d_N / d_S ratio. Values of \omega < 1 indicate purifying selection, \omega = 1 neutrality, and \omega > 1 positive selection, with site-specific models (e.g., M7 vs. M8 in PAML) allowing \omega to vary across codons via mixture distributions. Tools like Datamonkey implement methods such as FEL and REL to test for episodic selection, identifying codons under positive pressure in alignments from diverse taxa. Seminal codon models, building on early frameworks, enable robust inference of in protein-coding genes.

Non-sequence data

Substitution models have been extended beyond molecular sequences to analyze discrete non-molecular data, such as morphological traits observed in fossils or extant organisms, where characters are coded as states rather than continuous rates. These models treat morphological as a Markov process on a , analogous to or substitution, but adapted for traits like bone structure or presence/absence of features. The foundational approach is the Mk model, introduced by in 2001, which serves as a general framework for multi-state characters and parallels the general time-reversible (GTR) model by allowing a full rate matrix among k states. For binary traits, such as the presence or absence of a morphological feature, the model simplifies to the Lewis two-state version (Mk with k=2), which estimates substitution rates between states while accounting for phylogenetic structure. Extensions to the basic model address specific properties of morphological data. Characters can be modeled as unordered, assuming equal evolutionary rates between any pair of states, or ordered, enforcing stepwise transitions (e.g., from state 0 to 1 to 2), which is suitable for traits with implied developmental or functional gradients like size gradations. A key challenge in fossil-inclusive datasets is ascertainment bias, arising because systematists typically exclude invariant characters, inflating branch lengths; corrections involve adjusting the likelihood by enumerating unobserved site patterns, as implemented in tools like RevBayes for the model. These adaptations enable robust inference from sparse, incomplete morphological matrices often encountered in . In applications, these models support cladistic analysis by providing a likelihood-based alternative to traditional , allowing parameter estimation and for tree topologies from morphological data alone. They also facilitate integration with molecular sequences in total evidence approaches, where morphological and genetic data are combined in a unified Bayesian framework to improve resolution, particularly for groups with limited sequence coverage like fossils. Despite these advances, limitations persist due to the sparse nature of morphological datasets, which often include fewer characters than molecular ones, leading to overparameterization in the rate matrix and reduced statistical power. To mitigate this, hybrid approaches incorporating scores or weighted have been employed, balancing model complexity with data availability. A recent development bridging morphological and protein data is the 3Di substitution model, introduced in 2025, which encodes three-dimensional protein structures as discrete characters using FoldSeek's alphabet to infer deep phylogenies where sequence divergence obscures relationships. This matrix provides a to models, enhancing total evidence analyses by incorporating conserved folds as "morphological" traits at the molecular level.

Computational tools

Several software packages implement substitution models for phylogenetic inference, enabling maximum likelihood (ML), Bayesian, and simulation-based analyses. IQ-TREE, a widely used tool for efficient phylogenomic inference, supports over 200 time-reversible substitution models for DNA, protein, codon, binary, and morphological data, with rate heterogeneity among sites. Its 2025 release (version 3.0.1) extends capabilities for complex mixture models as alternatives to partitioned models, improving handling of site-specific heterogeneity without predefined partitions, and includes gene and site concordance factors for phylogenomic datasets. MrBayes facilitates Bayesian phylogenetic inference under mixed models, incorporating priors on substitution rates and tree topologies to account for uncertainty in evolutionary processes across heterogeneous datasets. RAxML, optimized for rapid ML-based tree inference on large phylogenies, employs randomized accelerated algorithms to evaluate substitution models like GTR and HKY85, making it suitable for bootstrap analyses and post-analysis of extensive alignments. Key features enhance model applicability and efficiency. ModelFinder, integrated into IQ-TREE, performs ultrafast automatic selection of optimal substitution models using , AIC, or AICc criteria, outperforming traditional tools like jModelTest by 10-100 times in speed while maintaining accuracy for partitioned and mixture schemes. For sequence simulation under substitution models, Seq-Gen generates or sequences along specified phylogenies, implementing common Markov models such as Jukes-Cantor and GTR to test hypotheses or validate methods by mimicking evolutionary processes. Recent advancements address temporal and selective aspects of substitution. BEAST2 supports relaxed molecular clock models, including uncorrelated lognormal and random local clocks, to estimate divergence times while allowing rate variation across branches, with optimizations like the Optimised Relaxed Clock (ORC) package improving performance for large datasets. HyPhy, a platform for hypothesis testing in , has seen post-2023 enhancements in codon substitution models for detecting adaptive evolution, enabling site-specific selection analyses through flexible likelihood frameworks on protein-coding alignments. Scalability challenges persist for genome-scale analyses, where computational demands of complex substitution models on massive alignments can exceed traditional CPU limits, prompting 2025 developments in GPU acceleration, such as up to 65-fold speedups in variant calling, and more modest gains (e.g., ~1.5-fold) in phylogenetic simulations like . Tools like Apollo leverage GPUs for within-host simulations under processes, improving scalability for dynamics modeling in viral phylogenomics. Best practices emphasize partitioning alignments by gene boundaries to accommodate heterogeneous evolutionary rates, followed by model selection per partition to enhance accuracy without overparameterization. Validation through simulations, such as those generated by Seq-Gen, is recommended to assess model fit and robustness against biases in real data.

References

  1. [1]
    Trends in substitution models of molecular evolution - Frontiers
    Substitution models of evolution describe the process of genetic variation through fixed mutations and constitute the basis of the evolutionary analysis at ...
  2. [2]
    Substitution Model - an overview | ScienceDirect Topics
    A substitution model describes how mutations arise between related samples over time [87]. Common models are 'Hasegawa, Kishino, and Yano' (HKY) and 'General ...
  3. [3]
    [PDF] Taking variation of evolutionary rates between sites into account in ...
    His work with Charles Cantor, setting forth the Jukes-Cantor model of molecular evolution (Jukes and. Cantor 1969) was a very small part of his wider studies on ...
  4. [4]
    Dayhoff Amino Acid Substitution Matrix (PAM Matrix, Percent ...
    Oct 15, 2004 · Dayhoff MO, et al. (1978). A model of evolutionary change in proteins. In Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3 ...Missing: original | Show results with:original
  5. [5]
    PDF - BIOINFORMATICS ORIGINAL PAPER - Oxford University Press
    ABSTRACT. Motivation: The general-time-reversible (GTR) model is one of the most popular models of nucleotide substitution because it constitutes.
  6. [6]
    Estimating Amino Acid Substitution Models: A Comparison of ...
    We start with summarizing the original work of Dayhoff, Schwartz, and Orcutt (1978) , develop the formalism of a maximum likelihood estimator, and finally ...
  7. [7]
    Choosing Appropriate Substitution Models for the Phylogenetic ...
    We determined the most appropriate model for alignments of 177 RNA virus genes and 106 yeast genes, using 11 substitution models including one codon model and ...
  8. [8]
    Substitution Models - IQ-TREE
    Jun 10, 2025 · IQ-TREE supports a wide range of substitution models, including advanced partition and mixture models. This guide gives a detailed information of all available ...
  9. [9]
    [PDF] A gentle Introduction to Probabilistic Evolutionary Models - HAL
    Apr 10, 2020 · Here, we explain the fundamental of continuous time Markov models used to describe sequence evolution. We be- gin by describing discrete Markov ...
  10. [10]
    Evolutionary trees from DNA sequences: A maximum likelihood ...
    The application of maximum likelihood techniques to the estimation of evolutionary trees from nucleic acid sequence data is discussed.
  11. [11]
    [PDF] Jukes T H & Cantor C R. Evolution of protein molecules. (Munro H N ...
    Feb 16, 1990 · It was published in Munro's book in 1969, and the article has 110 printed pages. Citations to our long article relate only to the following ...
  12. [12]
    [PDF] Different Versions of the Dayhoff Rate Matrix - EMBL-EBI
    Again, only a probability matrix, mutabilities, frequencies and incomplete counts are provided in the original paper. The JTT probability matrix has also ...
  13. [13]
    [PDF] Evolutionary Trees from DNA- Sequences - a Maximum-Likelihood ...
    Summary. The application of maximum likelihood techniques to the estimation of evolutionary trees from nucleic acid sequence data is discussed. A computa-.
  14. [14]
    Dating of the human-ape splitting by a molecular clock of ...
    A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed.Missing: original | Show results with:original
  15. [15]
    The rapid generation of mutation data matrices from protein ...
    The general trends shown in the PET matrix are essentially those found in the original Dayhoff matrix: hydrophobicity and size being the most significant ...
  16. [16]
    Data-specific substitution models improve protein-based ... - PeerJ
    Aug 8, 2023 · Calculating amino-acid substitution models that are specific for individual protein data sets is often difficult due to the computational ...
  17. [17]
    General Substitution Matrix for Structural Phylogenetics
    Our 3Di substitution matrix provides a starting point for revisiting many deep phylogenetic problems that have so far been extremely difficult to solve.
  18. [18]
    Testing Substitution Models Within a Phylogenetic Tree
    With the exception of LogDet and paralinear distances, these models require homogeneity, reversibility, and stationarity. The model is usually defined by a ...Models And Matrices · Inference And Testing · DiscussionMissing: key properties
  19. [19]
    Efficient Likelihood Computations with Nonreversible Models of ...
    Noticeably, these algorithms are currently limited to reversible models of evolution, in which Felsenstein's pulley principle applies. In this paper we show ...
  20. [20]
    Simple Methods for Testing the Molecular Evolutionary Clock ... - NIH
    Simple statistical methods for testing the molecular evolutionary clock hypothesis are developed which can be applied to both nucleotide and amino acid ...
  21. [21]
    Simple methods for testing the molecular evolutionary clock ...
    Simple statistical methods for testing the molecular evolutionary clock hypothesis are developed which can be applied to both nucleotide and amino acid ...
  22. [22]
    [PDF] Simple methods for testing the molecular evolutionary clock ...
    Simple statistical methods for testing the molecular evolutionary clock hypothesis are developed which can be applied to both nucleotide and amino acid ...
  23. [23]
    The fossilized birth–death process for coherent calibration of ... - PNAS
    Jul 9, 2014 · Branch lengths, then, are the product of substitution rate and time, and usually measured in units of expected number of substitutions per site.
  24. [24]
    Testing the molecular clock using mechanistic models of fossil ...
    Jun 21, 2017 · Hence, calibration of the molecular clock relies ultimately on information derived from fossil evidence (or other geological events). Fossil ...
  25. [25]
    Relaxed Phylogenetics and Dating with Confidence - PMC - NIH
    Copyright: © 2006 Drummond et al. This is an open-access article ... BEAST with a molecular clock (CLOC); and uncorrelated lognormal relaxed clock (UCLN).
  26. [26]
    Molecular Clocks | BEAST Documentation
    Jul 24, 2017 · The relaxed clock implementation in BEAST works by assigning each branch one rate from a fixed number of discrete rates. Basically, the ...
  27. [27]
    Estimation of Primate Speciation Dates Using Local Molecular Clocks
    In a model of local molecular clock, we assume that each branch in the phylogeny can take one of k possible rates. We let r0 = 1 be the default rate and simply ...
  28. [28]
    Modeling Substitution Rate Evolution across Lineages and Relaxing ...
    Sep 27, 2024 · Relaxing the molecular clock using models of how substitution rates change across lineages has become essential for addressing evolutionary problems.
  29. [29]
  30. [30]
  31. [31]
    New Algorithms and Methods to Estimate Maximum-Likelihood ...
    PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges ...Strategies To Search The... · New Spr Algorithm · Phylogenetic Methods...Missing: hill- | Show results with:hill-<|control11|><|separator|>
  32. [32]
    MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model ...
    Feb 22, 2012 · MrBayes 3.2 is a software for Bayesian phylogenetic inference using MCMC, with new features like convergence diagnostics, faster likelihood ...
  33. [33]
    Model Selection and Model Averaging in Phylogenetics ...
    Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the ...
  34. [34]
    Performance of Criteria for Selecting Evolutionary Models
    Aug 9, 2010 · Of the four widely-used model-selection criteria in phylogenetics - the hLRT, AIC, BIC, and DT - the hLRT was once argued to be reasonably ...
  35. [35]
    jModelTest: Phylogenetic Model Averaging - Oxford Academic
    jModelTest is a new program for the statistical selection of models of nucleotide substitution based on “Phyml” (Guindon and Gascuel 2003.Introduction · Model Selection with jModelTest · Model Selection Uncertainty
  36. [36]
    ProtTest: selection of best-fit models of protein evolution
    ProtTest is a java program to find the best model of amino acid replacement for a given protein alignment. It is based on the Phyml program (Guindon and Gascuel ...Abstract · INTRODUCTION · THE PROGRAM: PROTTEST
  37. [37]
    IQ-TREE: Efficient phylogenomic software by maximum likelihood
    ### Summary of IQ-TREE (http://www.iqtree.org/)
  38. [38]
    Site-specific time heterogeneity of the substitution process and its ...
    Jan 14, 2011 · In this model, the substitution process is assumed to be site-independent and is entirely defined by the equilibrium frequencies of amino acids, ...<|separator|>
  39. [39]
  40. [40]
    A simple method for estimating evolutionary rates of base ... - PubMed
    A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980 Dec;16(2):111-20.<|separator|>
  41. [41]
    Dating of the human-ape splitting by a molecular clock of ... - PubMed
    A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed.Missing: original paper
  42. [42]
    Evolutionary trees from DNA sequences: a maximum likelihood ...
    The application of maximum likelihood techniques to the estimation of evolutionary trees from nucleic acid sequence data is discussed.
  43. [43]
    Estimation of the number of nucleotide substitutions in the control ...
    ... Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993 May;10(3):512-26.<|separator|>
  44. [44]
    Maximum likelihood phylogenetic estimation from DNA sequences ...
    The first, called the "discrete gamma model," uses several categories of rates to approximate the gamma distribution, with equal probability for each category.Missing: paper | Show results with:paper
  45. [45]
    A Hidden Markov Model approach to variation among sites in rate of ...
    A Hidden Markov Model approach to variation among sites in rate of evolution. Mol Biol Evol. 1996 Jan;13(1):93-104. doi: 10.1093/oxfordjournals.molbev.a025575.
  46. [46]
  47. [47]
    [PDF] Identifiability of Phylogenetic Models - CrossWorks
    May 18, 2023 · The Cavender-Farris-Neyman (CFN) model has the rate matrix: QCFN ... The Kimura 3-Parameter model on 4 letters has the rate matrix:.<|control11|><|separator|>
  48. [48]
    Toric geometry of the Cavender-Farris-Neyman model with a ...
    The CFN model describes substitutions at a single site in the gene sequences of the taxa in question. It is a two-state model, where the states are purine ...
  49. [49]
    [PDF] PHASE TRANSITIONS IN PHYLOGENY 1. Introduction Phylogenetic ...
    Cavender, Farris and Neyman [28, 5, 11] introduced a model of evolution of binary characters. In this model, the evolution of characters is governed by the.
  50. [50]
    Inferring ancestral sequences in taxon-rich phylogenies
    For instance, consider any continuous Markov process on two states, with arbitrary transition rates (generally unequal) between the two states.
  51. [51]
    New Acquisition Bias Corrections for Inferring SNP Phylogenies - PMC
    We introduce two new acquisition bias corrections for dealing with alignments composed exclusively of SNPs, a conditional likelihood method and a reconstituted ...
  52. [52]
    [2312.07450] The Pfaffian Structure of CFN Phylogenetic Networks
    In this paper, we study the ideal of phylogenetic invariants of the Cavender-Farris-Neyman (CFN) model on a phylogenetic network.Missing: admixture | Show results with:admixture
  53. [53]
    [PDF] dayhoff-1978-apss.pdf
    The 1 PAM matrix can be multiplied by itself N times to yield a matrix that predicts the amino acid replace- ments to be found after N PAMs of evolutionary ...
  54. [54]
    rapid generation of mutation data matrices from protein sequences
    An efficient means for generating mutation data matrices from large numbers of protein sequences is presented here. By means of an approximate peptide-based ...Missing: JTT substitution
  55. [55]
    General Empirical Model of Protein Evolution Derived from Multiple ...
    Jones, Taylor, and Thornton (1992) applied a faster, automated procedure based on Dayhoff and colleagues' (Dayhoff, Eck, and Park 1972 ; Dayhoff, Schwartz, and ...
  56. [56]
    Improved General Amino Acid Replacement Matrix - Oxford Academic
    Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches.
  57. [57]
    Accurate prediction of substitution rates at protein sites with ... - Nature
    Oct 2, 2025 · This study demonstrates that a rapid and simple method can be developed from the mutation-selection model to predict substitution rates from ...<|separator|>
  58. [58]
    Codon-substitution models for heterogeneous selection pressure at ...
    We develop models that account for heterogeneous omega ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences.
  59. [59]
    Next-generation development and application of codon model in ...
    Though empirical codon models are highly useful in understanding protein evolution as well as in phylogenetic applications, only a few models have been ...<|control11|><|separator|>
  60. [60]
    A likelihood approach for comparing synonymous and ... - PubMed
    The model uses the codon, as opposed to the nucleotide, as the unit of evolution, and is parameterized in terms of synonymous and nonsynonymous nucleotide ...Missing: paper | Show results with:paper
  61. [61]
    Codon-Substitution Models for Detecting Molecular Adaptation at ...
    Jun 1, 2002 · The model of codon substitution of Goldman and Yang (1994) (see also Muse and Gaut 1994 ) provides a framework for studying the mechanism of ...
  62. [62]
    A codon-based model of nucleotide substitution for protein-coding ...
    A codon-based model for the evolution of protein-coding DNA sequences is presented for use in phylogenetic estimation.Missing: mechanistic 10 parameters
  63. [63]
    [PDF] A codon-based model of nucleotide substitution for protein ...
    A codon-based model of nucleotide substitution for protein-coding DNA sequences. · N. Goldman, Ziheng Yang · Published in Molecular biology and… 1 September 1994 ...
  64. [64]
    Evaluation of an improved branch-site likelihood method ... - PubMed
    We describe a modified branch-site model and use it to construct two LRTs, called branch-site tests 1 and 2. We applied the new tests to reanalyze several real ...Missing: et al.
  65. [65]
    Superiority of a mechanistic codon substitution model even for ...
    Nov 21, 2013 · It is shown that the mechanistic codon substitution model with the assumption of equal codon usage yields better values of Akaike and Bayesian ...
  66. [66]
    Codon optimization with deep learning to enhance protein expression
    Oct 19, 2020 · The codon box can be regarded as a coding method in machine learning that can simplify deep learning models, and a codon box and an amino acid ...
  67. [67]
    [PDF] divergence times on the basis on DNA sequence data. Suppose that
    The general reversible modela seem particularly tractable; a crude estimate of t, for the. 2 above data is 14.4 (serum albumin) and 32.8 (a-fetoprotein) MY.
  68. [68]
    Is the General Time-Reversible Model Bad for Molecular ...
    The general time-reversible (GTR) model (Tavaré 1986) has been the workhorse of molecular phylogenetics for the last decade. GTR sits at the top of the ...
  69. [69]
    A Bayesian Mixture Model for Across-Site Heterogeneities in the ...
    Nicolas Lartillot, Hervé Philippe, A Bayesian Mixture Model for Across ... (CAT-GTR model). Because they are defined up to a multiplicative constant ...
  70. [70]
    A Bayesian mixture model for across-site heterogeneities ... - PubMed
    Our model, named CAT, assumes the existence of distinct processes (or classes) differing by their equilibrium frequencies over the 20 residues. Through the use ...Missing: GTR paper
  71. [71]
    Suppression of long-branch attraction artefacts in the animal ...
    Feb 8, 2007 · Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model. Nicolas Lartillot,; Henner Brinkmann & ...
  72. [72]
    A site- and time-heterogeneous model of amino acid replacement
    We combined the category (CAT) mixture model (Lartillot N, Philippe H. 2004) and the nonstationary break point (BP) model (Blanquart S, Lartillot N. 2006) ...
  73. [73]
    Trends in substitution models of protein evolution for phylogenetic ...
    Sep 27, 2025 · Substitution models of protein evolution are essential for probabilistic approaches to phylogenetic inference. We overview their fundaments and ...
  74. [74]
    IQ-TREE: Efficient phylogenomic software by maximum likelihood
    IQ-TREE. Efficient software for phylogenomic inference. Latest release 3.0.1 (May 5, 2025). Download v3.0.1 for Windows Download v3.0.1 for macOS Universal ...
  75. [75]
    Felsenstein Phylogenetic Likelihood - PMC - PubMed Central - NIH
    Jan 13, 2021 · The main goal of Felsenstein's 1981 JME article was to show how to efficiently calculate the probability of a set of aligned nucleotide ...
  76. [76]
    CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING ...
    Abstract. The recently‐developed statistical method known as the “bootstrap” can be used to place confidence intervals on phylogenies.
  77. [77]
    a novel method for rapid multiple sequence alignment based on fast ...
    A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods.Mafft: A Novel Method For... · Methods · Acknowledgements
  78. [78]
    Selecting Models of Nucleotide Substitution - Oxford Academic
    The first model developed for molecular evolution was that of Jukes and Cantor (1969) (JC), who considered all possible changes among nucleotides to occur ...Missing: original | Show results with:original
  79. [79]
    A Fast Algorithm for Joint Reconstruction of Ancestral Amino Acid ...
    A dynamic programming algorithm is developed for maximum-likelihood reconstruction of the set of all ancestral amino acid sequences in a phylogenetic tree.
  80. [80]
    Discrete morphology - Ancestral State Estimation - RevBayes
    Feb 25, 2022 · This tutorial will provide a discussion of modeling morphological characters and ancestral state estimation, and will demonstrate how to perform such Bayesian ...
  81. [81]
    Discrete morphology - Tree Inference - RevBayes
    Sep 12, 2023 · Ascertainment Bias. When Lewis first introduced the Mk model, he observed that branch lengths on the trees were greatly inflated. The reason for ...
  82. [82]
    Combined-evidence analyses of ultraconserved elements and ...
    Aug 26, 2020 · The GTRCAT approximation was applied to the molecular data, and a MULTICAT Mk model with the asc option (ascertainment bias) and Lewis ...
  83. [83]
    IQ-TREE 2: New Models and Efficient Methods for Phylogenetic ...
    Here, we describe notable features of IQ-TREE version 2 and highlight the key advantages over other software. phylogenetics, phylogenomics, maximum likelihood, ...Fast Likelihood Mapping... · Single-Locus Tree Inference · Scalability With Large Data...
  84. [84]
    Phylogenomic Inference Software using Complex Evolutionary Models
    IQ-TREE 3 significantly extends version 2 with new features, including mixture models as an alternative to partitioned models, gene and site concordance factors ...<|control11|><|separator|>
  85. [85]
    MrBayes | manual
    MRBAYES 3.2: Efficient Bayesian phylogenetic inference and model selection across a large model space. Syst. Biol. 61:539-542. If you use the parallel abilities ...
  86. [86]
    [PDF] MrBayes version 3.2 Manual: Tutorials and Model Summaries
    basic Bayesian MCMC analysis of phylogeny, explaining the most important features of the program. There are two versions of the tutorial. You will first ...
  87. [87]
    RAxML - The Exelixis Lab - Heidelberg Institute for Theoretical Studies
    RAxML is a tool for phylogenetic analysis and post-analysis of large phylogenies, using Randomized Axelerated Maximum Likelihood.
  88. [88]
    RAxML version 8: a tool for phylogenetic analysis and post ... - NIH
    RAxML is a program for phylogenetic analysis of large datasets under maximum likelihood, using a fast tree search algorithm.
  89. [89]
    ModelFinder: Fast Model Selection for Accurate Phylogenetic ...
    May 8, 2017 · ModelFinder is a tool for fast model selection for accurate phylogenetic estimates. It determines amino-acids with high and low evolutionary ...
  90. [90]
    ModelFinder: fast model selection for accurate phylogenetic estimates
    May 8, 2017 · ModelFinder is implemented in IQ-TREE and offers many features, including the choice of comparing models of SE inferred on the same tree ( ...
  91. [91]
    Seq-Gen: Simulation of molecular sequences
    Seq-Gen is a program that will simulate the evolution of nucleotide or amino acid sequences along a phylogeny, using common models of the substitution process.
  92. [92]
    Seq-Gen: an application for the Monte Carlo simulation of DNA ...
    Seq-Gen is a program that will simulate the evolution of nucleotide sequences along a phylogeny, using common models of the substitution process.
  93. [93]
    BEAST 2
    It estimates rooted, time-measured phylogenies using strict or relaxed molecular clock models. It can be used as a method of reconstructing phylogenies but ...Tutorials · Citation · Book · FAQ
  94. [94]
    Species Trees with Relaxed Molecular Clocks - Taming the BEAST
    StarBEAST2 is a Bayesian method for species tree estimation, using relaxed clocks to estimate substitution rates of extant and ancestral species.
  95. [95]
    jordandouglas/ORC: Optimised relaxed clock (ORC) - GitHub
    ORC - Optimised Relaxed Clock. A BEAST 2 package containing a series of optimisations made to improve the performance of the phylogenetic relaxed clock model.
  96. [96]
    HyPhy 2.5—A Customizable Platform for Evolutionary Hypothesis ...
    Aug 27, 2019 · HyPhy is a scriptable, open-source package for fitting a broad range of evolutionary models to multiple sequence alignments.
  97. [97]
    From GPUs to AI and quantum: three waves of acceleration in ...
    ParaBricks is a state-of-the-art GPU-accelerated software suite for accelerating genomic workflows. It can achieve up to 65 × acceleration with germline variant ...
  98. [98]
    Apollo: a comprehensive GPU-powered within-host simulator for ...
    Jul 1, 2025 · We developed Apollo, a GPU-accelerated, out-of-core tool for within-host simulation of viral evolution and infection dynamics across population, tissue, and ...Results · Simulation Of Hiv Sequences... · Simulation Of Hiv...
  99. [99]
    Automatic selection of partitioning schemes for phylogenetic ...
    Feb 10, 2015 · Start with a partitioning scheme that has all sites assigned to a single subset, and choose the best-fit substitution model for that subset;.Missing: validation | Show results with:validation
  100. [100]
    Comparing Partitioned Models to Mixture Models: Do Information ...
    The current typical phylogenomic analysis consists of partitioning the alignment by gene boundaries, running PartitionFinder to merge blocks and select models, ...
  101. [101]
    Biological Sequence Simulation for Testing Complex Evolutionary ...
    Sequence simulation is an important tool in validating biological hypotheses as well as testing various bioinformatics and molecular evolutionary methods.