Point accepted mutation
A point accepted mutation (PAM) is the replacement of one amino acid by another in a protein sequence, which has become fixed in a population through natural selection and is observable as a change in the genetic code.[1] This concept forms the basis for modeling evolutionary changes in proteins, particularly in bioinformatics for assessing sequence similarity and divergence.[2] The PAM model was developed by Margaret O. Dayhoff and colleagues in the late 1970s, drawing from empirical observations of amino acid substitutions in closely related protein sequences sharing over 85% identity.[3] By analyzing 71 phylogenetic trees constructed from closely related protein sequences, they counted 1,572 accepted mutations—those inferred to have occurred via a single substitution per site using maximum parsimony—and normalized these to derive mutability values for each amino acid.[1] The resulting PAM1 matrix represents the expected substitutions after 1% divergence (or one accepted mutation per 100 residues), with higher-order matrices like PAM250 obtained by matrix exponentiation to model greater evolutionary distances.[2] PAM matrices are log-odds scoring systems used in protein sequence alignment algorithms, such as those in BLAST or dynamic programming methods, to quantify the likelihood of alignments reflecting true homology rather than chance.[4] They emphasize conservative substitutions (e.g., between similar amino acids like leucine and isoleucine) over radical ones, aiding in evolutionary inference and functional prediction.[3] While superseded in some applications by more recent models like BLOSUM, PAM remains foundational for understanding protein evolution and is influential in phylogenetic analyses.[2]Background and History
Biological basis
Point accepted mutations arise from single nucleotide changes in the DNA sequence of a gene, which can alter the codon and thereby replace one amino acid with another in the encoded protein. These substitutions occur through nonsynonymous mutations, where the nucleotide change results in a different amino acid being incorporated during translation, in contrast to synonymous mutations that do not alter the amino acid sequence despite changing the codon. Due to the degeneracy of the genetic code, most single nucleotide substitutions lead to nonsynonymous changes, potentially disrupting protein structure or function, while a smaller fraction are synonymous and typically neutral.[5] Natural selection plays a pivotal role in determining whether such nonsynonymous mutations are accepted or rejected in a population, based on their impact on the organism's fitness. Beneficial mutations that enhance protein function, stability, or adaptability to environmental pressures are more likely to become fixed through positive selection, whereas deleterious mutations that impair protein folding, enzymatic activity, or interactions are purged by purifying selection. Neutral mutations, which have minimal effect on fitness, can drift to fixation via genetic drift, allowing gradual evolutionary change without immediate selective pressure.[5] In protein evolution, observed mutations—those that become fixed in lineages—represent only a subset of possible mutations, with a strong bias toward conservative substitutions that replace an amino acid with one of similar physicochemical properties, such as size, charge, or hydrophobicity. These conservative changes are more frequently accepted because they tend to preserve the protein's three-dimensional structure, folding stability, and functional sites, minimizing disruptive effects on overall fitness. In contrast, radical substitutions involving dissimilar amino acids are rarer in observed evolution, as they often lead to significant structural perturbations and reduced viability.[5][6] Over evolutionary time, these accepted point mutations accumulate in diverging phylogenetic lineages, serving as markers of genetic divergence between species or populations. As lineages branch and adapt independently, the rate of mutation fixation reflects the balance between mutational input and selective constraints, with proteins under strong functional pressure evolving more slowly than those with greater tolerance for change. This accumulation enables the reconstruction of evolutionary histories and highlights how point accepted mutations act as fundamental units of protein sequence divergence.[5][7]Historical development
The development of point accepted mutation (PAM) matrices originated in the 1960s with the pioneering efforts of Margaret Dayhoff and her collaborators at the National Biomedical Research Foundation, who began systematically compiling and analyzing protein sequences in the Atlas of Protein Sequence and Structure.[1] Initial work in the 1967-1968 edition of the Atlas introduced early models of evolutionary change in proteins, focusing on substitution probabilities derived from observed amino acid replacements in related sequences.[1] These efforts evolved from rudimentary substitution tables, which combined empirical mutation data with estimates of relative mutability for each amino acid, into more formalized quantitative frameworks by the early 1970s.[1] A key milestone came in the 1978 edition of the Atlas of Protein Sequence and Structure (Volume 5, Supplement 3), where Dayhoff, along with Robert M. Schwartz and Bonnie C. Orcutt, presented the definitive PAM matrices based on an expanded dataset.[1] This analysis drew from 1,572 observed changes across 71 groups of 157 closely related proteins spanning 34 superfamilies, with sequences within each phylogenetic tree differing by less than 15% to ensure the changes primarily reflected single substitutions.[1] The matrices were constructed by inferring ancestral sequences from phylogenetic trees and counting accepted mutations, formalizing the "1 PAM" unit as one accepted mutation per 100 residues, which could then be extrapolated for longer evolutionary distances through matrix powers.[1] This work built directly on Dayhoff's prior refinements in the 1972 volume of the Atlas, where substitution probabilities were first scaled for evolutionary time, transitioning from static tables to dynamic, distance-dependent models.[1] The PAM framework's emphasis on global alignments of closely related proteins influenced later substitution matrix developments, such as the BLOSUM series introduced by Steven and Jorja Henikoff in 1992, which shifted to local alignments of conserved blocks from more divergent sequences.[8]Core Concepts and Terminology
Definition of point accepted mutation
A point accepted mutation (PAM), also known as an accepted point mutation, refers to the replacement of a single amino acid in the primary structure of a protein with another amino acid that has become fixed in a lineage through evolutionary processes.[9] This concept was introduced by Margaret O. Dayhoff and colleagues in their seminal work on protein evolution.[10] Unlike a general point mutation, which typically describes a single nucleotide change in DNA that may or may not result in an amino acid substitution, a PAM specifically emphasizes substitutions at the protein level that are detectable as changes in aligned sequences of related proteins and have persisted over time without reversion.[9] The term "accepted" highlights that these mutations have survived purifying selection, meaning the new amino acid variant is functionally viable and not eliminated by natural selection, often because it maintains similar biochemical properties to the original.[10] The unit of evolutionary distance in the PAM model is defined such that 1 PAM corresponds to an average of 1% accepted mutations per 100 amino acid residues, providing a standardized measure of divergence between protein sequences.[9] This quantification allows for the assessment of how closely related two proteins are based on the number of such fixed changes accumulated along their evolutionary path.[10]Mutation probability matrices
A mutation probability matrix (MPM) in the context of point accepted mutations (PAM) is a 20×20 table that models the probabilities of amino acid substitutions over an evolutionary period defined by one PAM unit. Each entry M_{ij} in the matrix represents the probability that an original amino acid in column j is replaced by the amino acid in row i after one PAM of evolution.[1] The diagonal elements M_{ii} denote the probability that a given amino acid remains unchanged during this evolutionary interval, while the off-diagonal elements M_{ij} (where i \neq j) specify the probabilities of particular changes to another amino acid. These elements are calculated based on empirical observations of mutations, incorporating the relative mutabilities of amino acids and their background frequencies to reflect biologically realistic substitution patterns.[1] The matrix is normalized so that the sum of all elements in each column equals 1, which ensures that the probabilities for every possible outcome—from a specific original amino acid—total 100% and represent conditional probabilities in a probabilistic model. This structure aligns with a Markov chain framework, where the state at one evolutionary step depends only on the immediate prior state.[1] In relation to evolutionary models, PAM matrices empirically quantify amino acid substitution rates by deriving probabilities from real alignments of closely related proteins, thereby capturing the influence of natural selection, chemical similarities among amino acids, and constraints imposed by the genetic code. These matrices enable the simulation of protein sequence evolution over specified distances and form the basis for scoring systems in sequence alignment algorithms.[1]Construction of PAM Matrices
Data collection from related sequences
The construction of PAM matrices relies on empirical data gathered from alignments of closely related protein sequences to capture observed amino acid substitutions that have been accepted by natural selection. Selection criteria emphasize protein families with high sequence similarity, typically greater than 85% identity (or less than 15% divergence), to minimize the occurrence of multiple mutations at the same site and ensure that observed changes represent primarily single substitutions. This approach was pioneered by Dayhoff et al., who analyzed 71 groups of closely related proteins drawn from 34 superfamilies, focusing on phylogenetic trees to infer ancestral sequences and derive 1,572 accepted point mutations across these families.[1] The data collection process involves constructing multiple sequence alignments using phylogenetic trees, where sequences are compared not only pairwise but also against inferred ancestral nodes to sharpen the mutation counts and reduce alignment biases. Positions with gaps are generally excluded to focus on conserved, ungapped sites, while any ambiguities in nodal sequences—arising from equally parsimonious alternatives—are handled statistically by distributing potential changes proportionally among the possibilities. This method allows for the tabulation of substitution frequencies, such as how often one amino acid replaces another in evolutionarily recent branches, providing a robust empirical basis for the base mutation probability matrix.[1] A key challenge in the original 1970s data collection was the limited availability of protein sequences, restricting the dataset to just over 1,500 mutations and potentially leading to sparse counts for rare substitutions. Modern efforts to update PAM-style matrices, such as the GONNET matrix, address this by leveraging larger databases like Swiss-Prot (version 23, with approximately 27,000 sequences) for exhaustive pairwise alignments, followed by manual curation to exclude artifacts from point mutations, insertions, or deletions; however, implementations often retain the original Dayhoff data for consistency in comparative bioinformatics applications.[1][11]Building the base mutation matrix
The construction of the base mutation matrix, known as the PAM1 matrix, involves transforming the observed substitution counts from closely related protein sequences into a probabilistic model of amino acid replacements. This process begins with the calculation of relative mutability for each amino acid, which quantifies its propensity to change relative to others. The relative mutability m_i for amino acid i is defined as the total number of observed changes from i divided by the total number of occurrences of i in the aligned sequences, averaged across phylogenetic blocks to account for varying sequence lengths and evolutionary distances. Next, the mutation probabilities are derived by adjusting the raw observed substitution frequencies for these relative mutabilities, ensuring the matrix adheres to a Markov chain model where transitions depend only on the current state. Specifically, for i \neq j, the probability M_{ij} is given by M_{ij} = \left( \frac{\text{number of observed changes from } i \text{ to } j}{\text{total occurrences of } i} \right) \times \frac{m_j}{m_i}, which normalizes the observed changes to reflect the differing mutabilities of target amino acids j. The diagonal elements are then set to preserve the total probability for each row: M_{ii} = 1 - \sum_{j \neq i} M_{ij}, representing the probability that amino acid i remains unchanged. To define the evolutionary scale, a constant of proportionality is applied to the off-diagonal elements such that the matrix corresponds to an average of 1% accepted mutations per site, or 1 PAM unit; this ensures the expected mutation rate across all amino acids, weighted by their frequencies, equals 0.01. This scaling makes the PAM1 matrix suitable as a baseline for extrapolating to greater evolutionary distances while maintaining consistency with observed data from sequences diverged by less than 15% in total replacements.Extrapolation to PAM-n matrices
To model evolutionary distances beyond a single accepted point mutation, the base PAM1 matrix is extrapolated to PAM-n matrices by raising it to the power n, where n denotes the evolutionary distance in PAM units (1% accepted mutations per 100 residues). This process treats amino acid substitutions as a Markov chain, with matrix multiplication representing the probabilities of changes over n successive steps.[10] The entries of the PAM-n matrix are computed recursively through matrix multiplication: (PAM-n)_{ij} = \sum_k (PAM1)_{ik} \cdot (PAM-(n-1))_{kj}, allowing iterative calculation from PAM1 for small n; for larger n, eigenvalue decomposition of the PAM1 matrix enables more efficient exponentiation by diagonalizing and powering the eigenvalues.[12][10] This extrapolation accounts for the effects of multiple substitutions that occur as sequences diverge over time, which cannot be captured by the PAM1 matrix alone. As n grows, the off-diagonal elements of PAM-n increase, reflecting higher substitution probabilities, while the matrices progressively approach an equilibrium state where transition probabilities align with the stationary distribution of amino acid frequencies.[10] For practical applications in sequence alignment scoring, the PAM-n probability matrices are transformed into log-odds matrices using the formula S_{ij} = 10 \log_{10} \left( \frac{(PAM-n)_{ij}}{f_j} \right), where f_j is the background frequency of the target amino acid j; positive scores indicate substitutions more likely than chance, with the factor of 10 providing a convenient scaling for integer-valued entries in units of 0.1 bits.Mathematical Properties
Symmetry and diagonal elements
The mutation probability matrix M in PAM models is asymmetric, with M_{ij} \neq M_{ji} in general for i \neq j. This asymmetry stems from the construction of M, where the probability of substituting amino acid i with j is proportional to the relative mutability of i (the likelihood of i undergoing change) and the background frequency f_j of j in proteins. Rare amino acids, which have low f_j, are thus less likely to appear as substitutes, even if the source amino acid is mutable.[1] Diagonal elements M_{ii} represent the probability that amino acid i remains unchanged over the specified evolutionary distance. In the PAM1 matrix, these elements exhibit strong dominance, with values close to 1 (approximately 0.99 on average), as this matrix models minimal divergence where only about 1% of sites experience accepted mutations. For higher PAM-n matrices, diagonal dominance weakens progressively, with M_{ii} decreasing as n increases, since greater evolutionary time allows more substitutions to accumulate and reduce the likelihood of no change.[13] Off-diagonal elements M_{ij} (for i \neq j) capture substitution probabilities and follow patterns aligned with physicochemical properties. Higher values occur for transitions between similar amino acids, such as hydrophobic residues (e.g., leucine to isoleucine) or charged ones (e.g., aspartate to glutamate), because such replacements are more readily accepted during evolution without compromising protein stability or function. These patterns emerge from empirical counts of observed mutations in closely related proteins, weighted by mutabilities and frequencies.[14] The overall structure of M embodies a reversible evolutionary process biased by amino acid frequencies. Reversibility is ensured through detailed balance, where the flux from i to j equals that from j to i at equilibrium (f_i M_{ij} = f_j M_{ji}), allowing the model to maintain stationary frequencies over time. The frequency bias, however, introduces directionality in short-term probabilities, reflecting how natural selection and mutational patterns favor substitutions toward prevalent amino acids while permitting back-mutations at rates consistent with equilibrium.[1]Relating accepted mutations to evolutionary distance
The evolutionary distance measured in PAM units quantifies the expected number of accepted point mutations per 100 amino acids between two protein sequences. Specifically, 1 PAM unit corresponds to approximately 1% observed amino acid differences per site for closely related sequences, providing a standardized scale for divergence.[15][3] As evolutionary distance increases, however, the observed differences d between sequences underestimate the true number of accepted mutations due to the multiple hits problem. In this phenomenon, individual sites can accumulate multiple substitutions over time, including back-mutations that revert to the original amino acid or parallel mutations that overlay to the same alternative amino acid, rendering some changes invisible in direct comparisons. The construction of PAM-n matrices mitigates this by raising the base PAM1 matrix to the power n via matrix multiplication, which probabilistically incorporates the effects of multiple substitutions and reduces the impact of unobserved events.[3] The connection between observed differences and PAM units can be expressed approximately by the formulad \approx 100 \left(1 - e^{-n/100}\right),
where d is the percent observed amino acid differences and n is the number of PAM units. For small n, this approximates to d \approx n, aligning with the foundational definition of PAM distance. To derive this, start with the Poisson process underlying substitution models, where the probability of no substitution at a site is e^{-n/100} (with the expected number of substitutions per site being n/100), so the probability of at least one observable change is $1 - e^{-n/100}; the percent observed differences is then d = 100 (1 - e^{-n/100}), with simplification for low divergence via the Taylor expansion $1 - e^{-x} \approx x when x = n/100 is small.[3] While 1 PAM unit equates to about 1% observed amino acid changes, this masks a higher level of underlying genetic evolution, with estimates indicating roughly 3-5% actual nucleotide changes per site, primarily due to the accumulation of synonymous substitutions that do not affect the protein sequence but occur at a faster neutral rate.[16] Despite these relations, the PAM framework carries limitations, as it presumes a uniform mutation rate over time and across sites, ignoring heterogeneities such as varying selective constraints or rate accelerations in specific genomic contexts.[3]