Fact-checked by Grok 2 weeks ago

Structural alignment

Structural alignment is a computational technique in bioinformatics that superimposes and compares the three-dimensional structures of biological macromolecules, primarily proteins and nucleic acids, to establish correspondences between their atomic coordinates and identify similarities in spatial arrangements, folds, and functional motifs, irrespective of sequence homology.^[1] This approach is essential for detecting evolutionary relationships and functional conservation in proteins with low sequence identity, where traditional sequence-based methods fail, as it reveals shared structural cores that imply common ancestry or biochemical roles.^[2] For instance, globins like hemoglobin and neuroglobin exhibit structural similarity despite diverging sequences, highlighting alignment's role in uncovering distant homologs.^[1] Key applications include homology modeling, functional annotation, and the classification of protein domains into folds within databases such as SCOP, CATH, and FSSP, which rely on structural alignments to organize the Protein Data Bank (PDB).^[2] Methods vary from rigid-body superimpositions, which assume fixed conformations for closely related structures, to flexible alignments that accommodate domain movements and loops, as in tools like FATCAT and DALI developed since the 1990s.^[2]^[3] Topology-independent approaches further handle permutations, such as circular shifts in chain connectivity, enhancing accuracy for diverse superfamilies.^[1] Overall, structural alignment underpins phylogenetic analyses and structure prediction pipelines, like threading, by quantifying similarity via metrics such as root-mean-square deviation (RMSD) of aligned atoms.^[3]

Fundamentals

Definition and Principles

Structural alignment is a computational method used to compare three-dimensional (3D) structures of biomolecules, primarily proteins, by establishing correspondences between their atomic coordinates to reveal similarities in spatial arrangement despite limited sequence similarity.^[2] This process identifies equivalent residues or atoms, enabling the superposition of structures to assess shared folds or functional motifs that may indicate evolutionary relationships.^[4] While most commonly applied to proteins, the approach is extensible to other macromolecules like nucleic acids or ligands where 3D geometry informs homology.^[5] To contextualize structural alignment, protein structures are hierarchically organized: the primary structure refers to the linear amino acid sequence; secondary structure encompasses local patterns such as α-helices and β-strands stabilized by hydrogen bonds; and tertiary structure describes the global 3D fold resulting from non-covalent interactions.^[6] Structural alignment operates at the tertiary level, focusing on coordinate-based comparisons rather than sequence, as protein evolution conserves 3D folds more robustly than primary sequences, allowing detection of distant homologs with low sequence identity (often below 25%). The core principles involve establishing a residue-to-residue (or atom-to-atom) correspondence and applying rigid-body transformations—rotations and translations—to overlay the structures optimally.^[7] This superposition minimizes spatial discrepancies, quantifying similarity through metrics like root-mean-square deviation (RMSD), which prioritizes structural invariance over sequential order.^[8] Unlike sequence alignment, which relies on linear matching, structural alignment accounts for topological equivalences and evolutionary drifts in backbone geometry.^[7] Historically, structural alignment emerged in the 1970s with pioneering work by Rossmann and Argos, who developed systematic methods to explore homology by comparing backbone conformations across proteins, initially focusing on shared functional sites like enzyme active centers.^[7] Their approach laid the foundation for modern tools by introducing iterative superposition techniques to detect subtle similarities.^[7] A key aspect is the least-squares fitting to minimize RMSD, formulated as: [ \text{RMSD} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} | \mathbf{r}_i - (R \mathbf{t}_i + \mathbf{t}) |^2} ] where N is the number of aligned points, \mathbf{r}_i and \mathbf{t}_i are the coordinates of corresponding atoms in the target and template structures, R is the optimal rotation matrix, and \mathbf{t} is the translation vector; the goal is to find R and \mathbf{t} that minimize this value.

Applications

Structural alignment serves as a cornerstone in evolutionary studies within bioinformatics, enabling the detection of remote homologs—proteins sharing a common ancestor but exhibiting sequence similarity below 20%—by focusing on conserved three-dimensional folds rather than linear sequences. This approach reveals evolutionary relationships that sequence-based methods often miss, particularly for divergent protein families. For instance, databases such as SCOP (Structural Classification of Proteins) and CATH (Class, Architecture, Topology, Homologous superfamily) rely on structural alignments to classify proteins into hierarchical superfamilies, facilitating the systematic organization of the protein universe and inference of evolutionary histories.^[9]^[10]^[11] In the realm of function prediction, structural alignment aids in inferring biochemical roles, especially for enzymes, by mapping conserved structural motifs like active sites across related proteins. Alignments highlight fold conservation that correlates with functional similarity, even amid sequence divergence. A prominent example is the TIM barrel superfamily, a ubiquitous enzyme fold comprising eight α/β units, where structural alignments have elucidated shared catalytic mechanisms in diverse enzymes such as triosephosphate isomerase and pyruvate kinase, allowing predictions of active site residues and substrate specificity for novel members.^[12]^[13]^[14] For protein structure prediction and modeling, structural alignment underpins template-based approaches in homology modeling pipelines, where a target sequence is aligned to experimentally determined templates to generate atomic models. Tools like MODELLER exemplify this by using alignments to satisfy spatial restraints derived from template structures, producing reliable models for targets with detectable homologs and supporting downstream analyses in structural biology.^[15] In drug discovery, structural alignment facilitates the identification of analogous binding pockets in related proteins, enabling the comparison of ligand-binding sites to guide inhibitor design or drug repurposing. This is particularly valuable in virtual screening workflows, where aligned structures allow the docking of compound libraries against multiple targets, prioritizing hits based on conserved pharmacophores and accelerating lead optimization.^[16]^[17] Database searching represents another key application, where structural alignment tools query repositories like the Protein Data Bank (PDB) to retrieve proteins with similar folds, aiding in the functional annotation of newly solved structures. Servers such as DALI perform exhaustive 3D comparisons to generate alignments and similarity scores, helping researchers contextualize novel folds within the broader structural landscape.^[18]^[19]

Representations and Data

Structure Representations

Protein structures for alignment are primarily represented by atomic coordinates in the Protein Data Bank (PDBx/mmCIF) format, the current standard, or the legacy PDB format, where in the latter each atom's position is specified using Cartesian coordinates x, y, and z in angstroms within ATOM or HETATM records.^[20] These coordinates capture the three-dimensional arrangement of all heavy atoms in the macromolecule, enabling precise spatial comparisons.^[20] To enhance computational efficiency, especially for large-scale alignments, representations are often simplified to Cα atoms only, focusing on the alpha carbon of each residue to form a polyline backbone.^[21] This reduction preserves the overall fold while drastically lowering the number of points from thousands to hundreds per protein, facilitating faster superposition and similarity calculations.^[22] Secondary structure encodings provide a higher-level abstraction, with the Dictionary of Secondary Structure of Proteins (DSSP) algorithm assigning states such as α-helices (H), β-sheets (E), and coils (C or loop regions) based on hydrogen bonding patterns between backbone atoms.^[23] These assignments reduce the structure to a sequence of categorical labels, aiding in alignment by matching regular secondary elements while accommodating irregular coils.^[24] Backbone dihedral angles offer a compact vector-based representation, where the conformation of each residue is encoded by the φ (phi) angle around the N-Cα bond and the ψ (psi) angle around the Cα-C bond, typically ranging from -180° to 180°.^[25] These angles form a sequence vector that captures local geometry without relying on absolute positions, useful for comparing flexible regions.^[26] Further abstractions include contact maps, which are binary matrices indicating pairs of residues in close proximity (e.g., Cα distance < 8 Å), torsion angle sequences beyond just φ and ψ (such as ω for peptide bonds), and intra-molecular distance matrices defined as D_{ij} = \| \mathbf{pos}_i - \mathbf{pos}_j \|, where \mathbf{pos}_i and \mathbf{pos}_j are the 3D coordinates of residues i and j.^[27]^[28] Distance matrices, in particular, transform the structure into a rotation- and translation-invariant form, ideal for alignment algorithms like DALI that optimize matrix overlaps.^[29] Full atomic representations excel in capturing fine details like side-chain interactions and hydrogen bonds but incur high computational costs due to the large number of atoms, making them less suitable for rapid screening of protein databases.^[30] In contrast, Cα-only or abstracted models prioritize speed and scalability, though they may overlook subtle differences in flexible loops, where multi-conformer averaging or ensemble representations are sometimes employed to account for conformational variability.^[22]^[30]

Outputs from Alignments

Structural alignment processes generate tangible outputs that capture residue correspondences, quantify similarity, and enable visualization of three-dimensional overlaps between protein structures. These outputs are essential for downstream analyses in bioinformatics, such as evolutionary inference and functional annotation.^[31] Alignment files primarily consist of residue mappings that identify equivalent residues across compared structures, often presented as lists of corresponding amino acids with their positions. These mappings are exported in standardized formats, including the PIR (Protein Information Resource) format, which includes protein identifiers, sequence alignments, and annotations for structural modeling applications, and aligned PDB files that incorporate superimposed atomic coordinates for direct use in visualization software. For instance, tools like CE-MC produce such files to represent multi-structure alignments with preserved spatial relationships.^[32]^[31] Quantitative data derived from alignments include the root-mean-square deviation (RMSD), a metric of atomic fit quality computed for global structures or local subsets after superposition, typically reported in angstroms to indicate average displacement. Additionally, the number of aligned residues provides a count of the overlapping structural elements, reflecting the coverage of similarity. These values are standard in outputs from methods like TM-align, where they benchmark alignment robustness.^[33]^[34] Visual outputs feature superimposed models, where aligned structures are overlaid in three-dimensional space to reveal conserved folds, and difference maps that depict positional variances, often color-coded by deviation magnitude for intuitive interpretation. Such representations are generated by servers like SuperPose, facilitating rapid assessment of conformational changes.^[35] Unlike sequence alignments, where gaps denote insertions or deletions (indels), structural alignments treat gaps as regions of non-equivalent topology or flexibility, preserving continuous residue chains without implying evolutionary events. Outputs from the DALI server exemplify this by providing gap-inclusive equivalence lists in searchable datasets, enabling analysis of discontinuous motifs in protein families.^[36]^[37]

Comparison Methods

Superposition Techniques

Superposition techniques form a foundational approach in structural alignment, focusing on geometrically overlaying molecular structures, particularly proteins, to identify spatial correspondences between their atomic coordinates. These methods assume rigid-body transformations—rotations and translations—without deformations, aiming to minimize the distance between equivalent atoms in the two structures. The core objective is to find the optimal transformation that aligns the structures as closely as possible, often serving as an initial step before more sophisticated alignment refinements. Rigid-body superposition typically employs least-squares minimization to determine the best rotation and translation parameters. This involves iteratively adjusting the positions of one structure relative to the other to reduce the sum of squared distances between corresponding points. A seminal iterative method is the Kabsch algorithm, which efficiently computes the optimal rotation matrix using singular value decomposition (SVD). The process begins with centering both sets of atomic coordinates by subtracting their respective centroids, ensuring the structures are translationally aligned at their centers of mass. Next, a correlation matrix is constructed from the centered coordinates, and SVD is applied to derive the rotation matrix that minimizes the root-mean-square deviation (RMSD), a common byproduct measure of alignment quality. This approach is computationally efficient and widely adopted for pairwise alignments of protein backbones, such as Cα atoms.^[38] In cases of multi-domain proteins connected by flexible linkers, global superposition may distort the alignment by forcing rigid overlay across mobile regions, leading to poor fits in individual domains. Local superposition addresses this by treating subdomains as independent rigid bodies, superposing them separately to account for inter-domain flexibility. This segmented approach enhances accuracy for proteins with conformational variability due to linker dynamics.^[30] To initiate superposition, especially for distantly related structures, initial seeding guides the selection of corresponding residues. Sequence alignments provide a preliminary mapping based on amino acid similarity, while secondary structure matches—such as aligning α-helices or β-sheets by direction and length—offer geometric priors to identify potential equivalents. Methods like TM-align combine gapless threading with secondary structure similarity to generate an initial set of aligned residues, which then informs the rigid-body transformation. These seeding strategies reduce search space and improve convergence in iterative superposition.^[39] A key challenge in superposition arises from insertions and deletions (indels) that manifest as topological discontinuities in 3D space, complicating the identification of equivalent points. Unlike sequence alignments, where indels are handled linearly, 3D indels require accounting for spatial gaps that can skew rotation estimates if not masked, potentially inflating RMSD in unaffected regions. Techniques often involve excluding indel-affected segments during initial overlay or using dynamic programming to tolerate such variations, though this increases computational demands for large structures.^[40]^[41]

Similarity Evaluation

Similarity evaluation in structural alignment quantifies the degree of resemblance between superimposed protein structures, providing a numerical basis for assessing evolutionary relationships, functional analogies, or modeling accuracy. These metrics address limitations of raw superposition by incorporating distance-based deviations, topological features, and statistical significance, often normalized to enable comparisons across proteins of varying sizes. The root-mean-square deviation (RMSD) is a foundational metric, defined as the square root of the average squared distances between corresponding atoms after optimal superposition:

\text{RMSD} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} d_i^2},

where d_i are the Euclidean distances for n atom pairs. Global RMSD evaluates the entire aligned structure, emphasizing overall fit but sensitive to outliers in flexible regions. Local RMSD, in contrast, focuses on specific subsets like active sites, mitigating the impact of conformational variability elsewhere. Backbone RMSD, typically computed using C\alpha atoms, reduces noise from side-chain orientations, while all-atom RMSD includes heavy atoms for a more comprehensive but noisier assessment.^[42] Advanced scores overcome RMSD's length dependence and sensitivity to alignment gaps. The TM-score, ranging from 0 to 1, measures topological similarity in a size-independent manner:

\text{TM-score} = \max \left[ \frac{1}{L} \sum_{i=1}^{L_{\text{ali}}} \frac{1}{1 + \left( \frac{d_i}{d_0} \right)^2 } \right],

where L is the length of the shorter protein, L_{\text{ali}} the number of aligned residues, d_i the distance between aligned residue pairs, and d_0 \approx 0.23 \, L^{2/3} N^{1/3} - 1.21 with N the length of the longer protein; values above 0.5 indicate the same fold regardless of size.^[43] The Global Distance Test Total Score (GDT-TS), used prominently in structure prediction assessments, averages the percentages of residues within distance cutoffs of 1 Å, 2 Å, 4 Å, and 8 Å from their references, yielding a 0-100 scale robust to local distortions.^[44] Other metrics provide complementary insights into significance and higher-order features. The Z-score assesses statistical relevance by standardizing a raw similarity score against a distribution of random alignments: Z = (S - \mu_S)/\sigma_S, where S is the score, and \mu_S, \sigma_S are the mean and standard deviation from null models; high Z-scores (e.g., >6) denote non-random similarity. Contact overlap evaluates preservation of intra-molecular interactions by comparing binary contact maps, often using C\beta-C\beta distances within 8 Å, normalized as the fraction of shared contacts to gauge fold conservation. Normalization is essential for fair comparisons, adjusting for protein length, resolution, and alignment coverage to avoid bias toward larger structures. For instance, RMSD increases with chain length for random pairs, while TM-score and GDT-TS incorporate length scaling; thresholds like TM-score >0.5 or GDT-TS >60 reliably identify homologous folds in benchmarks. These metrics collectively enable robust interpretation of alignments, prioritizing fold-level conservation over atomic precision.^[45]

Algorithms

Complexity Analysis

Pairwise structural alignment of proteins can be formulated as a graph matching problem, where protein structures are represented as graphs with residues as nodes and spatial interactions (such as contacts or distance constraints) as edges, requiring the identification of a maximum-weight matching between the two graphs.^[46] Alternatively, it is viewed as an optimization over permutations of residue indices to maximize a scoring function that accounts for spatial superposition and sequence similarity.^[46] These formulations capture the core challenge of aligning 3D coordinates while respecting the sequential order of residues, often leading to integer linear programming models with variables representing possible residue correspondences.^[46] The optimal pairwise structural alignment problem is NP-hard, akin to the subgraph isomorphism problem, due to the combinatorial explosion in evaluating all possible residue mappings under 3D geometric constraints.^[47] Exact solutions exhibit time complexities of O(N^3) or worse, where N denotes the number of residues, arising from dynamic programming approaches over distance matrices or exhaustive enumeration in branch-and-bound frameworks; space requirements similarly scale poorly, often to O(N^4) in intermediate representations for unequal-length proteins.^[46] In contrast to 1D sequence alignment, which is solvable in polynomial time via O(N^2) dynamic programming, the 3D dimensionality introduces non-local geometric dependencies that preclude efficient exact optimization.^[40] Modeling protein flexibility, such as through gap penalties for loops or non-rigid deformations, further exacerbates complexity by expanding the search space with variable-length insertions and tolerance for conformational variations.^[40] These computational demands have significant practical implications, necessitating approximate methods for large-scale applications like database-wide searches across the Protein Data Bank (PDB), where exact alignment of thousands of structures against a query would be infeasible due to exponential runtimes even for moderate N (e.g., 100-300 residues).^[46] While exact algorithms remain valuable for verifying optimality in small, high-confidence cases, approximations enable scalable analyses but may sacrifice global optimality, as explored in subsequent sections on exact and approximate techniques.^[46]

Exact Algorithms

Exact algorithms for structural alignment seek to compute globally optimal solutions by exhaustively exploring the alignment space, often adapting techniques from sequence alignment to account for three-dimensional geometry. These methods guarantee the best possible alignment under a defined scoring function, such as maximizing the overlap of inter-residue distances or contact maps, but at the cost of high computational complexity. The structural alignment problem is NP-hard, making exact approaches impractical for large proteins but valuable for precise benchmarking of approximate methods.^[46] One prominent class involves extensions of dynamic programming, particularly adaptations of the Needleman-Wunsch algorithm originally developed for sequence alignment. In structural contexts, these employ double dynamic programming to align pairs of residues while incorporating geometric constraints, such as Cα atom distances or secondary structure similarities, to evaluate 3D compatibility. Another key category utilizes integer programming formulations, which model the alignment as an optimization problem minimizing an energy-like function that captures structural similarity through distance matrices or overlap scores. These are typically solved via branch-and-bound or Lagrangian relaxation techniques to find the global optimum. A representative example is PAUL (Protein Alignment Using Lagrangian), which formulates pairwise alignment as an integer linear program based on inter-residue distances, using Lagrangian relaxation with variable splitting to efficiently compute optimal solutions by dualizing constraints and iteratively improving bounds. Similarly, DALIX applies integer programming with branch-and-cut methods to optimize the DALI scoring function, which emphasizes rigid-body superimposition and distance-based rewards, achieving exact alignments that outperform heuristic baselines in up to 85% of benchmark cases across protein folds.^[48]^[49] Despite their precision, exact algorithms are limited to small proteins, typically under 100 residues, due to exponential time and memory demands—often O(n²m²) for proteins of lengths n and m. For pairs around 200 residues, runtimes can extend to hours on standard hardware, as seen in benchmarks where DALIX requires up to 30 CPU hours for challenging instances from datasets like SCOP or RIPC. Consequently, these methods see rare practical use beyond benchmarking and validation of faster heuristics, prioritizing conceptual optimality over scalability in routine analyses.^[50]^[46]

Approximate Algorithms

Approximate algorithms for structural alignment address the computational challenges of exact methods by employing heuristics that achieve near-optimal solutions in polynomial time, motivated by the NP-hard nature of the problem for general scoring functions. These approaches prioritize efficiency for large-scale analyses while maintaining sufficient accuracy for biological insights, often running in O(n^2) time where n is the number of residues.^[51] Iterative optimization forms a core heuristic in approximate structural alignment, typically beginning with an initial seed alignment and refining it through repeated adjustments to minimize structural dissimilarity. For instance, algorithms sample rigid-body transformations using quaternion representations to align structures, iteratively optimizing scores like root-mean-square deviation (RMSD) via gradient-based or Monte Carlo methods until convergence. This process starts with a coarse superposition and progressively refines residue correspondences, enabling handling of globular proteins with approximation guarantees within a factor ε of the optimum. Seeding strategies enhance this by initializing matches based on sequence similarity or secondary structure elements, such as identifying conserved helices or sheets to anchor the alignment before extension. These seeds reduce the search space, allowing iterative refinement to explore local optima efficiently without exhaustive enumeration. For example, the SSAP (Sequential Structure Alignment Program) method, introduced by Orengo and colleagues, uses iterated double dynamic programming to generate high-scoring alignments by iteratively refining residue pairings based on vector representations of local geometry. This approach effectively handles the additional dimensionality of protein structures compared to linear sequences.^[51]^[52]^[53] Progressive alignment extends pairwise approximations to multiple structures by hierarchically building alignments, starting with highly similar pairs and iteratively incorporating additional structures based on a guide tree derived from initial similarity scores. This method aligns two structures or sub-alignments at each step using dynamic programming on a simplified scoring function, propagating correspondences transitively to form a global multiple alignment. For example, tree-based progressive schemes first compute pairwise alignments and then merge them progressively, balancing the inclusion of distant homologs with computational tractability. Seeding in progressive contexts often leverages fragment-based matches from secondary structures to initiate pair alignments, ensuring robustness to conformational variations.^[54] These approximate strategies involve inherent trade-offs between accuracy and speed, where heuristics like O(n^2) dynamic programming for fixed transformations yield high-quality alignments for proteins up to thousands of residues but may sacrifice optimality for very divergent structures. Iterative and progressive methods typically achieve 90-95% of exact scores in seconds to minutes, compared to hours for exact algorithms, making them suitable for database-scale searches while occasionally missing subtle local similarities. Quantitative benchmarks show that such approximations scale to alignments of 10-20 structures with RMSD errors under 2 Å for homologous pairs, prioritizing practical utility in evolutionary and functional studies.^[51]^[54]

Core Methods

Matrix-Based Methods

Matrix-based methods for structural alignment of proteins rely on representing three-dimensional structures as matrices of distances or similarities between residues, enabling the detection of topological equivalences even in the presence of conformational variations. These approaches typically construct intra-molecular distance matrices from Cα atomic coordinates, where each element reflects the Euclidean distance between residue pairs, and then align structures by maximizing a similarity score between these matrices. This paradigm, originating in the 1990s, facilitates robust comparisons by focusing on global structural patterns rather than rigid superpositions, making it particularly effective for proteins with flexible loops or domain insertions. One seminal matrix-based method is DALI (Distance-matrix ALIgnment), introduced by Holm and Sander in 1993. DALI computes Cα distance matrices for each protein and decomposes them into hexagonal patterns of intra-molecular distances, identifying similar submatrices through an iterative process that combines Monte Carlo optimization with rigid-body superposition. The alignment score is refined by maximizing the number of matched residue pairs while penalizing gaps and distortions, ultimately yielding a Z-score that quantifies significance based on the distribution of intra-molecular distances in a reference set of unrelated structures. This Z-score, typically above 2 indicating structural similarity, allows DALI to benchmark alignments against statistical expectations, achieving high sensitivity for remote homologs. DALI's iterative refinement enhances accuracy by progressively superimposing aligned segments, with the final superposition minimizing root-mean-square deviation (RMSD) for equivalent residues. Another influential approach is SSAP (Sequential Structure Alignment Program), developed by Taylor and Orengo in 1989 and refined in subsequent works. SSAP employs a vector-based representation derived from distance matrices, where inter-residue vectors are encoded to capture local structural environments, such as solvent accessibility and secondary structure propensities. Alignment proceeds via double dynamic programming: an initial scan aligns vectors in three dimensions, followed by a sequence-like optimization that scores pairwise environments using a matrix combining geometric and physicochemical similarities. This scoring function, which weights vector directions and lengths, produces a normalized percentage identity score ranging from 0 to 100, with values above 70 often indicating significant similarity. SSAP's strength lies in its ability to handle non-sequential alignments while enforcing spatial consistency through iterative superposition, making it suitable for comparing domains within multidomain proteins.^[55] Structural alphabet methods extend matrix-based alignment by discretizing continuous structural information into a finite set of "letters" analogous to amino acid codes, allowing sequence alignment algorithms to be applied directly to encoded structures. These methods originated in the 1990s with efforts to identify recurrent local motifs, such as Rooman et al.'s automated detection of structural fragments in proteins using distance-based clustering. Residues or short segments are encoded based on backbone torsion angles, distances, or environmental features into an alphabet of prototypes; for instance, the 3Di alphabet, introduced in 2023, assigns one of 20 states to each residue by considering its closest spatial neighbor, capturing tertiary interactions while reducing dependency on sequential order. Alignment then uses standard dynamic programming on these letter sequences, with substitution matrices derived from observed structural similarities, enabling efficient detection of conserved folds. This encoding preserves key topological features from underlying distance matrices, facilitating alignments robust to local distortions like loop variations.^[56] Overall, matrix-based methods excel in robustness to local distortions due to their emphasis on distance-derived similarities rather than coordinate overlays, a characteristic rooted in their 1990s development when computational resources limited exact geometric searches. These techniques have become foundational for database scanning and fold classification, with DALI and SSAP remaining widely adopted for their balance of sensitivity and specificity in identifying evolutionary relationships.^[55]^[56]

Fragment Assembly Methods

Fragment assembly methods in protein structural alignment involve identifying and combining short, local structural motifs or fragments from the proteins being compared to construct a global alignment. These approaches are particularly effective for detecting similarities in proteins with low sequence identity, where global methods may fail, by focusing on compatible local geometries rather than rigid whole-structure overlays.^[57] The Combinatorial Extension (CE) algorithm, introduced by Shindyalov and Bourne in 1998, exemplifies this strategy by incrementally building alignments from aligned fragment pairs (AFPs). CE begins by scanning the Protein Data Bank (PDB) to identify short segments (typically 8-24 residues) between two structures where the root-mean-square deviation (RMSD) is below a threshold, such as 2.0 Å, ensuring geometric compatibility. These AFPs are then extended combinatorially by linking non-overlapping pairs that maintain spatial proximity and orientation, guided by a scoring function that accumulates similarity based on residue distances and alignment length. The process optimizes the path through a dynamic programming-like extension to maximize the cumulative score while minimizing gaps and distortions.^[57]^[58] MAMMOTH, developed by Ortiz et al. in 2002, advances fragment assembly through multidimensional embedding of local structural environments. It represents each residue by a vector in a high-dimensional space capturing intra- and inter-residue distances within a local window (e.g., 10-15 residues), then applies principal component analysis (PCA) to reduce dimensionality to 3-6 principal axes that preserve structural variance. Fragments are matched by comparing these reduced embeddings, with alignments grown by assembling compatible segments that maximize a Z-score based on structural similarity, allowing for flexible handling of conformational variations. Superposition techniques are used post-assembly to refine the global fit of the aligned fragments.^[59]^[60] In general, these methods rely on pre-computed fragment libraries derived from the PDB, where candidate matches are filtered using RMSD thresholds (often 1.5-3.0 Å) to ensure low structural deviation before assembly. This modular approach excels in handling discontinuous alignments, such as those in multi-domain proteins or structures with insertions/deletions, by permitting gaps in the fragment chain without penalizing the overall similarity score as severely as in continuous methods. For instance, CE has demonstrated superior performance in aligning proteins with <20% sequence identity, achieving alignments with RMSD values under 3 Å for homologous folds in benchmark tests.^[57]^[58]^[59]

Geometric Methods

Geometric methods for structural alignment leverage directional and angular properties of protein backbones to identify similarities, focusing on vector representations and angular encodings rather than direct coordinate overlays. These approaches optimize alignments by minimizing deviations in orientation and topology, often using rotation matrices or pseudo-sequences derived from geometric features. By prioritizing continuous spatial characteristics, such methods enable efficient detection of structural homologs even when sequences diverge significantly.^[61] TM-align exemplifies geometric alignment through optimization of a rotation matrix that superimposes protein structures while maximizing the TM-score, a scale-independent metric of topological similarity defined as \text{TM-score} = \max \left[ \frac{1}{L_{\text{target}}} \sum_{i=1}^{L_{\text{aligned}}} \frac{1}{1 + (d_i / d_0(L_{\text{target}}))^2} \right], where d_i is the distance between aligned residue pairs after superposition, and d_0 is a length-dependent scaling factor. The algorithm begins with an initial alignment guided by secondary structure elements, assigning higher scores to matching helices or strands to prioritize conserved geometric motifs, followed by iterative dynamic programming refinements using a distance-based scoring matrix S(i,j) = 1 / (1 + d_{ij}^2 / d_0(L_{\min})^2). This rotation-centric approach ensures global optimization of angular and directional alignment without relying on fragment decomposition.^[62] Vector alignment methods represent protein backbones as sequences of direction vectors, typically unit vectors between consecutive Cα atoms, to capture local orientations and enable comparison via metrics like unit-vector root-mean-square deviation (URMS). In such approaches, structures are aligned by finding the optimal rotation that minimizes the sum of squared angular distances between corresponding vectors, often using dynamic programming to match vector sequences while accounting for gaps. For instance, consensus shapes for protein families are derived by averaging aligned vectors, preserving directional consistency across homologs with variable backbone spacing up to approximately 4 Å. These methods emphasize geometric invariance under rigid transformations, facilitating the identification of shared folds through vector topology. Dihedral angle comparisons complement vector representations by quantifying torsional similarities, where φ and ψ angles define local backbone conformations for finer angular alignment.^[63]^[64] A prominent 3D-1D encoding strategy projects three-dimensional structures into one-dimensional pseudo-sequences using structural alphabets, such as the 16 Protein Blocks (PBs), which approximate local geometries via pentapeptide-like motifs derived from φ and ψ dihedral angles. Each residue is assigned a PB by minimizing root-mean-square deviation on angular values within a sliding window of five Cα atoms, transforming the 3D backbone into a 1D string suitable for sequence-like processing. Alignment then proceeds via dynamic programming on these PB sequences, employing a substitution matrix built from known structural alignments (e.g., from the PALI database) to score matches between blocks, supporting both local and global optimizations with gap penalties. This encoding preserves key angular and directional features while reducing computational complexity.^[65] Geometric methods like TM-align and PB-based encodings demonstrate superior speed and accuracy, particularly for twilight zone proteins with sequence identities below 30%, where TM-align achieves average TM-scores of 0.51 for high-confidence matches and detects remote homologs with RMSD up to 5 Å and coverage over 40%, outperforming coordinate-based tools in sensitivity. Vector and PB approaches similarly enable rapid alignments (under 1 minute per pair) with recognition rates exceeding 85% for distant relatives in large databases like SCOP, balancing efficiency with robust geometric fidelity.^[62]^[65]^[64]

Extensions

Multiple Alignment

Multiple structural alignment extends the principles of pairwise structural alignment to simultaneously superimpose and compare more than two protein structures, enabling the identification of conserved cores and variable regions across a set of related proteins. This process typically builds on pairwise alignments as foundational building blocks, iteratively merging them to form a global superposition that captures evolutionary conservation and structural divergence. By aligning multiple structures, researchers can infer functional insights that are obscured in pairwise comparisons alone.^[66] The primary challenges in multiple structural alignment stem from the combinatorial explosion of possible residue equivalences as the number of input structures grows, rendering the problem NP-hard and computationally intractable for exact solutions beyond small sets. Heuristic methods are thus essential to approximate optimal alignments efficiently. Another key difficulty is establishing a consensus superposition that minimizes overall deviations while accommodating conformational flexibility and insertions/deletions across the structures, often requiring iterative refinement to balance global and local similarities.^[67] Prominent methods address these challenges through progressive or graph-based strategies. Progressive approaches, exemplified by MUSTANG (MUltiple STructural AligNment AlGorithm), begin with pairwise alignments scored on residue-residue contacts and local topology, then progressively incorporate additional structures along a guide tree without explicit gap penalties, allowing for flexibility in distant homologs and achieving high accuracy (e.g., 93.4% on benchmark families). Graph-based methods, such as POSA (Partial Order Structure Alignment), model protein backbones as partial order graphs to represent flexible alignments, enabling the detection of conserved regions present in subsets of structures and handling internal motions without assuming a linear order. These techniques prioritize structural similarity over sequence, often outperforming sequence-based multiples in cases of low homology.^[66]^[68] Multiple structural alignments find critical applications in analyzing family-wide folds, where they reveal conserved structural motifs across homologous proteins despite sequence divergence, as seen in alignments of cyclin-dependent kinases that enable classification of active and inactive states with over 98% accuracy using derived features. In superfamily analysis, they facilitate the detection of distant evolutionary relationships by highlighting shared cores in diverse proteins, supporting functional predictions and the expansion of structural databases like SCOP or CATH.^[69] Evaluation of multiple alignments relies on metrics that quantify conservation and superposition quality, including core RMSD, which computes the root-mean-square deviation solely over the strict core of positions equivalent in all structures and within 4 Å, providing a measure of structural fidelity in conserved regions (lower values indicate tighter alignments). Alignment length consistency, often expressed as the core size as a percentage of the shortest input structure, assesses the extent of overlap and robustness across the set, with higher percentages signaling more reliable evolutionary inferences.^[70]^[71]

RNA Alignment

RNA structural alignment focuses on comparing three-dimensional (3D) conformations of ribonucleic acid (RNA) molecules, which exhibit unique topological features such as pseudoknots, diverse base-pairing interactions, and junction loops that distinguish them from protein structures. Pseudoknots arise when nucleotides in a single-stranded loop form base pairs with complementary nucleotides outside the loop, often stabilized by coaxial stacking of stems and non-canonical base triples involving loops.^[72] Base interactions in RNA include canonical Watson-Crick pairs (A-U, G-C) as well as non-canonical ones like Hoogsteen and wobble pairs, which contribute to the flexibility and functional diversity of RNA motifs.^[73] Junction loops, or multi-branched loops, serve as critical connectors between helical stems, organizing the overall 3D architecture and enabling complex folding patterns observed in functional RNAs like ribozymes and riboswitches.^[74] The primary goal of RNA alignment is to identify correspondences that preserve secondary structure motifs, including base-pairing patterns and pseudoknotted regions, while accounting for the inherent flexibility and modular nature of RNA topologies. This conservation highlights evolutionary relationships and functional similarities, such as in non-coding RNAs where structural integrity is key to regulatory roles. Unlike rigid superposition used in general structural comparisons, RNA alignments often incorporate dynamic programming or graph-based approaches to handle discontinuous helices and loop-mediated interactions.^[75] Key methods for RNA 3D structural alignment include RMalign, which employs a size-independent scoring function called RMscore to evaluate similarity based on residue-residue distances and secondary structure elements, achieving superior performance in classifying RNA structures compared to earlier tools like SARA.^[76] TOPAS provides a network-based approach for pairwise alignment of RNA sequences, constructing topological networks from predicted secondary structures that incorporate sequential and base-pairing edges, followed by probabilistic alignment to capture structural similarities even in pseudoknotted regions.^[77] For homology detection, rMSA integrates sequence search against databases like NCBI nt and RNAcentral with covariance model-based multiple sequence alignment, enhancing the accuracy of secondary structure prediction by at least 20% through improved alignment of homologous RNAs.^[78] Automated pipelines facilitate comprehensive RNA analysis, with rMSA serving as a multi-stage tool that combines homology search, alignment, and structure modeling to streamline workflows for large-scale RNA datasets. Recent advances incorporate deep learning, such as the 2024 REDalign method, which uses a residual encoder-decoder network to predict and align RNA secondary structures by learning consensus patterns from sequence and structural data, offering high accuracy with reduced computational overhead compared to traditional dynamic programming.^[79]

Advances

Integration with Prediction Tools

Structural alignment plays a pivotal role in protein structure prediction pipelines, particularly for template selection in AlphaFold-Multimer, where alignments to known structures generate diverse multiple sequence alignments (MSAs) and structural templates to boost prediction accuracy.^[80] Methods like MULTICOM leverage both sequence and structure alignments to create these inputs, enabling more robust modeling of protein complexes by identifying homologous templates beyond pure sequence similarity.^[81] Furthermore, alignment facilitates the refinement of predicted structures against experimentally validated ones, correcting discrepancies in low-confidence regions and improving overall model quality.^[82] Following prediction, structural alignment tools are essential for validating AlphaFold models, with TM-align commonly used to compare predicted structures to references, incorporating the TM-score for fold similarity assessment.^[83] This process integrates AlphaFold's per-residue confidence scores, such as pLDDT, to prioritize alignments in regions with predicted local distance differences test (pLDDT) values below 70, guiding targeted refinements.^[84] In large-scale databases like the AlphaFold Protein Structure Database, Foldseek enables efficient structural searches and alignments across millions of predicted models, encoding structures into a 20-state 3Di alphabet for rapid, sensitive comparisons.^[85] This integration supports template discovery and validation by returning alignments with metrics like E-values and overlap percentages, streamlining workflows for researchers querying the database.^[86] The synergy between structural alignment and prediction tools gained momentum after AlphaFold's dominance in CASP14 in 2021, where its high-accuracy predictions highlighted the need for alignment-based validation and refinement to handle novel folds.^[87] By 2025, advancements include multimodal methods that align AlphaFold3 outputs with cryo-EM densities for refined atomic models, and Distance-AF, which optimizes predictions using distance matrix alignments to enhance tertiary structure accuracy.^[88]^[89] These developments, alongside October 2025 updates to the AlphaFold Database incorporating isoforms and expanded coverage, have further integrated alignment into iterative refinement pipelines.^[90]

Machine Learning Approaches

Machine learning approaches have revolutionized structural alignment by enabling faster, more scalable comparisons of protein and RNA structures, particularly in the era of vast predicted structure databases. These methods leverage neural networks to encode complex three-dimensional features into compact representations, reducing the computational burden of traditional geometric alignments while maintaining high accuracy. Key innovations include graph-based embeddings and deep architectures tailored for high-throughput searches and specific biomolecular types. Graph neural networks (GNNs) have been instrumental in creating embeddings for rapid structural search. For instance, Foldseek employs a vector-quantized variational autoencoder (VQ-VAE), a type of neural network, to discretize protein structures into a 20-character "3Di" alphabet that captures tertiary residue-residue interactions, transforming 3D alignment into efficient sequence alignment. This approach achieves near-linear scaling, enabling searches across millions of structures in seconds, with sensitivity comparable to traditional tools like TM-align but orders of magnitude faster. Deep learning models have further advanced alignment for both proteins and RNA. SARST2 integrates artificial neural networks (ANNs) and decision trees in a filter-and-refine pipeline, combining sequence, secondary structure, and evolutionary data to align query proteins against massive databases like AlphaFold DB in under 4 minutes on standard hardware, with 96% accuracy in homology detection. For RNA, REDalign uses a residual encoder-decoder network to align secondary structures by learning pattern-based correspondences, outperforming classical methods like RNAforester in accuracy on benchmark datasets while requiring less computation. Recent benchmarking underscores the speed advantages of ML-driven tools. A 2025 study evaluated nine alignment algorithms—including ML-based ones like Foldseek, TM-Vec, and DeepAlign—on downstream tasks such as homology detection and function prediction, revealing that ML methods like Foldseek reduced runtime by up to 100-fold compared to geometric tools like TM-align, especially for large-scale datasets, without sacrificing alignment quality. Advances in ML have extended to AI-assisted de novo protein design and post-AlphaFold data handling. In de novo design, neural network embeddings facilitate alignment of generated structures against natural templates to validate novelty and function, as seen in frameworks that invert structure prediction models for binder design.00311-9) For large datasets, ML clustering via Foldseek has processed over 200 million AlphaFold predictions, identifying structural families and enabling proteome-wide alignments that were previously infeasible.^[91]

Tools

Web-Based Tools

Web-based tools for structural alignment provide accessible platforms that enable researchers to perform alignments directly in a browser without requiring software installation, making them ideal for quick analyses or users without advanced computational resources. These services typically accept uploads of Protein Data Bank (PDB) files or entry identifiers and output visualizations, scores, and aligned structures, though they often impose limits on input sizes or concurrent jobs to manage server load. Common scoring metrics, such as TM-score for global similarity or Z-scores for significance, help evaluate alignment quality.^[33]^[18] The RCSB PDB Pairwise Structure Alignment tool, hosted by the Research Collaboratory for Structural Bioinformatics (RCSB), allows users to upload PDB files or specify existing entries for pairwise superposition of protein structures. It computes alignments using methods like TM-align, providing TM-scores to quantify structural similarity, and supports visualization of overlaid models. This free service emphasizes user-friendly interfaces for selecting chains and viewing results, with no installation needed, though large datasets may require batch processing options.^[92]^[33] DaliLite, accessible via the Dali web server, serves as the online version of the DALI algorithm for pairwise and multiple protein structure alignments based on intra-molecular distance matrices. Users submit query structures against the PDB database or other inputs, receiving Z-scores that indicate statistical significance of similarities, along with aligned coordinate files and structural neighborhoods. Designed for ease of use, it processes jobs asynchronously and limits inputs to manage computational demands, making it suitable for exploratory comparisons without local setup.^[19]^[18] PDBeFold, provided by the Protein Data Bank in Europe (PDBe), offers a web interface for structural alignments using Combinatorial Extension (CE) and Sequence Structure Alignment Program (SSAP) methods, supporting both pairwise and multiple comparisons. It enables uploading of structures or searching against the PDB, outputting superposition visuals and similarity scores, with options to refine alignments iteratively. As a free, browser-based resource, it facilitates rapid assessments but restricts large-scale submissions to prevent overload.^[93] For RNA structures, RNAhub (launched in 2025) is a specialized web server that automates the alignment of RNA homologs by integrating sequence searches with secondary structure prediction and covariation analysis. Users input an RNA sequence to generate multiple alignments that incorporate structural constraints, assessing conservation via tools like R-scape, which is particularly useful for sequence-to-structure alignments in non-coding RNAs. This no-install platform is free but caps query lengths and homology searches to ensure efficient processing.^[94]^[95]

Standalone Software

Standalone software for structural alignment provides downloadable programs that enable offline computation, supporting high-throughput analyses, customization via scripting, and integration with local workflows without reliance on internet connectivity. These tools are particularly valuable for researchers handling large datasets or requiring reproducible, resource-controlled environments, often available as open-source options with binaries or source code for various operating systems. TM-align is a widely used command-line tool for fast pairwise protein structure alignment based on a geometric approach employing the TM-score rotation matrix to optimize superposition while prioritizing global topology. Developed in 2005, it excels in speed and accuracy for comparing structures with low sequence similarity, making it suitable for scripting and pipeline integration in structural bioinformatics tasks.^[34] The tool outputs alignment details including TM-score, RMSD, and aligned residues, and is distributed as a precompiled executable for Linux, macOS, and Windows, with minimal dependencies.^[96] Foldseek, introduced in 2023, is an open-source tool designed for rapid and sensitive large-scale protein structure searches and alignments by encoding 3D structures into compact 1D sequences using a 20-state 3Di alphabet derived from inter-residue geometry. This encoding allows sequence alignment techniques like reduced-alphabet BLAST to perform structural comparisons at speeds up to 2-4 orders of magnitude faster than traditional methods, while maintaining high sensitivity for remote homolog detection.^[97] It supports monomer and multimer alignments, clustering, and batch processing on massive databases, with installation via conda or from source on Linux and macOS, requiring dependencies like OpenMP for parallelization.^[98] CE (Combinatorial Extension) is a fragment-based alignment method that identifies and extends short aligned fragments to construct optimal global alignments, emphasizing rigid-body superpositions of protein backbones. Implemented as a standalone program since its inception in 1998, it supports pairwise and multiple alignments through CE-MC extensions, enabling batch processing for comparing models against reference structures. Source code and binaries are available for download, compilable on Unix-like systems with C dependencies, and it integrates well with tools for structural database annotation. MaxCluster complements fragment-based approaches by providing a versatile command-line utility for pairwise structure comparison and clustering, computing metrics like RMSD, GDT-TS, and TM-score across large sets of models. Released around 2008, it facilitates high-throughput evaluation of predicted structures, such as those from folding simulations, with precompiled binaries for Linux, macOS, and Windows, and no external dependencies beyond standard libraries.^[99] A recent advancement is SARST2, released in 2025, which offers high-throughput, resource-efficient structural alignment against massive databases by transforming Ramachandran angles into sequential representations for accelerated similarity searches. It achieves superior speed and low memory usage compared to predecessors, making it ideal for aligning predicted structures like those from AlphaFold models to explore evolutionary relationships at scale.^[100] The open-source implementation on GitHub supports Linux and macOS installation via Python dependencies including NumPy and SciPy, with options for GPU acceleration to handle datasets exceeding millions of structures.^[101]

References

[1]
Pairwise Structure Alignment - RCSB PDB
Sep 23, 2024 · Introduction. What is Structure Alignment? Structure alignment attempts to establish residue-residue correspondence between two or more ...
[2]
The meaning of alignment: lessons from structural diversity - PMC
Dec 23, 2008 · Protein structural alignment provides a fundamental basis for deriving principles of functional and evolutionary relationships.
[3]
[PDF] A Novel Approach to Structure Alignment - Stanford University
It enables the study of func- tional relationship between proteins and is very important for homology and threading methods in structure prediction.
[4]
RAPIDO: a web server for the alignment of protein structures ... - PMC
The RMSD for a flexible superposition (RMSDf) is calculated as the RMSD over all Cα-atoms of the individual rigid bodies superimposed separately. The ...Missing: equation | Show results with:equation
[5]
Methods of protein structure comparison - PMC - NIH
RMSD can be calculated for any type and subset of atoms; for example, Cα atoms of the entire protein, Cα atoms of all residues in a specific subset (e.g. the ...Methods Of Protein Structure... · 2 Methods · Rms Of Dihedral Angles
[6]
The difficulty of protein structure alignment under the RMSD - PMC
Protein structure alignment is often modeled as the largest common point set (LCP) problem based on the Root Mean Square Deviation (RMSD).
[7]
Protein remote homology detection and structural alignment ... - Nature
Sep 7, 2023 · The challenge of remote homology detection is identifying structurally similar proteins that do not necessarily have high sequence similarity.
[8]
CATH database: an extended protein family resource for structural ...
These approaches used datasets of distant homologues selected from the structural classifications, such as SCOP and CATH, to determine the sensitivity of ...
[9]
CATHe: detection of remote homologues for CATH superfamilies ...
CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation ...
[10]
The TIM Barrel Architecture Facilitated the Early Evolution of Protein ...
Jan 5, 2016 · The triosephosphate isomerase (TIM) barrel protein fold is a structurally repetitive architecture that is present in approximately 10 % of all enzymes.
[11]
Protein function annotation with Structurally Aligned Local Sites of ...
Feb 28, 2013 · The top-ranked residues constitute the active site prediction. Structural alignments are obtained for sets of these local sites. Characteristic ...
[12]
Correlation of fitness landscapes from three orthologous TIM barrels ...
Mar 6, 2017 · The TIM barrel superfamily contains 57 distinct families and encompasses five of the six enzyme commission functional categories catalysing at ...
[13]
Template-Based Protein Structure Modeling - PMC - PubMed Central
Template-based protein structure modeling techniques rely on the study of principles that dictate the 3D structure of natural proteins from the theory of ...
[14]
Structure-Based Virtual Screening for Drug Discovery: Principles ...
In this review, we focus on the principles and applications of Virtual Screening (VS) within the context of SBDD and examine different procedures.
[15]
PoLi: A Virtual Screening Pipeline Based On Template Pocket ... - NIH
In PoLi, ligand binding pockets in the query protein's structure are predicted using two different approaches: first by global structural superposition of holo- ...
[16]
Dali server: structural unification of protein families - Oxford Academic
May 24, 2022 · The Dali server for 3D protein structure comparison is widely used by crystallographers to relate new structures to pre-existing ones.
[17]
Dali server - ekhidna.biocenter.
The Dali server is a network service for comparing protein structures in 3D. You submit the coordinates of a query protein structure and Dali compares them.
[18]
Dealing with Coordinates - PDB-101
In PDB file format, the ATOM record is used to identify proteins or nucleic acid atoms, and the HETATM record is used to identify atoms in small molecules.Atomic-Level Data · Chains And Models · Temperature Factors
[19]
[PDF] PROTEIN DATA BANK ATOMIC COORDINATE AND ...
The file contains records of 80 characters each, including HEADER, COMPND, SOURCE, ATOM, and HETATM records, with ATOM and HETATM records appearing in ...
[20]
GOSSIP: a method for fast and accurate global alignment of protein ...
In this article, we propose a global structural alignment method that has the following properties: it can be applied to any set of protein structures with Cα ...
[21]
Comprehensive Evaluation of Protein Structure Alignment Methods
Aug 7, 2025 · The majority of the current structure alignment 4, 5 and protein classification methods use either C α -polylines [6][7][8] or the 3D ...
[22]
Secondary structure assignment that accurately reflects physical and ...
Secondary structure is used in hierarchical classification of protein structures, identification of protein features, such as helix caps and loops, ...
[23]
Protein secondary structure: category assignment and predictability
The conventional 'H' assignment of α-helices by DSSP contains approximately 20% residues without hydrogen bonds from residues i→i+4 and/or i−4←i. For the β- ...
[24]
Deep learning methods for protein torsion angle prediction
Sep 18, 2017 · The conformation of the backbone of a protein can be largely represented by two torsion angles (phi and psi angles) associated with each Cα atom ...
[25]
Part 1: Protein Structure - Backbone torsion angles
The protein backbone can be described in terms of the phi, psi and omega torsion angles of the bonds: The phi angle is the angle around the -N-CA- bond ...
[26]
Alignments of biomolecular contact maps | Interface Focus - Journals
Jun 11, 2021 · Here, we focus on contact maps, ie undirected graphs with an ordered set of vertices. These serve as natural discretizations of RNA and protein structures.
[27]
Protein structure comparison by alignment of distance matrices
Sep 5, 1993 · We have developed a novel algorithm (DALI) for optimal pairwise alignment of protein structures. The three-dimensional co-ordinates of each protein are used to ...
[28]
[PDF] Protein Structure Comparison by Alignment of Distance Matrices
Here, we present a general approach for aligning a pair of proteins represented by two-dimensional matrices. The result is a set of structurally equiva- lent ...
[29]
Alignment of protein structures in the presence of domain motions
As it is well known, protein molecules are flexible entities with internal movements ranging from the displacement of individual atoms to movements of entire ...
[30]
CE-MC: a multiple protein structure alignment server - PMC - NIH
Output data. The CE-MC program outputs multiple alignments in four different formats, i.e. JOY/html, JOY/PostScript, text and FASTA formats. The JOY/html format ...
[31]
Alignment file (PIR)
The first line of each sequence entry specifies the protein code after the >P1; line identifier. The line identifier must occur at the beginning of the line.
[32]
Pairwise Structure Alignment - RCSB PDB
Sep 23, 2024 · This tool presents options for pairwise structure alignment of proteins. In the case of pairwise alignment, structures are always compared in pairs.Pairwise Structure Alignment · Documentation · ExamplesMissing: efficiency | Show results with:efficiency<|separator|>
[33]
TM-align: a protein structure alignment algorithm based on the ... - NIH
We have developed TM-align, a new algorithm to identify the best structural alignment between protein pairs that combines the TM-score rotation matrix and ...Initial Structural Alignment · Results · Benchmark Test
[34]
SuperPose
From a superposition of two or more structures, Superpose generates sequence alignments, structure alignments, PDB coordinates, RMSD statistics, Difference ...
[35]
Protein multiple alignments: sequence-based versus structure ...
Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built ...Introduction · Materials and methods · Results · Discussion
[36]
Dali server: conservation mapping in 3D - PMC
May 10, 2010 · Outputs. The Dali server, Dali Database and pairwise comparison use a common output format and share interactive analysis tools. The result ...
[37]
Matt: Local Flexibility Aids Protein Multiple Structure Alignment
We introduce an algorithm that allows local flexibility in the structures when it brings them into closer alignment.
[38]
Fr-TM-align: a new protein structural alignment method based on ...
TM-align employs a very simple approach that uses both gapless threading and secondary structure similarity to generate the initial set of equivalent residues.
[39]
The difficulty of protein structure alignment under the RMSD
Jan 4, 2013 · The study identified two important factors in the problem's complexity: (1) The lack of a limit in the distance between the consecutive points ...
[40]
Comparative Analysis of Protein Structure Alignments - PMC
Similar structures might display considerable structural variability and are often related by several insertions and deletions (indels) of considerable size.
[41]
https://pmc.ncbi.nlm.nih.gov/articles/PMC1959231/
[42]
https://doi.org/10.1016/0022-2836(79)90308-5
[43]
https://doi.org/10.1002/prot.20264
[44]
[PDF] Exact algorithms for pairwise protein structure alignment - CORE
Finding the score- optimal residue correspondences is believed to be an NP-hard problem for almost all. 3-dimensional scoring functions. 2-dimensional scoring ...Missing: isomorphism | Show results with:isomorphism<|control11|><|separator|>
[45]
A comparison of algorithms for the pairwise alignment of biological ...
The problem of global alignment is equivalent to the subgraph isomorphism problem, which is NP-complete (Cook, 1971), so aligners settle for approximate ...
[46]
Protein structure comparison using iterated double dynamic ... - NIH
A protein structure comparison method is described that allows the generation of large populations of high-scoring alternate alignments.
[47]
An integrated approach to the analysis and modeling of protein ...
This is a local alignment algorithm that focuses only on aligning the core structures defined by the threshold, and is similar to the Smith-Waterman ...<|control11|><|separator|>
[48]
PAUL: protein structural alignment using integer linear programming ...
Oct 19, 2009 · Protein structural alignment determines the three-dimensional superposition of protein structures by means of aligning the protein's residues.Missing: definition | Show results with:definition
[49]
DALIX: optimal DALI protein structure alignment - PubMed
We present a mathematical model and exact algorithm for optimally aligning protein structures using the DALI scoring model.Missing: runtime | Show results with:runtime
[50]
[PDF] DALIX: optimal DALI protein structure alignment - Hal-Inria
Apr 26, 2012 · For SISY, RIPC and SKOLNICK, a maximum running time of 30 CPU hours per instance is applied and for the short SCOPCath instances a time ...
[51]
Approximate protein structural alignment in polynomial time - PNAS
Here, we study the structural alignment problem as a family of optimization problems and develop an approximate polynomial-time algorithm to solve them.
[52]
Algorithms for Multiple Protein Structure Alignment and ... - NIH
The program output format of multiple alignments is ClustalX or PIR. ... Kolodny R and Linial N Approximate protein structural alignment in polynomial time.
[53]
https://www.sciencedirect.com/science/article/pii/S0022283600939731
[54]
Protein structure alignment by incremental combinatorial extension ...
The algorithm uses combinatorial extension (CE) of aligned fragment pairs (AFPs) based on local geometry to build an optimal protein structure alignment.
[55]
A database and tools for 3-D protein structure comparison and ...
The database reported here is derived using the Combinatorial Extension (CE) algorithm which compares pairs of protein polypeptide chains and provides a list of ...
[56]
MAMMOTH (matching molecular models obtained from theory)
Here, a new method for sequence-independent structural alignment is presented that allows comparison of an experimental protein structure with an arbitrary low- ...<|control11|><|separator|>
[57]
MAMMOTH (Matching molecular models obtained from theory): An ...
Apr 13, 2009 · A new method for sequence-independent structural alignment is presented that allows comparison of an experimental protein structure with an arbitrary low- ...
[58]
Geometric Methods for Protein Structure Comparison - SpringerLink
Protein structural comparison is an important operation in molecular biology and bionformatics. It plays a central role in protein analysis and design.
[59]
a protein structure alignment algorithm based on the TM-score
We have developed TM-align, a new algorithm to identify the best structural alignment between protein pairs that combines the TM-score rotation matrix and ...
[60]
[PDF] Finding the Consensus Shape for a Protein Family
The compared vectors are those between adjacent α-carbons along the protein backbone. We refer to these direction vectors as unit vectors because, for proteins, ...
[61]
SABERTOOTH: protein structural alignment based on a vectorial ...
We show that protein comparison based on a vectorial representation of protein structure performs comparably to established algorithms based on coordinates.
[62]
Protein Block Expert (PBE): a web-based protein structure analysis ...
Encoding protein 3D structures into 1D string using short structural prototypes or structural alphabets opens a new front for structure comparison and analysis.
[63]
MUSTANG: A multiple structural alignment algorithm
May 30, 2006 · MUSTANG aligns residues on the basis of similarity in patterns of both residue–residue contacts and local structural topology.
[64]
Algorithms, applications, and challenges of protein structure alignment
In this review, we first present a small survey of current methods for protein pairwise and multiple alignment, focusing on those that are publicly available ...
[65]
Multiple flexible structure alignment using partial order graphs
A new method of multiple protein structure alignment, POSA (Partial Order Structure Alignment), was developed using a partial order graph representation of ...
[66]
A multiple protein structure alignment and feature extraction suite - NIH
Apr 6, 2020 · A multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments.Missing: wide | Show results with:wide
[67]
Multiple structure alignment and consensus identification for proteins
Feb 2, 2010 · This paper presents an algorithm to compute a multiple structure alignment for a set of proteins and to generate a consensus structure. The ...
[68]
Accuracy analysis of multiple structure alignments - PMC - NIH
Protein structure alignment methods are essential for many different challenges in protein science, such as the determination of relations between proteins ...Missing: cons speed
[69]
Structure, stability and function of RNA pseudoknots involved in ...
An RNA pseudoknot is a structural element of RNA which forms when nucleotides within one of the four types of single-stranded loops in a secondary structure ( ...
[70]
Structure of the Human Telomerase RNA Pseudoknot Reveals ...
The telomerase pseudoknot has 8 nt in loop 1, which allows it to form additional stem 2-loop 1 Hoogsteen base triples. These multiple base triple interactions ...
[71]
RNAloops: a database of RNA multiloops - Oxford Academic
Jul 9, 2022 · RNAloops is a self-updating database that stores multi-branched loops identified in the PDB-deposited RNA structures.
[72]
All-at-once RNA folding with 3D motif prediction framed by ... - Nature
Oct 3, 2025 · Critical loops and junctions connect these helices and arrange them into a 3D structure. These nonhelical linker regions, called RNA 3D motifs ...
[73]
RMalign: an RNA structural alignment tool based on a novel scoring ...
Apr 8, 2019 · In this study, we develop a novel RNA 3D structure alignment approach RMalign, which is based on a size-independent scoring function RMscore.
[74]
TOPAS: network-based structural alignment of RNA sequences
Jan 10, 2019 · We propose a novel network-based scheme for pairwise structural alignment of RNAs. The proposed algorithm, TOPAS, builds on the concept of topological networks.
[75]
rMSA: A Sequence Search and Alignment Algorithm to Improve RNA ...
rMSA is a multi-stage pipeline for automated RNA homology search and MSA construction. rMSA improves covariance-based RNA secondary structure prediction ...
[76]
accurate RNA structural alignment using residual encoder-decoder ...
Nov 5, 2024 · REDalign presents a significant advancement in RNA secondary structural alignment, balancing high alignment accuracy with lower computational demands.
[77]
Enhancing alphafold-multimer-based protein complex structure ...
MULTICOM enhances AlphaFold-Multimer predictions by generating diverse MSAs and structural templates using both sequence and structure alignments for AlphaFold ...
[78]
Enhancing AlphaFold-Multimer-based Protein Complex Structure ...
May 18, 2023 · MULTICOM enhances AlphaFold-Multimer predictions by generating diverse MSAs and structural templates using both sequence and structure ...Multimer Msa Generation · Multimer Template... · 3 Results
[79]
Enhancing cryo-EM structure prediction with DeepTracer and ...
In this study, we present DeepTracer-Refine, an automated method that refines AlphaFold predicted structures by aligning them to DeepTracers modeled structure.
[80]
Flexible fitting of AlphaFold2-predicted models to cryo-EM density ...
Nov 18, 2024 · AlphaFold2 generally predicts highly accurate structures, but 18 of the 137 models of isolated chains exhibit a TM-score below 0.80. We achieved ...
[81]
AlphaFold two years on: Validation and impact - PNAS
pTM stands for predicted TM-score, and is a measure of AlphaFold's expected global accuracy on a protein or complex. An interface-only version of pTM can be ...
[82]
Fast and accurate protein structure search with Foldseek - PMC - NIH
Foldseek enables fast and sensitive comparison of large structure sets. It encodes structures as sequences over the 20-state 3Di alphabet.
[83]
AlphaFold database empowers researchers with enhanced ...
Sep 12, 2024 · The integration provides a seamless and user-friendly experience, allowing for smooth navigation between sequence and structural data. This ...
[84]
Highly accurate protein structure prediction with AlphaFold - Nature
Jul 15, 2021 · Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure ...
[85]
Multimodal deep learning integration of cryo-EM and AlphaFold3 for ...
Oct 31, 2025 · This model is further refined using AlphaFold3-predicted structures and density maps to build final atomic structures. MICA significantly ...
[86]
Distance-AF improves predicted protein structure models by ... - Nature
Sep 30, 2025 · The use of AF2 models is further facilitated by the AlphaFold Database which now holds structure models of over 200 million proteins. AF2 has ...
[87]
EMBL-EBI and Google DeepMind renew partnership and release ...
Oct 7, 2025 · This AFDB update ensures AlphaFold coverage for newly discovered proteins and incorporates protein isoforms, providing a more complete view of ...<|control11|><|separator|>
[88]
Clustering predicted structures at the scale of the known protein ...
Sep 13, 2023 · Here, we developed a structural-alignment-based clustering algorithm—Foldseek cluster—that can cluster hundreds of millions of structures ...
[89]
Pairwise Structure Alignment Tool - RCSB PDB
This tool allows the selection of protein 3D structures for alignment. Use an existing PDB or Computed Structure Model entry ID, upload a local file with ...
[90]
PDBe < Fold < EMBL-EBI
PDBeFold is used as a structure search engine in PDBePISA. PDBeFold queries may be launched from any web site (instructions).PDBeFold Links · Version log · PDBeFold · Publications
[91]
RNAhub—an automated pipeline to search and align RNA ...
An online server created for identifying and aligning homologous sequences is rMSA [10]. The server generates MSA searching sequences from the NCBI Nucleotide ...
[92]
RNAhub
RNAhub is a computational tool developed to automate the retrieval and alignment of homologous RNA sequences, with the aim of facilitating RNA structural and ...Missing: 2024 | Show results with:2024
[93]
A protein structure alignment algorithm using TM-score rotation matrix
TM-align is an algorithm for sequence independent protein structure comparisons. For two protein structures of unknown equivalence, TM-align first generates ...
[94]
Fast and accurate protein structure search with Foldseek - Nature
May 8, 2023 · Structural alphabets thus reduce structure comparisons to much faster sequence alignments. Many ways to discretize the local amino acid backbone ...
[95]
Foldseek enables fast and sensitive comparisons of large structure ...
Foldseek enables fast and sensitive comparisons of large protein structure sets, supporting monomer and multimer searches, as well as clustering.Releases 10 · Steineggerlab/foldseek Wiki · Steineggerlab/foldseek · GitHub
[96]
MaxCluster - A tool for Protein Structure Comparison and Clustering
MaxCluster is a command-line tool for the comparison of protein structures. It provides a simple interface for a large number of common structure comparison ...Missing: TS | Show results with:TS
[97]
SARST2 high-throughput and resource-efficient protein structure ...
Sep 30, 2025 · Unlike TM-align, which aligns proteins solely based on structural information18, SARST2 and Foldseek incorporate amino acid sequence features ...
[98]
NYCU-10lab/sarst: An efficient protein structural alignment ... - GitHub
SARST2 high-throughput and resource-efficient protein structure alignment against massive databases. Nature Communications (2025). https://doi.org/10.1038/ ...Missing: deep | Show results with:deep