Fact-checked by Grok 2 weeks ago

Protein structure

Protein structure refers to the of atoms within a , which dictates its shape, stability, and biological function. Proteins are linear polymers composed of 20 standard linked by bonds to form one or more polypeptide chains, with typical lengths ranging from 50 to 2,000 residues. This arrangement enables proteins to perform diverse roles, including enzymatic , structural support, transport, and signaling, as their three-dimensional conformation allows precise interactions with other molecules. The four levels of protein structure—primary, secondary, , and —build upon one another, stabilized by noncovalent and covalent interactions such as bonds, hydrophobic effects, electrostatic forces, van der Waals interactions, and disulfide bridges. The primary structure is the simplest level, defined as the specific linear of in a polypeptide chain, determined by the and covalent bonds between the carboxyl group of one and the amino group of the next. Variations in this , such as single substitutions, can profoundly alter protein function, as seen in diseases like sickle cell anemia where a glutamate-to-valine change disrupts stability. This serves as the blueprint for higher-order folding, with the chemical properties of side chains (R groups)—ranging from hydrophobic to charged—driving subsequent structural organization. At the secondary structure level, the polypeptide backbone folds locally into repeating patterns, primarily α-helices and β-sheets, stabilized by hydrogen bonds between the carbonyl oxygen and amide hydrogen of the backbone. These motifs, first elucidated in the mid-20th century through studies of fibrous proteins like and silk fibroin, contribute to the protein's overall rigidity and flexibility, with α-helices forming coiled rods and β-sheets creating pleated structures that can be parallel or antiparallel. Secondary elements often cluster to form compact domains, modular units of 40–350 that can function independently or combine for complex activities. The tertiary structure encompasses the global three-dimensional folding of a single polypeptide chain, where secondary elements and side chains pack together to form a compact, globular (or elongated fibrous) shape, driven by the hydrophobic burial of nonpolar residues in the core and exposure of polar ones on the surface. This level is crucial for active sites in enzymes or binding interfaces, with folding often assisted by molecular chaperones to prevent aggregation and ensure correct conformation under physiological conditions. Misfolding at this stage can lead to pathological states, such as amyloid fibrils in . Many proteins exhibit quaternary structure, in which multiple polypeptide subunits (homomers or heteromers) assemble into a functional complex, further stabilized by the same noncovalent interactions as tertiary structure, plus potential interchain bonds. Examples include hemoglobin, a tetramer that enables cooperative oxygen binding, illustrating how quaternary assembly enhances efficiency and regulation. Across , only about 2,000 distinct folds have been identified among known structures, underscoring the efficiency of these principles in generating functional diversity from a limited set of building blocks.

Levels of protein structure

Primary structure

The primary structure of a protein refers to the linear of in a polypeptide chain, connected covalently by peptide bonds between the carboxyl group of one and the amino group of the next. This forms the foundational backbone of the protein and is uniquely determined for each protein type. The polypeptide chain has directionality, with the (amino terminus) featuring a free amino group (-NH₂) at one end and the (carboxyl terminus) bearing a free carboxyl group (-COOH) at the other; by convention, the primary structure is described from the N- to . The often serves as the initiation site for protein synthesis, while the can influence stability and interactions. The primary structure arises from the process of , in which (mRNA) sequences are decoded into chains according to the . During , ribosomes read mRNA in triplets called codons, each specifying one of the 20 standard via transfer RNAs (tRNAs) that match codons to their corresponding . This , nearly universal across organisms, ensures that the sequence in mRNA directly dictates the precise order of in the protein. The 20 standard are distinguished by their side chains (R groups), which vary in size, shape, charge, and polarity—ranging from nonpolar hydrophobic groups like those in and to polar uncharged ones in serine and threonine, acidic ones in and , and basic ones in and —imparting specific chemical properties that influence the protein's overall behavior. The primary structure is critical for a protein's identity, folding, and biological function, as even minor alterations can disrupt these processes. For instance, in sickle cell anemia, a in the β-globin substitutes (a hydrophobic ) for (hydrophilic) at the sixth position of the β-chain, leading to abnormal protein aggregation and red blood cell deformation. Such mutations highlight how the exact sequence governs protein stability and activity. Analytical techniques for determining primary structure include , which uses phenylisothiocyanate to selectively cleave and identify the N-terminal in successive cycles, allowing sequencing of up to 50-60 residues. complements this by fragmenting peptides and measuring mass-to-charge ratios to infer the full sequence, often via tandem MS/MS for sequencing of complex proteins. The primary structure provides the template that influences the formation of higher-order structures.

Secondary structure

Secondary structure describes the local, regular conformations of the polypeptide backbone in a protein, primarily stabilized by bonds between the backbone carbonyl oxygen (C=O) of one residue and the (N-H) of another residue. These interactions occur within the backbone atoms, independent of side-chain effects, and give rise to repetitive structural motifs that form the building blocks of higher-order protein architecture. Unlike the linear primary sequence, secondary structures impose spatial constraints on the chain, influencing flexibility and overall folding propensity. The predominant secondary structural elements are alpha-helices and beta-sheets. An alpha-helix is a right-handed coiled structure in which the backbone forms a cylindrical spiral, with approximately 3.6 residues per helical turn and a pitch of 5.4 . The stabilization arises from intra-chain hydrogen bonds between the carbonyl oxygen of residue i and the hydrogen of residue i+4, aligning parallel to the axis and spaced about 2.8 apart. This configuration was first proposed by and Robert Corey based on stereochemical modeling of polypeptide chains. Alpha-helices are common in proteins, comprising about 30% of residues in globular proteins, and often cluster to form hydrophobic cores. Beta-sheets consist of extended polypeptide strands, typically 5–10 residues long, that align either in a (strands running in the same N-to-C direction) or antiparallel (opposite directions) fashion to form a pleated sheet-like array. Hydrogen bonds form between the carbonyl oxygen and of adjacent strands, creating a of bonds perpendicular to the strand direction and resulting in a twisted, pleated appearance due to the tetrahedral of the alpha-carbon. Antiparallel beta-sheets are more stable than ones because their hydrogen bonds are more linear. Beta-sheets often form the core of beta-rich proteins like silk fibroin and contribute to the rigidity of structures such as immunoglobulin domains. This motif was also theoretically derived by Pauling and Corey shortly after the alpha-helix model. Other secondary structural elements include beta-turns, loops, and less common variants like pi-helices. Beta-turns are tight, four-residue reversals in the polypeptide chain that connect successive beta-strands or other elements, allowing the chain to fold back on itself; they are classified into types I, II, and III based on dihedral angles, with type I being the most frequent (about 50% of turns). Loops are irregular, non-repetitive segments lacking consistent bonding patterns, often exposed to and varying in length from a few to tens of residues; they provide flexibility and can harbor functional sites. Pi-helices, a wider variant of the alpha-helix with 4.4 residues per turn and hydrogen bonds between residues i and i+5, occur infrequently (less than 1% of helical residues) and are typically found at the ends of alpha-helices or in distorted regions. The conformational possibilities of the polypeptide backbone are constrained by steric hindrance and visualized in the , which maps the dihedral angles (φ, rotation around the N-Cα bond) and (ψ, rotation around the Cα-C bond) for each residue. Allowed regions correspond to sterically favorable conformations: the alpha-helix occupies φ ≈ -60°, ψ ≈ -45°; beta-sheets cluster around φ ≈ -120°, ψ ≈ +120°; and beta-turns appear in broader areas. , lacking a beta-carbon, populates more regions due to reduced steric clash, while is restricted by its ring structure. Disallowed areas represent high-energy clashes, ensuring backbone planarity near the (ω ≈ 180°). This plot, derived from model-building and energy calculations, underscores the limited flexibility of the backbone despite 20 possible side chains. Secondary structures are predicted from amino acid sequences using empirical methods like the Chou-Fasman algorithm, which assigns propensity values (Pα, Pβ) to each residue based on their observed frequencies in known alpha-helices and beta-sheets from structures. For example, has high Pα (1.42) favoring helices, while favors beta-sheets (Pβ = 1.70). The algorithm scans the sequence for nucleating segments where four of six residues have P > 1.00, then extends until broken by helix-breakers like . Though accuracy is around 50–60%, it provided early insights into sequence-structure relationships before advanced approaches. Functionally, secondary elements contribute to active sites and stability; for instance, in , eight alpha-helices (labeled A–H) pack around the , forming a hydrophobic pocket that positions the iron for reversible oxygen binding and protects it from oxidation. This helical arrangement enables 's role as an protein in muscle . These local motifs ultimately pack via hydrophobic interactions and side-chain packing to form the tertiary structure.

Tertiary structure

Tertiary structure refers to the overall three-dimensional arrangement of a single polypeptide chain, resulting from the folding of its secondary structural elements into a compact, functional conformation. This level of encompasses the spatial positioning of all atoms in the backbone and side chains ( groups), enabling the protein to perform its biological . Unlike secondary structure, which involves local hydrogen bonding along the backbone, tertiary structure is stabilized primarily by interactions between distant parts of the chain or between side chains. The stability of tertiary structure arises from a combination of non-covalent and covalent interactions. Hydrophobic interactions drive nonpolar side chains to cluster away from the aqueous environment, forming a hydrophobic core that minimizes contact with . Ionic bonds, or bridges, form between oppositely charged side chains, such as those of aspartate and , contributing to structural rigidity. Hydrogen bonds occur between polar side chains or between side chains and the backbone, beyond those in secondary structures. Van der Waals forces provide weak attractions between closely packed atoms in the core, while bridges—covalent bonds between residues—offer additional stabilization, particularly in extracellular proteins. In the folded state, the hydrophobic core consists of buried nonpolar residues, shielded from , while polar and charged residues are typically exposed on the surface, facilitating and interactions with other molecules. This core formation is a key energetic driver of folding, as the of hydrophobic surfaces reduces unfavorable loss in surrounding molecules. Side-chain interactions, including bridges and bonds, further fine-tune the packing, ensuring precise alignment of functional groups. Globular proteins, such as the monomeric subunit of , exemplify compact structures optimized for enzymatic or transport functions, with a hydrophobic interior and hydrophilic exterior. In contrast, fibrous proteins like α-keratin display elongated folds, often dominated by extended secondary elements, providing mechanical strength in tissues. These examples highlight how architecture adapts to diverse roles, from in cellular environments to . Denaturation disrupts structure through agents like heat, , or changes, leading to reversible unfolding where the polypeptide expands and loses its native conformation. Classic experiments on demonstrated this reversibility: upon removal of denaturants and reoxidation of bonds, the protein refolds spontaneously into its functional form, underscoring that structure is thermodynamically determined by the sequence. Tertiary folds exhibit greater evolutionary conservation than primary , allowing proteins with low sequence identity to maintain similar three-dimensional architectures across . This structural persistence facilitates functional divergence while preserving , as seen in homologous proteins where mutations accumulate in surface loops but spare the buried core. In multi-subunit proteins, the fold of individual chains serves as the foundation for subsequent assembly.

Quaternary structure

Quaternary structure describes the non-covalent associations, and occasionally covalent linkages, of multiple polypeptide subunits to form a functional . These interactions typically occur between the folded tertiary structures of individual subunits, enabling the assembly of larger, often architectures essential for . Protein complexes exhibit various types, including homodimers composed of two identical subunits, heterodimers with two different subunits, and higher-order oligomers such as tetramers or larger assemblies. A prominent example is heterotetramer (α₂β₂) that facilitates oxygen transport in blood, where the subunits assemble via hydrophobic and electrostatic interactions at their interfaces. Homomers, formed by identical subunits, predominate among proteins with known quaternary structures, comprising 50-70% of such cases across diverse proteomes, which underscores their evolutionary prevalence and role in simplifying assembly. Subunit interfaces in quaternary structures involve specific contacts, such as bonds, bridges, and der Waals interactions, that stabilize the complex and often mediate allosteric effects, where at one site influences activity at another. These interfaces confer functional advantages, including enhanced stability against denaturation, regulatory control through , and specialization of active sites spanning multiple subunits—as seen in aspartate transcarbamoylase, where the catalytic sites lie at the boundaries between catalytic subunits to enable substrate and . Quaternary assemblies can also dissociate under environmental cues, such as shifts or ; for instance, tetramers reversibly break into αβ dimers at low or upon release, modulating oxygen and preventing aggregation.

Structural domains, motifs, and folds

Structural domains

Structural domains are compact, semi-independent folding units within a protein that typically range from 50 to 200 in length and are often connected by flexible linker regions, allowing them to function autonomously while contributing to the overall protein architecture. These domains encompass arrangements of secondary structural elements, such as alpha helices and beta sheets, organized into a stable fold. Multi-domain proteins, which incorporate two or more such domains, are highly prevalent in nature, comprising more than 80% of proteins in eukaryotic organisms, compared to about 67% in prokaryotes. In some cases, domains can participate in domain swapping, a process where identical or similar protein monomers exchange structural elements to form dimers or higher-order oligomers, thereby modulating protein function and stability. Protein domains serve diverse roles, including , , and regulation of protein activity. For instance, catalytic domains house the active sites for enzymatic reactions, while domains facilitate interactions with substrates, cofactors, or other molecules; regulatory domains, such as the Src homology 2 (SH2) domain, bind phosphorylated residues to propagate signaling cascades in cellular pathways like signaling. These functional specializations enable multi-domain proteins to integrate multiple processes within a single polypeptide chain. From an evolutionary perspective, structural domains have expanded the functional diversity of proteomes through mechanisms like domain duplication, where a segment of the gene encoding a domain is copied internally, and domain shuffling via genetic recombination, which rearranges domains between different proteins to create novel architectures. These processes, often facilitated by exon shuffling in eukaryotes, have driven the complexity of multi-domain proteins over evolutionary time. Identification of structural domains can be achieved through experimental structure determination, such as or cryo-electron microscopy, which reveals compact folding units, or computationally via sequence analysis using databases like , which catalogs domain families based on hidden Markov models derived from multiple sequence alignments. A prominent example of structural domains is found in antibodies, where immunoglobulin domains adopt a characteristic -sandwich fold consisting of two antiparallel beta sheets stabilized by a bond, enabling recognition and immune response modulation.

Sequence and structural motifs

Sequence and structural motifs are short, conserved patterns in protein sequences or three-dimensional structures that often confer specific biochemical functions, such as or . These motifs typically span fewer than 50 residues and are frequently embedded within larger protein domains, distinguishing them from independently folding structural domains. Unlike domains, which represent modular, self-contained units capable of folding autonomously, motifs serve as functional signatures that can occur in diverse structural contexts. Sequence motifs consist of linear patterns of that are conserved across evolutionarily related proteins due to their functional importance. A prominent example is the Walker A and B motifs, identified in ATP-binding proteins, where the Walker A motif follows the consensus GxxxxGK[T/S] and forms a phosphate-binding (P-loop) that interacts with the γ-phosphate of ATP, while the Walker B motif (hhhhDE, where h is a hydrophobic residue, D is aspartate, and E is glutamate) coordinates a magnesium essential for hydrolysis. These motifs were first recognized in nucleotide-binding enzymes like subunits and kinases. Sequence motifs are detected using regular expressions or pattern-matching algorithms in databases like , which compiles biologically significant patterns for functional annotation of uncharacterized proteins. Such sequence motifs often define binding sites or catalytic residues critical for enzymatic activity. For instance, the in serine proteases—comprising serine, , and aspartate residues arranged to facilitate nucleophilic attack on peptide bonds—enables efficient and is conserved across families like and , despite low overall sequence similarity.85760-6/fulltext) Structural motifs, in contrast, refer to recurring three-dimensional arrangements that may not be evident from sequence alone but are crucial for function. The motif, a compact ββα fold stabilized by a ion coordinated to and residues, enables DNA binding in transcription factors like TFIIIA, where tandem repeats recognize specific sequences. Similarly, the motif features two α-helices with leucine residues at every seventh position forming a coiled-coil dimer , facilitating protein-protein interactions in transcription factors such as C/EBP. Another example is the EF-hand motif, a helix-loop-helix structure with a 12-residue loop that binds calcium ions via oxygen-containing side chains, as seen in , where it triggers conformational changes for .77292-7/fulltext) Detection of structural motifs relies on geometric searches in protein structure databases like the (PDB), using algorithms that match spatial arrangements of secondary elements or atom coordinates, often integrated into tools like those in the CATH or classifications. These motifs underpin diverse functions, including metal ion coordination, dimerization, and , and their conservation highlights evolutionary pressures for functional specificity.

Supersecondary structures

Supersecondary structures, also known as motifs, represent recurring combinations of two or more secondary structural elements, such as α-helices and β-strands, connected by short loops or turns, forming compact and stable spatial units that serve as intermediate building blocks between secondary and levels of protein organization. These structures are characterized by specific geometric arrangements stabilized by bonds, hydrophobic interactions, and van der Waals forces, often exhibiting enhanced rigidity compared to isolated secondary elements. Common examples of supersecondary structures include the motif, consisting of two α-helices linked by a short β-turn, which provides a stable scaffold frequently observed in regulatory proteins. Another prevalent motif is the β-α-β unit, where two parallel β-strands are connected by an intervening α-helix in a right-handed crossover, contributing to the core of many enzymatic domains. The β-hairpin, formed by two antiparallel β-strands joined by a tight of 2–5 residues, exemplifies a simple yet versatile structure that can fold independently and is classified into types based on loop conformation and hydrogen bonding patterns. A more extended example is the Rossmann fold, comprising multiple tandem β-α-β motifs arranged around a central β-sheet, which is widely distributed in proteins binding dinucleotides like NAD⁺ and was first systematically described in comparative analyses of dehydrogenases.90088-3) These supersecondary structures play a crucial role in protein architecture by acting as modular building blocks that assemble into larger structural domains and folds, facilitating efficient packing and functional organization. Their stability arises from the close packing of adjacent secondary elements, which minimizes exposure and maximizes non-covalent interactions; for instance, isolated β-hairpins and α-α-corners have been shown to maintain native-like conformations in peptide fragments through spectroscopic studies. Evolutionarily, supersecondary structures exhibit high conservation across diverse protein families, even in non-homologous sequences, indicating their emergence as ancient folding nuclei that have been reused and diversified throughout protein evolution, as evidenced by phylogenetic analyses of motifs like the Rossmann fold in cofactor-binding enzymes. Prediction of these structures relies on sequence-based methods that leverage propensities for secondary element formation and loop flexibility, including early statistical approaches recognizing patterns in sequences and modern models trained on structural databases like the . Such predictions integrate supersecondary units into models of folds to guide overall structure determination.

Protein folds

Protein folds represent the distinctive three-dimensional topologies formed by the backbone of polypeptide chains, encompassing the overall arrangement of secondary structural elements without regard to the specific sequence. These topologies are recurrent patterns observed across diverse proteins, such as the , which features a central cylinder of eight parallel β-strands encircled by eight α-helices, facilitating enzymatic activity in numerous metabolic pathways. Another prominent example is the immunoglobulin fold, a β-sandwich structure composed of two Greek key β-sheets packed against each other, commonly found in immune recognition proteins. Despite the immense diversity of protein sequences—estimated at over 10^12 possible 100-residue polypeptides—the structural fold space is remarkably constrained, with approximately 2,000 distinct folds cataloged in major databases as of 2024. Recent AI-driven predictions, such as those from , have expanded the cataloged folds, revealing nearly 200 new ones in 2024, further illuminating . This limitation arises from biophysical constraints on stable, functional architectures, allowing unrelated sequences to independently evolve into the same fold through , where selective pressures favor similar structural solutions for analogous roles. In contrast, divergent evolution preserves folds within homologous protein families descended from a common . Such convergence is evident in cases like the Rossmann fold, a β-α-β repeated to form a nucleotide-binding domain, which has been adopted by dehydrogenases and other enzymes handling NAD(P)-dependent reactions across distant lineages. Protein folds are systematically classified in hierarchical databases like (Structural Classification of Proteins) and CATH (, Architecture, , and Homologous superfamily), which delineate folds based on topological and geometric similarity of secondary elements. organizes structures into classes, folds, superfamilies, and families, emphasizing evolutionary relationships, while CATH focuses on architectural and topological descriptors to group domains. These resources reveal that specific folds often constrain functional possibilities; for example, the Rossmann fold predominantly supports coenzyme-binding roles in , limiting the range of reactions it can accommodate. Beyond sequence-based methods, fold comparison enables the detection of distant homologs by identifying shared topologies obscured by low identity, thus inferring evolutionary and functional connections in proteins diverged over billions of years. Tools leveraging structural alignments, such as those in and CATH, facilitate this by quantifying similarities in geometry, aiding in of uncharacterized proteins. This approach underscores how space exploration bridges gaps in understanding protein and multifunctionality.

Protein dynamics and conformational changes

Protein dynamics

Protein dynamics refer to the time-dependent fluctuations and movements within protein structures that occur even in their native states, influencing their biological functions. These motions range from small-scale vibrations to large-scale rearrangements, allowing proteins to adapt to environmental changes and interact with other molecules. Understanding these dynamics is essential, as they underpin processes like enzymatic activity and molecular recognition. Key types of motions in proteins include side-chain rotations, which involve the conformational changes of side chains around their chi angles; loop flexibility, where flexible s undergo bending or twisting; and hinge bending, characterized by rigid-body rotations between structural domains connected by flexible hinges. Side-chain rotations enable local adjustments for substrate positioning, while loop flexibility facilitates access to active sites, and hinge bending allows for overall changes in multi-domain proteins. These motions occur across a broad spectrum of timescales, from for bond vibrations and side-chain fluctuations to milliseconds for movements and bending. Vibrational motions in the range involve stretching and bending of covalent bonds, whereas slower to timescales capture loop and side-chain dynamics, and millisecond events correspond to larger conformational shifts. Molecular dynamics (MD) simulations and are primary techniques for probing protein dynamics. MD simulations model atomic trajectories over time using force fields to predict motions from femtoseconds to microseconds, providing insights into inaccessible experimental timescales. , such as NMR spin relaxation or methods, captures real-time structural changes by monitoring spectroscopic signals after perturbation, revealing dynamics on to scales. Protein dynamics play critical functional roles, including enabling by positioning substrates in enzyme active sites, facilitating binding through transient openings, and mediating allostery where motions in one region propagate signals to distant sites. For instance, breathing motions—collective expansions and contractions of the protein core—allow enzymes like to accommodate substrates and release products, enhancing catalytic efficiency. A significant of protein is intrinsic disorder, where certain regions lack a fixed three-dimensional structure and instead exist as dynamic ensembles under physiological conditions. These intrinsically disordered regions (IDRs) are prevalent, occurring in over 70% of proteins and particularly in signaling proteins, where about 66% contain long disordered segments that enable flexible interactions with multiple partners. IDRs contribute to by allowing rapid conformational sampling essential for regulatory functions. These dynamic processes contribute to the broader conformational ensembles that proteins sample, linking microscopic motions to functional versatility.

Conformational ensembles

Proteins in their native environments exist not as rigid, single structures but as dynamic ensembles of multiple three-dimensional conformations that interconvert under physiological conditions. This view challenges the traditional static model derived from early , emphasizing instead the inherent flexibility essential for biological function. The conformational ensemble represents the Boltzmann-weighted population of states accessible to the protein, where each conformation's occupancy is determined by its relative to others. The distribution of conformations within an ensemble follows the , governed by the equation P_i = \frac{e^{-\Delta G_i / RT}}{\sum_j e^{-\Delta G_j / RT}}, where P_i is the probability of conformation i, \Delta G_i is its difference from the , R is the , and T is the . Lower-energy conformations dominate the , while higher-energy, low-population states can still contribute to if transiently stabilized. Experimental sampling of these ensembles relies on techniques that probe structural heterogeneity and populations. (NMR) relaxation measurements, such as ^{15}N relaxation rates, reveal conformational on microsecond-to-millisecond timescales by quantifying order parameters and times that reflect motional amplitudes across the ensemble. Similarly, single-molecule Förster resonance energy transfer (smFRET) tracks real-time distance fluctuations between fluorophore-labeled sites, enabling direct observation of conformational subpopulations and their interconversion kinetics without ensemble averaging. Conformational ensembles play a critical role in protein-ligand interactions, where mechanisms such as conformational selection—ligand binding to a rare, pre-existing state that shifts the equilibrium—and induced fit—binding that actively drives a conformational change—facilitate specificity and efficiency. These processes are not mutually exclusive; many systems exhibit hybrid behaviors, with selection dominating for low-affinity initial encounters and induced fit stabilizing the bound state. A representative example is , an that catalyzes phosphate transfer and alternates between an open, substrate-accessible conformation (predominant in the apo form) and a closed, catalytically active state upon binding ATP and , with the ensemble allowing rapid transitions essential for its kinetic cycle. These ensembles arise from underlying dynamic motions that sample the energy landscape, though the focus here is on the equilibrium distribution rather than transient kinetics. Advances in have enhanced ensemble characterization, particularly through cryo-electron microscopy (cryo-EM), which captures snapshots of heterogeneous states in near-native conditions and, when combined with computational modeling, resolves multiple conformers and their relative populations from vitrified samples. This integration overcomes limitations of traditional methods by accommodating larger, more complex systems and providing atomic-level insights into lowly populated states.

Protein folding

Folding mechanisms

Protein folding mechanisms describe the physical processes by which polypeptide chains transition from disordered, unfolded states to their functional native conformations, navigating an immense conformational space in biologically feasible timescales. The Levinthal paradox highlights the challenge of this process: for a typical protein with 100 residues, each capable of sampling approximately 3 possible conformations per residue, the total number of possible structures exceeds 10^47, yet proteins fold in milliseconds to seconds, implying that random sampling would take longer than the age of the . This paradox underscores that folding cannot proceed via exhaustive random search but must follow directed pathways biased by the protein's energy landscape. The folding funnel model resolves the Levinthal paradox by conceptualizing the protein's free energy landscape as a funnel-shaped surface, where the unfolded ensemble at high energy and progressively loses while decreasing toward the native state at the bottom. In this statistical mechanical , evolutionarily optimized sequences minimize energetic , creating smooth funnels that guide folding without deep kinetic traps, enabling rapid convergence to the native . The funnel's ruggedness reflects local minima, but overall bias toward the native state ensures efficient folding for minimally frustrated proteins. Proteins exhibit either two-state or multi-state folding kinetics, depending on their size and topology. In two-state folding, the transition from unfolded (U) to native (N) state is cooperative, with no detectable populated intermediates, as seen in small, single-domain proteins like chymotrypsin inhibitor 2 (CI2), where the folding rate is limited by a single high-energy transition state. Multi-state folding involves obligatory on-pathway intermediates, common in larger proteins, where partial structures form sequentially before reaching the native state, allowing for more complex energy landscapes with multiple barriers. The distinction arises from the protein's foldon units—cooperative substructures that nucleate folding—and is probed by comparing equilibrium and kinetic unfolding rates. A key mechanism in both two-state and multi-state folding is nucleation-condensation, where an initial nucleus of ordered secondary structure forms, followed by rapid condensation of the remaining chain around it to stabilize tertiary interactions. In CI2, for example, a diffuse involving the C-terminal and beta-sheet initiates folding, with the featuring partial native-like interactions that propagate structure formation. This hybrid mechanism combines elements of (secondary structure first) and hydrophobic collapse models, optimizing folding rates by coupling local and nonlocal contacts early. It predominates in small globular proteins, ensuring cooperative transitions without stable off-pathway species. Despite directed pathways, folding landscapes contain off-pathway traps where misfolded conformations form kinetic dead-ends, leading to aggregates or . These traps arise from sequence-specific frustrations, such as improper hydrophobic burial or beta-strand mispairing, slowing productive folding. A prominent example is prion proteins, where the cellular PrP^C (alpha-helical) can misfold into the beta-sheet-rich PrP^Sc isoform, seeding self-propagating aggregates that cause transmissible spongiform encephalopathies. Such off-pathway events highlight the role of kinetic partitioning in folding efficiency, with misfolding rates increasing under cellular stress. Experimental probes elucidate these mechanisms through kinetic and structural analyses. Phi (Φ)-value analysis quantifies transition-state structure by measuring changes in folding/unfolding rates and stabilities upon mutations, where Φ ≈ 1 indicates native-like interactions and Φ ≈ 0 suggests unfolded-like; in barnase, Φ values revealed a polarized transition state with structured core. Stopped-flow kinetics, using rapid mixing to initiate refolding and monitor fluorescence or absorbance changes, resolves millisecond-scale transitions, distinguishing two-state chevron plots (linear) from multi-state curvatures indicative of intermediates. In vivo, these intrinsic mechanisms are supported by chaperones that prevent aggregation, but the core pathways remain sequence-determined.

Chaperones and folding assistants

Molecular chaperones, particularly the heat shock protein (Hsp) families, play essential roles in assisting within the crowded cellular environment by preventing misfolding and aggregation of nascent or stress-damaged polypeptides. These proteins do not impart a specific folded structure but instead facilitate the correct assembly through transient interactions that shield hydrophobic regions exposed in unfolded states. Among the major classes, and Hsp60 (chaperonins) represent key types that operate via distinct but complementary mechanisms to promote productive folding pathways. Hsp70 chaperones, such as the bacterial DnaK and eukaryotic Hsc70 or inducible , bind to unfolded polypeptide chains, stabilizing them in a conformation competent for folding. This binding occurs through an ATP-dependent cycle: in the ATP-bound state, the substrate-binding domain (SBD) adopts an open conformation with low affinity for substrates; , stimulated by co-chaperones, transitions the SBD to a closed, high-affinity state that clamps onto hydrophobic segments of the unfolded chain, effectively isolating it from aggregation-prone interactions. Nucleotide exchange factors then promote release and ATP rebinding, releasing the substrate to allow folding attempts. This iterative cycle prevents premature aggregation and enables repeated binding-release events, increasing the likelihood of reaching the native state. In contrast, Hsp60 chaperonins, exemplified by the bacterial , function by encapsulating substrates within a protected cavity to isolate them during folding. forms a double-ring structure with 14 identical subunits, each containing apical, intermediate, and equatorial domains; the equatorial domain binds ATP, while the apical domain captures unfolded proteins via hydrophobic grooves. Upon ATP binding and , the co-chaperonin GroES caps one ring, enlarging the central cavity into a hydrophilic environment that expels bound water and hydrophobic residues, promoting substrate expansion and folding. This encapsulation mechanism sequesters a single substrate protein per cycle, preventing intermolecular associations that lead to aggregates, and the process repeats for iterative annealing until the native fold is achieved. Co-chaperones regulate these cycles for specificity and efficiency. Hsp40 (DnaJ homologs), J-domain-containing proteins, target unfolded chains to by first binding substrates themselves and then stimulating 's ATPase activity up to 1000-fold through interaction with its nucleotide-binding domain. This enhances substrate delivery and clamps the complex during the high-affinity phase. (STIP1), another co-chaperone, bridges and by binding their C-terminal motifs via TPR domains, facilitating transfer of partially folded clients to for further maturation in a coordinated chaperone network. These regulators ensure timely progression through folding stages, preventing kinetic traps. Chaperones are vital for de novo folding of newly synthesized proteins emerging from ribosomes, where Hsp70 systems capture nascent chains co-translationally to avert aggregation in the cytosol. Under stress conditions, such as heat shock, they also mediate refolding of denatured proteins; for instance, Hsp70 solubilizes aggregates in cooperation with disaggregases, allowing recapture and iterative folding attempts. Chaperonins like GroEL similarly assist refolding by providing an isolated compartment, as demonstrated with substrates like rhodanese, where encapsulation yields up to 90% recovery of native activity post-denaturation. In eukaryotes, the TRiC (or CCT) chaperonin serves as a functional analog to , folding approximately 10% of the cytosolic , including and . Composed of eight distinct subunits forming a hetero-oligomeric double ring, TRiC uses an inherent lid mechanism without a separate co-chaperonin like GroES; drives asymmetric conformational changes, sequentially closing the chamber to create a polarized that guides folding. It often cooperates with prefoldin for delivery of obligate substrates, highlighting its specialized role in complex eukaryotic folding. Deficiencies in chaperone function contribute to neurodegenerative diseases characterized by protein aggregation. In conditions like Alzheimer's and Parkinson's, impaired Hsp70 activity leads to accumulation of misfolded tau or α-synuclein, exacerbating neuronal toxicity due to failed refolding and clearance. Similarly, reduced TRiC efficiency disrupts cytoskeletal protein folding, promoting amyloid formation and synaptic loss in Huntington's disease models. These chaperone deficits underscore their protective role against proteotoxic stress in the aging brain.

Protein stability

Thermodynamic principles

The native conformation of a protein represents the thermodynamically most stable state under physiological conditions, corresponding to the global minimum of the landscape. The stability of this native state relative to the unfolded ensemble is quantified by the standard change for unfolding, \Delta G^\circ = G_U - G_N = \Delta H - T\Delta S, where \Delta H is the change, T is the absolute temperature, and \Delta S is the change; a positive \Delta G^\circ ensures the native state predominates at equilibrium. This thermodynamic framework underpins , which posits that the sequence of a protein encodes the information necessary for it to achieve its thermodynamically favored native structure spontaneously , as demonstrated by refolding experiments on A. Proteins exhibit marginal thermodynamic stability, with the native state typically only 5–15 kcal/ more stable than the unfolded state under ambient conditions, allowing functional flexibility while preventing aggregation. This narrow energy margin arises from a delicate balance of enthalpic and entropic contributions, where unfolding exposes hydrophobic residues to , leading to a characteristic positive change, \Delta C_p > 0, typically on the order of 1–3 kcal/· for small proteins. The \Delta C_p term influences the temperature dependence of \Delta G^\circ via the Gibbs-Helmholtz relation, \Delta G(T) = \Delta H(T_0) + \int_{T_0}^T \Delta C_p \, dT - T \left[ \Delta S(T_0) + \int_{T_0}^T \frac{\Delta C_p}{T} \, dT \right], resulting in parabolic stability curves that peak near and enable both heat and cold denaturation. For many globular proteins, unfolding follows a two-state model, approximating an all-or-nothing transition between native (N) and unfolded (U) states without stable intermediates: N \rightleftharpoons U. The equilibrium constant is K = \frac{[U]}{[N]} = e^{-\Delta G / RT}, where R is the gas constant, allowing extrapolation of stability parameters from denaturation experiments using denaturants or temperature. Differential scanning calorimetry (DSC) directly measures the heat capacity as a function of temperature, yielding unfolding endotherms from which \Delta H, T_m (midpoint temperature), and \Delta C_p are derived to construct comprehensive stability profiles.

Factors influencing stability

Protein stability is modulated by a variety of environmental and molecular factors that alter the balance between the folded and unfolded states, primarily by influencing the landscape as described in thermodynamic principles. These factors can either enhance or disrupt stabilizing interactions such as bonds, hydrophobic effects, and electrostatic forces within the protein structure. The of the surrounding environment significantly affects protein stability by protonating or deprotonating ionizable residues, which in turn influences electrostatic interactions like salt bridges and charge repulsion. At extreme values, such as highly acidic or basic conditions, the net charge on the protein can increase, leading to repulsion between like-charged residues and subsequent unfolding. For instance, many proteins exhibit optimal stability near their , where the net charge is minimized, reducing electrostatic repulsion. , determined by salt concentration, modulates these electrostatic effects by screening charges through Debye-Hückel interactions; low ionic strength enhances charge-charge attractions that stabilize salt bridges, while high ionic strength can weaken them, potentially destabilizing the structure. In monoclonal antibodies, for example, increasing from low to moderate levels often stabilizes the folded state by reducing unfavorable repulsions. Temperature exerts a profound influence on protein stability, with elevated temperatures promoting thermal denaturation by increasing molecular motion and disrupting weak non-covalent interactions, leading to unfolding above a characteristic melting temperature (Tm). Conversely, cold denaturation occurs at low temperatures, where the hydrophobic effect weakens due to reduced gain upon burial of nonpolar residues, destabilizing the core. affects stability through volumetric changes; high hydrostatic favors the unfolded state by compressing voids in the protein structure and promoting penetration, as seen in pressure-induced denaturation of globular proteins. This is particularly relevant for deep-sea organisms, where pressures exceed 100 , yet adapted proteins maintain integrity via compact folding. Ligands and cofactors play crucial roles in stabilizing proteins by binding to specific sites, often rigidifying the structure and shifting the equilibrium toward the folded state. Small-molecule ligands can form additional bonds or hydrophobic contacts, enhancing overall , as demonstrated in screening methods that identify stabilizing additives for therapeutic proteins. Metal ions, such as or calcium, serve as cofactors that coordinate with residues in the or core, bridging distant parts of the polypeptide chain and preventing unfolding; for example, in zinc-finger proteins, metal binding increases thermal by up to 20-30°C. These bound states are essential for enzymes like , where cofactor absence leads to rapid degradation. Mutations alter protein stability by modifying intramolecular interactions, with effects ranging from stabilizing to destabilizing depending on their location and nature. Core mutations that improve packing density, such as replacing a smaller residue with a bulkier one, can enhance hydrophobic interactions and increase stability, as observed in engineered variants of T4 lysozyme. Conversely, surface mutations introducing charged mismatches or disrupting hydrogen bonds often destabilize the structure, contributing to diseases like via misfolding of the CFTR protein. Single-point mutations typically follow a Gaussian distribution in their stability impact, with most causing modest destabilization due to the marginal stability of wild-type proteins. Post-translational modifications, particularly , contribute to stability by adding moieties that shield hydrophobic regions, promote proper folding, and resist proteolytic . N-linked , for instance, stabilizes glycoproteins like immunoglobulins by increasing and reducing aggregation propensity through steric hindrance. In therapeutic monoclonal antibodies, at specific sites enhances thermal stability by modulating surface charge and hydrogen bonding networks. This modification is critical in eukaryotic proteins, where its absence often leads to stress and . In extremophiles, adaptations enhance protein stability under harsh conditions; thermophilic proteins from organisms like often feature increased bonds, which covalently link distant cysteines to rigidify the and resist unfolding. These proteins also exhibit higher charged residue content on surfaces to strengthen bridges and more compact cores with optimized hydrophobic packing, allowing at temperatures above 80°C. Such adaptations, evolved through selection for , include reduced content to limit flexibility, as seen in hyperthermophilic archaeal enzymes.

Experimental determination of protein structures

Biophysical techniques

Biophysical techniques for determining protein structures rely on physical principles that probe arrangements through interactions of with or fields, enabling the reconstruction of three-dimensional models from experimental data. These methods exploit phenomena such as , , and magnetic to generate signals that, when analyzed, yield distributions or positional coordinates of atoms within proteins. The foundational goal is to achieve sufficient to distinguish atomic features, typically measured in angstroms (), where high-resolution data allows for precise placement of individual atoms, while lower-resolution outputs provide overall shapes and secondary elements. Resolution in protein structure determination refers to the smallest distance between features that can be reliably distinguished, with resolution generally considered below 3 , enabling the visualization of side-chain orientations and patterns, whereas resolutions above 4 are low and reveal only the protein's gross architecture, such as arrangements. For instance, structures at 1-2 allow unambiguous tracing, akin to seeing individual beads on a string, while low-resolution maps at 5-10 resemble fuzzy outlines. is a critical aspect, varying by technique: crystalline states are required for methods like diffraction to produce ordered lattices for wave interference, solution states for (NMR) to maintain native dynamics in liquid environments, and frozen hydrated states for cryo-electron microscopy to preserve biomolecules in near-native conditions without crystals. The core principles of and underpin many techniques, where incident waves (e.g., X-rays or electrons) interact with the electrons in protein atoms, producing interference patterns that encode spatial information. These patterns are mathematically transformed via into maps, which depict regions of high electron concentration corresponding to atomic positions, guided by the protein's known . In scattering approaches, such as , the overall shape is inferred from low-angle deflections without needing atomic detail. Limitations persist, including the problem in , where diffraction intensities are measured but phase information is lost, requiring indirect methods like isomorphous replacement for reconstruction, and size constraints in NMR, typically limited to proteins under 50 kDa due to signal broadening from slower tumbling in larger molecules. To overcome individual technique shortcomings, approaches integrate data from multiple sources for more complete models, such as combining low-resolution envelopes from with high-resolution fragments from to assemble full structures of large complexes. These integrative methods use computational frameworks to fit and validate components against complementary datasets, enhancing accuracy for dynamic or heterogeneous systems. A pivotal historical milestone was the determination of the first protein structure, , at 6 Å resolution in 1958 by and colleagues using , marking the advent of atomic-level insights into globular proteins and earning Kendrew the 1962 . Subsequent refinements to 2 Å in 1960 solidified the alpha-helical fold, revolutionizing .

Key experimental methods

X-ray crystallography remains the most widely used technique for determining high-resolution protein structures, accounting for over 80% of entries in structural databases as of 2023. The process begins with the challenging task of growing well-ordered protein crystals, often requiring extensive optimization of conditions such as , , and precipitant concentrations. Once crystals are obtained, they are exposed to a beam of X-rays, which scatter off the atoms to produce diffraction patterns; these patterns are analyzed using mathematical methods like Fourier transforms to reconstruct maps. Atomic models are then built into these maps and refined iteratively, often achieving s better than 2 Å for small to medium-sized proteins, as exemplified by the structure of solved at 2 Å in 1960. Nuclear magnetic resonance (NMR) spectroscopy complements X-ray crystallography by providing structures of proteins in solution, which more closely mimic physiological conditions. It relies on measuring nuclear Overhauser effects (NOEs) to identify spatial proximities between atoms, typically within 5 Å, along with restraints from coupling constants and data to define secondary structures. For proteins up to about 50 kDa, multidimensional NMR experiments—such as 3D or 4D heteronuclear methods—enable assignment of resonances and structure calculation using restrained simulations, yielding ensembles that capture conformational flexibility; also provide insights into dynamics on to timescales. Cryo-electron microscopy (cryo-EM) has undergone a "resolution revolution" since the 2010s, driven by advances in direct electron detectors, phase plates, and computational image processing, enabling routine determination of structures at near-atomic (better than 3 Å); as of 2025, resolutions better than 2 Å are increasingly routine. In single-particle cryo-EM, purified proteins are flash-frozen in vitreous ice to preserve native states, imaged at cryogenic temperatures to minimize beam damage, and thousands of particle projections are aligned and averaged using algorithms like RELION or cryoSPARC to reconstruct 3D density maps. This method excels for large macromolecular complexes, such as ribosomes or viral particles exceeding 500 kDa, where it has resolved structures like the 3.4 Å map of the human γ-secretase complex in 2015. Small-angle X-ray scattering (SAXS) offers a lower-resolution (typically 10-50 ) but versatile approach for probing overall protein shapes, flexibility, and assemblies in solution, particularly for disordered or heterogeneous systems unsuitable for high-resolution methods. SAXS measures the scattering of X-rays at small angles to derive parameters like the (R_g) and maximum dimension (D_max), which inform on global architecture; for example, it has been used to model the elongated shape of like α-synuclein. Data analysis often involves ab initio modeling or ensemble optimization to fit scattering profiles, providing complementary information to high-resolution techniques. Each method has distinct strengths: X-ray crystallography delivers the highest precision for rigid, crystallizable proteins but requires crystals that may trap non-native conformations; NMR uniquely captures solution dynamics and is ideal for small, flexible proteins but struggles with sizes above 50 kDa; cryo-EM is transformative for large, dynamic complexes in near-native states without , though it demands high sample purity and can suffer from preferred orientations. These techniques are often integrated—for instance, using NMR or SAXS to validate cryo-EM models—to provide a more complete structural picture. Recent advances in time-resolved methods have enabled visualization of transient protein folding intermediates, bridging with dynamics. Time-resolved serial femtosecond crystallography (TR-SFX) at free-electron lasers captures snapshots of folding pathways by mixing proteins with triggers like temperature jumps, as demonstrated in resolving intermediates of a photoreceptor at sub-microsecond timescales. Similarly, time-resolved cryo-EM, using microfluidic mixing devices, has imaged of proteins on millisecond scales, revealing compaction and secondary structure formation. These developments, accelerated post-2020, leverage and for to study folding mechanisms in .

Protein structure resources

Databases

The (PDB) serves as the primary global repository for experimentally determined three-dimensional structures of proteins, nucleic acids, and complex assemblies. Established in 1971 at under the leadership of Walter Hamilton, it began with just seven structures and has since grown into a foundational resource for . The PDB adopts the macromolecular (mmCIF) as its standard format, which supports detailed annotations for atomic coordinates, experimental metadata, and validation reports, enabling interoperability with various software tools. As of 2025, the archive contains over 244,000 entries, reflecting annual releases of around 12,000 to 14,000 structures in recent years. Complementing the PDB are specialized databases that archive complementary data from specific experimental techniques. The Electron Microscopy Data Bank (EMDB), established in 2002, stores three-dimensional density maps derived from electron microscopy reconstructions, including high-resolution cryo-EM volumes of macromolecular complexes and subcellular structures. Similarly, the Biological Magnetic Resonance Bank (BMRB) collects, annotates, and disseminates spectral and quantitative data from () of biological macromolecules, such as chemical shifts and relaxation parameters for proteins and nucleic acids. To ensure data reliability, deposited structures undergo rigorous validation using specialized tools. MolProbity, for instance, performs all-atom contact analysis to identify steric clashes, Ramachandran outliers, and side-chain rotamer errors, providing clashscores and percentile rankings for quality assessment. offers comprehensive checks on geometry, hydrogen bonding, and packing density, aiding in the refinement of models before deposition. These tools are integrated into the deposition pipelines of the PDB and its partners, promoting high standards across the archive. Access to these databases is facilitated through user-friendly interfaces, programmatic APIs, and visualization software. The RCSB PDB provides RESTful web services and APIs for querying entries by , , or experimental method, enabling automated data retrieval for large-scale analyses. Popular visualization tools include PyMOL, an open-source system for rendering atomic models with ray-tracing capabilities, and , which supports interactive analysis of structures alongside density maps and trajectories. The growth of the PDB has accelerated with advances in experimental techniques, yet computational predictions like those from have influenced trends by providing hypotheses that guide and validate new depositions without supplanting experimental efforts. The Protein Structure Database (AFDB), released by EMBL-EBI, complements experimental resources by offering predicted structures for over 200 million proteins from various organisms, aiding in hypothesis generation and filling gaps in experimental data. Despite this expansion, challenges persist, including incomplete coverage of certain protein classes; for example, membrane proteins remain underrepresented due to difficulties in and stability, comprising less than 5% of PDB entries.

Structural classifications

Structural classifications of proteins organize known three-dimensional structures into hierarchical schemes based on similarities in folding patterns and evolutionary relationships, facilitating the understanding of protein architecture across diverse biological contexts. These systems, such as SCOP and CATH, provide frameworks for grouping protein domains or entire proteins, enabling researchers to identify common structural motifs that often correlate with shared functions or ancestry. By categorizing structures at multiple levels, from broad secondary structure composition to specific evolutionary lineages, these classifications reveal patterns in protein evolution and aid in annotating uncharacterized proteins. The Structural Classification of Proteins (SCOP) database employs a manually curated to classify protein domains according to their structural and evolutionary relationships. At the highest level, proteins are divided into classes based on secondary structure content, such as all-alpha proteins (dominated by alpha-helices, exemplified by globins like ), all-beta proteins (composed mainly of beta-sheets, as seen in immunoglobulin domains), alpha/beta proteins (alternating alpha-helices and beta-strands, like the Rossmann in dehydrogenases), and alpha+beta proteins (segregated alpha and beta regions). Subsequent levels include (describing the overall topology without implying ), superfamily (groups sharing a common evolutionary origin with low similarity but structural conservation), and (closely related proteins with high identity). This four-tiered structure—class, , superfamily, —extends to protein and levels for finer , encompassing over 100,000 domains in recent releases. SCOP's manual curation, involving expert visual inspection of structures alongside and functional data, ensures high accuracy in delineating evolutionary links. In contrast, the Class, Architecture, Topology, and Homologous superfamily (CATH) database focuses exclusively on protein domains and uses a semi-automated approach to generate its hierarchy. The class level mirrors SCOP's, grouping by secondary structure predominance (e.g., mainly alpha, mainly beta, alpha-beta), while architecture describes the gross orientation of secondary elements without connectivity details, such as the barrel or sandwich arrangements in beta proteins. Topology (or fold family) specifies the connectivity and packing of these elements, and the homologous superfamily level clusters domains with evidence of shared ancestry, often supported by sequence or structural alignments. CATH classifies hundreds of thousands of domains, emphasizing domain-level granularity over whole proteins. Unlike SCOP's predominantly manual process, CATH integrates automated clustering algorithms with human oversight, allowing for scalable updates and reducing subjectivity in topology assignments. Key differences between and CATH arise from their methodologies and scopes: SCOP prioritizes evolutionary inference through manual integration of structural, sequence, and functional evidence across entire proteins, resulting in a more conservative , whereas CATH's domain-centric, semi-automated pipeline enables broader coverage and faster incorporation of new structures, though it may introduce minor discrepancies in superfamily assignments due to algorithmic thresholds. Both systems serve complementary roles, with SCOP favored for detailed evolutionary studies and CATH for high-throughput analysis. These classifications underpin applications in evolutionary inference, where superfamily groupings highlight from common ancestors despite divergence, as seen in the fold shared across enzymes from to eukaryotes. They also enable function prediction by leveraging structural similarity; for instance, assigning a domain to a known superfamily can infer catalytic roles based on conserved active sites, improving accuracy in projects. Post-2020, CATH has significantly expanded by incorporating AI-predicted structures from , adding over 150 million domains from 21 model organisms to enhance coverage of understudied superfamilies and support variant interpretation in disease research. SCOP updates, through its extended version , have similarly increased structural coverage to nearly all superfamilies, though with less emphasis on predicted models to maintain reliance on experimental data.

Computational prediction of protein structure

Template-based methods

Template-based methods, also known as comparative or , predict the three-dimensional structure of a target protein by leveraging structural templates from evolutionarily related proteins with known structures. This approach assumes that homologous proteins share similar folds, allowing the transfer of structural information from templates to the target sequence. The method is particularly effective when the target shares significant sequence similarity with existing structures in databases like the (PDB). The homology modeling pipeline typically begins with template selection, where the target sequence is searched against structural databases to identify suitable templates. Tools such as PSI-BLAST, which uses position-specific scoring matrices to detect distant homologs, or HHpred, which employs profile (HMM) comparisons for sensitive homology detection, are commonly used for this step. PSI-BLAST iteratively refines searches to capture weak similarities, while HHpred excels in aligning query and template profiles to identify remote homologs with low sequence identity. Once templates are selected, is performed to map the target residues onto the template backbone, often using tools like Clustal Omega or the alignment modules in modeling software. Model building follows, where atomic coordinates are derived by copying conserved regions from the template and modeling variable loops and side chains, typically via satisfaction of spatial restraints derived from the alignment and statistical potentials. Refinement optimizes the model through energy minimization or to resolve clashes and improve . For cases with low sequence similarity (<30%), threading methods extend by aligning the target sequence to fold templates without relying on high sequence identity. Threading evaluates the compatibility of the target sequence with template structures using energy-based potentials that consider burial, secondary structure propensity, and pairwise interactions, often ranking alignments by a . Seminal work demonstrated that threading can successfully recognize by optimizing sequence-structure fitness, even for proteins with sequence identities as low as 10-20%. The accuracy of template-based models correlates strongly with the sequence between target and template; models with >30% identity typically achieve backbone (RMSD) values below 1 to the native structure, enabling reliable prediction of core folds. Below 30% identity, accuracy declines, with RMSD often exceeding 3 due to errors and loop inaccuracies, as established in analyses of homologous protein pairs. Widely adopted tools for include MODELLER, which implements restraint-based modeling to generate and refine structures from alignments, and , a fully automated that integrates search, alignment, and quality assessment for high-throughput predictions. These tools have been benchmarked in community experiments like , where they perform well for targets with detectable templates. A key limitation of template-based methods is their dependence on available templates; they perform poorly for proteins with novel folds not represented in structural , where no suitable homologs exist. In contrast to methods, which build structures from physicochemical principles without templates, homology modeling requires evolutionary relatedness for success. Models are validated by comparing predicted structures to native ones (if available) using RMSD on Cα atoms, where values <2 Å indicate high fidelity, and by stereochemical checks such as Ramachandran plot analysis to ensure backbone dihedral angles fall within allowed regions. Additional metrics like global distance test (GDT) scores and energy profiles further assess overall quality.

De novo and AI-driven prediction

De novo protein structure prediction, also known as ab initio prediction, aims to determine a protein's three-dimensional structure solely from its amino acid sequence without relying on known homologs or templates. These methods typically involve assembling short structural fragments derived from sequence patterns and refining them through energy minimization to identify low-energy conformations. A prominent example is the , which uses fragment assembly followed by Monte Carlo sampling and energy-based optimization to generate plausible folds. employs empirical potential functions, such as Lennard-Jones terms for van der Waals interactions and statistical potentials derived from known structures, to score and minimize the energy of assembled models. Physics-based approaches complement fragment assembly by simulating the folding process through molecular dynamics (MD) simulations, which model atomic interactions using classical force fields to trace folding trajectories. These simulations capture thermodynamic principles like entropy-driven collapse and hydrogen bonding stabilization, providing insights into folding pathways for small proteins. For instance, all-atom MD has successfully folded peptides and miniproteins in microseconds-scale simulations, revealing funnel-like energy landscapes guiding native states. However, full-scale MD for larger proteins remains computationally intensive due to the timescales involved, often requiring enhanced sampling techniques like replica-exchange MD. The advent of artificial intelligence has revolutionized de novo prediction, with deep learning models leveraging multiple sequence alignments (MSAs) to infer evolutionary constraints and structural propensities. AlphaFold 2, developed by DeepMind, marked a breakthrough in the 2020 CASP14 competition, achieving unprecedented accuracy by using attention-based neural networks to predict residue-residue distances and angles directly from sequence data. This end-to-end approach bypasses traditional intermediate steps like fragment threading, instead training on vast structural databases to output atomic models with confidence scores (pLDDT). Similarly, RoseTTAFold from the Baker lab introduced a three-track neural network architecture that processes sequence, 2D distance maps, and 3D coordinates in parallel, enabling rapid predictions comparable to for single chains and complexes. Building on this, DeepMind released in May 2024, which employs a diffusion-based architecture to predict joint structures of biomolecular complexes, including interactions with DNA, RNA, ligands, and modifications, substantially advancing applications in drug discovery and biology. This work earned Demis Hassabis and John Jumper the 2024 Nobel Prize in Chemistry for computational protein structure prediction. Accuracy benchmarks highlight the impact of these AI methods; in CASP14, 2 attained a median Global Distance Test-Total Score (GDT-TS) of 92.4, surpassing human expert levels for many targets and enabling near-atomic resolution (RMSD < 1 Å) for proteins up to 400 residues. These tools have expanded coverage of the "dark proteome"—regions lacking experimental structures—with predicting high-confidence models (pLDDT > 90) for about 37% of residues in structurally uncharacterized domains. In CASP16 (2024), top AI-driven methods, including variants of 3, achieved even higher median GDT-TS scores (around 85-95 for monomers and multimers) and improved prediction, further demonstrating near-solved status for many protein structure challenges. Hybrid approaches occasionally incorporate sparse template information from databases to refine novel folds, but AI-driven methods dominate for orphan proteins. The Protein Structure Database was updated in 2024 to include predictions for over 200 million proteins across eukaryotes, , and , enhancing global accessibility. Looking ahead, integrating predictions with simulations promises more realistic models that account for conformational flexibility beyond static structures. Emerging frameworks combine AlphaFold-like outputs as starting points for to generate Boltzmann-distributed ensembles, aiding for flexible targets like enzymes. This synergy could address limitations in capturing transient states, with ongoing efforts focusing on generative for sampling diverse conformations efficiently.