Fact-checked by Grok 2 weeks ago

Protein design

Protein design is the interdisciplinary field of engineering proteins with novel three-dimensional structures and functions, typically by computationally determining sequences that fold into predefined conformations, often from scratch in a process known as design. This approach inverts the classical problem, where the goal shifts from predicting a structure from a sequence to inventing sequences for targeted structures, leveraging principles of biophysical stability, energy minimization, and ary insights. Emerging as a cornerstone of , protein design enables the creation of proteins that natural has not produced, with applications in , , and . The field originated in the late 1980s with pioneering efforts to design simple helical bundles, including the first water-soluble, cooperatively folded four-helix bundle protein (α4) in 1987, which demonstrated that proteins could be rationally engineered using physicochemical principles without natural templates. Early advances focused on metalloproteins and basic motifs, such as the 1990 design of a zinc-binding protein, but computational limitations restricted complexity until the development of fragment-based methods in the 2000s. A landmark achievement came in 2003 with Top7, the first fully de novo protein featuring a novel fold verified by X-ray crystallography at 2.5 Å resolution, marking the transition to designing unprecedented topologies. Key methods in protein design combine computational modeling with experimental validation, including energy-based optimization via software like , which uses rotamer libraries and sampling to explore sequence-structure space. Recent breakthroughs integrate , such as the 2023 RFdiffusion model, a diffusion-based generative tool that produces diverse and structures with up to 50% experimental success rates, enabling symmetric assemblies and functional motif scaffolding. The 2024 highlighted these innovations, awarding David Baker for computational protein design—pioneering de novo proteins since Top7—and Demis Hassabis and John Jumper for AlphaFold2, which in 2020 achieved near-atomic accuracy in structure prediction, accelerating design cycles by informing sequence optimization with tools like ProteinMPNN. These AI-driven advances have boosted design fidelity, with success rates exceeding 10-20% for complex binders and >50% for stabilized scaffolds in recent studies. Protein design has transformative applications, including high-affinity binders for therapeutics, such as nanomolar inhibitors of or cancer checkpoints like , and self-assembling nanomaterials for drug delivery or vaccines, as seen in designs. In , it facilitates custom enzymes to degrade plastics or pollutants. In , it yields programmable switches and sensors for cellular engineering, such as auxin-responsive biosensors. Ongoing challenges include enhancing functional diversity, stability, and scalability, but with integration, the field promises modular synthetic proteins for precision medicine and sustainable technologies; as of 2025, advances like for further expand capabilities.

Introduction

Definition and principles

Protein design is the computational engineering of sequences to fold into specified three-dimensional structures or perform targeted functions, representing the of natural where sequences determine structures. Unlike forward folding, which predicts structures from given sequences, protein design starts with a desired backbone or functional and generates compatible sequences that minimize while achieving stability and specificity. This approach leverages biophysical principles such as hydrophobic packing, hydrogen bonding, and electrostatic interactions to ensure the designed proteins adopt the intended conformation. Key principles distinguish rational design, which modifies existing natural proteins by optimizing sequences around known scaffolds to enhance properties like or binding affinity, from design, which creates entirely novel proteins without relying on natural templates. Rational design employs biophysical models to perturb sequences incrementally, often guided by evolutionary data or structural databases, while design enumerates unprecedented folds using geometric constraints and energy minimization to explore beyond natural diversity. The basic workflow involves specifying a target structure, optimizing sequences via scoring functions that evaluate energetic compatibility, and validating designs through simulations or experimental assays like and . Protein design's importance lies in its ability to produce custom proteins that surpass natural limitations, enabling applications in such as novel therapeutics and , and in for biocatalysts and biomaterials. By transcending evolutionary constraints, it facilitates the creation of proteins with tailored properties, like high-affinity binders or symmetric assemblies, accelerating innovation in . Up to 2025, the field has shifted from purely physics-based methods to hybrid AI-physics approaches, exemplified by AlphaFold's accurate structure prediction enabling inverse design pipelines and RFdiffusion's generative modeling for backbones.

Historical overview

The foundations of protein design were laid in the mid-20th century, building on insights into and structure. In 1973, Christian Anfinsen proposed the thermodynamic hypothesis, often referred to as , stating that the native structure of a protein is determined by its sequence under physiological conditions, as the sequence encodes the information needed to minimize and achieve the lowest-energy conformation. This principle, derived from experiments on ribonuclease A refolding, provided the theoretical basis for designing sequences that could fold into predetermined structures. Early efforts in the 1970s and 1980s focused on manual, rational design of simple motifs, such as alpha-helical bundles, to test these ideas. A landmark example was William DeGrado's 1988 design of a four-helix bundle protein, synthesized from peptides that self-assembled into a stable, helical structure matching the intended model, demonstrating that de novo sequences could mimic natural folds. The 1990s marked the transition to computational methods, enabling systematic exploration of sequence space. David Baker's lab developed the Rosetta software suite starting in the mid-1990s, initially for ab initio structure prediction by assembling fragments from known protein structures using Monte Carlo sampling and energy minimization. A key algorithmic advance was the dead-end elimination (DEE) theorem introduced in 1992, which efficiently prunes suboptimal side-chain rotamers during optimization, drastically reducing the combinatorial search space for protein design. Building on this, John Desjarlais and Tracy Handel applied DEE in 1995 to redesign hydrophobic cores of proteins like thioredoxin, generating sequences that maintained stability and structure comparable to wild-type, validating computational core repacking as a viable design strategy. In the 2000s, computational design achieved novel folds and functions, shifting from motif mimicry to creation. Brian Kuhlman and colleagues in Baker's lab reported in 2003 the design of Top7, the first protein with a novel fold not observed in nature, where a 93-residue folded into a mixed alpha-beta structure with atomic accuracy (RMSD 1.6 to the model), confirmed by . Progress accelerated with functional designs; in 2008, the same group engineered enzymes catalyzing the , achieving rate accelerations up to 10^6-fold through active-site optimization in computationally generated scaffolds. These successes highlighted the potential for designing proteins with tailored catalytic properties. The saw expansions to complex architectures, particularly symmetric assemblies, while exposing challenges in certain classes like proteins. Baker's lab designed self-assembling protein cages, such as a 120-subunit icosahedral in with high thermal stability (melting temperature >100°C), enabling applications in . Efforts to design proteins lagged due to difficulties in modeling environments and conformational dynamics, with early successes limited to small helical bundles rather than full transporters. The 2020s ushered in an AI-driven revolution, leveraging for unprecedented generative capabilities. DeepMind's AlphaFold2, released in 2020, achieved near-experimental accuracy in structure prediction (median GDT-TS 92.4 on CASP14 targets), inverting the design process by allowing back-prediction of sequences from structures. The lab's RoseTTAFold in 2021 extended this with a three-track for joint sequence-structure co-design, enabling rapid generation of binder proteins. Generative models proliferated, including RFdiffusion (2023), a diffusion-based method that hallucinates novel backbones conditioned on motifs, yielding designs with 40% experimental success rates for diverse folds. Concurrently, the hallucination paradigm, refined in 2023, used to optimize random sequences against structure prediction losses, producing luciferases and repeat proteins with novel topologies validated by cryo-EM. By 2025, these AI tools continued to advance scalable protein design methods, such as relaxed sequence optimization, enabling the creation of larger proteins and high-affinity interactions with structural validation. Recent developments as of 2025 include AI-powered designs for and enhanced applications.

Fundamentals of Protein Structure

Hierarchical structure levels

Proteins exhibit a of structure that serves as the foundational framework for computational and rational design efforts, allowing engineers to specify target architectures at multiple scales without preconceived sequence biases. This hierarchy comprises four levels—primary, secondary, , and —each building upon the previous to dictate , , and interactions. Understanding these levels is essential for protein design, as it enables the independent manipulation of backbone geometries and subunit arrangements to achieve desired properties, such as enhanced enzymatic activity or novel affinities. The primary structure refers to the linear sequence of linked by peptide bonds, which constitutes the fundamental blueprint for all higher-order folding and serves as the primary input variable in protein design. This sequence, determined experimentally through methods like , dictates the chemical properties and potential interactions that drive subsequent structural assembly, as exemplified by Frederick Sanger's sequencing of insulin, which revealed the precise order of its 51 across two chains connected by bonds. In design contexts, specifying or optimizing the primary structure allows for targeted modifications, such as introducing cysteines for bridging or polar residues for , while ensuring compatibility with intended folds. Secondary structure encompasses local, repeating patterns stabilized primarily by hydrogen bonds between backbone atoms, including alpha-helices, beta-sheets, and connecting loops or turns that contribute to overall rigidity and functional motifs. Alpha-helices feature a right-handed coil with 3.6 residues per turn, while beta-sheets form pleated arrangements of hydrogen-bonded strands, either parallel or antiparallel, as first proposed by and Robert Corey based on stereochemical constraints. These elements are critical for design because they provide modular scaffolds for stability; for instance, packing helices into bundles or sheets into barrels enhances thermal resilience, informing the selection of backbones that support catalytic sites or ligand-binding pockets without sequence-dependent biases. Tertiary structure describes the global three-dimensional folding of a single polypeptide chain, achieved through long-range interactions such as hydrophobic collapse into a core, hydrogen bonds, electrostatic forces, and disulfide bridges that minimize and yield a compact, functional conformation. Christian Anfinsen's experiments on demonstrated that the native tertiary fold is thermodynamically determined by the primary sequence under physiological conditions, underscoring the principle that design targets must prioritize energetically favorable arrangements, like burying nonpolar residues to form stable cores. In , tertiary specification involves defining architectures—such as all-alpha or mixed motifs—to encode specific functions, enabling the creation of proteins with novel topologies for therapeutic applications. Quaternary structure arises when multiple polypeptide chains (subunits) assemble into a multi-subunit complex, stabilized by non-covalent interactions and sometimes covalent links, resulting in symmetric or asymmetric oligomers that amplify function, such as . Max Perutz's crystallographic analysis of revealed its tetrameric arrangement of two alpha and two beta chains, with interfaces enabling cooperative oxygen binding, highlighting how quaternary design can introduce regulatory mechanisms or increased . For protein engineers, targeting quaternary levels allows the construction of oligomeric assemblies, like symmetric cages or signaling complexes, by specifying subunit interfaces that promote and enhance stability or specificity . Visualization of these hierarchical levels is facilitated by resources like the (PDB), which archives experimentally determined structures, and software such as PyMOL, which renders atomic models to inspect folds, interfaces, and dynamics at resolutions down to angstroms. This capability is prerequisite for design workflows, as it permits the abstraction of backbones from natural templates or ideal geometries, decoupling structure specification from evolutionary sequence constraints to innovate novel proteins.

Sequence-to-structure mapping

The sequence-to-structure mapping refers to the biophysical process by which an sequence determines the three-dimensional structure of a protein through folding. This mapping is central to protein design, as designing novel proteins requires predicting how a proposed sequence will fold into a desired structure. highlights the computational intractability of this process: for a 100-residue protein assuming approximately three possible conformations per residue, the total number of possible conformations is on the order of $3^{100} \approx 5 \times 10^{47}, far exceeding the age of the even if sampled at rates. This paradox is resolved by the concept, where the energy landscape guides the protein toward the native state via a biased, downhill pathway rather than random sampling, minimizing and enabling folding on biologically relevant timescales. Folding mechanisms underpin this mapping, as articulated by Anfinsen's thermodynamic hypothesis, which posits that the native structure is the global minimum determined solely by the sequence under physiological conditions. In vivo, molecular chaperones assist this process by preventing aggregation and facilitating proper folding pathways, particularly for larger proteins. The vastness of further complicates the mapping: for a 100-residue protein, there are $20^{100} \approx 10^{130} possible sequences, yet natural proteins represent only a minuscule fraction of the total space. This sparsity underscores the evolutionary selection for sequences that reliably map to functional structures. The entropy of sequence diversity can be quantified using Shannon entropy, S = -\sum p_i \log p_i, where p_i is the probability of the i-th at a position, highlighting the information content required for specific folding. Advances in structure prediction have revolutionized understanding of sequence-to-structure mapping. Prior to 2020, methods relied heavily on , which aligned query sequences to known structures using templates like those in the , achieving moderate accuracy for homologous proteins but struggling with novel folds. Post-2020, approaches such as dramatically improved predictions; 2 achieved near-atomic accuracy across diverse structures, while 3 extended this to multimers, ligands, and modifications with median backbone RMSDs below 1 Å for many complexes. In protein design, the —finding sequences that fold to a target structure—has seen success rates evolve from below 10% in the , limited by simplistic energy models and computational power, to 10–50% or higher in the 2020s using integrated physics- and machine learning-based methods. These improvements enable the generation of stable, functional proteins, bridging the gap between sequence prediction and design.

Conformational flexibility and dynamics

Proteins are not static structures but exhibit conformational flexibility, which is essential for their biological functions such as enzymatic , ligand binding, and . In protein design, accounting for this flexibility is crucial to ensure , prevent misfolding, and enable functional , as rigid designs may fail to mimic native behaviors. Conformational flexibility manifests in several types, including side-chain rotamers that allow discrete torsional adjustments for optimizing interactions and adapting to environments; backbone fluctuations that permit local hinge-like movements and loop adjustments; and allostery, where perturbations at one site propagate structural changes to distant regions, modulating activity. These arise from motions and are influenced by composition, with designs needing to balance rigidity for folding and flexibility for . Normal mode analysis provides a computational to model protein dynamics by identifying low-frequency vibrational modes that capture large-scale, collective motions such as domain shifts or helix rotations, which are relevant for predicting functional transitions in designed proteins. This approach, often using elastic network models, efficiently approximates essential dynamics without exhaustive simulations, aiding designers in incorporating anticipated flexibility into target structures. Ensemble views of proteins emphasize that conformations follow a , where states are populated according to their relative energies, necessitating designs that stabilize desired ensembles rather than single structures to achieve robust . methods trained on data can generate such ensembles rapidly, ensuring compatibility across multiple states and avoiding entrapment in suboptimal conformations. Challenges in incorporating conformational flexibility include the risk of over-stabilization, which can induce rigidity and impair adaptive functions, and underestimation of dynamics, leading to sequences prone to misfolding or aggregation due to unexplored alternative states. These issues highlight the need for multi-state optimization to smooth energy landscapes and promote funnel-like folding pathways. Experimental validation of designed protein flexibility relies on techniques like (NMR) spectroscopy, which resolves multistate structures, and (MD) simulations, which quantify motional amplitudes; for example, deep learning-designed dynamic proteins have shown conformational equilibria and interaction networks matching predictions, with NMR confirming atomic-level precision in flexible states comparable to those in native proteins. Recent advances in 2025 integrate with for flexible designs, such as AlphaFold-Metainference, which leverages AlphaFold-predicted distances as restraints in replica-exchange simulations to generate Boltzmann-consistent ensembles of disordered and partially structured proteins, improving agreement with experimental data like . This approach enables efficient exploration of dynamic landscapes, facilitating the creation of proteins with tailored flexibility for applications in sensing and regulation.

Design Principles and Challenges

Target structure specification

Target structure specification in protein design involves defining the desired three-dimensional backbone or as a starting point for subsequent sequence optimization, ensuring the geometry supports , novelty, and potential function. This step is crucial because the backbone dictates the overall , secondary structure elements, and spatial arrangement of residues, which in turn influence foldability and interactions. Designers typically generate or select scaffolds that avoid existing natural to enable creation, while incorporating features like binding pockets or active sites for targeted applications. Several methods exist for specifying target structures. Enumerative approaches systematically assemble idealized secondary structure elements, such as alpha-helices and beta-sheets, from a predefined library of building blocks to enumerate possible topologies exhaustively, as demonstrated in algorithms that generate diverse pocket geometries in NTF2-fold scaffolds. Fragment assembly, pioneered in the Rosetta software suite, involves stitching together short segments (typically 3-9 residues) derived from known protein structures in the Protein Data Bank (PDB) to build novel backbones, reducing the search space while maintaining physical realism; this method was key to early de novo designs by iteratively sampling conformations via Monte Carlo optimization. More recently, generative models based on diffusion processes have emerged, particularly post-2020, where noise is added to and then denoised from protein coordinates to produce diverse scaffolds conditioned on constraints like symmetry or motifs, enabling rapid generation of unprecedented folds. Key criteria guide the selection of target structures. For stability, backbones are evaluated using metrics like the Template Modeling (TM)-score, where values above 0.5 indicate a high likelihood of adopting the intended fold upon sequence realization, as this threshold correlates with topological similarity to native proteins. Novelty is assessed by ensuring no close homologs exist in the PDB, often via structural alignment tools like or TM-align, to confirm the design explores untapped sequence-structure space. Functionality requires precise geometry for features such as active sites, where distances and angles must align with catalytic or binding requirements, often verified through simulations. Prominent tools facilitate backbone generation and functionalization. RFdiffusion, a fine-tuned RoseTTAFold-based released in 2023, generates high-quality and multimer backbones or conditioned on partial motifs, achieving experimental success rates over 20% for fold validation in blind tests. Motif grafting integrates functional elements, such as active sites or epitopes, into these scaffolds using protocols that optimize loop connections and interface packing to preserve without steric disruption. Challenges in this specification phase include ensuring the backbone is foldable with natural , as many generated structures may lack compatible sequences due to strained geometries or unfavorable energetics. Avoiding steric clashes between non-local residues is another hurdle, requiring iterative refinement to eliminate overlaps that could destabilize the fold during realization. Seminal examples illustrate these principles. The Top7 protein, designed in 2003, used fragment assembly in to specify a novel α/β fold with no natural homologs (TM-score <0.3 to closest PDB entries), resulting in an experimentally validated structure with 1.2 Å RMSD to the computational model. More recently, RFdiffusion-enabled hallucination of binders in 2023 produced de novo scaffolds that bound diverse targets like IL-7 and PD-1 with nanomolar affinities, incorporating specified geometric constraints for interfaces while confirming novelty through PDB searches.

Energy functions and scoring

Energy functions in protein design serve as mathematical models to assess the compatibility of an amino acid sequence with a target structure by estimating the free energy of the system. These functions typically approximate the Gibbs free energy ΔG, guiding the selection of sequences that minimize energetic frustration and stabilize the desired fold. Energy functions are broadly classified into physics-based and knowledge-based categories. Physics-based functions derive terms from fundamental physical principles, such as atomic interactions, while knowledge-based functions rely on statistical potentials extracted from structural databases like the . The exemplifies a hybrid approach, combining physics-based terms for short-range interactions with knowledge-based statistical potentials for conformational preferences. Key components of such energy functions include van der Waals interactions, modeled via to capture steric repulsion and attraction; electrostatics, computed using with a distance-dependent dielectric; solvation effects, often via generalized Born/surface area (GB/SA) models to account for polar and nonpolar desolvation; hydrogen bonding, with orientation-dependent terms for donor-acceptor geometry; and torsion potentials, enforcing backbone Ramachandran and side-chain rotamer preferences. The total energy is expressed as a weighted sum: \Delta E_\text{total} = \sum_i w_i E_i(\theta, \text{aa}) where w_i are empirical weights, E_i are individual terms, \theta denotes conformational variables like dihedral angles, and aa represents amino acid identities. Statistical potentials in knowledge-based components use reference states derived from alignments, such as Boltzmann-distributed frequencies of residue pairs or backbone angles relative to an unfolded ensemble, to define favorable interactions. These reference states enable the calculation of effective energies that correlate with observed native structures. Despite their utility, energy functions face challenges, including inaccuracies in non-native contexts where they may overestimate hydrophobic burial stability or underpenalize polar group desolvation, leading to suboptimal sequence rankings. Additionally, most functions omit explicit conformational entropy terms to maintain computational tractability, hindering accurate modeling of backbone and side-chain flexibility. In optimization, partial derivatives like \partial E / \partial \theta for rotamer angles are computed to minimize the energy landscape efficiently. Validation of energy functions often involves correlating predicted energy changes with experimental ΔΔG values from mutagenesis studies; for instance, the Rosetta function achieves a Pearson correlation coefficient R = 0.994 for ΔΔG upon mutation on its optimization dataset, while performance on independent blind tests is typically lower (Pearson r ≈ 0.3–0.8 depending on the protocol and dataset). Recent machine learning advancements, such as those in , have improved potentials by incorporating deep learning predictions of interresidue orientations, enhancing accuracy in structure prediction and design tasks during the 2020s. Recent developments include machine learning-based energy functions, such as deep learning-derived coarse-grained force fields that predict protein structures and dynamics with high accuracy.

Sequence space exploration

Protein sequence space exploration in design involves navigating the vast combinatorial landscape of possible amino acid sequences—estimated at 20^N for an N-residue protein—to identify those that stably adopt a target structure, without relying on exhaustive brute-force search due to computational infeasibility. Traditional approaches discretize this space using , which represent side-chain conformations observed in protein structures, such as the backbone-dependent containing approximately 10 to 100 rotamers per amino acid type derived from clustering empirical data from the . This discretization reduces the per-residue search space from continuous dihedral angles to a manageable discrete set, enabling optimization techniques like dead-end elimination to prune incompatible combinations early. Clustering further refines these libraries by grouping similar rotamers, minimizing redundancy while preserving conformational diversity essential for realistic packing. Continuous aspects of the sequence space, particularly backbone sampling and side-chain packing, introduce additional complexity beyond discrete rotamers. Backbone sampling generates low-energy conformational ensembles using methods like fragment assembly, allowing flexibility in phi/psi dihedrals to explore viable folds, while side-chain packing optimizes rotamer assignments conditioned on the backbone to minimize steric clashes and maximize favorable interactions. For small proteins (e.g., <50 residues), exhaustive enumeration of sequence-rotamer combinations is feasible, yielding global minima, but for larger systems, approximations such as Monte Carlo sampling are employed to stochastically traverse the space, iteratively perturbing sequences and conformations to escape local minima. Success in exploration is gauged by metrics like low-energy sequences, typically those scoring below -2 Rosetta Energy Units (REU) per residue using the Rosetta all-atom energy function, indicating thermodynamic stability comparable to natural proteins. Diversity is enhanced through Monte Carlo methods that incorporate temperature parameters to sample a broader range of viable sequences, preventing convergence to homogeneous solutions and promoting robustness. Recent advances leverage machine learning, particularly protein language models like , which use transformer architectures trained on evolutionary sequences to generate embeddings that guide sequence sampling in underrepresented regions of the space. Post-2022 neural network approaches, including generative models, enable direct exploration of novel sequence variants by inverting structure-to-sequence mappings or conditioning on structural motifs, as demonstrated in global generative frameworks that sample across the entire protein universe. By 2025, extensions like retrieval-augmented ESM variants incorporate homologous sequences to refine predictions, accelerating discovery of diverse, functional designs.

Computational Methods

Optimization formulations

Protein design is formalized as a mathematical optimization problem that seeks amino acid sequences or structural configurations compatible with a desired three-dimensional fold, typically by minimizing an energy function derived from biophysical models. The core challenge lies in navigating the enormous sequence space—approximately 20 possibilities per residue—while ensuring the designed protein adopts the target conformation with high stability and, if applicable, specific functional properties. This setup contrasts with protein structure prediction, which infers structure from sequence, by inverting the process to engineer sequences for predefined structures. A primary problem type is sequence design given a fixed target structure, formulated as minimizing the conditional energy E(\text{sequence} \mid \text{structure}), where the energy function decomposes into terms for intra-residue interactions, pairwise residue contacts, and solvation effects. For instance, the total energy is often expressed as E = E_0 + \sum_i E_i(r_i) + \sum_{i<j} E_{ij}(r_i, r_j), with r_i denoting the rotamer (discrete side-chain conformation) at residue i, E_i the unary term, and E_{ij} the pairwise term. In structure design, joint optimization extends this to simultaneously optimize sequence and backbone coordinates, coupling sequence compatibility with conformational sampling. The objective generally minimizes energy subject to foldability constraints, such as ensuring the target conformation has lower energy than decoy structures; multi-objective variants trade off stability (e.g., via folding free energy) against function (e.g., binding specificity), often yielding Pareto-optimal sets of sequences. Combinatorial and continuous formulations address the discrete or flexible nature of protein degrees of freedom. In the combinatorial approach, side chains are discretized into rotamer libraries, leading to integer programming models: binary variables x_{i,k} = 1 if rotamer k is selected for residue i, with constraints like \sum_k x_{i,k} = 1 (one rotamer per residue) and linear inequalities preventing steric clashes (e.g., pairwise exclusion). This yields a 0/1 integer linear or quadratic program. Continuous formulations, by contrast, optimize torsion angles \phi, \psi for backbone and \chi angles for side chains directly, relaxing the discrete search to a differentiable landscape suitable for gradient-based methods, though requiring approximations for non-convexity. The general design equation is \min_x E(x) \quad \text{s.t.} \quad g(x) \leq 0, \quad h(x) = 0, where x is the sequence vector (or extended to include angles in joint cases), E(x) the energy, and constraints g, h enforce steric feasibility and fold specificity. The discrete protein design problem is NP-hard, with computational complexity scaling exponentially in the number of residues due to the combinatorial explosion of possible assignments, necessitating approximations or heuristics for practical scales beyond small peptides. Stochastic formulations incorporate uncertainty from conformational dynamics or noisy energy estimates by optimizing expected values, such as \min_x \mathbb{E}[E(x)] over an ensemble of structures, using probabilistic sampling to model flexibility and robustness. These handle ensemble-averaged properties, like partial unfolding risks, but introduce variability in solutions compared to deterministic setups.

Algorithms with mathematical guarantees

Algorithms with mathematical guarantees in protein design focus on exact optimization techniques that provably identify the global minimum energy conformation (GMEC) or provide tight bounds on the optimal solution, typically formulated as finding the lowest-energy sequence and rotamer assignment for a given backbone structure. These methods address the combinatorial explosion of the sequence-to-structure mapping by leveraging pruning, bounding, or integer programming to ensure optimality without exhaustive enumeration, though they are computationally intensive for large proteins. They contrast with heuristic approaches by offering formal proofs of correctness, often building on energy functions that decompose into pairwise interactions between residues. Dead-end elimination (DEE) is a cornerstone algorithm that iteratively prunes suboptimal rotamers from consideration, guaranteeing the identification of the GMEC when no further eliminations are possible. The core criterion eliminates a rotamer r_i at residue position k if its minimum possible energy in any conformation exceeds the maximum possible energy of any alternative rotamer r_j (where j \neq i): \min_{\text{conf} \ni r_i} E(\text{conf}) > \max_{\text{conf} \ni r_j} E(\text{conf}) This is approximated using bounds on pairwise interactions, such as E(k_{r_i}) + \sum_{l \neq k} \min_{r_l} E(k_{r_i}, l_{r_l}) > E(k_{r_j}) + \sum_{l \neq k} \max_{r_l} E(k_{r_j}, l_{r_l}), enabling efficient reduction of the search space from millions to thousands of rotamers per site. Introduced in its generalized form for protein design, has been extended with perturbations (DEEPer) to handle continuous side-chain flexibility by sampling perturbations around discrete rotamers and tightening bounds iteratively. Multistate variants, like type-dependent DEE, further prune by considering multiple target conformations simultaneously. Branch-and-bound (BnB) algorithms perform an exact tree search over the rotamer space, using upper and lower energy bounds to prune branches that cannot contain the GMEC, thus guaranteeing optimality while avoiding full enumeration. The search proceeds depth-first or best-first, evaluating partial assignments and discarding subtrees where the lower bound exceeds the current best upper bound on the global energy. A* variants enhance by incorporating admissible heuristics, such as relaxations of the energy function, to guide the expansion toward low-energy regions; for instance, BroMAP combines BnB with mean-field approximations for tighter bounds in multistate designs. BnB formulations exploit the graphical of protein graphs to decompose the problem, reducing complexity for symmetric or modular proteins. These methods have successfully designed sequences for folds by exhaustively exploring constrained spaces. Integer programming (often formulated as a mixed-integer program (MIQP), which can be linearized to an (ILP)) reformulates protein design as an over binary variables indicating rotamer selections, with linear constraints ensuring at most one rotamer per site and compatibility between interacting residues. The objective minimizes the total energy, expressed as \min \sum_{k} \sum_{r_k} c_{k,r_k} x_{k,r_k} + \sum_{k<l} \sum_{r_k,r_l} e_{k,l,r_k,r_l} x_{k,r_k} x_{l,r_l}, where x_{k,r_k} are binary indicators and c, e are self and pairwise energies; LP relaxations provide bounds, and branch-and-cut solvers like Gurobi yield exact . This approach handles continuous dihedral angles via mixed-integer extensions and has been applied to side-chain packing and sequence , with cluster expansions accelerating large instances by approximating higher-order terms. ILP guarantees the GMEC for discrete models and scales via commercial solvers. Message-passing approximations, such as loopy belief propagation and max-product message passing, provide dual bounds to the LP relaxation of the protein design graphical model, enabling provable optimality gaps for the GMEC. These algorithms iteratively propagate marginal beliefs over rotamer variables along the interaction graph, converging to a stationary point that lower-bounds the minimum energy; the dual formulation ensures the bound is tight for tree-structured graphs and approximate otherwise. Tree-reweighted variants further tighten relaxations by reweighting messages to encourage consistency, while max-sum belief propagation solves the dual efficiently for partial assignments. In protein design, they integrate with BnB to guide pruning, offering guarantees on suboptimality when combined with exact solvers. These exact methods perform well for proteins under 100 residues, often solving instances with 10-20 mutable sites in seconds to minutes on modern hardware, and excel in symmetric or low-flexibility designs where the search space is tractable. For larger systems, exhaustive optimality remains challenging due to NP-hardness, but successes include designing symmetric oligomers and enzyme active sites with verified low-energy sequences.

Heuristic and AI-driven approaches

Heuristic approaches in protein design prioritize computational efficiency over exact optimality, employing stochastic or approximate inference techniques to navigate the vast sequence and conformation spaces. Monte Carlo methods, integrated into the Rosetta software suite, sample protein conformations and sequences by proposing random perturbations and accepting or rejecting them based on energy changes. Simulated annealing enhances this by incorporating a temperature parameter that decreases over iterations via predefined cooling schedules, allowing temporary acceptance of higher-energy states to escape local minima. The acceptance probability follows the Metropolis criterion, where a move with energy increase ΔE is accepted with probability exp(-ΔE / kT), with k as the Boltzmann constant and T as the current temperature. This approach has been foundational in Rosetta for both structure prediction and design tasks since the late 1990s. The FASTER algorithm represents an advanced heuristic for side-chain placement and sequence optimization in protein design, achieving rapid enumeration by iteratively pruning rotamer libraries to smaller, promising subsets while maintaining near-optimal energy scores. By relaxing only select positions during perturbations and using initial configurations that bias toward low-energy states, FASTER delivers up to two orders of magnitude speedup over traditional dead-end elimination or Monte Carlo methods, reducing computation from days to hours for complex designs. This enables practical application to multistate design problems, where sequences must satisfy multiple conformational states. Belief propagation offers another approximate inference strategy, modeling protein design as a probabilistic graphical model where variables represent amino acid choices and factors encode interaction energies. The algorithm performs iterative message passing between nodes to marginalize probabilities, converging to approximate optima for low-energy sequences without exhaustive enumeration. This method excels in capturing pairwise and higher-order dependencies, providing marginal amino acid probabilities that guide sequence selection in large systems. Modern AI-driven methods leverage deep learning for scalable protein design, particularly generative models like variational autoencoders (VAEs) and diffusion models that learn latent representations of protein structures and sequences from large datasets. ProteinMPNN, a message-passing neural network introduced in 2022, generates sequences conditioned on fixed backbones by autoregressively predicting residues from N- to C-terminus, incorporating structural features such as inter-residue distances and dihedral angles. Trained on over 19,000 Protein Data Bank structures and fine-tuned with structural noise for robustness, it achieves 52.4% native sequence recovery—superior to Rosetta's 32.9%—and designs functional proteins for monomers, oligomers, and interfaces, validated experimentally via crystallography and cryo-EM. Hallucination protocols extend these AI techniques to de novo backbone generation, using denoising diffusion models to sample novel folds from noise. RFdiffusion, built on RoseTTAFold as the denoising backbone, iteratively refines random residue frames over up to 200 steps, enabling topology-constrained design of unprecedented structures like TIM barrels and symmetric assemblies. It generates 100-residue proteins in seconds on consumer GPUs, outperforming prior hallucination methods in diversity and accuracy, with experimental validation of large oligomers up to 1,050 residues via negative-stain electron microscopy. Complementing this, the 2023 Chroma model integrates diffusion with graph neural networks for conditional generation, allowing user-specified constraints such as symmetry, shape, or natural-language prompts to produce novel protein complexes exceeding 3,000 residues in minutes on standard hardware. These heuristic and AI approaches yield speedups of over 1,000-fold relative to exact optimization methods like branch-and-bound, facilitating designs intractable for exhaustive search while recovering near-native sequences and folds. By 2025, they have enabled successes in engineering large protein assemblies, including modular self-assembling nanomaterials and symmetric nanoparticles validated by high-resolution structural biology, accelerating applications in therapeutics and materials. In 2024, advancements like AI frameworks incorporating experimental feedback have further improved design efficiency for applications in medicine and catalysis.

Applications

De novo and novel fold design

De novo protein design involves the computational creation of proteins with entirely novel structures that do not exist in nature, relying on principles of physics and biology to specify backbones and sequences from scratch. This approach contrasts with template-based methods by generating unprecedented folds, enabling the exploration of new topological space. Key strategies include scaffold design, where idealized structural motifs like beta-barrels are assembled into stable cores, and fold hallucinations, which use deep neural networks to generate diverse backbone conformations without relying on existing templates. A landmark example is Top7, a 93-residue α/β protein designed in 2003 with a novel fold unrelated to any natural protein, folding into its intended structure. Scaffold-based designs have produced functional beta-barrels, such as eight-stranded transmembrane variants that insert into lipid membranes and exhibit high thermal stability exceeding 50°C, confirmed by circular dichroism spectroscopy. More recent advances include de novo metalloproteins, like an expandable platform incorporating redox-active heme groups into novel folds for electron transfer applications. Validation of these designs typically involves biophysical characterization, with X-ray crystallography providing atomic-level confirmation; for instance, the Top7 structure matched its computational model with a root-mean-square deviation of 1.2 Å, and designed beta-barrels have shown near-perfect agreement to predicted backbones. Thermal denaturation experiments often reveal melting temperatures above 50°C, indicating robust folding in aqueous environments. These metrics underscore the fidelity of modern design tools in producing stable, novel architectures. Despite these successes, challenges persist, including variable experimental success rates for folding into intended structures due to inaccuracies in energy functions and sampling limitations. Integrating function into novel folds remains difficult, often requiring iterative refinement. By 2025, AI-driven methods have advanced applications, such as de novo mini-proteins designed as potent inhibitors of the MERS-CoV spike protein, achieving nanomolar binding affinities and protection in cell models. Additionally, recent de novo enzymes, like porphyrin-containing catalysts with stereoselective activity for carbon-carbon bond formation, highlight progress in functional novelty.

Enzyme and catalyst engineering

Enzyme and catalyst engineering involves the computational and experimental creation of proteins that accelerate chemical reactions, often by precisely positioning catalytic residues to stabilize transition states. A key approach is theozyme placement, where an ideal catalytic motif—termed a —modeling the transition state geometry is docked into protein scaffolds to identify suitable backbones that can support the required interactions. This is followed by scaffold matching, an automated process that scans protein structures for backbone fragments compatible with the theozyme, ensuring geometric and energetic feasibility for catalysis. These methods enable de novo design of active sites in existing or novel folds, prioritizing electrostatic and hydrogen-bonding networks to lower activation barriers. Early successes demonstrated the viability of this paradigm with the design of Kemp eliminases in 2008, where theozyme-based placement into diverse scaffolds yielded enzymes catalyzing the Kemp elimination reaction—a proton abstraction and bond-breaking process—with k_cat values up to 700 min⁻¹ for the KE70 variant, marking a milestone in non-natural catalysis. Similarly, retro-aldolases designed that year used four distinct theozymes to break carbon-carbon bonds in a non-natural substrate, achieving detectable activity across 32 of 72 tested designs spanning multiple folds, with k_cat/K_M efficiencies reaching 10² M⁻¹ s⁻¹. These examples highlighted how scaffold matching can repurpose protein architectures for xenobiotic reactions, though initial efficiencies were modest compared to natural enzymes. To enhance performance, semi-rational strategies combine computational design with directed evolution, iteratively refining active sites through mutagenesis and selection. For instance, cytochrome P450 variants like CYP102A1 (P450BM3) have been engineered for selective oxidations of pharmaceuticals, where initial Rosetta-based designs predict substrate binding, followed by evolution yielding variants with >100-fold improved and k_cat/K_M >10³ M⁻¹ s⁻¹ for specific substrates like testosterone. This hybrid approach addresses design inaccuracies by leveraging evolutionary optimization for specificity and stability, as seen in variants achieving >90% enantioselectivity in sulfoxidation. Recent advances incorporate / (QM/MM) hybrids to refine energy functions, providing higher accuracy in modeling over classical methods alone; for example, QM/MM simulations have improved predictions of electrostatic contributions in Kemp eliminase active sites by 20-30% in barrier heights. In 2025, luciferases designed via -guided theozyme placement and scaffold generation enabled multiplexed imaging, with neoLux variants exhibiting >10-fold brighter emission than prior designs and orthogonal substrate specificity for applications. models further aid reaction prediction, using to forecast catalytic motifs and efficiencies, as in generative frameworks that hallucinate sequences for uncharted reactions with >80% validation success in wet-lab tests. These developments underscore ongoing progress toward enzymes rivaling natural catalysts in rate and selectivity.

Therapeutic and binding proteins

Protein design for therapeutic and applications focuses on proteins that recognize and interact with specific molecular , such as pathogens, cancer cells, or disease-related proteins, to enable , neutralization, or immune modulation. These designs prioritize high-affinity while minimizing off-target effects, often leveraging computational methods to optimize protein-protein interfaces. Key include viral receptors, tumor antigens, and signaling molecules, where binders serve as inhibitors, diagnostic tools, or components in immunotherapies. Interface design in therapeutic proteins emphasizes hotspot residues—specific that contribute disproportionately to in protein-protein interactions—to create stable, high-affinity complexes. By computationally identifying and optimizing these hotspots, designers can sculpt interfaces that mimic natural but with enhanced stability or novel scaffolds. For antibody engineering, CDR () grafting transfers the antigen-binding loops from a non-human antibody onto a to reduce while preserving specificity; this method has been refined computationally to select optimal framework matches that maintain CDR conformation. Binding affinity is evaluated using protocols like RosettaΔΔG, which estimates the change in binding free energy (ΔΔG) upon or by sampling conformational ensembles and scoring interactions; designs achieving ΔΔG < -2 kcal/mol indicate significant affinity improvements suitable for therapeutic use. A simplified approximation for binding free energy in these models is \Delta G_{\text{bind}} \approx \Delta E_{\text{vdw}} + \Delta E_{\text{ele}}, where van der Waals (\Delta E_{\text{vdw}}) and electrostatic (\Delta E_{\text{ele}}) terms dominate interface energetics, though full protocols incorporate solvation and entropy. To ensure specificity and avoid off-target binding, negative design incorporates constraints that penalize interactions with non-target proteins, such as by disfavoring homodimerization or cross-reactivity in computational scoring. Exemplary applications include de novo miniprotein binders to the SARS-CoV-2 spike protein receptor-binding domain, designed in 2020 with picomolar affinities (e.g., <1 nM dissociation constants) that block viral entry by competing with the ACE2 receptor. Computationally designed inhibitors, such as those targeting amyloid aggregation in Alzheimer's disease, demonstrate how interface optimization yields stable complexes that halt pathogenic protein misfolding. Recent advances incorporate AI-driven methods, like RFdiffusion, to design affimer-like non-antibody scaffolds with tailored specificity for therapeutic targets, expanding beyond traditional antibodies. Clinically, bispecific antibodies have seen FDA approvals in 2025, including linvoseltamab (Lynozyfic) for relapsed multiple myeloma, enhancing T-cell redirection with engineered affinities. In CAR-T therapies, designed protein binders boost antitumor activity by improving antigen recognition and reducing exhaustion, as shown in constructs targeting glioblastoma antigens like EGFR and CD276, where computational optimization yields >100-fold specificity gains. These developments underscore protein design's role in advancing precision medicine, with ongoing refinements addressing stability and manufacturability.

Materials and non-biomedical uses

Protein design has enabled the creation of self-assembling nanostructures for applications in and biomaterials, where precise control over assembly pathways yields materials with tailored geometries and functions. One prominent example involves the computational design of icosahedral protein shells, such as those reported in 2021, which utilize symmetric arrangements of protein subunits to form closed polyhedral cages up to 120 subunits in size, exhibiting high stability and potential for encapsulating in . These designs leverage hierarchical to minimize off-pathway aggregates, facilitating scalable production for non-biological uses like nanoscale reactors. Similarly, amyloid-like have been engineered from combinatorial libraries, forming stable β-sheet structures that mimic natural amyloids but with customizable lengths and mechanical properties for use in composite materials. Such , derived from food proteins or synthetic peptides, provide exceptional resistance to denaturation, enabling their integration into durable scaffolds for environmental or industrial applications. Beyond structural designs, specific proteins have been tailored for sensing and material fabrication. De novo luciferases, designed using deep learning in 2023, offer compact, stable enzymes that emit bright bioluminescence in response to substrates, serving as components in industrial biosensors for real-time monitoring of chemical processes. These proteins, as small as 117 amino acids, outperform natural counterparts in stability under harsh conditions, making them suitable for non-medical detection systems. In textile applications, silk-inspired proteins engineered via AI-driven methods replicate the hierarchical β-sheet and amorphous domains of natural spider silk, yielding fibers with high tensile strength for sustainable fabrics. These recombinant silks, produced from microbial hosts, exhibit biocompatibility and biodegradability, addressing demands for eco-friendly alternatives in manufacturing. Designed protein materials often feature tunable mechanical properties, with Young's moduli ranging from 1 to 10 GPa achieved through optimization of secondary structures and interfaces, allowing customization for load-bearing applications like structural composites. For instance, engineered protein fibers can reach moduli of approximately 4.9 GPa while maintaining elasticity, surpassing many synthetic polymers in . Additionally, responsiveness to environmental stimuli enhances functionality; pH-sensitive helical bundles, designed in , undergo reversible assembly-disassembly at physiological ranges, enabling adaptive materials for or sensing. Light-responsive protein hydrogels, incorporating photo-switchable domains, transition between liquid and solid states upon irradiation, facilitating on-demand reshaping in processes. In industrial contexts, protein design optimizes enzymes for production, such as cellulases engineered for enhanced of into fermentable sugars, improving in conversion pathways. These modifications, including for better binding, boost activity under high-temperature conditions typical of biorefineries. Designed proteins further support purification technologies, with computationally optimized helical bundles forming selective channels in bilayers to facilitate or solute separation in systems. Recent advances as of 2025 include scaffolds designed for non-therapeutic applications, such as modular protein assemblies that serve as robust platforms for multivalent display in catalytic or sensing arrays, leveraging for precise geometry control. Computational approaches have also enabled responsive hydrogels, where proteins with programmable interactions form networks that swell or stiffen in response to stimuli, filling gaps in dynamic for encapsulation or of non-biological agents. In 2025, designed enzymes have been developed for degrading persistent pollutants like and , supporting sustainable . These developments underscore the versatility of protein design in creating sustainable, high-performance materials outside biomedical domains.

References

  1. [1]
    Advances in protein structure prediction and design - Nature
    Aug 15, 2019 · Protein design is frequently referred to as the inverse protein-folding problem. Instead of searching for the lowest-energy conformation for a ...
  2. [2]
    De novo protein design, a retrospective - PMC - PubMed Central - NIH
    Proteins are molecular machines whose function depends on their ability to achieve complex folds with precisely defined structural and dynamic properties.
  3. [3]
  4. [4]
  5. [5]
  6. [6]
  7. [7]
    [PDF] Protein design - Stanford University
    Oct 29, 2024 · Protein design is designing a protein to serve a purpose, choosing the appropriate amino acid sequence, and designing an amino acid sequence ...
  8. [8]
    De novo design of protein structure and function with RFdiffusion
    Jul 11, 2023 · De novo protein design seeks to generate proteins with specified structural and/or functional properties, for example, making a binding ...
  9. [9]
    Nobel Prize in Chemistry 2024
    ### Summary of the 2024 Nobel Prize in Chemistry
  10. [10]
    Designing proteins | National Institutes of Health (NIH)
    Proteins perform many different functions in biology. Scientists have been working to custom design proteins that can perform unique functions beneficial to ...
  11. [11]
  12. [12]
    The convergence of AI and synthetic biology: the looming deluge
    Jul 1, 2025 · Protein engineering stands to benefit from AI. One key area of focus has been the de novo design of proteins - creating novel protein sequences ...
  13. [13]
  14. [14]
    Highly accurate protein structure prediction with AlphaFold - Nature
    Jul 15, 2021 · The AlphaFold network directly predicts the 3D coordinates of all heavy atoms for a given protein using the primary amino acid sequence and ...
  15. [15]
    Principles that Govern the Folding of Protein Chains - Science
    This article, 'Principles that Govern the Folding of Protein Chains', is by Christian B. Anfinsen, published in Science on July 20, 1973.
  16. [16]
    Characterization of a Helical Protein Designed from First Principles
    A systematic approach was aimed at the design of a four-helix bundle protein. The gene encoding the designed protein was synthesized and the protein was ...
  17. [17]
    The dead-end elimination theorem and its use in protein side-chain ...
    Apr 9, 1992 · Here we present a theorem, referred to as the 'dead-end elimination' theorem, which imposes a suitable condition to identify rotamers that cannot be members of ...
  18. [18]
    De novo design of the hydrophobic cores of proteins - Desjarlais
    Two of the designs, including one with eight core sequence changes, have thermal stabilities comparable to the native protein, whereas the third design and the ...
  19. [19]
    De novo design of luciferases using deep learning - Nature
    Feb 22, 2023 · Here we describe a deep-learning-based 'family-wide hallucination' approach that generates large numbers of idealized protein structures ...
  20. [20]
    Scalable protein design using optimization in a relaxed sequence ...
    Oct 24, 2024 · We report a “hallucination”-based protein design approach that functions in relaxed sequence space, enabling the efficient design of high-quality protein ...
  21. [21]
    Foundations for the Study of Structure and Function of Proteins - PMC
    Protein Structure Hierarchy. Protein structures are studied at primary, secondary, tertiary, and quaternary levels. There are tight correlations among these ...Missing: seminal | Show results with:seminal
  22. [22]
    The structure of proteins: Two hydrogen-bonded helical ... - PNAS
    The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Linus Pauling, Robert B. Corey, and H. R. ...
  23. [23]
    [PDF] Protein Structure Prediction Levinthal's Paradox The Central Dogma ...
    If each amino acid can adopt only 3 possible conformations, the total number of conformations is. 3^100 = 5 x 10^47. • Assuming it would take 10^(-13) seconds ...Missing: 130 | Show results with:130
  24. [24]
    How much of protein sequence space has been explored by life on ...
    Apr 15, 2008 · 10130) for a protein of 100 amino acids in which any of the ... Levinthal paradox' of protein folding rates (Levinthal 1969; Zwanzig et al.
  25. [25]
    Global analysis of protein folding using massively parallel design ...
    Jul 14, 2017 · Iteration between design and experiment increased the design success rate from 6% to 47%, produced stable proteins unlike those found in nature ...<|control11|><|separator|>
  26. [26]
    Deep learning–guided design of dynamic proteins - Science
    May 22, 2025 · Pioneering work to design protein conformational switches has focussed on side-chain rearrangements or large-scale hinge-like domain motions ...
  27. [27]
    Mixed, nonclassical behavior in a classic allosteric protein - PNAS
    Sep 11, 2023 · Thus, individual side-chain rotamer switching is one of the clearest structural indicators—and convenient to observe by NMR—of the allosteric ...Missing: fluctuations | Show results with:fluctuations
  28. [28]
  29. [29]
    A Fresh Look at the Normal Mode Analysis of Proteins
    Apr 2, 2024 · Normal mode anal. (NMA) is a leading method for studying long-time dynamics and elasticity of biomols. The method proceeds from complex ...Introduction · Results and Discussion · Methods · References
  30. [30]
    Direct generation of protein conformational ensembles via machine ...
    Feb 11, 2023 · We demonstrate that machine learning can be trained with simulation data to directly generate physically realistic conformational ensembles of proteins.
  31. [31]
    Protein sequence design by conformational landscape optimization
    Typical problems include lack of soluble expression, aggregation, and folding into unintended structures. Reasoning that many of these problems could be ...
  32. [32]
    Patterns in Protein Flexibility: A Comparison of NMR “Ensembles ...
    Mar 9, 2021 · Crystallographic B-factors and Molecular Dynamics (MD) simulations both provide insights into protein flexibility on an atomic scale. Nuclear ...
  33. [33]
    AlphaFold prediction of structural ensembles of disordered proteins
    Feb 14, 2025 · We introduce the AlphaFold-Metainference method to use AlphaFold-derived distances as structural restraints in molecular dynamics simulations.
  34. [34]
    The Rosetta all-atom energy function for macromolecular modeling ...
    Rosetta traditionally models the solvent surrounding the protein using the Lazaridis-Karplus (LK) model, which assumes a solvent environment made of pure water.
  35. [35]
    Improved protein structure prediction using predicted interresidue ...
    ... Rosetta energy function, we show that still more accurate models can be generated. We also explore applications of the model to the protein design problem.Missing: limitations | Show results with:limitations
  36. [36]
    Sampling of structure and sequence space of small protein folds
    Nov 22, 2022 · We developed and experimentally validated a computational platform that can design a wide variety of small protein folds while sampling shape diversity.
  37. [37]
    Rotamer libraries in the 21st century - PubMed - NIH
    Rotamer libraries are widely used in protein structure prediction, protein design, and structure refinement. As the size of the structure data base has ...
  38. [38]
    [PDF] Rotamer libraries in the 21st century Roland L Dunbrack Jr
    Rotamer libraries are widely used in protein structure prediction, protein design and structure refinement. As the size of the structure database has ...
  39. [39]
    Building protein structure-specific rotamer libraries - Oxford Academic
    Jul 13, 2023 · The smaller standard deviations around the distribution mode peaks could explain the lack of dihedral angles when picking rotamers from Dunbrack ...
  40. [40]
    An end-to-end deep learning method for protein side-chain packing ...
    May 30, 2023 · This work provides a fast and precise machine learning approach that jointly models side-chain interactions and directly predicts physically realistic packings.
  41. [41]
    [PDF] Iterative Monte Carlo Protein Design - UBC Computer Science
    Jan 19, 2005 · Given a target conformation and energy function, an exhaustive enumeration of sequence space is performed to obtain a sequence of minimum ...
  42. [42]
    ProtGPT2 is a deep unsupervised language model for protein design
    Jul 27, 2022 · A broad rule of thumb is that the total score (Rosetta Energy Units, REU) should lie between −1 and −3 per residue. We observe such ...<|separator|>
  43. [43]
    [2510.23786] Relaxed Sequence Sampling for Diverse Protein Design
    Oct 27, 2025 · We introduce Relaxed Sequence Sampling (RSS), a Markov chain Monte Carlo (MCMC) framework that integrates structural and evolutionary ...
  44. [44]
    Evolutionary-scale prediction of atomic-level protein structure with a ...
    Mar 16, 2023 · We find that the ESM-2 language model generates state-of-the-art three-dimensional (3D) structure predictions directly from the primary protein ...
  45. [45]
    [PDF] Exploring the Protein Sequence Space with Global Generative Models
    Jan 19, 2023 · In this chapter, we delve into the potential of models that are capable of generating protein sequences across the entire protein space. We ...
  46. [46]
    Improving pretrained protein language models via sequence retrieval
    Apr 8, 2025 · We introduce RAG-ESM, a retrieval-augmented framework that allows to condition pretrained ESM2 protein language models on homologous sequences.
  47. [47]
    Computational protein design as an optimization problem
    The protein design problem as defined above, with a rigid backbone, a discrete set of rotamers, and pairwise energy functions has been proven to be NP-hard [74] ...
  48. [48]
    Searching for the Pareto frontier in multi-objective protein design
    Aug 10, 2017 · Thermodynamic stability is computed using chemical models of various degrees of resolution from heuristic sequence-based scoring functions ( ...<|control11|><|separator|>
  49. [49]
    Solving and analyzing side-chain positioning problems using linear ...
    Integer linear programming formulation​​ We first formulate the SCP problem as an ILP, so that a solution to the ILP gives an optimal solution to the SCP problem ...
  50. [50]
    Protein Design Using Continuous Rotamers - Research journals
    In this work we show that allowing continuous side-chain flexibility (which we call continuous rotamers) greatly improves protein flexibility modeling.<|separator|>
  51. [51]
    Protein design is NP-hard - PubMed
    It turns out that in the language of the computer science community, this discrete optimization problem is NP-hard. The purpose of this paper is to explain ...
  52. [52]
    Algorithms for Protein Design - PMC - NIH
    Then, integer linear programming solvers can be used to efficiently find the optimal sequence in the new model. Since this mapping is only an approximation ...
  53. [53]
    Dead-end Elimination for Multistate Protein Design - PubMed
    In this article we propose a variant of the standard DEE, called type-dependent DEE. Our method reduces the size of the conformational space of the multistate ...Missing: seminal | Show results with:seminal
  54. [54]
    Protein Design by Provable Algorithms - PMC - PubMed Central - NIH
    Protein design with provable algorithms has already had success in the design of novel enzymes and proteins with therapeutic applications. As the field matures ...Missing: Performance | Show results with:Performance
  55. [55]
  56. [56]
  57. [57]
    Dramatic performance enhancements for the FASTER optimization ...
    FASTER is a combinatorial optimization algorithm useful for finding low-energy side-chain configurations in side-chain placement and protein design calculations ...
  58. [58]
    Robust deep learning–based protein sequence design ... - Science
    Sep 15, 2022 · We describe a deep learning–based protein sequence design method, ProteinMPNN, that has outstanding performance in both in silico and experimental tests.<|control11|><|separator|>
  59. [59]
    Illuminating protein space with a programmable generative model
    Nov 15, 2023 · Here we introduce Chroma, a generative model for proteins and protein complexes that can directly sample novel protein structures and sequences.
  60. [60]
    Reshaping Protein‐Based Nanoparticles: Innovative Artificial ...
    Jun 5, 2025 · In nanomaterials, ProteinMPNN has successfully designed tetrahedral NP backbones and protein assemblies critical for applications such as ...
  61. [61]
    De novo design of small beta barrel proteins - PNAS
    Mar 10, 2023 · Here, we explore the de novo design of small beta barrel topologies using both Rosetta energy–based methods and deep learning approaches.
  62. [62]
    De novo protein design by deep network hallucination - Nature
    Dec 1, 2021 · Here we investigate whether the information captured by such networks is sufficiently rich to generate new folded proteins with sequences ...Missing: fold | Show results with:fold
  63. [63]
    Design of a Novel Globular Protein Fold with Atomic-Level Accuracy
    Here, we used a general computational strategy that iterates between sequence design and structure prediction to design a 93-residue α/β protein called Top7 ...
  64. [64]
    De novo design of transmembrane β barrels - Science
    Feb 19, 2021 · De novo–designed eight-stranded transmembrane β barrels fold spontaneously and reversibly into synthetic lipid membranes. The illustration shows ...
  65. [65]
    An expandable, modular de novo protein platform for ... - PNAS
    We report the design of an expandable, modular protein platform for creating well-folded, new-to-nature proteins containing one or more redox-active heme ...
  66. [66]
    Recent advances in de novo protein design: Principles, methods ...
    The resulting scoring functions significantly improved docking success rate. TERM-based scoring. Protein design methods typically seek to find low-energy ...
  67. [67]
    Designed miniproteins potently inhibit and protect against MERS-CoV
    Jun 24, 2025 · We computationally designed monomeric and homo-oligomeric miniproteins that bind with high affinity to the MERS-CoV spike (S) glycoprotein.
  68. [68]
    De novo design of porphyrin-containing proteins as efficient and ...
    May 8, 2025 · De novo design of protein catalysts with high efficiency and stereoselectivity provides an attractive approach toward the design of environmentally benign ...
  69. [69]
    Kemp elimination catalysts by computational enzyme design - Nature
    Mar 19, 2008 · Here we describe the computational design of eight enzymes that use two different catalytic motifs to catalyse the Kemp elimination—a model ...Missing: protein paper
  70. [70]
    Automated scaffold selection for enzyme design - PubMed
    The method consists of two steps; it first identifies pairs of backbone positions in pocket-like regions. Then, it combines these to complete attachment sites ...
  71. [71]
    Computational redesign of cytochrome P450 CYP102A1 for highly ...
    Over the past 2 decades, a large repertoire of CYP102A1 variants has been identified by many laboratories around the world through directed evolution and/or ...
  72. [72]
    Directed evolution of cytochrome P450 enzymes for biocatalysis
    Mar 20, 2015 · The purpose of the present review is to illustrate the progress that has been made in altering properties of P450s such as substrate range, cofactor preference ...<|separator|>
  73. [73]
    Computation-Aided Engineering of Cytochrome P450 for the ...
    These calculations revealed that the P450pra variant found by directed evolution indeed was present among a set of designs optimized by Rosetta for binding of ...
  74. [74]
    Navigating the landscape of enzyme design: from molecular ...
    Jul 11, 2024 · 3.2 QM/MM method. Hybrid quantum mechanics/molecular mechanics (QM/MM) methods combine accurate QM methods to study the reactions and classical ...
  75. [75]
    De novo luciferases enable multiplexed bioluminescence imaging
    Nov 12, 2024 · We leverage AI-powered de novo protein design to create a new generation of luciferase catalysts, termed the neoLux series, which exhibit ...
  76. [76]
    Review article Generative artificial intelligence for enzyme design
    We review the recent advances in generative AI models for enzyme design, with a particular focus on those that have been validated by experiments.
  77. [77]
    Design of protein-binding proteins from the target structure alone
    Mar 24, 2022 · We demonstrate the broad applicability of this approach through the de novo design of binding proteins to 12 diverse protein targets with different shapes and ...
  78. [78]
    Hot spots in protein–protein interfaces: Towards drug discovery
    Residue based analysis can help revealing protein–protein binding mechanisms. Alterations in native protein–protein interactions may lead to several diseases.
  79. [79]
    Antibody humanization by structure-based computational protein ...
    We initially attempted to humanize the antibody using traditional CDR grafting. The CDR-grafted sequence was designed by selecting separate fully human antibody ...
  80. [80]
    Matrixed CDR grafting: A neoclassical framework for antibody ...
    This study demonstrates that modern throughput systems enable a more thorough, customizable, and systematic analysis of graft-framework combinations.
  81. [81]
    Flex ddG: Rosetta Ensemble-Based Estimation of Changes in ...
    Feb 5, 2018 · Computationally modeling changes in binding free energies upon mutation (interface ΔΔG) allows large-scale prediction and perturbation of ...
  82. [82]
    Flex ddG: Rosetta Ensemble-Based Estimation of Changes in ...
    The current state-of-the-art Rosetta ΔΔG method, ddg_monomer, has proven effective at predicting changes in stability of monomeric proteins after mutation, but ...
  83. [83]
    [PDF] Accurate Estimation of Ligand Binding Affinity Changes upon ... - MPI
    Dec 13, 2018 · We show that both the free energy calculations and Rosetta are able to quantitatively predict changes in ligand binding affinity upon protein ...
  84. [84]
    Computational Design of Affinity and Specificity at Protein ... - NIH
    ... negative design was required to create specific binders. Negative design was especially important for disfavoring homodimer formation by the designed proteins.
  85. [85]
    De novo design of picomolar SARS-CoV-2 miniprotein inhibitors
    Cao et al. designed small, stable proteins that bind tightly to the spike and block it from binding to ACE2. The best designs bind with very high affinity.
  86. [86]
    De novo designed protein inhibitors of amyloid aggregation ... - PNAS
    Aug 15, 2022 · Inhibitors iTau-N (D), i αSyn-F (E), and iAβ-H (F) are shown to maintain stable folds both computationally and experimentally. (G) To assess the ...
  87. [87]
    Development of AI-designed protein binders for detection and ...
    May 15, 2025 · AI-designed protein binders were created using RFdiffusion, validated, and used in CAR-T cells and as tetravalent quattrobinders for cancer ...
  88. [88]
    FDA Approvals in Oncology: July-September 2025 | Blog | AACR
    Oct 2, 2025 · The bispecific T-cell engager linvoseltamab-gcpt (Lynozyfic) received accelerated approval for the treatment of adult patients with relapsed or ...
  89. [89]
    FDA approved bispecific antibodies - evitria
    Approved in 2025 for relapsed or refractory multiple myeloma. Bispecific Antibody Production. Bispecific antibody process. Bispecific antibodies – what makes ...01.List of FDA approved... · 02.Bispecific antibodies – what...
  90. [90]
    Computational design of protein binders that boost the antitumour ...
    Oct 24, 2024 · We computationally designed protein binders for chimeric antigen receptor (CAR) constructs to target the glioblastoma-associated antigens EGFR and CD276.
  91. [91]
    Structure-based design of novel polyhedral protein nanomaterials
    Here we review the leading approaches to the design of closed polyhedral protein assemblies, highlight the importance of considering the assembly process ...
  92. [92]
    De novo amyloid proteins from designed combinatorial libraries
    Like natural amyloid, the de novo fibrils are composed of β-sheet secondary structure and bind the diagnostic dye, Congo red. Thus, binary patterning of polar ...
  93. [93]
    Amyloid Fibrils and Their Applications: Current Status and Latest ...
    Feb 7, 2025 · Plant protein self-assembly into amyloid-like fibrils is a modification introduced in emerging food and material applications.
  94. [94]
    Silk, Wool, and Beyond: AI-Driven Design of Custom Protein Fibers
    Sep 5, 2023 · Today we report in Nature Chemistry a novel approach to designing protein fibers that takes inspiration from silk, wool, and spider webs.
  95. [95]
    Silk Proteins: Designs from Nature with Multipurpose Utility and ...
    Oct 29, 2024 · Complex evolutionary pressures result in unique protein fiber designs for silks. This includes multi-domain features and novel amino acid ...The Mystery of Silks · Big Data-Driven Molecular... · Water-resistant Artificial Silk...
  96. [96]
    Bridging between material properties of proteins and the underlying ...
    We find the Young's modulus of proteins can be as high as 10 Gpa, while the Young's modulus of protein interface regions is several times smaller. Our work ...Missing: tunable | Show results with:tunable
  97. [97]
    De novo design of pH-responsive self-assembling helical protein ...
    Apr 3, 2024 · Designed pH-responsive proteins are a promising new class of such materials with potential applications in fields such as tissue engineering ...
  98. [98]
    Reversible light-responsive protein hydrogel for on-demand cell ...
    We present the engineering of a new protein material that is capable of switching between liquid and solid state reversibly, controlled by lights of different ...
  99. [99]
    Engineering cellulases for conversion of lignocellulosic biomass
    This review addresses relevant engineering targets for cellulases, discusses a few notable cellulase engineering studies of the past decades and provides an ...
  100. [100]
    Computational Design of Membrane Proteins - PMC - PubMed Central
    Membrane associated proteins have been designed to provide a “switch” that can be used to modulate the integrity of a lipid bilayer. In particular, amphiphilic ...Design Of Membrane Proteins · Figure 3. Topology Of The De... · Figure 4. Champ...
  101. [101]
    Computational design of bifaceted protein nanomaterials - Nature
    Jul 31, 2025 · Computationally designed protein nanoparticles have emerged as a promising class of nanomaterials that have served as robust scaffolds for a ...
  102. [102]
    Computer-designed proteins allow for tunable hydrogels that can ...
    Jan 31, 2024 · New research led by the University of Washington demonstrates a new class of hydrogels that can form not just outside cells, but also inside of them.