Fact-checked by Grok 2 weeks ago

Protein structure database

A protein structure database is a specialized bioinformatics resource that archives and provides to experimentally determined and computationally predicted three-dimensional (3D) atomic coordinates of proteins and other biological macromolecules, enabling researchers to visualize, analyze, and model molecular structures for insights into biological functions. These databases primarily compile data from techniques such as , (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM), alongside predicted models from tools, and include associated like experimental conditions, resolution quality, and functional annotations. The cornerstone of protein structure databases is the Protein Data Bank (PDB), established in 1971 at and now managed by an international consortium including the RCSB PDB in the United States, PDBe in Europe, and PDBj in Japan. As of November 2025, the PDB archive holds 245,074 structures, the majority experimental, reflecting a steady growth from just seven entries in its inaugural year to supporting breakthroughs in fields like and enzymology. Complementing the PDB, the AlphaFold Protein Structure Database, launched in 2021 by DeepMind and EMBL-EBI, offers over 200 million AI-predicted structures covering nearly all known protein sequences in , dramatically expanding access to structural information for understudied proteins. Additional databases focus on classification and annotation to organize the vast PDB data hierarchically by structural similarity and evolutionary relationships. For instance, the database, originally developed at the and now maintained by the , provides manually and semi-automatically curated classifications of protein domains into classes, folds, superfamilies, and families based on structural and evolutionary criteria. Similarly, the CATH (Class, Architecture, Topology, Homologous superfamily) database, maintained by , employs a semi-automated approach to classify over 500,000 protein domains from the PDB into four hierarchical levels, aiding in the identification of novel folds and functional motifs. Together, these resources form the backbone of , facilitating comparative analyses, , and integrative studies that underpin advancements in , , and .

Overview

Definition and Scope

A protein structure database is a specialized repository that archives three-dimensional () atomic coordinates of proteins and other biological macromolecules, primarily derived from experimental techniques such as , (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM). These databases organize structural data in standardized formats, enabling visualization, querying, and comparative analysis to support investigations into molecular architecture and function. Increasingly, they incorporate computationally predicted structures, driven by advancements that complement experimental data. The scope of protein structure databases includes primary data in the form of raw experimental coordinates, secondary data with classifications and annotations derived from primary sources, and associated metadata detailing aspects like resolution, experimental parameters, and biological context such as sequence alignments or functional roles. These repositories distinguish between those centered solely on isolated protein structures and others that encompass macromolecular complexes, including interactions with nucleic acids, small-molecule ligands, or other proteins. This breadth ensures comprehensive coverage of structural diversity while maintaining data integrity through validation protocols. Protein structure databases are classified into primary, secondary, and specialized categories based on their purpose and content. Primary databases function as archival stores for unaltered experimental data, such as atomic coordinate files from techniques like or cryo-EM. Secondary databases offer analytical layers, including hierarchical classifications of structures by fold or evolutionary relationships to aid in . Specialized databases focus on subsets, such as those for proteins or pathogen-related structures, providing tailored annotations for domain-specific research. Over time, the scope of these databases has expanded dramatically, from a dozen structures archived in the early to 245,074 experimentally determined entries as of November 2025, with the post-2020 integration of AI-predicted models—exemplified by over 241 million predictions in the database—elevating the total to hundreds of millions and enabling proteome-wide structural insights. This evolution underscores the shift from limited experimental archives to inclusive resources blending empirical and predictive data.

Importance in Biology

Protein structure databases play a central role in by providing three-dimensional models that elucidate patterns, architectures, and molecular interactions, which are fundamental to deciphering protein function, evolutionary dynamics, and disease-associated mechanisms. These resources enable researchers to visualize how sequences translate into functional conformations, revealing how mutations disrupt folding or binding interfaces that contribute to pathologies such as cancer or neurodegenerative disorders. By archiving atomic-level details, the databases facilitate the study of evolutionary , where homologous structures across highlight adaptive changes in protein scaffolds over time. These databases have been instrumental in enabling key scientific discoveries, including mechanistic insights into , the thermodynamics of protein-ligand binding, and phylogenetic relationships inferred from structural . For instance, comparative analyses of deposited structures have illuminated how enzymes like proteases or kinases accommodate substrates through precise pocket geometries, informing rational design of inhibitors. Similarly, based on database entries has accelerated the resolution of complex assemblies, bridging gaps in experimental data to uncover evolutionary divergences in protein families. Beyond core , protein structure databases foster broader scientific impacts by integrating with to link sequence variants to functional outcomes, thereby enhancing predictions of structure-function relationships in diverse organisms. This synergy accelerates advancements in , where structural data on viral proteins aids in understanding host-pathogen interactions, and in , supporting the design of targeted therapies against mutated oncoproteins. policies ensure global accessibility, democratizing research and promoting collaborative efforts across disciplines. As of November 2025, these repositories encompass 245,074 experimentally determined structures alongside more than 241 million predicted models, achieving coverage of approximately 58% of human residues with confident predictions.

History

Early Foundations

The determination of the first three-dimensional protein structure, that of by in 1958 using , marked a pivotal advancement in . Prior to the establishment of dedicated databases, such structures were disseminated primarily through scientific publications and physical models, limiting accessibility and hindering comparative analyses as the number of solved structures grew in the . By the mid-1960s, crystallographers in the United States and recognized the pressing need for a centralized to archive and share atomic coordinate data, driven by the increasing volume of experimental results from X-ray diffraction studies. The Protein Data Bank (PDB) emerged as the pioneering solution, announced on October 20, 1971, in Nature New Biology as a collaborative initiative between Brookhaven National Laboratory in the United States and the Cambridge Crystallographic Data Centre in the United Kingdom. Founded under the leadership of Walter Hamilton with key contributions from Edgar Meyer and Helen Berman, the PDB began operations at Brookhaven with just seven initial X-ray crystallographic structures of proteins and nucleic acids, stored on punched cards and magnetic tapes for manual deposition and distribution via mail. Meyer also developed the SEARCH program in 1971, enabling the first remote access to the database for offline analysis of protein structures. This grassroots effort, spearheaded by a small team of US and UK crystallographers, emphasized open access and community-driven contributions to foster broader research collaboration. By 1980, the PDB had expanded to fewer than 100 structures, all derived from , reflecting the era's predominant experimental technique. Early growth was supported by informal networks among , who deposited data voluntarily despite the absence of formal policies. However, the initiative faced significant hurdles, including limited computational resources that restricted data processing and visualization, reliance on manual deposition methods prone to errors, and the lack of standardized formats for coordinate files, which complicated integration and validation. These challenges underscored the nascent stage of digital infrastructure in the 1970s and 1980s, yet the PDB's establishment laid the groundwork for systematic archiving in .

Expansion and Key Milestones

The expansion of databases from the 1990s onward was marked by rapid growth in deposited structures, driven by advances in experimental techniques and computational tools. By 1993, the (PDB) contained 1,000 structures, primarily determined by . This number surged to over 10,000 by 1999, reflecting increased accessibility of methods and mandatory deposition policies in journals. Further acceleration occurred in the and , with the archive reaching 100,000 entries by 2014 and exceeding 240,000 experimental structures by 2025, underscoring the databases' role as indispensable resources for global research. A significant surge in cryo-electron microscopy (cryo-EM) structures followed the "resolution revolution" in the , enabled by improvements in detector technology and image processing algorithms that routinely achieved near-atomic resolution. Prior to 2010, cryo-EM contributions were minimal, but by 2025, over 30,000 cryo-EM-derived structures comprised about 12% of the PDB archive, complementing traditional methods like and NMR for studying large macromolecular complexes. This diversification expanded the scope of accessible protein architectures, particularly for proteins and dynamic assemblies previously challenging to crystallize. Internationalization efforts culminated in the formation of the Worldwide Protein Data Bank (wwPDB) in 2003, uniting the RCSB PDB (), Protein Data Bank in Europe (PDBe, UK), and Protein Data Bank Japan (PDBj) to ensure a single, unified global archive. This distributed management model facilitated standardized data deposition, validation, and dissemination worldwide, reducing redundancy and enhancing accessibility for international researchers. The Biological Magnetic Resonance Bank (BMRB) joined as a full partner in 2006, integrating NMR-specific data like chemical shifts, while the Electron Microscopy Data Bank (EMDB) became an associate member in 2021 to support cryo-EM map archiving. Technological advancements in the and transformed database usability and . Web-based interfaces emerged early, with the release of AutoDep in 1996 as the first web tool for PDB deposition, followed by the RCSB PDB's comprehensive portal in 1998, which enabled user-friendly , , and downloading. In the , integration of validation tools like MolProbity, introduced in 2007, became standard within wwPDB workflows by the early 2010s, providing all-atom clashscore and Ramachandran analyses to improve deposited model accuracy. These developments democratized access and elevated data reliability, supporting broader applications in . The AI revolution accelerated expansion through initiatives like the Critical Assessment of Structure Prediction (CASP) competitions, launched in 1994 to benchmark computational prediction methods biennially. fostered iterative improvements in modeling accuracy, culminating in the 2021 release of 2, which achieved unprecedented prediction precision for diverse proteins. This led to the Protein Structure Database, initially releasing over 360,000 predicted models in 2021 and expanding to more than 200 million by 2022, integrated alongside experimental data in hybrid archives like those managed by EMBL-EBI. This shift augmented traditional databases, providing structural coverage for the majority of known proteomes and enabling hypothesis-driven research where experimental determination remains resource-intensive.

Primary Databases

Protein Data Bank (PDB)

The (PDB) serves as the single global archive for three-dimensional structural data of biological macromolecules, established in 1971 as a repository for experimentally determined atomic coordinates. Managed by the (wwPDB) consortium, it stores atomic coordinates, maps, and associated for proteins, nucleic acids, and their complexes, ensuring free and public access to the global scientific community. The archive emphasizes experimentally validated structures derived from techniques such as , (NMR) , and cryo-electron microscopy (cryo-EM), excluding predicted models to maintain data integrity. As of November 2025, the PDB contains 245,011 entries, reflecting steady growth driven by advances in methods. Approximately 81% of these structures are determined by , 6% by NMR, and 12% by cryo-EM, with the remainder from hybrid or other techniques; entries often include details on bound ligands, site-directed mutations, and experimental conditions to support downstream analyses. Each entry is accompanied by validation reports generated using tools like the wwPDB Validation Pipeline, which assess geometric quality, , and consistency with experimental data to aid users in interpreting structural reliability. The wwPDB oversees deposition through the unified OneDep system, introduced in 2014 to streamline submission, biocuration, and validation across partner sites, replacing earlier tools like ADIT for more efficient processing of coordinates, maps, and metadata. New entries are released weekly following rigorous annotation by wwPDB partners—the Research Collaboratory for Structural Bioinformatics (RCSB) in the United States, the Protein Data Bank in Europe (PDBe), the Protein Data Bank Japan (PDBj), and the Biological Magnetic Resonance Bank (BMRB)—ensuring consistent global standards and interoperability. Distinctive aspects of the PDB include its programmatic accessibility via libraries such as , which enable automated parsing of PDB files for coordinates and in computational workflows, and a strict focus on experimental evidence to distinguish it from predictive databases. This emphasis on validation and archival stability has made the PDB a foundational resource for , with OneDep facilitating joint submissions that integrate atomic models with complementary data like NMR restraints or EM maps.

AlphaFold Protein Structure Database

The AlphaFold Protein Structure Database (AlphaFold DB) is a comprehensive open repository of computationally predicted three-dimensional protein structures, launched in July 2021 through a collaboration between and the (EMBL-EBI). It contains 241 million structure predictions, encompassing nearly all protein sequences catalogued in the database, thereby providing unprecedented structural coverage for the protein universe. These predictions address the longstanding challenge of determining structures for the vast majority of proteins that remain experimentally uncharacterized, particularly those difficult to crystallize. The database's predictions are generated using the 2 deep learning system, which achieved top performance in the 2020 Critical Assessment of Structure Prediction (CASP14) competition by leveraging multiple sequence alignments and evolutionary relationships to model atomic-level protein folds with high accuracy. Each model includes a per-residue score, the predicted Local Distance Difference Test (pLDDT), ranging from 0 to 100, where scores above 90 indicate very high reliability comparable to experimental structures, 70-90 suggest confident predictions, and lower values highlight regions of potential uncertainty such as disordered loops. The database focuses on single polypeptide chain () predictions, enabling detailed insights into complex cellular machineries that are often underrepresented in experimental databases. Since its inception, the database has undergone periodic updates to enhance coverage and utility, including expansions in 2022 to incorporate additional proteomes and a September 2025 synchronization with release 2025_03, which added isoform predictions and made multiple sequence alignments (MSAs) available. Tools like allow users to predict protein complexes separately using open-source code. Synchronization with sequence databases is further supported through initiatives like AlphaSync, introduced in 2025, which automatically updates predictions to reflect revisions in entries, including new isoforms and sequences, ensuring the resource remains current with ongoing genomic discoveries. Unlike repositories of experimentally derived structures, such as the (PDB), DB focuses exclusively on AI-generated monomer models, complementing empirical data by filling structural gaps for unstudied proteins. Access is fully open, with structures available for interactive viewing on the web interface, bulk downloads in PDB format, and programmatic retrieval via APIs, fostering widespread use in research. This dynamic maintenance, combined with the scale of predictions, positions DB as a transformative tool that bridges the divide between protein sequences and their functional three-dimensional architectures.

Secondary and Specialized Databases

Structural Classification Databases

Structural classification databases organize protein structures from the (PDB) into hierarchical schemes based on structural similarity, topology, and evolutionary relationships, enabling researchers to identify homologous families, superfamilies, and novel folds for functional and evolutionary analysis. These databases facilitate the grouping of domains by shared architectural features, such as secondary structure arrangements, while distinguishing between structural convergence and divergence due to . By providing a framework for comparing thousands of structures, they support tasks like fold recognition, protein function prediction, and benchmarking structure prediction algorithms. The Structural Classification of Proteins (SCOP) database, first released in 1994, employs a manually curated emphasizing fold-level similarities to delineate evolutionary relationships. Its classification levels include class (based on secondary structure content, e.g., all-alpha or ), fold (overall ), superfamily (common evolutionary origin inferred from and ), and (close sequence and structural similarity). , an extended version developed since 2011, automates much of the process while maintaining manual oversight to classify newer PDB entries and correct inconsistencies; as of release 2.08 in 2023, it encompasses approximately 345,000 domains from over 100,000 PDB structures across about 1,500 . Users can browse hierarchies interactively, query by fold or superfamily, and access direct links to corresponding PDB entries for detailed visualization. and have been instrumental in establishing benchmarks for fold prediction methods, with their fold definitions serving as gold standards in evaluations of early prediction tools. The , , , and Homologous superfamily (CATH) database, initiated in 1995, complements with a semi-automated that integrates both structural and across four main levels: (secondary structure composition), (gross orientation of secondary structures, independent of connectivity), (fold or shape including connectivity), and homologous superfamily (inferred evolutionary relationships). Recent releases, such as version 4.4 updated in early 2025, classify over 500,000 from more than 150,000 experimental PDB structures, spanning around 2,000 folds and 6,500 superfamilies, with significant expansion driven by automated domain parsing tools. CATH features web-based hierarchical browsing, advanced search interfaces for or , and integrations with databases for functional annotations, alongside links to PDB for structure downloads. Unlike purely manual systems, CATH's hybrid approach allows rapid updates and has been widely used in studies of protein and domain diversity.00160-8) Both databases have evolved to incorporate predicted structures post-2022, enhancing coverage of uncharted protein space; for instance, CATH's 2024 update via the CATH-AlphaFlow pipeline integrated high-confidence models, adding nearly 200 novel folds and expanding the total structural repertoire by over 180-fold compared to prior experimental-only versions. This inclusion aids in identifying potential evolutionary links in underrepresented superfamilies and supports applications for variant interpretation. SCOPe, while primarily focused on experimental data, provides extensible frameworks that researchers adapt for predicted model , ensuring these resources remain vital for amid rapid advances in prediction accuracy.

Domain and Niche Repositories

Domain-focused repositories emphasize the modular units of proteins, known as , which are evolutionarily conserved regions often responsible for specific functions. , established in 1997, is a foundational database that curates protein families using hidden Markov models (HMMs) derived from multiple sequence alignments incorporating both sequence and structural data.1097-0134(19970801)28:4<405::AID-PROT10>3.0.CO;2-#) By 2025, Pfam encompasses over 20,000 families, enabling the of functional domains across proteomes and supporting searches for domain architectures in novel sequences. These models facilitate the identification of distant homologs and highlight evolutionary conservation, with annotations often linking domains to biological roles such as enzymatic activity or binding specificity. InterPro complements Pfam by integrating signatures from multiple specialized resources, including Pfam, , , and others, to provide a unified view of protein domains, families, and functional sites. Launched in 1999, InterPro merges overlapping predictions from its 13 member databases into hierarchical entries that describe domain boundaries, motifs, and sites, aiding in comprehensive functional inference. Tools within InterPro, such as InterProScan, allow users to scan sequences against these integrated signatures, revealing domain combinations that inform protein and interactions, with recent enhancements incorporating structural predictions for improved accuracy. Niche repositories target specific biological contexts, curating structures and annotations for underrepresented protein classes. The Membrane Protein Structure Database (mpstruc), initiated in 1999, manually curates transmembrane protein structures from the Protein Data Bank (PDB), classifying over 1,700 unique entries by function, topology, and lipid interactions to support studies in membrane biology. Similarly, Viro3D, released in 2025, compiles AlphaFold2- and ESMFold-predicted structures for more than 85,000 proteins from over 4,400 viruses, focusing on vertebrate and invertebrate hosts to map viral evolution, functional motifs, and therapeutic targets in virology. PDBsum offers pictorial summaries of PDB entries, generating 2D schematic diagrams of 3D interactions between proteins, ligands, DNA, and metals, which visualize binding sites, secondary structures, and domain interfaces for rapid analysis. These repositories have expanded through the integration of predicted structures from , refining domain boundaries in and enhancing signature predictions in to cover previously uncharacterized regions. Such advancements enable domain searching tools that align user queries against conserved sites, fostering applications in specialized fields like and while referencing broader classifications from resources like and CATH for contextual fold hierarchies.

Data Management and Access

File Formats and Standards

The (PDB) format, introduced in the , remains a foundational text-based standard for storing atomic coordinates of protein structures. It uses fixed-width columns in records such as for standard residues and HETATM for non-standard atoms or ligands, specifying details like atom identifiers, residue names, chain IDs, and orthogonal coordinates (x, y, z). This legacy format, formalized in an 80-column layout by 1976, has supported the exchange of structural data across tools but imposes limitations due to its rigid, punched-card-era design, which restricts handling of large assemblies, complex , and non-standard characters. To address these constraints, the macromolecular (mmCIF), developed in 1997, provides a modern, relational alternative based on the Crystallographic Information Framework. As PDBx/mmCIF, it organizes data into hierarchical categories and loops with key-value pairs, enabling relationships between entities (e.g., linking sequences to sources) and accommodating extensive such as experimental conditions, validation statistics, and citations. This extensible, machine-readable format supports structures of any size without fixed-width restrictions, making it ideal for contemporary protein data. In 2019, the Worldwide (wwPDB) mandated PDBx/mmCIF submissions for all new crystallographic depositions to enhance data quality, interoperability, and archival efficiency, while continuing best-efforts support for legacy PDB files. Complementary formats include BinaryCIF, a compressed binary encoding of mmCIF that achieves up to 10-fold size reduction for large structures through techniques like and run-length , thereby improving parsing speed and storage for high-throughput analyses. PDBML, an XML-based representation derived from the PDBx/mmCIF dictionary, facilitates programmatic access and web services by structuring in tagged elements, such as <atomSite> for coordinates, and is available in space-efficient variants. Protein files often integrate sequence in , using one-letter codes to represent chains alongside 3D coordinates, enabling direct comparison with reference sequences from databases like . These standards rely on the wwPDB Chemical Component Dictionary for validation, which defines over 20,000 residues and s with standardized nomenclature, , idealized coordinates, and SMILES notations to verify chemical accuracy and consistency in deposited structures, ensuring across visualization, modeling, and analysis tools.

Querying and Retrieval Tools

Protein structure databases provide a variety of querying interfaces to facilitate access to their extensive archives of experimental and predicted structures. The RCSB (RCSB PDB) offers an advanced search system that enables users to construct complex queries using a graphical query builder, combining conditions such as presence, thresholds (e.g., structures resolved below 2.0 Å), experimental method, and source. This text-based attribute search supports full-text queries across fields like structure titles and publication details, allowing rapid retrieval of relevant entries from over 245,000 experimental structures as of November 2025. Similarly, the in (PDBe) integrates search capabilities through its , enabling federated queries that aggregate data from multiple sources including annotations on function and interactions. For sequence and structure similarity searches, tools like and are commonly integrated or linked within database portals. The Basic Local Alignment Search Tool (), developed by NCBI, performs protein sequence similarity searches against database sequences, identifying homologous structures based on evolutionary conservation with statistical significance scores (E-values). In contrast, the Dali server specializes in structure comparison, aligning query structures against the PDB archive to detect fold similarities using distance matrix-based algorithms, often revealing remote homologs not evident from sequence alone. These methods support uploads of custom coordinates or sequences, returning ranked lists of matches with alignment visualizations. Programmatic access is enhanced through RESTful APIs and query languages tailored to each database. The RCSB PDB provides a search that accepts HTTP GET/POST requests in format for metadata queries, such as by molecule names or sequences, returning identifiers and summaries for further processing. PDBj Mine 2 offers endpoints for querying its , allowing semantic searches across RDF representations of PDB data, including chemical components and . The PDBe aggregated facilitates federated access to integrated structural and functional annotations from diverse resources, powering and enabling complex graph-based queries on protein entities. For the Protein Structure Database, an provides metadata retrieval for over 200 million predicted models, with download endpoints for structures in mmCIF format based on accessions. Scripts, such as those in the repository, automate bulk data setup and retrieval. Visualization tools are embedded in database interfaces to aid immediate exploration of retrieved structures. The Mol* viewer, a web-based molecular , is integrated into RCSB PDB, PDBe, and portals, supporting interactive rendering of proteins, ligands, and maps in browsers without plugins, including features for superposition and trajectory playback. Batch downloads are supported via for efficiency; RCSB PDB hosts archives accessible through or direct links, replacing legacy FTP, while offers organism-specific tarballs for large-scale retrieval. Recent enhancements, such as RCSB's 2024 text search improvements using , have boosted query accuracy and speed for natural language-like inputs.

Applications

In Structural Biology and Drug Design

Protein structure databases play a pivotal role in by providing atomic-level models essential for and molecular docking simulations, which allow researchers to predict protein-ligand interactions and infer functions of uncharacterized proteins based on known structures. For instance, the rapid deposition of spike protein structures in the (PDB) in 2020 facilitated the design of mRNA vaccines by revealing the receptor-binding domain's interaction with ACE2, enabling targeted stabilization of the prefusion conformation. These databases support iterative refinement of structural hypotheses, accelerating the understanding of protein dynamics in biological processes. In , protein structure databases enable of compound libraries against known protein-ligand complexes in the PDB, identifying potential hits for further optimization, as demonstrated in the development of kinase inhibitors where structural data guides modifications to enhance binding affinity and selectivity. The Protein Structure Database has further expanded this capability by providing predicted structures for previously undruggable proteins lacking experimental data, aiding target identification through de novo pocket and allosteric exploration. Structure-based lead optimization leverages these resources to refine inhibitors, reducing off-target effects and improving pharmacokinetic properties. Notable case studies underscore the impact of these databases: the 1990s development of inhibitors, such as , relied on early PDB structures to perform structure-based design, leading to the first FDA-approved antiretroviral that transformed treatment from palliative to manageable. Applications in precision medicine have advanced through structure-guided design informed by databases. Integration with docking tools exemplifies practical utility; software like utilizes PDB structures as rigid or flexible receptors to simulate ligand binding, supporting high-throughput in early-stage pipelines. This synergy has contributed to structural coverage for 88% of small-molecule drugs approved up to 2018, with more recent analyses showing 100% coverage for anti-cancer small-molecule drugs approved from 2019 to 2023, validating the databases' foundational role in pharmaceutical innovation.

In Protein Engineering and Prediction

Protein structure databases play a pivotal role in by providing a repository of experimentally validated scaffolds that guide experiments toward stable and functional variants. Researchers mine these databases, such as the (PDB), to identify compact, robust frameworks that can tolerate mutations while maintaining fold integrity, thereby accelerating the evolution of novel binding proteins or enzymes. For instance, a systematic analysis of the PDB revealed a 45-amino-acid scaffold capable of evolving high-affinity ligands with nanomolar dissociation constants, demonstrating how database-derived templates enhance the efficiency of library design in . In protein design, tools like leverage PDB templates to generate entirely new structures without natural homologs, enabling the creation of custom folds for specific functions. The Rosetta blueprint builder protocol assembles protein backbones from short fragments extracted from PDB entries, followed by sequence optimization to stabilize the designed , as exemplified in the development of novel β-barrel proteins. This approach has produced hyperstable proteins that fold independently and serve as modular scaffolds for multi-enzyme assemblies in synthetic pathways. These databases also enhance protein structure prediction by serving as primary training datasets for artificial intelligence models, with AlphaFold relying on PDB structures released before 2018 to learn evolutionary and structural patterns. The supervised learning on millions of PDB entries allows models to generalize to novel sequences, achieving median backbone RMSDs below 1 Å for many targets during benchmarking. In the Critical Assessment of Structure Prediction (CASP) experiments, predictors are evaluated against unpublished PDB structures, using database folds as references to quantify improvements in accuracy, such as AlphaFold's top performance in CASP14 with GDT-TS scores exceeding 90 for easy targets. In applications, databases facilitate redesign for production by informing mutations that optimize catalytic efficiency and substrate specificity. of hydrocarbon-producing enzymes, guided by PDB-derived structural insights, has yielded variants with up to 10-fold increased activity toward lignocellulosic feedstocks, supporting scalable processes. As of 2025, trends in multimer prediction emphasize modeling protein complexes for engineering multi-subunit assemblies, with advancements like AlphaFold-Multimer enabling accurate quaternary forecasts that inform designs for metabolic pathways in microbes. An iterative feedback loop further amplifies these capabilities, as predicted structures from tools like are deposited into archives such as ModelArchive, enriching the database for subsequent training and validation cycles. This deposition of computational models, including confidence scores, allows community and refinement, fostering continuous improvement in prediction accuracy for engineered proteins.

Challenges and Future Directions

Data Validation and Quality Control

Data validation and quality control in protein structure databases are essential to maintain the integrity of archived entries, ensuring they accurately represent experimental or predicted macromolecular structures. The Worldwide Protein Data Bank (wwPDB) generates standardized validation reports for deposited structures, which include metrics such as the clashscore—a measure of steric overlaps between atoms expressed as clashes per 1000 atoms—and Ramachandran plots that assess the distribution of backbone phi-psi dihedral angles against expected values for non-glycine, non-proline residues. These reports help identify outliers and guide refinements before final release. MolProbity serves as a key tool for in-depth geometry validation, performing all-atom analyses of covalent bond lengths, angles, and torsion angles in proteins and nucleic acids, often integrated into wwPDB workflows and refinement software like Phenix. It flags deviations from ideal , such as unusual side-chain rotamers or backbone conformations, enabling depositors to correct local errors. Several quantitative metrics underpin structure quality assessments. For , resolution below 2 Å indicates high-quality data with -level detail, while the R-free value—calculated from a withheld test set—quantifies model agreement with experimental data, with values under 25% typical for well-refined structures. B-factors ( factors) estimate uncertainty or , with lower values (e.g., <30 Ų) suggesting reliable positioning in rigid regions. For computationally predicted structures, the predicted local distance difference test (pLDDT) score, ranging from 0 to 100, gauges per-residue confidence, where scores above 90 denote very high accuracy. Deposition processes incorporate pre-submission checks via the wwPDB OneDep system, which automates preliminary validation reports for depositors to review and revise entries, reducing errors in coordinates, ligands, and metadata. Community oversight by wwPDB partner sites involves expert biocuration and periodic re-examination of archives to uphold standards. In response to identified issues, such as misfolds or modeling artifacts in structures from the , affected entries are obsoleted or retracted following journal retractions or formal investigations, with errors present in a notable fraction of historical deposits. By 2025, validation standards for cryo-electron microscopy (cryo-EM) structures have been strengthened, incorporating tools like EMRinger to evaluate map-model fit through density profiles around carbonyl oxygens and Cβ atoms, aiding detection of over- or under-fitting in resolutions from 2–4 Å. Additionally, AI-assisted approaches, such as models trained on structural embeddings, enable automated in protein folds by identifying outliers in residue-level geometries or global topologies within databases.

Integration and Emerging Technologies

Efforts to integrate protein structure databases with other biological data resources have advanced significantly, enabling a more holistic understanding of protein function and evolution. The Protein Data Bank in Europe Knowledge Base (PDBe-KB), launched in 2019, exemplifies this by federating structural data from the (PDB) with sequence information from and functional annotations from the (GO). This integration facilitates residue-level mappings through initiatives like Structure Integration with Function, Taxonomy and Sequence (SIFTS), allowing researchers to correlate 3D structures with evolutionary, functional, and genomic contexts. Similarly, AlphaFold predictions have been linked to genomic resources, such as Ensembl, where over 200 million predicted structures are accessible alongside variant effect predictors like AlphaMissense, aiding in the interpretation of genetic variants' structural impacts. Emerging technologies are addressing the challenges of managing vast datasets and enhancing prediction accuracy. Cloud-based platforms, such as CloudProteoAnalyzer, enable scalable processing of , including predictions from , by leveraging for storage and analysis without local infrastructure demands. In AI-driven refinements, 3, released in 2024, extends capabilities to model protein-ligand complexes with high fidelity, predicting interactions involving small molecules and ions to support . technologies are being explored for ensuring data provenance in , with frameworks proposed to maintain immutable records of depositions and , enhancing trust in collaborative . Looking ahead, protein structure databases aim for near-complete proteome coverage, with AlphaFold already providing predictions for over 214 million sequences, and projections indicating that experimental and predicted structures could encompass more than 90% of human proteomes by 2030 through ongoing expansions. Real-time synchronization tools like AlphaSync, developed in 2025, update models against the latest entries, maintaining a database of 2.6 million structures across for timely access via . Ethical considerations in applications emphasize responsible use, including transparency in model training and mitigation of misuse in designing novel proteins, to align predictions with societal benefits. A key challenge in these advancements is the phylogenetic bias in training data, where models perform better on structures from well-represented organisms like humans and , potentially underperforming for underrepresented and requiring diverse .

References

  1. [1]
    Learn: Guide to Understanding PDB Data: PDB Overview - PDB-101
    PDB Overview. The PDB archive is a repository of atomic coordinates and other information describing proteins and other important biological macromolecules.
  2. [2]
    About RCSB PDB: A Living Digital Data Resource That Enables ...
    The RCSB PDB creates tools and resources for research and education in molecular biology, structural biology, computational biology, and beyond.
  3. [3]
    PDB History
    The PDB was established in 1971 at Brookhaven National Laboratory under the leadership of Walter Hamilton and originally contained 7 structures.
  4. [4]
    PDB Statistics: Protein-only Structures Released Per Year
    Year, Total Number of Entries Available, Number of Structures Released Annually. 2025, 210,566, 12,361. 2024, 198,205, 12,550. 2023, 185,655, 12,081.
  5. [5]
    AlphaFold Protein Structure Database
    AlphaFold DB provides open access to over 200 million protein structure predictions, generated by an AI system, to accelerate scientific research.Missing: major | Show results with:major
  6. [6]
    SCOP| Structural Classification of Proteins
    SCOP classification of proteins aims to provide comprehensive structural and evolutionary relationships between all proteins whose structure is known.
  7. [7]
    CATH: Protein Structure Classification Database at UCL
    Sep 30, 2024 · CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient ...Browse · Search · Search CATH by PDB structure · Download CATH-Gene3D Data
  8. [8]
    Protein Database - an overview | ScienceDirect Topics
    Protein databases provide detailed structural data for proteins, nucleic acids, and biomolecules, including 3D structures, and are used for molecular insights.Biological Databases For... · Proteins · 1.4. 2 Swiss-Prot And Trembl
  9. [9]
    RCSB PDB: Homepage
    RCSB Protein Data Bank (RCSB PDB) enables breakthroughs in science and education by providing access and tools for exploration, visualization, and analysis.About RCSB PDB · Protein Data Bank · PDB Statistics · Team Members
  10. [10]
    Primary and secondary databases | Bioinformatics for the terrified
    Primary databases are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.
  11. [11]
    Highly accurate protein structure prediction with AlphaFold - Nature
    Jul 15, 2021 · AlphaFold greatly improves the accuracy of structure prediction by incorporating novel neural network architectures and training procedures ...
  12. [12]
    SCOP: a Structural Classification of Proteins database - PMC - NIH
    The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of known protein structures.
  13. [13]
    Membrane Proteins of Known Structure
    mpstruc is a curated database of membrane proteins of known 3D structure. To be included in the database, a structure must be available in the RSCB Protein ...
  14. [14]
    (PDF) The Protein Data Bank: a historical perspective - ResearchGate
    Aug 6, 2025 · The Protein Data Bank began as a grassroots effort in 1971. It has grown from a small archive containing a dozen structures to a major ...
  15. [15]
    PDB Statistics: Overall Growth of Released Structures Per Year
    PDB Statistics: Overall Growth of Released Structures Per Year ; 2024, 229,662, 15,471 ; 2023, 214,191, 14,500 ; 2022, 199,691, 14,290 ; 2021, 185,401, 12,586.
  16. [16]
    providing structure coverage for over 214 million protein sequences
    Nov 2, 2023 · The AlphaFold Protein Structure Database (AlphaFold DB) is a massive digital library of predicted protein structures, with over 214 million ...
  17. [17]
    Why Structure Prediction Matters | DNASTAR
    A protein's 3D structure dictates its function. Structure prediction is needed because experimental methods are costly and time-consuming, and the gap between ...
  18. [18]
    Structure, Function, and Bioinformatics | Protein Science Journal
    Oct 18, 2023 · The continued evolution of these tools and methodologies will deepen our understanding of protein function and accelerate disease pathogenesis ...
  19. [19]
    How the Protein Data Bank changed biology - PubMed Central - NIH
    Mar 27, 2021 · Neidle highlights the important role that the PDB plays in ensuring the quality of the structures that are used for computational analyses and ...
  20. [20]
    Learn: Exploring the Structural Biology of Evolution - PDB-101
    Looking at the structures of biological molecules, we can explore how evolution has shaped modern proteins and nucleic acids, and search for clues about the ...1. Variation And Selection · 2. Reconstructing The Tree... · 3. Gene Duplication
  21. [21]
    Functional Evolution of Proteins - PMC - NIH
    In this study, we present the first functional clustering and evolutionary analysis of the RCSB Protein Data Bank (RCSB PDB) based on similarities between ...
  22. [22]
    BioLiP2: an updated structure database for biologically relevant ...
    Jul 31, 2023 · We developed the BioLiP2 database (https://zhanggroup.org/BioLiP) to extract biologically relevant protein–ligand interactions from the PDB database.
  23. [23]
    Modeling enzyme-ligand binding in drug discovery
    Oct 6, 2015 · Ever increasing numbers of 3D holo enzyme structures deposited in large protein databases enable that the information of known enzyme-ligand ...Background · Function Prediction · Ligand 3d Homology Modeling
  24. [24]
    Ranking Enzyme Structures in the PDB by Bound Ligand Similarity ...
    Mar 15, 2018 · We present a study of ligand-enzyme complexes that compares the similarity of bound and cognate ligands, enabling the best matches to be identified.Introduction · Results · Star Methods
  25. [25]
    The Evolution of Protein Structures and Structural Ensembles Under ...
    Our knowledge of protein structure comes from solved structures in the Protein Data Bank (PDB), our knowledge of sequence through sequences found in the NCBI ...<|separator|>
  26. [26]
    Genomics 2 Proteins portal: a resource and discovery tool for linking ...
    Sep 18, 2024 · The G2P portal is a bioinformatic tool to dynamically query, retrieve and connect genetic variants and transcripts to protein sequence annotations and ...Missing: virology oncology
  27. [27]
    Open data sharing accelerates COVID-19 research | EMBL
    Oct 20, 2020 · Researchers can access reference genome annotation through Ensembl, 3D protein structural data through the Protein Data Bank in Europe (PDBe), ...Missing: virology oncology
  28. [28]
    OncoDB: an interactive online database for analysis of gene ...
    Oct 28, 2021 · OncoDB, an online database resource to explore abnormal patterns in gene expression as well as viral infection that are correlated to clinical features in ...
  29. [29]
    FAQs - AlphaFold Protein Structure Database
    How many proteins are there in the database? There are 241,070,489 structures available on the AlphaFold DB website, including 40,054 isoforms and 46 complete ...
  30. [30]
    Highly accurate protein structure prediction for the human proteome
    Jul 22, 2021 · The resulting dataset covers 58% of residues with a confident prediction, of which a subset (36% of all residues) have very high confidence. We ...
  31. [31]
    A Three-Dimensional Model of the Myoglobin Molecule Obtained by ...
    In 1958, J. C. Kendrew et al. applied Perutz–s technique to produce the first three-dimensional images of any protein - myoglobin, the protein used by muscles ...
  32. [32]
    Crystallography: Protein Data Bank | Nature New Biology
    Oct 20, 1971 · News; Published: 20 October 1971. Crystallography: Protein Data Bank. Nature New Biology volume 233, page 223 (1971)Cite this article. 7515 ...
  33. [33]
    Protein Data Bank - Wikipedia
    The Protein Data Bank was announced in October 1971 in Nature New Biology as a joint venture between Cambridge Crystallographic Data Centre, UK and Brookhaven ...History · Contents · File format · Viewing the dataMissing: early | Show results with:early
  34. [34]
    Protein Data Bank (PDB): The Single Global Macromolecular ...
    The first 356 structures deposited to the PDB archive were determined by crystallography. In 1988, structures determined using NMR methods began to be deposited ...
  35. [35]
    PDB Reaches a New Milestone: 200,000+ Entries
    With this week's update, the PDB archive contains a record 200,069 entries. The archive passed 150,000 structures in 2019 and 100,000 structures in 2014.
  36. [36]
    Cryo-Electron Microscopy Reaches Resolution Milestone - PMC
    Aug 18, 2020 · Since 2010, the average resolution of a cryo-EM structure has improved from 15 Å to about 6 Å, and it is increasingly common for cryo-EM to ...Missing: revolution | Show results with:revolution
  37. [37]
    Growth of Structures from 3DEM Experiments Released per Year
    PDB Statistics: Growth of Structures from 3DEM Experiments Released per Year ... Number of Structures Released Annually. 2025, 30,114, 6,032. 2024, 24,082, 5,791.
  38. [38]
    2021 News - wwPDB
    BMRB (USA) joined in 2006. This move formalizes a long-standing relationship between the EMDB and wwPDB. EMDB was established in 2002 at EMBL's European ...
  39. [39]
    MolProbity: all-atom contacts and structure validation for proteins ...
    MolProbity is a general-purpose web server offering quality validation for 3D structures of proteins, nucleic acids and complexes.
  40. [40]
    A New Generation of Crystallographic Validation Tools for the ...
    Oct 12, 2011 · This report presents the conclusions of the X-ray Validation Task Force of the worldwide Protein Data Bank (PDB).
  41. [41]
    Protein Structure Prediction Center
    CASP aims to establish the current state of the art in protein structure prediction and identify progress, using blind prediction to test methods.CASP13 · CASP1 (1994) · CASP_Commons · CASP16
  42. [42]
    AlphaFold Protein Structure Database: massively expanding ... - NIH
    Nov 17, 2021 · The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded ...
  43. [43]
    Celebrating 50 Years of the Protein Data Bank Archive - RCSB PDB
    In 1971, the structural biology community established the single worldwide archive for macromolecular structure data–the Protein Data Bank (PDB).
  44. [44]
    Worldwide Protein Data Bank: wwPDB
    The Worldwide PDB (wwPDB) organization manages the PDB archive and ensures that the PDB is freely and publicly available to the global community. Celebrating 20 ...FAQ · PDB Archive Downloads · wwPDB OneDep system · Deposition StatisticsMissing: history content
  45. [45]
    Protein Data Bank: the single global archive for 3D macromolecular ...
    Oct 24, 2018 · (B) Number of PDB structures released annually. All PDB Core Archive ... November 2025, 120. Citations. Powered by Dimensions. 787 Web of ...
  46. [46]
    PDB Statistics
    PDB Statistics · Growth of Released Structures Per Year · Non-redundant Protein Sequences Statistics · Domain Statistics · Small Molecule Statistics · Released PDB ...Growth of Structures from NMR... · Overall Growth of Released... · By Atom CountMissing: 1990 1000 2015 100000 2025 200000
  47. [47]
    PDB Data Distribution by Experimental Method and Molecular Type
    Inner most layer represents the distribution by experimental methods, X-ray is shown in blue; EM in orange; NMR in green; Integrative in Red; Multiple methods ...Missing: 2025 cryo-
  48. [48]
    OneDep: Unified wwPDB System for Deposition, Biocuration, and ...
    Feb 9, 2017 · The OneDep system supports revisions and upload of replacement files to finalize a submission. Once the preliminary wwPDB Validation Report is ...
  49. [49]
    wwPDB Deposition
    Welcome to the wwPDB OneDep system! To make efficient deposition, validate your structures on our anonymous validation server for better data quality.
  50. [50]
    About - AlphaFold Protein Structure Database
    Working in partnership with EMBL's European Bioinformatics Institute (EMBL-EBI), we've released over 200 million protein structure predictions by AlphaFold that ...
  51. [51]
    AlphaFold Protein Structure Database in 2024 - PubMed
    Jan 5, 2024 · We have added more data on specific organisms and proteins related to global health and expanded to cover almost the complete UniProt database, ...
  52. [52]
  53. [53]
    EMBL-EBI and Google DeepMind renew partnership and release ...
    Oct 7, 2025 · The AlphaFold Database contains protein structure predictions for over 200 million proteins, and has been used by over three million people in ...
  54. [54]
    Pfam protein families database: embracing AI/ML - Oxford Academic
    Nov 14, 2024 · The Pfam protein families database is a comprehensive collection of protein domains and families used for genome annotation and protein ...
  55. [55]
    About - InterPro - EMBL-EBI
    InterPro integrates signatures from the following 13 member databases: CATH, CDD, HAMAP, MobiDB Lite, Panther, Pfam, PIRSF, PRINTS, Prosite, SFLD, SMART, ...
  56. [56]
    InterPro: the protein sequence classification resource in 2025
    Nov 20, 2024 · Notable updates include the increased integration of structures predicted by AlphaFold and the enhanced description of protein families using ...
  57. [57]
  58. [58]
    PDBsum home page - EMBL-EBI
    Apr 10, 2023 · PDBsum is a pictorial database that provides an at-a-glance overview of the contents of each 3D structure deposited in the Protein Data Bank ...
  59. [59]
    wwPDB: File Format - Worldwide Protein Data Bank
    In 1976, a version using 72 characters plus 8 for sequencing was introduced. This 80-column format is what has commonly been called the (legacy) "PDB format".
  60. [60]
    MMCIF USER GUIDE
    Jun 7, 2024 · PDBx/mmCIF provides the foundation for the deposition, annotation, and archiving of structural data across various experimental techniques.
  61. [61]
    Announcing mandatory submission of PDBx/mmCIF format files for ...
    This letter announces that PDBx/mmCIF format files will become mandatory for crystallographic depositions to the Protein Data Bank (PDB).
  62. [62]
    BinaryCIF and CIFTools—Lightweight, efficient and extensible ...
    Oct 19, 2020 · BinaryCIF: The BinaryCIF format abstracts the structure of CIF formatted data and serializing it in a different way than text-based CIF files.
  63. [63]
    PDBML Schema Resources
    The Protein Data Bank Markup Language (PDBML) provides a representation of PDB data in XML format. The description of this format is provided in XML schema.
  64. [64]
    Sequence - RCSB PDB
    Dec 1, 2023 · In the PDB archive, an amino acid or nucleotide is usually represented by its one letter code using the FASTA format. Which sequences are ...
  65. [65]
    Chemical Component Dictionary - wwPDB
    This dictionary contains detailed chemical descriptions for standard and modified amino acids/nucleotides, small molecule ligands, and solvent molecules.
  66. [66]
    Overview: Advanced Search - RCSB PDB
    Dec 21, 2023 · RCSB PDB Advanced Search options allow you to query all data in the coordinate files and their associated annotations to rapidly find structures, polymers, and ...
  67. [67]
    Attribute Search - RCSB PDB
    Jan 17, 2024 · The Attribute Search on RCSB.org allows searching in specific attributes such as Structure Title, Release Date, Source Organism Taxonomy Name, etc.
  68. [68]
    PDBe-KB: a community-driven resource for structural and functional ...
    Oct 4, 2019 · This API powers the new PDBe-KB web components and pages and is also used by PDBe entry pages, PDBe query system, and has already been ...Missing: federated | Show results with:federated
  69. [69]
    BLAST: Basic Local Alignment Search Tool
    BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates ...Standard Protein BLAST · Protein BLAST · Nucleotide BLAST · NCBI BLAST Topics
  70. [70]
    Dali server - ekhidna.biocenter.
    Jun 1, 2017 · The Dali server is a network service for comparing protein structures in 3D. You submit the coordinates of a query protein structure and Dali compares them.
  71. [71]
    RCSB PDB Search API: Understanding and Using
    The RCSB PDB Search API searches metadata like molecule names and sequences using HTTP GET/POST with JSON, returning identifiers and metadata.Missing: assisted natural
  72. [72]
    [PDF] Newsletter 18_1_e - Protein Data Bank Japan
    and a wide range of example SQL queries are available as an tutorial (http://pdbj.org/help/mine2-sql). Based on the above development, we also provide the RDF ...
  73. [73]
    PDBe aggregated API: programmatic access to an integrative ... - NIH
    Jun 3, 2021 · The PDBe aggregated API is an open-access and open-source RESTful API that provides programmatic access to a wealth of macromolecular structural data.Missing: federated | Show results with:federated
  74. [74]
    API - AlphaFold Protein Structure Database - EMBL-EBI
    AlphaFold. Protein Structure Database · Home · About · FAQs · Downloads · API.
  75. [75]
    google-deepmind/alphafold: Open source code for ... - GitHub
    Please use the script scripts/download_all_data.sh to download and set up full databases. This may take substantial time (download size is 556 GB), so we ...
  76. [76]
    Mol*
    High-performance graphics and data handling of the Mol* Viewer allow users to simultaneously visualise up to hundreds of (superimposed) protein structures, play ...Viewer · Mol* Viewer Documentation · Mol* Mesoscale Explorer · MolViewSpec
  77. [77]
    File Download Services - RCSB PDB
    Sep 3, 2025 · All data are available via the HTTPS protocol. Note that the FTP protocol is no longer supported. See the announcement. RCSB PDB hosts the ...
  78. [78]
    Improved Text Searching - RCSB PDB
    Simple text searches at rcsb.org are now easier and more accurate. Text searching from the top query bar has been redesigned and is now powered by the open ...Missing: assisted natural language
  79. [79]
    RCSB Protein Data Bank: Enabling biomedical research and drug ...
    The Protein Data Bank (PDB) archive currently holds > 155,000 atomic‐level 3D structures of biomolecules experimentally determined using crystallography, ...
  80. [80]
    COVID-19/SARS-CoV-2 Resources - RCSB PDB
    Access all SARS-CoV-2 PDB structures. Main proteases; Spike proteins and receptor binding domains; Papain-like proteinases; Other SARS-CoV-2 structures; PanDDA ...
  81. [81]
    RCSB Protein Data Bank resources for structure-facilitated design of ...
    The open-access Protein Data Bank (PDB) stores and delivers three-dimensional (3D) biostructure data that facilitate discovery and development of therapeutic ...
  82. [82]
    Impact of structural biology and the protein data bank on us fda new ...
    Jun 17, 2024 · When the PDB was established in 1971 as the first open-access digital data resource in biology, it housed only seven protein structures [9].
  83. [83]
    Integrating artificial intelligence in drug discovery and early drug ...
    Mar 14, 2025 · AlphaFold can predict protein structures with high accuracy/druggability assessments. AI can help in structure/ligand-based drug design, de novo ...
  84. [84]
    Impact of structural biologists and the Protein Data Bank on small ...
    The Protein Data Bank (PDB) is an international core data resource central to fundamental biology, biomedicine, bioenergy, and biotechnology/bioengineering.
  85. [85]
    Recent Progress in the Development of HIV-1 Protease Inhibitors for ...
    In this review, we outline current drug design and medicinal chemistry efforts toward the development of next-generation protease inhibitors beyond the ...
  86. [86]
    AI for Precision Medicine: 2025's Game-Changer - Lifebit
    Jul 10, 2025 · Generative models now design entirely new molecular structures, guided by resources such as DeepMind's AlphaFold protein-structure library.
  87. [87]
    Computational protein-ligand docking and virtual drug screening ...
    AutoDock is a suite of free open–source software for the computational docking and virtual screening of small molecules to macromolecular receptors.
  88. [88]
    How Structural Biologists and the Protein Data Bank Contributed to ...
    The PDB archive contains 5,914 structures containing one of the known targets and/or a new drug, providing structural coverage for 88% of the recently approved ...
  89. [89]
    De Novo Protein Design Using the Blueprint Builder in Rosetta
    In this article, we will use the design of de novo β-barrel proteins as an example to describe the principles and basic procedures of the blueprint builder- ...
  90. [90]
    De novo protein fold design through sequence-independent ... - PNAS
    The core protocol that has enabled Rosetta to design new protein folds is fragment assembly, which involves the identification of small structural fragments ...
  91. [91]
    Directed evolution of hydrocarbon-producing enzymes
    Aug 12, 2025 · A wide variety of enzyme classes have been successfully engineered using small rationally designed 'smart' libraries using structural ...
  92. [92]
    Multimeric protein interaction and complex prediction: Structure ...
    This review encompasses recent advancements in multimer research, providing an overview of classical concepts and methodologies and the key differences from ...
  93. [93]
    Federating Structural Models and Data: Outcomes from A Workshop ...
    Dec 3, 2019 · ... ModelArchive. The Critical Assessment of Protein Structure Prediction (CASP) has been exploring modeling methods based in part on sparse ...
  94. [94]
    User guide to the wwPDB X-ray validation reports
    Aug 9, 2024 · The MolProbity Dangle program calculates Z-scores of bond length and bond angle values for each residue in the molecule relative to the ...Missing: integration date
  95. [95]
    [PDF] Full wwPDB X-ray Structure Validation Report i
    May 15, 2020 · The all-atom clashscore is defined as the number of clashes found per 1000 atoms (including hydrogen atoms). The all-atom clashscore for this ...
  96. [96]
    MolProbity - An Active Validation Tool - Phenix
    MolProbity is a web application that integrates validation programs from the Richardson lab at Duke University.Missing: PDB 2000s
  97. [97]
    Assessing the Quality of 3D Structures - RCSB PDB
    Oct 27, 2023 · Available quality assessment measures are discussed herein, together with how to interpret and use these measures for identifying suitable models.
  98. [98]
    Estimation of the quality of refined protein crystal structures - PMC
    Both Rwork and Rfree values can be adjusted to some degree by manipulating the resolution range of the data for structure refinement.Missing: pLDDT | Show results with:pLDDT
  99. [99]
    PDB File Format, The B-Factor in Crystallography
    Good-quality, well-refined protein structures generally have a resolution of 2.2 Å or better and an R-factor below 20%.Missing: metrics pLDDT<|separator|>
  100. [100]
    pLDDT: Understanding local confidence | AlphaFold - EMBL-EBI
    Feb 26, 2024 · pLDDT is a per-residue measure of local confidence. It is scaled from 0 to 100, with higher scores indicating higher confidence and usually a more accurate ...Missing: resolution R- B-
  101. [101]
    wwPDB Statement on Retraction of PDB Entries and 2018 ORI Finding
    May 8, 2018 · In such cases, the wwPDB will obsolete the entry if either the primary citation for the structure is retracted or a formal report by an ...
  102. [102]
    Errors in structural biology are not the exception - PMC
    Feb 27, 2023 · Errors from measurement, data processing and modelling are present throughout structures deposited in the Protein Data Bank.
  103. [103]
    Cryo-EM model validation recommendations based on outcomes of ...
    Feb 4, 2021 · EMRinger (module of Phenix) evaluates backbone positioning by measuring the peak positions of unbranched protein Cγ atom positions versus map ...
  104. [104]
    Detecting anomalous proteins using deep representations
    Feb 27, 2024 · To detect anomalies in entire proteins, one might consider using the anomaly score of the most anomalous residue within the entire protein.