Protein structure database
A protein structure database is a specialized bioinformatics resource that archives and provides open access to experimentally determined and computationally predicted three-dimensional (3D) atomic coordinates of proteins and other biological macromolecules, enabling researchers to visualize, analyze, and model molecular structures for insights into biological functions.[1] These databases primarily compile data from techniques such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM), alongside predicted models from artificial intelligence tools, and include associated metadata like experimental conditions, resolution quality, and functional annotations.[2] The cornerstone of protein structure databases is the Protein Data Bank (PDB), established in 1971 at Brookhaven National Laboratory and now managed by an international consortium including the RCSB PDB in the United States, PDBe in Europe, and PDBj in Japan.[3] As of November 2025, the PDB archive holds 245,074 structures, the majority experimental, reflecting a steady growth from just seven entries in its inaugural year to supporting breakthroughs in fields like drug discovery and enzymology.[4] Complementing the PDB, the AlphaFold Protein Structure Database, launched in 2021 by DeepMind and EMBL-EBI, offers over 200 million AI-predicted structures covering nearly all known protein sequences in UniProt, dramatically expanding access to structural information for understudied proteins.[5] Additional databases focus on classification and annotation to organize the vast PDB data hierarchically by structural similarity and evolutionary relationships. For instance, the SCOPe (Structural Classification of Proteins extended) database, originally developed at the MRC Laboratory of Molecular Biology and now maintained by the University of California, Berkeley, provides manually and semi-automatically curated classifications of protein domains into classes, folds, superfamilies, and families based on structural and evolutionary criteria.[6] Similarly, the CATH (Class, Architecture, Topology, Homologous superfamily) database, maintained by University College London, employs a semi-automated approach to classify over 500,000 protein domains from the PDB into four hierarchical levels, aiding in the identification of novel folds and functional motifs.[7] Together, these resources form the backbone of structural biology, facilitating comparative analyses, homology modeling, and integrative studies that underpin advancements in biomedicine, biotechnology, and personalized medicine.[8]Overview
Definition and Scope
A protein structure database is a specialized repository that archives three-dimensional (3D) atomic coordinates of proteins and other biological macromolecules, primarily derived from experimental techniques such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM).[8][9] These databases organize structural data in standardized formats, enabling visualization, querying, and comparative analysis to support investigations into molecular architecture and function.[10] Increasingly, they incorporate computationally predicted structures, driven by artificial intelligence advancements that complement experimental data.[11] The scope of protein structure databases includes primary data in the form of raw experimental coordinates, secondary data with classifications and annotations derived from primary sources, and associated metadata detailing aspects like resolution, experimental parameters, and biological context such as sequence alignments or functional roles.[10][9] These repositories distinguish between those centered solely on isolated protein structures and others that encompass macromolecular complexes, including interactions with nucleic acids, small-molecule ligands, or other proteins.[8] This breadth ensures comprehensive coverage of structural diversity while maintaining data integrity through validation protocols. Protein structure databases are classified into primary, secondary, and specialized categories based on their purpose and content. Primary databases function as archival stores for unaltered experimental data, such as atomic coordinate files from techniques like X-ray or cryo-EM.[10] Secondary databases offer analytical layers, including hierarchical classifications of structures by fold or evolutionary relationships to aid in pattern recognition.[12] Specialized databases focus on subsets, such as those for membrane proteins or pathogen-related structures, providing tailored annotations for domain-specific research.[13] Over time, the scope of these databases has expanded dramatically, from a dozen structures archived in the early 1970s to 245,074 experimentally determined entries as of November 2025, with the post-2020 integration of AI-predicted models—exemplified by over 241 million predictions in the AlphaFold database—elevating the total to hundreds of millions and enabling proteome-wide structural insights.[14][15][16] This evolution underscores the shift from limited experimental archives to inclusive resources blending empirical and predictive data.[11]Importance in Biology
Protein structure databases play a central role in structural biology by providing three-dimensional models that elucidate protein folding patterns, active site architectures, and molecular interactions, which are fundamental to deciphering protein function, evolutionary dynamics, and disease-associated mechanisms.[17][18] These resources enable researchers to visualize how amino acid sequences translate into functional conformations, revealing how mutations disrupt folding or binding interfaces that contribute to pathologies such as cancer or neurodegenerative disorders.[19] By archiving atomic-level details, the databases facilitate the study of evolutionary conservation, where homologous structures across species highlight adaptive changes in protein scaffolds over time.[20][21] These databases have been instrumental in enabling key scientific discoveries, including mechanistic insights into enzyme catalysis, the thermodynamics of protein-ligand binding, and phylogenetic relationships inferred from structural homology.[22][23] For instance, comparative analyses of deposited structures have illuminated how enzymes like proteases or kinases accommodate substrates through precise pocket geometries, informing rational design of inhibitors.[24] Similarly, homology modeling based on database entries has accelerated the resolution of complex assemblies, bridging gaps in experimental data to uncover evolutionary divergences in protein families.[25] Beyond core structural biology, protein structure databases foster broader scientific impacts by integrating with genomics to link sequence variants to functional outcomes, thereby enhancing predictions of structure-function relationships in diverse organisms.[26] This synergy accelerates advancements in virology, where structural data on viral proteins aids in understanding host-pathogen interactions, and in oncology, supporting the design of targeted therapies against mutated oncoproteins.[27][28] Open data policies ensure global accessibility, democratizing research and promoting collaborative efforts across disciplines.[2] As of November 2025, these repositories encompass 245,074 experimentally determined structures alongside more than 241 million predicted models, achieving coverage of approximately 58% of human proteome residues with confident predictions.[15][16][29]History
Early Foundations
The determination of the first three-dimensional protein structure, that of myoglobin by John Kendrew in 1958 using X-ray crystallography, marked a pivotal advancement in structural biology.[30] Prior to the establishment of dedicated databases, such structures were disseminated primarily through scientific publications and physical models, limiting accessibility and hindering comparative analyses as the number of solved structures grew in the 1960s.[19] By the mid-1960s, crystallographers in the United States and Europe recognized the pressing need for a centralized repository to archive and share atomic coordinate data, driven by the increasing volume of experimental results from X-ray diffraction studies.[19] The Protein Data Bank (PDB) emerged as the pioneering solution, announced on October 20, 1971, in Nature New Biology as a collaborative initiative between Brookhaven National Laboratory in the United States and the Cambridge Crystallographic Data Centre in the United Kingdom.[31] Founded under the leadership of Walter Hamilton with key contributions from Edgar Meyer and Helen Berman, the PDB began operations at Brookhaven with just seven initial X-ray crystallographic structures of proteins and nucleic acids, stored on punched cards and magnetic tapes for manual deposition and distribution via mail.[3] Meyer also developed the SEARCH program in 1971, enabling the first remote access to the database for offline analysis of protein structures. This grassroots effort, spearheaded by a small team of US and UK crystallographers, emphasized open access and community-driven contributions to foster broader research collaboration.[32] By 1980, the PDB had expanded to fewer than 100 structures, all derived from X-ray crystallography, reflecting the era's predominant experimental technique. Early growth was supported by informal networks among structural biologists, who deposited data voluntarily despite the absence of formal policies.[32] However, the initiative faced significant hurdles, including limited computational resources that restricted data processing and visualization, reliance on manual deposition methods prone to errors, and the lack of standardized formats for coordinate files, which complicated integration and validation.[14] These challenges underscored the nascent stage of digital infrastructure in the 1970s and 1980s, yet the PDB's establishment laid the groundwork for systematic archiving in structural biology.[19]Expansion and Key Milestones
The expansion of protein structure databases from the 1990s onward was marked by rapid growth in deposited structures, driven by advances in experimental techniques and computational tools. By 1993, the Protein Data Bank (PDB) contained 1,000 structures, primarily determined by X-ray crystallography. This number surged to over 10,000 by 1999, reflecting increased accessibility of structural biology methods and mandatory deposition policies in journals. Further acceleration occurred in the 2000s and 2010s, with the archive reaching 100,000 entries by 2014 and exceeding 240,000 experimental structures by 2025, underscoring the databases' role as indispensable resources for global research.[32][33][15] A significant surge in cryo-electron microscopy (cryo-EM) structures followed the "resolution revolution" in the 2010s, enabled by improvements in detector technology and image processing algorithms that routinely achieved near-atomic resolution. Prior to 2010, cryo-EM contributions were minimal, but by 2025, over 30,000 cryo-EM-derived structures comprised about 12% of the PDB archive, complementing traditional methods like X-ray and NMR for studying large macromolecular complexes. This diversification expanded the scope of accessible protein architectures, particularly for membrane proteins and dynamic assemblies previously challenging to crystallize.[34][35] Internationalization efforts culminated in the formation of the Worldwide Protein Data Bank (wwPDB) in 2003, uniting the RCSB PDB (USA), Protein Data Bank in Europe (PDBe, UK), and Protein Data Bank Japan (PDBj) to ensure a single, unified global archive. This distributed management model facilitated standardized data deposition, validation, and dissemination worldwide, reducing redundancy and enhancing accessibility for international researchers. The Biological Magnetic Resonance Bank (BMRB) joined as a full partner in 2006, integrating NMR-specific data like chemical shifts, while the Electron Microscopy Data Bank (EMDB) became an associate member in 2021 to support cryo-EM map archiving.[3][32][36] Technological advancements in the 1990s and 2000s transformed database usability and quality control. Web-based interfaces emerged early, with the release of AutoDep in 1996 as the first web tool for PDB deposition, followed by the RCSB PDB's comprehensive portal in 1998, which enabled user-friendly searching, visualization, and downloading. In the 2000s, integration of validation tools like MolProbity, introduced in 2007, became standard within wwPDB workflows by the early 2010s, providing all-atom clashscore and Ramachandran analyses to improve deposited model accuracy. These developments democratized access and elevated data reliability, supporting broader applications in structural biology.[3][37][38] The AI revolution accelerated expansion through initiatives like the Critical Assessment of Structure Prediction (CASP) competitions, launched in 1994 to benchmark computational prediction methods biennially. CASP fostered iterative improvements in modeling accuracy, culminating in the 2021 release of AlphaFold 2, which achieved unprecedented prediction precision for diverse proteins. This led to the AlphaFold Protein Structure Database, initially releasing over 360,000 predicted models in 2021 and expanding to more than 200 million by 2022, integrated alongside experimental data in hybrid archives like those managed by EMBL-EBI. This shift augmented traditional databases, providing structural coverage for the majority of known proteomes and enabling hypothesis-driven research where experimental determination remains resource-intensive.[39][40]Primary Databases
Protein Data Bank (PDB)
The Protein Data Bank (PDB) serves as the single global archive for three-dimensional structural data of biological macromolecules, established in 1971 as a repository for experimentally determined atomic coordinates.[41] Managed by the Worldwide Protein Data Bank (wwPDB) consortium, it stores atomic coordinates, electron density maps, and associated metadata for proteins, nucleic acids, and their complexes, ensuring free and public access to the global scientific community.[42] The archive emphasizes experimentally validated structures derived from techniques such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM), excluding predicted models to maintain data integrity.[43] As of November 2025, the PDB contains 245,011 entries, reflecting steady growth driven by advances in structural biology methods.[44] [45] Approximately 81% of these structures are determined by X-ray crystallography, 6% by NMR, and 12% by cryo-EM, with the remainder from hybrid or other techniques; entries often include details on bound ligands, site-directed mutations, and experimental conditions to support downstream analyses.[4] Each entry is accompanied by validation reports generated using tools like the wwPDB Validation Pipeline, which assess geometric quality, stereochemistry, and consistency with experimental data to aid users in interpreting structural reliability. The wwPDB oversees deposition through the unified OneDep system, introduced in 2014 to streamline submission, biocuration, and validation across partner sites, replacing earlier tools like ADIT for more efficient processing of coordinates, maps, and metadata.[46] New entries are released weekly following rigorous annotation by wwPDB partners—the Research Collaboratory for Structural Bioinformatics (RCSB) in the United States, the Protein Data Bank in Europe (PDBe), the Protein Data Bank Japan (PDBj), and the Biological Magnetic Resonance Bank (BMRB)—ensuring consistent global standards and interoperability.[42] Distinctive aspects of the PDB include its programmatic accessibility via libraries such as BioPython, which enable automated parsing of PDB files for coordinates and metadata in computational workflows, and a strict focus on experimental evidence to distinguish it from predictive databases. This emphasis on validation and archival stability has made the PDB a foundational resource for structural biology, with OneDep facilitating joint submissions that integrate atomic models with complementary data like NMR restraints or EM maps.[47]AlphaFold Protein Structure Database
The AlphaFold Protein Structure Database (AlphaFold DB) is a comprehensive open repository of computationally predicted three-dimensional protein structures, launched in July 2021 through a collaboration between Google DeepMind and the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI). It contains 241 million structure predictions, encompassing nearly all protein sequences catalogued in the UniProt database, thereby providing unprecedented structural coverage for the protein universe.[48] [16] These predictions address the longstanding challenge of determining structures for the vast majority of proteins that remain experimentally uncharacterized, particularly those difficult to crystallize. The database's predictions are generated using the AlphaFold 2 deep learning system, which achieved top performance in the 2020 Critical Assessment of Structure Prediction (CASP14) competition by leveraging multiple sequence alignments and evolutionary relationships to model atomic-level protein folds with high accuracy.[11] [29] Each model includes a per-residue confidence score, the predicted Local Distance Difference Test (pLDDT), ranging from 0 to 100, where scores above 90 indicate very high reliability comparable to experimental structures, 70-90 suggest confident predictions, and lower values highlight regions of potential uncertainty such as disordered loops.[11] The database focuses on single polypeptide chain (monomer) predictions, enabling detailed insights into complex cellular machineries that are often underrepresented in experimental databases. Since its inception, the database has undergone periodic updates to enhance coverage and utility, including expansions in 2022 to incorporate additional proteomes and a September 2025 synchronization with UniProt release 2025_03, which added isoform predictions and made multiple sequence alignments (MSAs) available.[49] [16] Tools like AlphaFold-Multimer allow users to predict protein complexes separately using open-source code. Synchronization with sequence databases is further supported through initiatives like AlphaSync, introduced in 2025, which automatically updates predictions to reflect revisions in UniProt entries, including new isoforms and sequences, ensuring the resource remains current with ongoing genomic discoveries.[50] [51] Unlike repositories of experimentally derived structures, such as the Protein Data Bank (PDB), AlphaFold DB focuses exclusively on AI-generated monomer models, complementing empirical data by filling structural gaps for unstudied proteins. Access is fully open, with structures available for interactive viewing on the web interface, bulk downloads in PDB format, and programmatic retrieval via APIs, fostering widespread use in research. This dynamic maintenance, combined with the scale of predictions, positions AlphaFold DB as a transformative tool that bridges the divide between protein sequences and their functional three-dimensional architectures.Secondary and Specialized Databases
Structural Classification Databases
Structural classification databases organize protein structures from the Protein Data Bank (PDB) into hierarchical schemes based on structural similarity, topology, and evolutionary relationships, enabling researchers to identify homologous families, superfamilies, and novel folds for functional and evolutionary analysis. These databases facilitate the grouping of domains by shared architectural features, such as secondary structure arrangements, while distinguishing between structural convergence and divergence due to homology. By providing a framework for comparing thousands of structures, they support tasks like fold recognition, protein function prediction, and benchmarking structure prediction algorithms. The Structural Classification of Proteins (SCOP) database, first released in 1994, employs a manually curated hierarchy emphasizing fold-level similarities to delineate evolutionary relationships. Its classification levels include class (based on secondary structure content, e.g., all-alpha or alpha/beta), fold (overall topology), superfamily (common evolutionary origin inferred from structure and function), and family (close sequence and structural similarity). SCOPe, an extended version developed since 2011, automates much of the classification process while maintaining manual oversight to classify newer PDB entries and correct inconsistencies; as of release 2.08 in 2023, it encompasses approximately 345,000 domains from over 100,000 PDB structures across about 1,500 folds. Users can browse hierarchies interactively, query by fold or superfamily, and access direct links to corresponding PDB entries for detailed visualization. SCOP and SCOPe have been instrumental in establishing benchmarks for fold prediction methods, with their fold definitions serving as gold standards in evaluations of early structure prediction tools. The Class, Architecture, Topology, and Homologous superfamily (CATH) database, initiated in 1995, complements SCOP with a semi-automated classification that integrates both structural and sequence data across four main levels: class (secondary structure composition), architecture (gross orientation of secondary structures, independent of connectivity), topology (fold or shape including connectivity), and homologous superfamily (inferred evolutionary relationships). Recent releases, such as version 4.4 updated in early 2025, classify over 500,000 domains from more than 150,000 experimental PDB structures, spanning around 2,000 folds and 6,500 superfamilies, with significant expansion driven by automated domain parsing tools. CATH features web-based hierarchical browsing, advanced search interfaces for architecture or topology, and integrations with sequence databases for functional annotations, alongside links to PDB for structure downloads. Unlike purely manual systems, CATH's hybrid approach allows rapid updates and has been widely used in studies of protein evolution and domain architecture diversity.00160-8) Both databases have evolved to incorporate predicted structures post-2022, enhancing coverage of uncharted protein space; for instance, CATH's 2024 update via the CATH-AlphaFlow pipeline integrated high-confidence AlphaFold models, adding nearly 200 novel folds and expanding the total structural repertoire by over 180-fold compared to prior experimental-only versions. This inclusion aids in identifying potential evolutionary links in underrepresented superfamilies and supports machine learning applications for variant interpretation. SCOPe, while primarily focused on experimental data, provides extensible frameworks that researchers adapt for predicted model benchmarking, ensuring these resources remain vital for structural biology amid rapid advances in prediction accuracy.Domain and Niche Repositories
Domain-focused repositories emphasize the modular units of proteins, known as domains, which are evolutionarily conserved regions often responsible for specific functions. Pfam, established in 1997, is a foundational database that curates protein domain families using hidden Markov models (HMMs) derived from multiple sequence alignments incorporating both sequence and structural data.1097-0134(19970801)28:4<405::AID-PROT10>3.0.CO;2-#) By 2025, Pfam encompasses over 20,000 families, enabling the annotation of functional domains across proteomes and supporting searches for domain architectures in novel sequences.[52] These models facilitate the identification of distant homologs and highlight evolutionary conservation, with annotations often linking domains to biological roles such as enzymatic activity or binding specificity. InterPro complements Pfam by integrating signatures from multiple specialized resources, including Pfam, PROSITE, SMART, and others, to provide a unified view of protein domains, families, and functional sites.[53] Launched in 1999, InterPro merges overlapping predictions from its 13 member databases into hierarchical entries that describe domain boundaries, motifs, and post-translational modification sites, aiding in comprehensive functional inference. Tools within InterPro, such as InterProScan, allow users to scan sequences against these integrated signatures, revealing domain combinations that inform protein evolution and interactions, with recent enhancements incorporating structural predictions for improved accuracy.[54] Niche repositories target specific biological contexts, curating structures and annotations for underrepresented protein classes. The Membrane Protein Structure Database (mpstruc), initiated in 1999, manually curates transmembrane protein structures from the Protein Data Bank (PDB), classifying over 1,700 unique entries by function, topology, and lipid interactions to support studies in membrane biology.[13] Similarly, Viro3D, released in 2025, compiles AlphaFold2- and ESMFold-predicted structures for more than 85,000 proteins from over 4,400 viruses, focusing on vertebrate and invertebrate hosts to map viral evolution, functional motifs, and therapeutic targets in virology.[55] PDBsum offers pictorial summaries of PDB entries, generating 2D schematic diagrams of 3D interactions between proteins, ligands, DNA, and metals, which visualize binding sites, secondary structures, and domain interfaces for rapid analysis.[56] These repositories have expanded through the integration of predicted structures from AlphaFold, refining domain boundaries in Pfam and enhancing signature predictions in InterPro to cover previously uncharacterized regions.[54] Such advancements enable domain searching tools that align user queries against conserved sites, fostering applications in specialized fields like membrane transport and viral pathogenesis while referencing broader classifications from resources like SCOP and CATH for contextual fold hierarchies.[52]Data Management and Access
File Formats and Standards
The Protein Data Bank (PDB) format, introduced in the 1970s, remains a foundational text-based standard for storing atomic coordinates of protein structures. It uses fixed-width columns in records such as ATOM for standard residues and HETATM for non-standard atoms or ligands, specifying details like atom identifiers, residue names, chain IDs, and orthogonal coordinates (x, y, z). This legacy format, formalized in an 80-column layout by 1976, has supported the exchange of structural data across tools but imposes limitations due to its rigid, punched-card-era design, which restricts handling of large assemblies, complex metadata, and non-standard characters.[57] To address these constraints, the macromolecular Crystallographic Information File (mmCIF), developed in 1997, provides a modern, relational alternative based on the Crystallographic Information Framework. As PDBx/mmCIF, it organizes data into hierarchical categories and loops with key-value pairs, enabling relationships between entities (e.g., linking polymer sequences to sources) and accommodating extensive metadata such as experimental conditions, validation statistics, and citations. This extensible, machine-readable format supports structures of any size without fixed-width restrictions, making it ideal for contemporary protein data. In 2019, the Worldwide Protein Data Bank (wwPDB) mandated PDBx/mmCIF submissions for all new crystallographic depositions to enhance data quality, interoperability, and archival efficiency, while continuing best-efforts support for legacy PDB files.[58][59] Complementary formats include BinaryCIF, a compressed binary encoding of mmCIF data that achieves up to 10-fold size reduction for large structures through techniques like delta encoding and run-length compression, thereby improving parsing speed and storage for high-throughput analyses. PDBML, an XML-based representation derived from the PDBx/mmCIF dictionary, facilitates programmatic access and web services by structuring data in tagged elements, such as<atomSite> for coordinates, and is available in space-efficient variants. Protein structure files often integrate sequence data in FASTA format, using one-letter amino acid codes to represent polymer chains alongside 3D coordinates, enabling direct comparison with reference sequences from databases like UniProt.[60][61][62]
These standards rely on the wwPDB Chemical Component Dictionary for validation, which defines over 20,000 residues and ligands with standardized nomenclature, stereochemistry, idealized coordinates, and SMILES notations to verify chemical accuracy and consistency in deposited structures, ensuring interoperability across visualization, modeling, and analysis tools.[63]