Fact-checked by Grok 2 weeks ago

CATH database

The CATH (Class, Architecture, Topology, Homology) database is a free, publicly available online resource that hierarchically classifies structures from the (PDB) to elucidate their evolutionary and functional relationships. Developed in the mid-1990s by Christine Orengo and colleagues at , it employs a semi-automated process combining sequence alignments, structural comparisons via the SSAP algorithm, and manual curation to organize domains into four main levels: Class (C), based on secondary structure composition (e.g., mainly alpha or beta); Architecture (A), describing the spatial arrangement of secondary structures; Topology (T) or fold, reflecting connectivity patterns; and Homology (H), grouping evolutionarily related domains supported by sequence and structural evidence. The database's purpose is to facilitate the analysis of evolution, function prediction, and genome annotation by identifying structural similarities and divergences across superfamilies and folds. Initially launched to address the growing influx of PDB entries—over 5,000 structures by , with monthly additions—it has evolved into a comprehensive integrating experimental PDB data with predicted structures from , expanding its scope dramatically. As of version 4.4 (released October 2024), CATH encompasses approximately 600,000 PDB domains alongside over 90 million from the Transfer of Evolutionary Data (TED) resource, resulting in 6,573 homologous superfamilies, 2,081 folds, and 77 architectures, including 479 novel hypothetical folds identified through predicted models. Recognized as a Global Core Biodata Resource by the Global Biodata Consortium, CATH also incorporates functional annotations via Functional Families (FunFams) and links to sequence-based resources like Gene3D, enhancing its utility for structural genomics and . Annual CATH-Plus releases provide enriched datasets with ligand binding and disease associations, while daily updates ensure alignment with the latest PDB depositions.

Overview

Purpose and Scope

The CATH database, standing for , Architecture (A), , and Homologous superfamily (H), is a hierarchical classification system for structures derived primarily from the (PDB). Its core purpose is to organize these domains into evolutionary groups to elucidate folding patterns, structural similarities, and functional relationships among proteins, enabling researchers to infer ancestry and divergence from common origins. The scope of CATH encompasses domains from experimentally determined structures in the PDB as well as predicted models from resources like the , integrated through the Encyclopedia of Domains () tool. In its latest release (v4.4), CATH classifies over 601,000 domains from PDB entries into 6,573 superfamilies, emphasizing structural over sequence similarity alone to capture distant evolutionary connections that sequence-based methods might miss. This classification supports unique applications in bioinformatics, including protein function prediction by annotating domains with functional sites and residues, facilitating through identification of conserved structural motifs for targeting, and advancing evolutionary studies by tracing superfamily expansions across genomes. By grouping domains into superfamilies based on shared ancestry, CATH provides a framework for comparative that reveals how protein folds evolve and adapt.

History and Development

The CATH database was established in 1993 at () by Christine Orengo and colleagues, including Janet Thornton, as a manual curation effort to classify structures from the (). This initial work involved applying the Structure Superposition and Alignment Program (SSAP) algorithm to compare around 1,400 s and identify structural similarities indicative of evolutionary relationships. The database was publicly launched with web access in 1995, marking the beginning of its role as a comprehensive resource for . During the 1990s, CATH grew alongside the rapid expansion of the PDB, with early releases emphasizing manual domain assignment and curation to define hierarchical levels of , , , and homologous superfamily. By the late 1990s, it had classified approximately 3,000 PDB structures into about 1,000 groups and several superfolds, providing insights into protein fold space. The 2000s introduced automated methods to address curation challenges, such as recognizing domains in complex multi-domain proteins; a key advancement was the 2007 integration of the algorithm, which performed structural comparisons up to 1,000 times faster than SSAP while maintaining accuracy. This semi-automated approach enabled handling larger datasets and transitioned CATH from purely manual efforts to a capable of scaling with incoming structural data. A pivotal milestone came in 2024 with the release of version 4.4, which dramatically expanded coverage by incorporating predicted structures from the Protein Structure Database, resulting in an 180-fold increase in classified domains to over 90 million, including 64,844 new experimental domains from PDB and 90 million predicted ones. This integration, facilitated by tools like the Domain Boundary Parser and semi-automated assignment pipelines, revealed nearly 200 new folds and enhanced superfamily annotations. CATH is hosted and maintained by UCL's Orengo group within the Department of Structural and Molecular Biology, with funding from UK research councils such as the Biotechnology and Biological Sciences Research Council (BBSRC) and Medical Research Council (MRC), as well as international bodies including the and . The latest update to version 4.4 occurred in January 2025, continuing the database's evolution to manage the influx of predicted structural data while preserving manual oversight for quality.

Classification System

Hierarchical Levels

The CATH (Class, Architecture, Topology, Homologous superfamily) database employs a four-tier system to organize protein domains based on their structural and evolutionary relationships, progressing from broad compositional features to specific evolutionary groupings. This hierarchy enables systematic comparison of protein folds, with domains assigned unique alphanumeric codes reflecting their position at each level, such as 3.30.70.990, where the digits correspond to ...Homologous superfamily. At the Class (C) level, domains are grouped primarily by their secondary structure composition and packing, dividing them into four main categories: mainly alpha (Class 1), mainly beta (Class 2), alpha and beta (Class 3, encompassing both alternating alpha/beta and alpha+beta arrangements), and few secondary structures (Class 4, with low alpha and beta content). This initial classification, inspired by early analyses of globular proteins, captures gross differences in helical and sheet content without considering connectivity. For example, all-alpha proteins like myoglobin fall into Class 1 due to their predominance of alpha helices. The level describes the overall three-dimensional shape and relative orientation of secondary structure elements, independent of their strand or helix connectivity. Architectures are identified through or graph-based representations of secondary structure packing, resulting in descriptive categories such as orthogonal alpha bundles, beta barrels, or three-layer sandwiches. This level highlights gross topological arrangements, like the barrel architecture in porins, which accommodates beta strands in a cylindrical form regardless of specific linkages. The Topology (T) level, also known as the fold family, focuses on the intramolecular connections between secondary structure elements and the specific fold type, grouping domains with similar core topologies. Assignments rely on sequence-structure alignments and structural superimpositions to ensure conserved connectivity patterns, exemplified by the Rossmann fold (common in nucleotide-binding domains) or the (a alpha/beta motif in enzymes like ). This tier emphasizes functional implications of fold variations while maintaining architectural consistency. Finally, the Homologous Superfamily (H) level clusters topologies that share evidence of common evolutionary ancestry, typically through significant similarity (e.g., ≥35% ) or structural assessed via metrics like the SSAP score (≥70-80 with ≥60% overlap). These groupings often incorporate functional annotations to link structure with biological roles, such as the homologous superfamilies within Rossmann folds that bind diverse cofactors. The hierarchy thus flows from general structural descriptors to evolutionarily related families, facilitating evolutionary and functional studies.

Domain Assignment Methods

The domain assignment process in the CATH database begins with the identification of individual protein domains from full-length chains deposited in the (PDB). This delineation is primarily automated using tools such as , which employs graph-based representations of protein structures combined with dynamic programming to predict domain boundaries, achieving approximately 90% accuracy in fold group assignment and 78% boundary precision within ±15 residues for challenging multi-domain proteins. For predicted structures from the Database (AFDB), domain boundaries are assigned using , a deep learning-based algorithm benchmarked for high accuracy in segmenting multi-domain proteins. The assignment workflow integrates sequence and structural analyses to classify domains hierarchically. It starts with sequence similarity searches, such as PSI-BLAST, to detect potential homologs among unclassified chains and existing CATH entries, enabling the identification of relatives with up to 35% sequence identity. These candidates are then subjected to structural superposition using the SSAP (Structural Structure Alignment Program) algorithm, which aligns protein backbones and computes a normalized score reflecting fold similarity; domains with SSAP scores exceeding 70 are grouped at the (T) level, while scores above 80 indicate closer suitable for homologous superfamily (H) assignment. Automated pipelines within the CathDB system handle initial grouping, propagating classifications to new chains with high sequence identity (>80%) to established domains via tools like ChopClose, which applies SSAP thresholds and (RMSD) limits of ≤6.0 Å. Curation involves expert manual intervention for cases where automated predictions are ambiguous, such as remote homologs or complex multi-domain architectures. Curators review alignments supported by SSAP scores, (HMM) profiles from tools like HMMscan, and literature evidence, ensuring accurate boundary placement and fold assignment. For AlphaFold-predicted structures, incorporation relies on confidence metrics like the predicted Local Distance Difference Test (pLDDT) score; domains with pLDDT scores ≥70 (indicating good confidence) receive full four-level CATH classifications, while lower-confidence predictions may undergo additional validation via hidden Markov models and curation. Validation of assignments emphasizes structural fidelity to prevent over-classification, requiring from SSAP alignments rather than sequence alone; multi-domain proteins are split only if boundaries align with distinct folds, benchmarked against manually curated datasets to achieve error rates below 4% in recognition. This process ensures robust classification across both experimental and predicted structures, with SSAP serving as the core metric for quantifying fold similarity and guiding topology-level decisions.

Database Content

Structure and Data Sources

The CATH database primarily sources its protein structures from the (PDB), which provides experimentally determined structures obtained through methods such as , (NMR) , and cryo-electron microscopy (cryo-EM). These structures form the foundation for domain identification and classification, with domain boundaries delineated using automated and manual curation processes to ensure accuracy at the modular level. Since 2022, CATH has been supplemented by predicted structural models from the AlphaFold Database (AFDB), integrated via The Encyclopedia of Domains (TED) to expand coverage of uncharacterized proteins, particularly in model organisms. This integration has significantly enhanced the database's ability to classify predicted domains alongside experimental ones, without storing raw structural data but instead deriving hierarchical classifications. Key data types in CATH include domain coordinates derived from PDB entries or AFDB models, corresponding amino acid sequences, and functional annotations such as (GO) terms and Enzyme Commission (EC) numbers assigned to functional families (FunFams) within superfamilies. Evolutionary relationships are captured through superfamily assignments based on structural similarity and sequence conservation, linking domains that share a common ancestor. accompanying each domain encompasses details like structural resolution, experimental method, and source organism, enabling users to assess data quality and biological context. These elements emphasize CATH's domain-level granularity, which facilitates the study of modular protein by treating domains as independent units rather than whole proteins. As of version 4.4 (released October 2024), CATH classifies 601,493 experimental domains from the PDB and over 90 million predicted domains from the resource, organized across 6,573 superfamilies and 2,078 unique folds, including 479 novel hypothetical folds identified through predicted models. The database integrates with external resources such as UniProtKB for comprehensive sequence data and for domain family alignments, providing hyperlinks to these for deeper exploration without duplicating raw sequences or structures. This derived, interconnected approach ensures CATH serves as a dynamic resource rather than a primary data repository.

Releases and Updates

The CATH database maintains a release cadence of approximately annual major updates, designated as CATH-Plus versions, which incorporate comprehensive classifications and annotations, supplemented by daily CATH-B snapshots that provide the most current automated assignments synchronized with new Protein Data Bank (PDB) entries. Key releases include the initial CATH v1.0 in 1997, which established a manual hierarchical classification of early protein domain structures from the PDB. Version 3.5, released in September 2011, introduced enhanced automated protocols for domain assignment, adding 20,616 new domains, 77 new superfamilies, and 31 new folds to reach a total of 173,536 domains. CATH v4.3, issued in July 2019, expanded the database with 65,381 new domains and classifications for 25,311 additional PDB structures, totaling 500,238 domains. The most recent major release, v4.4 in October 2024, added 101,255 new domains and 65,024 PDB entries, bringing the total to 601,493 experimentally derived domains while integrating predicted structures. Updates in each release typically include the assignment of new domains from recent PDB depositions, reclassification of existing domains based on emerging structural and sequence evidence, and the addition of functional annotations such as enzyme commissions and evolutionary relationships derived from sequence alignments. Recent developments in 2024 with v4.4 have dramatically expanded coverage through the incorporation of AI-predicted structures from the Database via the (The Encyclopedia of Domains) resource, enabling the mapping of approximately 90 million predicted domains to CATH superfamilies and achieving a 180-fold increase in structural information for classified superfamilies, with particular improvements in the representation of eukaryotic and viral proteins. This expansion includes the identification of 479 novel hypothetical folds. The database is freely accessible for download in text-based formats such as the CATH Domain Description File (CDDF), CATH Names File (CNF), and FASTA sequences, with all versions archived for reproducibility and cited via associated publication DOIs, such as doi:10.1093/nar/gkae1082 for v4.4.

Tools and Applications

Open-Source Software

The primary open-source software for interacting with and maintaining the CATH database is the cath-tools repository, hosted on GitHub by the UCL Orengo Group, which provides a suite of command-line tools for protein structure comparison and classification tasks essential to CATH curation. This toolbox includes scripts for domain parsing, such as cath-assign-domains, which automates the assignment of structural domains to protein chains, and classification pipelines like cath-cluster for grouping domains based on structural similarity. Additionally, cathpy, a complementary Python-based bioinformatics toolkit also developed by the UCL Orengo Group, offers libraries and utilities for parsing CATH data files, performing alignments, and integrating with broader workflows, facilitating programmatic access to the database's hierarchical classifications. Key components within these tools emphasize and automated processing; for instance, the SSAP (Structure Superposition Aligned by Profile) implementation in cath-tools enables precise pairwise comparisons of protein domains to detect remote homologs, a core step in CATH's superfamily assignments. For initial domain boundary delineation, tools like cath-refine-align support iterative refinement of domain cuts by analyzing secondary structure and alignment scores, aiding in the semi-automated curation process. These components are designed for backend processing rather than end-user interfaces, allowing developers to build custom pipelines for tasks such as updating domain annotations with new entries. CATH-AlphaFlow, an open-source Nextflow pipeline released in , processes and classifies protein chains from models or experimental data, enabling the integration of predicted structures into CATH hierarchies. Both cath-tools and cathpy are released under permissive open-source licenses—GPL-3.0 for cath-tools and for cathpy—enabling free use, modification, and distribution while encouraging community contributions through pull requests for enhancements like new alignment algorithms or compatibility fixes. This licensing supports collaborative curation, where external developers can extend the tools for specialized analyses, such as integrating models for fold prediction. In practice, these tools are employed by bioinformatics developers for custom structural analyses, often integrated into or environments; for example, cathpy's modules can be imported into Python scripts to query CATH superfamilies, while cath-tools executables are invoked in shell-based workflows for large-scale domain clustering. Such integrations streamline bioinformatics pipelines, from sequence-to-structure mapping to functional annotation transfers within CATH hierarchies. The software is actively maintained by the Orengo Group, with updates aligned to CATH database releases; recent enhancements ensure compatibility with version 4.4, incorporating support for expanded predicted structures from models since the 2024 release. Ongoing development includes bug fixes and performance optimizations, with contributions welcomed to address evolving needs in classification.

Analysis and Visualization Tools

The CATH database provides a user-friendly web interface at cathdb.info, enabling searches by domain code, protein sequence, or structural identifiers such as PDB codes. Users can query specific domains or browse hierarchical classifications, including superfamily explorers that display evolutionary relationships and fold trees illustrating structural topologies. This interface facilitates exploration of protein domain architectures without requiring software installation, supporting both individual and large-scale batch queries through downloadable datasets for comprehensive genomic analyses. Specialized tools integrated with CATH enhance structural comparisons and sequence-based investigations. PDBeFold enables pairwise and multiple 3D alignments of protein structures against the CATH archive, aiding in the identification of structural similarities. CATH-BLAST, utilizing PSI-BLAST protocols, performs sequence similarity searches to assign query sequences to CATH superfamilies and detect potential novel domains. For visualization, the platform integrates with for interactive 3D rendering of domain structures directly in the browser, while compatibility with allows advanced session imports for detailed molecular modeling. CATHe2, released in 2025, enhances CATH superfamily detection using ProstT5 embeddings and structural alphabets for improved identification of remote homologs. Analysis features within CATH emphasize functional and evolutionary insights. Functional profiling tools, such as FunSite, leverage CATH functional families to predict ligand-binding sites, catalytic residues, and protein-protein interaction interfaces by analyzing conserved structural motifs across superfamilies. Evolutionary trees are generated to map phylogenetic relationships within homologous superfamilies, highlighting divergence in sequence and structure. Novelty detection identifies unclassified domains by comparing new structures against existing folds, seeding the addition of novel topologies—recently expanding CATH with over 250 new folds from PDB entries. Advanced applications include a RESTful API for programmatic access, allowing developers to retrieve domain classifications, sequence alignments, and functional annotations in or XML formats for integration into custom pipelines. Post-2024 updates have incorporated tools for variant interpretation, such as those in the CATH FunVar resource, which assess the structural and functional impact of disease-associated mutations by mapping them onto CATH domains and predicting effects on binding or in proteins and pathogens. These features support large-scale research in structural genomics and precision medicine, with all tools accessible via the web to promote broad usability.

References

  1. [1]
    CATH Documentation
    CATH is a free, online database providing information on protein domain evolutionary relationships, classified by Class, Architecture, Topology, and Homology.CATH FAQ · CATH Tools · Data Downloads · Release Notes
  2. [2]
    CATH – a hierarchic classification of protein domain structures
    The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H).
  3. [3]
    major expansion of CATH by experimental and predicted structural ...
    Nov 20, 2024 · CATH has recently been recognised as a Global Core BioData Resource (GCBR) and is one of the few national resources to be endorsed in this way.
  4. [4]
    List of Current Global Core Biodata Resources
    CATH. UK. The CATH database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. The ...
  5. [5]
    CATH: Protein Structure Classification Database at UCL
    Sep 30, 2024 · CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient ...About · Search by Sequence · Browse · Search
  6. [6]
    CATH--a hierarchic classification of protein domain structures
    Aug 15, 1997 · The four main levels of our classification are protein class (C), architecture (A), topology (T) and homologous superfamily (H). Class is ...
  7. [7]
    The history of the CATH structural classification of protein domains
    Orengo moved to the Thornton lab in the early 1990s and in 1993, Orengo and Thornton published a preliminary classification of around 1400 proteins structures ...Missing: founding | Show results with:founding
  8. [8]
    CATH - Database Commons
    This is a hierarchical classification of 13 359 protein domain structures into evolutionary families and structural groupings. We currently identify 827 ...<|control11|><|separator|>
  9. [9]
    Structural patterns in globular proteins - Nature
    Jun 1, 1976 · Cite this article. Levitt, M., Chothia, C. Structural patterns ... A two-stage approach towards protein secondary structure classification.
  10. [10]
    The CATH domain structure database: new protocols and ...
    The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids ...
  11. [11]
    CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in ...
    Radhakrishnan, M. Tsenkov, S. Nair, et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences.
  12. [12]
    The CATH extended protein‐family database: Providing structural ...
    Apr 13, 2009 · PSI-BLAST was benchmarked to derive conservative thresholds to reliably predict sequence domains for inclusion as input for the DomainFinder ...
  13. [13]
    Class Architecture Topology Homology - ScienceDirect.com
    In CATH, protein domains that have significant structural similarity to each other (i.e., a SSAP score of >=70), but no sequence or functional similarity, are ...
  14. [14]
    FAQs - AlphaFold Protein Structure Database
    “High-confidence ”: Domains with a Qscore ≥ 75 and full 4-level C.A.T.H assignment (Error rate 5%). “Moderate ☆”: Domains with 3 or 4-level C.A.T/C.A.T.H ...
  15. [15]
    CATH Release Notes
    CATH-Plus Version 3.5 · 20,616 newly assigned domains · 77 new homologous superfamilies · 31 new folds (topologies).
  16. [16]
    Download CATH-Gene3D Data
    This directory provides summary information of protein domains putatively classified in CATH since the last release. For each date with a CATH-B entry, there ...
  17. [17]
    UCLOrengoGroup/cath-tools: Protein structure comparison ... - GitHub
    Protein structure comparison tools such as SSAP, as used by the Orengo Group in curating CATH. Tools Extra Tools AuthorsUclorengogroup/cath-Tools · Cath Tools · Overview
  18. [18]
  19. [19]
    UCL/cathpy: Python Bioinformatics Toolkit for CATH ... - GitHub
    cathpy is a Bioinformatics toolkit written in Python. It is developed and maintained by the Orengo Group at UCL and is used for maintaining the CATH protein ...Ucl/cathpy · Cathpy · Development
  20. [20]
  21. [21]
    Assigning genomic sequences to CATH | Nucleic Acids Research
    The CATH domain assignments can be downloaded from the web page. Also available are the latest non-redundant lists of domains and complete chains at 100, 95, 60 ...
  22. [22]
    PDBe < Fold < EMBL-EBI
    PDBeFold performs pairwise and multiple 3D alignment of protein structures, and examines similarity with the PDB or SCOP archive. It is used as a structure ...
  23. [23]
    CATH Tools
    CATH: Protein Structure Classification Database by I. Sillitoe, N. Dawson, T. Lewis, D. Lee, J. Lees, C. Orengo is licensed under a Creative Commons ...
  24. [24]
    general purpose protein database search on the substructure level
    Jun 3, 2010 · Visualization is provided using Jmol (31), and downloadable Pymol (29) and Chimera (30) sessions. For the uploaded structures, the SSEs are ...
  25. [25]
    Molecular Graphics Software - RCSB PDB
    Feb 26, 2024 · Chimera, Interactive molecular modeling system for analysis and presentation graphics of molecular structures and related data, including ...
  26. [26]
    CATH functional families predict functional sites in proteins - PMC
    We present FunSite, a machine learning predictor that identifies catalytic, ligand-binding and protein–protein interaction functional sites.Missing: profiling | Show results with:profiling
  27. [27]
    UCLOrengoGroup/cath-api-docs: CATH API documentation - GitHub
    Many of the dynamic features of the CATH web pages are based on calls to a RESTful API. As a result, the CATH API is publicly accessible and well tested, ...Missing: database | Show results with:database
  28. [28]
    API - CATH FunVar
    CATH FunVar. Using the CATH protein structure classification to analyse the functional impact of mutations caused by variations in protein structure. Browse.
  29. [29]
    Transforming the Structural Landscape of CATH to Aid Variant ... - GtR
    Oct 10, 2025 · CATH classifies protein domains into evolutionary superfamilies to better understand sequence-structure-function relationships and improve ...