Fact-checked by Grok 2 weeks ago

Pfam

Pfam is a comprehensive, open-access database of protein families and domains that classifies proteins based on shared evolutionary origins and functional similarities, utilizing curated multiple sequence alignments and profile hidden Markov models (HMMs) to detect and annotate domains in novel sequences. Established in 1998 at the Sanger Institute, Pfam has evolved into a cornerstone resource in bioinformatics, now maintained by the (EMBL-EBI) and integrated within the consortium for enhanced protein analysis. Its development reflects advances in , starting with manual curation of seed alignments to build probabilistic HMMs that model conserved regions, allowing sensitive detection of distant homologs beyond simple sequence similarity searches. As of release 37.4 in June 2025, Pfam encompasses over 26,000 entries, including families, domains, repeats, and motifs, which collectively annotate domains in more than 80% of known protein from Reference Proteomes. These entries are organized into clans—groups of related families based on , , or functional —to resolve overlaps and provide hierarchical insights into protein . Pfam's models support applications in genome annotation, , and experimental design by predicting domain architecture, functional sites, and evolutionary relationships, thereby aiding research into protein diversity and disease mechanisms. Recent enhancements include the integration of and , such as models in Pfam-N, to automate curation, improve model accuracy, and expand coverage of underrepresented protein classes, ensuring Pfam remains a dynamic tool for the post-genomic era. Accessible via web interfaces, , and downloadable files, Pfam facilitates interdisciplinary studies by linking to resources like , PDB, and GO annotations for multifaceted protein exploration.

Introduction

Definition and Scope

Pfam is a widely used bioinformatics resource that provides a comprehensive collection of curated protein families and domains, with each entry represented by multiple sequence alignments (MSAs) and profile hidden Markov models (HMMs) to facilitate the identification and annotation of protein sequences. These models capture the conserved features of protein architectures, enabling researchers to classify proteins based on evolutionary relationships and functional similarities. As of Pfam version 38.0, released in October 2025, the database encompasses over 26,000 entries, including manually curated Pfam-A families, which collectively annotate domains in more than 90% of known protein from Reference Proteomes through integration with UniProtKB. These families are organized into clans that group related families based on , , or functional evidence. In contrast, the legacy Pfam-B entries, which consisted of automatically generated clusters from unassigned , have been deprecated since earlier releases and are no longer maintained or updated. The scope of Pfam is primarily focused on proteins from eukaryotic and prokaryotic organisms, encompassing domains typically longer than short peptides to emphasize structurally and functionally significant regions, while excluding non-protein entities such as nucleic acids or small molecules. This targeted coverage ensures high-quality annotations for the majority of cellular proteins across diverse taxa, supporting broad applicability in genomic and proteomic studies.

Importance in Bioinformatics

Pfam plays a central in bioinformatics by classifying proteins into families and based on evolutionary relationships, enabling researchers to infer functional, structural, and interaction properties from sequence similarities. Through curated multiple sequence alignments and profile hidden Markov models (HMMs), Pfam identifies conserved that reflect shared evolutionary histories, allowing predictions of protein function even for uncharacterized sequences. For instance, domain architectures—combinations of domains within a single protein—provide insights into molecular , as interacting domains often co-occur in protein complexes. This classification supports the annotation of protein in biological processes, bridging sequence data to higher-level understanding of cellular mechanisms. Pfam contributes significantly to large-scale bioinformatics projects, particularly in genome annotation pipelines and metagenomic analyses. It is integrated into resources like , where Pfam domains annotate protein entries to facilitate functional inference across proteomes, and Ensembl, which uses Pfam for identifying and summarizing protein domains in gene models during vertebrate genome assembly. In , Pfam enables the functional characterization of proteins from uncultured microbes by detecting domain signatures in environmental sequencing data, aiding the of novel enzymes and metabolic pathways in microbial communities without the need for cultivation. The database's emphasis on multidomain proteins highlights its impact on revealing functional novelty through domain combinations, as diverse arrangements of known s can generate proteins with emergent properties, such as specialized signaling or regulatory roles. This is crucial for evolutionary studies, where domain shuffling drives innovation in protein function. Pfam is extensively used in bioinformatics, with its core publications cited thousands of times annually, and it supports AI-driven tools like by providing alignment data that enhance structure prediction accuracy for domain-containing proteins.

Methodology

Hidden Markov Models

Profile hidden Markov models (profile HMMs) serve as the foundational statistical framework for representing protein families in Pfam, modeling the probabilistic patterns observed in multiple sequence alignments of related proteins. These models are position-specific, capturing both conserved regions and variability across a family by incorporating evolutionary changes such as substitutions, insertions, and deletions. Unlike simpler sequence profiles, profile HMMs treat the alignment as a , where each position in the model corresponds to a column in the alignment, allowing for flexible handling of gaps and indels that reflect biological divergence. The architecture of a profile HMM in Pfam consists of a linear chain of states, beginning with a silent begin state that initiates the model and transitioning into model, which includes a series of match (M), insert (I), and delete (D) states for each position in the consensus length of the family. Match states emit with position-specific probabilities derived from the frequency of residues in the , while insert states model additional residues with their own probabilities, and delete states are silent, allowing skips without . Transitions between states are governed by probabilities that dictate the likelihood of moving from one state to another, such as from a match to an or delete, enabling the model to accommodate variable-length sequences. The core model concludes with an end state, and non-homologous flanking regions are handled by additional states like N (N-terminal), J (joiner for multi-domain proteins), and C (C-terminal). These parameters— probabilities for the 20 at match and insert states, and probabilities across all state types—are estimated from the seed using priors to account for sampling biases. Profile HMMs provide significant advantages over traditional pairwise or profile-based alignments by offering higher for detecting distant homologs, particularly those with sequence identities below 20%, where simple alignments often fail due to accumulated mutations and gaps. This enhanced detection arises from the model's ability to incorporate position-dependent gap penalties and substitution probabilities, which better mirror the evolutionary processes shaping protein domains, leading to more accurate identification of family membership even in highly diverged s. In Pfam, as of release 38.0 (October 2025), these models are constructed and searched using the software suite, where the hmmbuild command generates a profile from a in format, incorporating family-specific annotations like accession numbers, and the hmmsearch command queries sequence databases to find significant matches based on probabilistic scores and curated thresholds.

Family Construction and Alignment

The construction of Pfam families begins with the creation of a seed alignment, which consists of a manually curated set of diverse, representative protein sequences typically ranging from 5 to 50 in number. These sequences are selected to capture the core structural and functional features of the family while ensuring variability to represent evolutionary diversity. The initial (MSA) for the seed is generated using established tools such as MUSCLE or MAFFT to produce a high-quality starting point for model development. Once the seed alignment is established, an iterative process expands it into a full by searching against large databases like UniProtKB or a set. A profile hidden Markov model () is built from the seed and used to identify potential homologs through tools like jackhmmer, with subsequent iterations incorporating significant matches to refine the model and . Non-homologous are pruned during this expansion based on statistical , such as E-values below 0.001, to maintain alignment quality and exclude false positives. This iterative refinement continues until no additional reliable homologs are found, resulting in a comprehensive full that includes all meeting the family's curated gathering . As of Pfam 37.0 (2024), enhancements to family construction incorporate structural predictions from to refine domain boundaries and create new families; for example, 114 new families were generated using ECOD’s Domain Parser for AlphaFold Models (DPAM), and 146 additional families from sequence similarity analysis of models. Additionally, 710 metagenomic families were added using MMseqs2 clustering of sequences from MGnify and UniProtKB. A new automated resource, Pfam-N, employs models such as ProtENN and Maskformer to generate predictive families, achieving an 8.8% increase in coverage over Pfam-A. Domain boundaries in Pfam alignments are defined using two coordinate systems to accommodate variability in matching. Envelope coordinates delineate a broader region that allows flexible inclusion of less conserved flanking areas, enabling robust detection of distant homologs during searches. In contrast, strict coordinates focus on the core conserved region for precise positioning within the full , ensuring accurate representation of the domain's essential features. These boundaries are determined post-search using HMMER's output, with the envelope typically recommended for purposes due to its tolerance for insertions and deletions. Refinements using structural data from have improved boundary definitions in recent releases. Pfam addresses repeated domains and multi-domain architectures by tagging instances of repeats within individual families and defining the order of domains in protein sequences. Repeated domains, such as those in or leucine-rich repeats, are identified through sequence similarity searches and manually annotated in the alignment to distinguish multiple occurrences in a single protein. Recent integrations with RepeatsDB and the REFRACT project have added 185 new repeat entries as of Pfam 37.0. Domain architectures are captured by combining multiple Pfam models, allowing users to visualize and query common combinations like kinase-phosphatase pairs without merging families prematurely. This approach facilitates the modeling of complex proteins while preserving family specificity.

Database Content

Protein Families and Domains

Pfam-A represents the manually curated core of the Pfam database, comprising 25,545 entries that model protein families and domains as evolutionary and functional units (as of release 38.0 in October 2025). Each entry is constructed from a seed (MSA) of carefully selected representative sequences, which forms the basis for generating a full MSA incorporating all detected members from sequence databases, along with a profile (HMM) for detecting distant homologs. These entries also include comprehensive functional annotations, such as mappings to (GO) terms and hyperlinks to pertinent , enabling researchers to infer biological roles from sequence similarity. In Pfam, a distinction is made between domains and families: domains are compact, structurally and functionally independent modules within proteins, such as the involved in phosphotyrosine recognition, while families encompass broader evolutionary groupings that may include multiple related domains or domain variants sharing common ancestry. This separation allows Pfam to capture both specific modular components and larger phylogenetic clusters, with families often serving as the primary unit for . Annotations in Pfam-A entries extend beyond basic alignments to include site-specific details, such as active and inactive residues critical for or , predictions of post-translational modifications (PTMs) like sites, and cross-references to experimentally determined structures in the (PDB). These features provide a rich layer of interpretive data, linking sequence patterns to molecular mechanisms. Pfam-A families collectively annotate approximately 80% of sequences in the full UniProt Knowledgebase (UniProtKB) and over 90% in UniProt Reference Proteomes (as of 2025). For instance, domains are distributed across multiple families, such as the eukaryotic protein kinase (Pkinase) family and related groups like atypical kinases, reflecting diverse evolutionary divergences within this functional class while maintaining shared catalytic cores. These families are occasionally grouped into clans to indicate deeper relationships.

Clans and Relationships

In Pfam, clans represent collections of protein families that share a common evolutionary origin but have diverged sufficiently that their relationships cannot be detected by standard () searches between individual family models. These groupings address limitations in family-level classification by capturing deeper , often spanning superfamilies with low sequence identity. As of release 38.0 in October 2025, Pfam includes over 800 such clans, reflecting ongoing curation efforts to organize the expanding database of protein domains. Clan membership is established through manual curation, relying on multiple lines of evidence including significant sequence similarity between the seed alignments of different families, structural correspondences observed in Protein Data Bank (PDB) entries, similarities in profile-HMMs, or shared functional characteristics. For instance, structural evidence from PDB often reveals conserved folds that link otherwise disparate families, ensuring that clans align with established classifications like those in SCOP or CATH. This criteria-based approach prevents arbitrary groupings and maintains evolutionary consistency across the database. The primary benefits of clans include resolving polyphyletic appearances in individual families caused by rapid evolutionary divergence and enabling the transfer of across distantly related homologs, which enhances overall protein accuracy. A notable example is the Rossmann fold clan (CL0063), which encompasses over 198 families involved in FAD/NAD(P)-binding, allowing researchers to infer cofactor-binding capabilities from structural and evolutionary context despite minimal sequence conservation. By grouping such families, clans facilitate broader insights into protein evolution and function. Clan-specific resources include shared multiple alignments derived from across member , HMMs that model the collective superfamily, and overlap rules to manage potential redundant matches during . These rules typically designate certain HMMs as "active" for reporting, ensuring that only the most appropriate assignment is made when a protein matches multiple members, thus avoiding conflicts and improving reliability in tools like InterProScan.

Domains of Unknown Function

In Pfam, Domains of Unknown Function (DUFs) represent a substantial portion of the database, with over 5,000 such families as of 2025, accounting for about 20% of the total Pfam entries. These families are systematically labeled as Domains or Regions of Unknown Function and assigned sequential identifiers starting from DUF1, with the current numbering extending beyond DUF7100 to reflect ongoing discoveries. DUFs encompass protein segments for which no functional role has been experimentally or computationally assigned, despite their presence in diverse proteomes. DUFs typically consist of small, highly motifs, often embedded within essential proteins that are critical for organismal viability, underscoring their likely importance in core biological processes. To encourage collaborative exploration, Pfam curators link many DUF entries to dedicated pages, fostering crowdsourced hypotheses and community-driven insights into potential roles. This integration has proven valuable for generating preliminary ideas, though experimental validation remains essential. The persistent conservation of DUFs across implies significant selective pressure and functional relevance, yet their characterization lags behind other Pfam families due to challenges in structural determination and experimental tractability. Progress occurs incrementally, with some DUFs reassigned to named families upon functional elucidation, contributing to a gradual decline in their relative proportion—from about 22% in early releases to roughly 20% in 2025 versions—through ongoing annotations. For example, bacterial DUFs like DUF2726, found in various prokaryotes and potentially associated with mechanisms, highlight areas where clan groupings with known domains may aid in inferring functions.

Features and Tools

Search and Annotation Capabilities

Pfam provides robust search and tools that enable users to identify protein domains and families using (HMM)-based methods, primarily through integration with the software suite. The core tool for sequence searching is InterProScan, which performs domain detection on user-submitted protein sequences by scanning against Pfam HMM profiles. This HMMER-based approach allows for sensitive detection, supporting both single-sequence queries and batch analysis for entire genomes or proteomes, facilitating large-scale annotation efforts. Annotation outputs from these searches include detailed domain architectures, denoted by Pfam identifiers such as PFxxxxx, which map detected regions to specific families or domains within the protein sequence. Results incorporate statistical measures like bit scores, which quantify the quality of matches, and E-values, which assess their to distinguish true positives from false ones. Visualized alignments are generated to illustrate how query sequences align to the HMM profiles, highlighting conserved residues and secondary structure predictions where available. Advanced features enhance the analysis of complex proteins, particularly multidomain ones, through an architecture viewer that displays the linear arrangement of domains along the sequence, aiding in the interpretation of functional modularity. Users can also perform full-text searches by keyword, , or related identifiers to retrieve relevant Pfam entries, clans, or associated data. is prioritized via a user-friendly web interface hosted at , where sequences can be submitted directly for analysis without software installation. Programmatic access is supported through a RESTful , allowing automated queries and retrieval of results in formats like . Additionally, downloadable HMM libraries from the Pfam FTP site enable local searches using , accommodating offline or customized workflows.

Integration with Other Resources

Since 2020, Pfam has been fully hosted within the database, enabling seamless integration of its (HMM)-based protein family signatures with those from other member databases such as , , PRINTS, and TIGRFAMs, to provide a comprehensive resource for protein classification and functional prediction. This integration allows users to access Pfam data alongside complementary signatures, facilitating non-redundant domain assignments and enhanced coverage of , with aggregating over 26,000 Pfam entries into unified entries that resolve overlaps and hierarchies. Pfam domains are automatically annotated in Knowledgebase (UniProtKB) entries, particularly for the TrEMBL section, where InterPro's predictions—including Pfam—transfer domain information to uncurated sequences, while Swiss-Prot entries receive manual validation of these annotations. This linkage supports large-scale annotation, enabling researchers to retrieve Pfam-based domain architectures directly from UniProt protein reports, which in turn propagate functional insights across databases. For structural analysis, Pfam domains are mapped to experimentally determined structures in the (PDB), allowing visualization of domain folds within known protein complexes and aiding in the validation of predictions against atomic models. Additionally, Pfam boundaries are aligned with AlphaFold-predicted structures to refine domain definitions and identify novel families, with tools like InterPro's structure viewer overlaying Pfam annotations on predicted models for comparative analysis. Pfam data further connects to functional and pathway resources through mappings to the Gene Ontology (GO), where Pfam families are associated with GO terms for automated inference of molecular functions, biological processes, and cellular components. Exports to KEGG pathways integrate Pfam domains into metabolic and signaling network contexts, enabling pathway enrichment analyses that link domain occurrences to biochemical roles. As part of the ELIXIR infrastructure, Pfam—via InterPro—benefits from standardized data dissemination and computational support across European life sciences platforms, ensuring interoperability and long-term sustainability.

Applications

Protein Function and Structure Prediction

Pfam facilitates protein function inference by identifying domain matches that allow the transfer of functional annotations from well-characterized family members to unannotated query sequences. For instance, if a novel protein sequence aligns significantly with a Pfam known to encode a specific enzymatic activity, such as or binding, that role can be confidently assigned to the query protein based on evolutionary conservation within the . This approach leverages curated annotations in Pfam, including terms and enzyme classifications, to propagate knowledge across related proteins, enhancing the functional characterization of proteomes. In structure prediction, Pfam domain boundaries provide critical guidance for modeling tools like by delineating modular regions that can be predicted independently or in context. AlphaFold models have been integrated into Pfam resources, providing predicted structures covering essentially all UniProtKB sequences and enabling curators to refine domain definitions and validate alignments against structural data. For example, representative structures are available for approximately 89% of Pfam families, demonstrating strong alignment between sequence-based domains and predicted folds. A representative example is the protein kinase domain (PF00069), one of the largest and most conserved Pfam families, which predicts kinase activity and identifies potential sites in query proteins by matching conserved catalytic motifs. This inference aids in signaling pathway reconstruction and has applications in drug target , where kinase domains are prioritized due to their frequent association with small-molecule inhibitors in therapeutic development. However, limitations arise in multidomain proteins, where individual functions may interact combinatorially, leading to ambiguous overall roles that necessitate of the complete domain architecture rather than isolated matches. Domains of unknown function (DUFs) further complicate predictions by lacking transferable .

Genomic and Metagenomic

Pfam plays a central role in automated annotation pipelines, particularly for prokaryotic genomes, where (HMM)-based searches against the Pfam database enable the assignment of protein domains to predicted genes. In large-scale analyses of bacterial genomes, Pfam achieve coverage of over 85% of protein sequences, providing functional insights into metabolic and structural components across diverse . For instance, pipelines like AnnoTree integrate Pfam with HMMscan to annotate domains in tens of thousands of bacterial genomes, revealing patterns influenced by and , such as lower coverage in phyla like Patescibacteria. This approach supports high-throughput annotation in resources like , where Pfam-derived domains contribute to curating over 400,000 prokaryotic genomes and identifying conserved functional elements. In metagenomic studies, Pfam facilitates domain profiling of uncultured microbial communities from environmental samples, enabling the discovery of novel enzymes and metabolic pathways. By applying tools to metagenome-assembled genomes (MAGs), researchers annotate protein families to uncover functional , such as in bathypelagic microbiomes where Pfam identifies transporters and enzymes involved in nutrient cycling. For example, of 58 deep-sea metagenomes using Pfam (release 31.0) revealed novel enzyme families in 317 high-quality MAGs, highlighting potential for chemolithoautotrophy and in understudied . This domain-centric approach complements other ontologies like , allowing targeted mining of bioactive compounds and enzymes from complex samples like sediments, where traditional culturing fails to capture >99% of microbial . Pfam supports evolutionary studies by enabling the reconstruction of domain architectures across phylogenetic trees, tracking gains and losses that drive proteome diversification. Researchers map Pfam domains onto species phylogenies to quantify events like domain acquisition through or loss via deletion, revealing rates that vary by superkingdom—e.g., higher gains in eukaryotes compared to prokaryotes. In large-scale projects such as the 1,000 Plant Genomes Initiative, Pfam annotations of transcriptomes from diverse green have illuminated domain rearrangements linked to innovations like vascular tissue evolution, with analyses showing frequent losses in simplified genomes and gains preceding key adaptations. These insights underscore Pfam's utility in inferring ancestral states and adaptive trajectories from data. Quantitative metrics derived from Pfam, such as domain frequency distributions, provide estimates of a genome's functional by modeling domain co-occurrences as "grammars" of protein architectures. Across 4,794 , Pfam-based analyses reveal conserved patterns where domain frequencies follow power-law distributions, with relative measures (~1.2 bits on average) quantifying architectural complexity and linking it to functional specialization, such as signaling in eukaryotes. This approach estimates breadth without exhaustive pathway , highlighting how rare domains contribute disproportionately to while common ones underpin core .

History and Development

Origins and Early Evolution

Pfam was founded in 1995 by Erik Sonnhammer, Sean Eddy, and Richard Durbin at the Sanger Institute in the , leveraging early versions of the software for hidden Markov model-based sequence analysis. The project emerged from efforts to classify protein domains amid the rapid growth of sequence data during the early genomic era, aiming to create a curated database of protein families using seed alignments and profile hidden Markov models (HMMs). Initial development focused on balancing manual curation for high-quality families (Pfam-A) with automated clustering for broader coverage (Pfam-B), enabling systematic annotation of protein domains in uncharacterized sequences. The first public release, Pfam 1.0, occurred in 1997 and included 175 manually curated Pfam-A families, derived from alignments of proteins in Swiss-Prot release 33, covering domains with more than 50 members where possible. Subsequent annual updates drove substantial growth; by 2008, Pfam had reached approximately 10,000 families, reflecting iterative curation to incorporate new sequences from expanding databases like . This expansion continued, with Pfam 24.0 in 2010 containing 11,912 families, enhancing coverage of diverse proteomes and supporting genome annotation projects. Key milestones in Pfam's early evolution included the introduction of clans in , which grouped related Pfam-A families based on shared evolutionary origins using sequence similarity and structural , facilitating higher-level of homologous domains. In the late 2000s, Pfam began linking domains of unknown function (DUFs) to articles to crowdsource functional insights, with systematic integration starting around release 26.0 in 2011, connecting over 4,900 families to community-edited pages. By 2013, hosting transitioned to the (EMBL-EBI) following the migration of the development team from the Sanger Institute at the end of 2012, improving infrastructure for global access and integration with other resources. Early challenges centered on scaling manual curation to match the of sequence databases like UniProtKB, which outpaced Pfam's linear family expansion and limited coverage of novel or low-abundance domains. Curators prioritized high-impact families, but the influx of metagenomic and eukaryotic sequences demanded refined seed alignments and automated supplements, ensuring Pfam remained a reliable tool for protein classification despite resource constraints.

Recent Advances and Integration

In 2021, Pfam underwent a significant infrastructural shift with its full integration into the consortium, culminating in the decommissioning of the standalone Pfam website by 2022. This migration unified Pfam's data and tools within InterPro's platform, streamlining access to protein family annotations alongside signatures from other databases like and . Release cycles have since synchronized, as exemplified by InterPro 107.0 incorporating Pfam 38.0 updates in October 2025, enabling seamless distribution of enhanced domain predictions. Since 2020, Pfam has embraced and to bolster its curation and predictive capabilities, particularly through the development of Pfam-N, a model that improves sequence alignments and expands coverage of domains of unknown function (DUFs). For instance, Pfam-N has facilitated the prediction and of previously unclassified DUFs such as DUF6919 and DUF6994, achieving an 8.8% increase in UniProtKB coverage compared to standard Pfam models. In collaboration with efforts leveraging models, Pfam curators have refined domain boundaries—for example, in the Voltage_CLC family (PF04762)—and incorporated 114 new families derived from predicted structures, accelerating the identification of novel protein architectures. This AI-assisted approach has supported the addition of over 7,000 families across five major releases, from Pfam 33.1 (containing 18,259 families) to Pfam 38.0 (25,545 families as of October 2025), enhancing the database's scope for functional . Looking ahead, Pfam aims to broaden clan coverage by integrating more structural and evolutionary relationships, while transitioning to more frequent, bi-monthly updates facilitated by the infrastructure to enable near-real-time data dissemination and community feedback. These initiatives, supported by ongoing AI advancements, position Pfam to address emerging challenges in protein family classification amid growing genomic datasets.

Curation Process

Manual Curation Procedures

The manual curation of Pfam entries is conducted by a dedicated team of professional curators at the (EMBL-EBI), following a systematic to ensure high-quality protein family models. The process begins with , where curators scan primary s such as peer-reviewed papers to identify novel domains, refine functional annotations, and update existing entries, particularly for domains of unknown function (DUFs). For instance, curators use tools like the EuropePMC to functional information, enabling the renaming of families like DUF1709 to Anillin based on structural and functional evidence from the . Sequence validation follows, involving the selection of representative sequences from UniProtKB, using tools like MAFFT, and iterative searching with jackhmmer to incorporate true homologs while excluding false positives. Hidden Markov models (HMMs) are then calibrated using software, starting from the and refined through multiple rounds until convergence, where no new significant matches are added. Recent enhancements incorporate and to automate initial , predict domain boundaries using structural predictions like , and assist in characterizing DUFs, complementing manual review to improve efficiency and coverage. This is tracked internally using version control systems like to log changes and maintain an for reproducibility. Quality control is integral to the curation process, emphasizing threshold setting and overlap resolution to minimize errors in family assignments. Curators manually define family-specific gathering thresholds (GAs)—bit score cutoffs that distinguish true positives (gathering) from —calibrated to achieve high specificity, with a median GA of approximately 22.1 bits across families. These thresholds are adjusted based on searches against UniProtKB and reference proteomes, ensuring maximal coverage without false positives; for example, between Pfam releases 24.0 and 26.0, 91% of the changed thresholds (affecting 13% of families) were raised to resolve issues. Since Pfam 28.0, the strict non-overlap rule for non-clan-related families has been relaxed, allowing only small overlaps (e.g., <5% of sequence length) and resolved by elevating GAs or refining boundaries using structural data from . This process also involves cross-validation with collaborating databases like ECOD and RepeatsDB to verify annotations and domain extents. Deprecated families, often those with insufficient evidence or redundancy, are handled by "killing" them—removing from active releases after review—such as the 8 families deprecated in the transition to Pfam 37.1. Pfam maintains its database through regular updates, with major annual releases incorporating comprehensive curation efforts, such as the addition of 1,840 new families between releases 24.0 and 26.0, alongside minor interim patches for urgent fixes. Since integration with , releases have become more frequent, aligning with 's bi-monthly schedule; for example, as of Pfam 37.3 (April 2025), updates include 348 new signatures and continued functional assignments for uncharacterized families, building on efforts like the 80 updates in release 37.1 (December 2024). Adherence to reproducibility standards mirrors guidelines for resources, employing documented pipelines with open-source tools (e.g., , MMseqs2) and providing downloadable seed alignments, HMMs, and flat files via FTP for verification and independent use. Community-suggested edits are occasionally incorporated but undergo rigorous internal validation before inclusion.

Community Contributions

The Pfam database actively incorporates input from the broader to enhance its curation of protein families, particularly through structured submission channels for new proposals and functional annotations. Researchers can propose new families by submitting data via the online Pfam helpdesk form, including IDs or sequence alignments (preferably from reference proteomes), suggested names, functional descriptions, supporting publications, and their identifiers for authorship credit. This process ensures that contributions are integrated into Pfam releases after validation, excluding small or species-specific families unless backed by strong evidence. Suggestions for annotating domains of unknown function (DUFs) follow a similar pathway, often via the helpdesk , allowing experts to provide hypotheses or experimental data to resolve uncharacterized entries. A notable collaborative effort involves linking DUF entries to dedicated pages, established since , to crowdsource hypotheses and discussions on their potential functions from the global research community. These links direct users to a centralized Wikipedia article for each DUF, fostering contributions from biologists worldwide and enabling Pfam curators to incorporate verified insights into official annotations. Through this mechanism, numerous DUFs have been functionally characterized and renamed, reducing the proportion of unannotated domains in the database over time. Pfam also engages domain specialists through targeted partnerships, such as collaborations with structural biologists to define families based on experimental . For instance, a joint effort with researcher Joana Pereira led to the creation of 57 new Pfam families by identifying previously unrecognized folds in protein structures. Additional expert input comes from alliances like those with the NCBI for families, where community-provided alignments and functional accelerate curation of pathogen-related domains. Community contributions have substantially impacted Pfam's growth, with examples including DUF annotations derived from insights that resolve long-standing unknowns. These external efforts complement internal curation, ensuring broader coverage of the protein universe while attributing credit to contributors via integration.

References

  1. [1]
    Pfam is now hosted by InterPro
    The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
  2. [2]
    Pfam - Sanger Centre
    Pfam. The open access resource was established at the Wellcome Trust Sanger Institute in 1998. Its vision is to provide a tool which allows experimental, ...
  3. [3]
    InterPro
    ### Summary of Pfam within InterPro
  4. [4]
    Pfam - Quick tour
    ### Summary of Pfam Introduction
  5. [5]
    Summary - Pfam Documentation - Read the Docs
    The Pfam database is a large collection of protein domain families. Each family is represented by multiple sequence alignments and a profile hidden Markov ...
  6. [6]
    Pfam: The protein families database in 2021 - Oxford Academic
    Oct 30, 2020 · The Pfam database is a widely used resource for classifying protein sequences into families and domains.
  7. [7]
    Pfam protein families database: embracing AI/ML - Oxford Academic
    Nov 14, 2024 · The Pfam protein families database is a comprehensive collection of protein domains and families used for genome annotation and protein ...
  8. [8]
    Pfam 37.0 release - Xfam Blog - WordPress.com
    Jun 6, 2024 · ... Pfam 37.0 contains a total of 21979 families and 709 clans. Since the last release, we have built 1196 new families, killed 12 families, and ...
  9. [9]
    Pfam: The protein families database in 2021 - PMC - NIH
    Oct 30, 2020 · We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that ...Missing: deprecated | Show results with:deprecated
  10. [10]
    Pfam Protein Families Database | Nucleic Acids Research
    The results are shown in Table 1. The data clearly show a bias towards eukaryotes, with over two-thirds of Pfam families containing a eukaryotic representative.
  11. [11]
    What are profile hidden Markov models? | Pfam - EMBL-EBI
    They are one of the computational algorithms used for predicting protein structure and function, identifies significant protein sequence similarities ...Missing: database | Show results with:database
  12. [12]
    [PDF] HMMER User's Guide - Eddy Lab
    probabilistic models of protein and DNA sequence domain families – called profile hidden Markov models, profile HMMs, or just profiles. – and for using these ...
  13. [13]
    Pfam: Multiple sequence alignments and HMM-profiles of protein ...
    Pfam contains multiple alignments and hidden Markov model based profiles (HMM-profiles) of complete protein domains. The definition of domain boundaries, family ...
  14. [14]
    HMMER
    HMMER is often used together with a profile database, such as Pfam or many of the databases that participate in Interpro. But HMMER can also work with query ...
  15. [15]
    The Pfam protein families database in 2019 - Oxford Academic
    Oct 24, 2018 · The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve ...
  16. [16]
    Pfam 37.1 release - Xfam Blog - WordPress.com
    Dec 9, 2024 · Release content. Pfam 37.1 contains a total of 23,794 families and 751 clans. Since the last release, we have built 1,823 new families, killed ...
  17. [17]
    The Pfam database - X
    Jun 20, 2025 · HUGE NEWS: Pfam 37.4 is LIVE! 318 brand new protein families + 4 new clans! Total collection now: 24736 entries!
  18. [18]
    Pfam: clans, web tools and services - PMC - NIH
    THE GROWTH OF PFAM​​ One of the main uses of Pfam is genome annotation, thus an important measure is the coverage of the non-redundant set of proteins encoded by ...
  19. [19]
    The Pfam protein families database in 2019 - PMC - NIH
    Oct 24, 2018 · Pfam 32.0, which released in September 2018 contains a total of 17,929 entries. Of all the sequences in UniProtKB, 77.2% have at least one match ...Missing: citations | Show results with:citations
  20. [20]
    AnnoDUF: A Web-Based Tool for Annotating Functions of Proteins ...
    Aug 31, 2024 · We executed our pipeline on 5111 unique DUF sequences obtained from Pfam, resulting in putative annotations for 2007 of these. These annotations ...
  21. [21]
    DUFs: families in search of function - PMC - NIH
    Domains of unknown function, or DUFs, are a large set of families within the Pfam database that do not include any protein of known function. Although called ...Missing: characteristics | Show results with:characteristics
  22. [22]
    Protein Domains of Unknown Function Are Essential in Bacteria - PMC
    Dec 31, 2013 · Pfam's DUF families are composed entirely of functionally uncharacterized protein fragments when they are assigned by the curators.Missing: characteristics | Show results with:characteristics
  23. [23]
    The Pfam protein families database - PMC - PubMed Central - NIH
    Nov 29, 2011 · Pfam is a widely used database of protein families, currently containing more than 13 000 manually curated protein families as of release 26.0.
  24. [24]
    DUFs: families in search of function - Wiley Online Library
    In Pfam release 23.0, the DUF numbering scheme reached DUF2607 and the fraction of DUF families in Pfam had increased to about 22% of all families (Fig. 2 ...Dufs: Families In Search Of... · 2. The Scale Of... · 3. Finding Function
  25. [25]
  26. [26]
  27. [27]
  28. [28]
    InterPro: the protein sequence classification resource in 2025 - PMC
    Nov 20, 2024 · The InterPro database provides annotations for over 200 million sequences, ensuring extensive coverage of UniProtKB, the standard repository of ...
  29. [29]
    Automatic annotation | UniProt help
    Mar 21, 2024 · In UniProtKB/TrEMBL entries, domains from the InterPro member databases PROSITE, SMART or Pfam are predicted and annotated automatically ...
  30. [30]
    UniProt: the universal protein knowledgebase in 2021
    Nov 25, 2020 · In UniProtKB/TrEMBL entries, domains predicted by the InterPro member databases PROSITE, SMART or Pfam are used to automatically provide domain ...
  31. [31]
    Cross-references of external classification systems to GO
    These mappings (or “cross-references”) create a network of information that integrate GO and its annotations with other biological databases and resources.
  32. [32]
    The Pfam Protein Families Database - PMC - NIH
    The latest version (4.3) of Pfam contains 1815 families. These Pfam families match 63% of proteins in SWISS-PROT 37 and TrEMBL 9. For complete genomes Pfam ...Missing: 37.0 | Show results with:37.0
  33. [33]
    DPAM-AI: a domain parser for AlphaFold models powered by ... - NIH
    These AlphaFold models significantly increase the fraction of Pfam families with 3D structures, allowing us to link 19 906 (95.7%) families to predicted ...3 Results · Figure 2 · Figure 5
  34. [34]
    Evolution of protein kinase substrate recognition at the active site
    In these 4 examples, kinase residues have been numbered according to their position in the protein kinase domain (Pfam: PF00069). The peptides and/or proteins ...
  35. [35]
    Insights into polypharmacology from drug-domain associations
    In this work, we model drug–domain networks to explore the role of protein domains as drug targets and to explain drug polypharmacology.
  36. [36]
    An assessment of genome annotation coverage across the bacterial ...
    Mar 3, 2020 · The third approach based on Pfam domain-based annotation produced a mean of 79±7.1 % annotation coverage (Fig. 1a), which is higher than that of ...
  37. [37]
    RefSeq: expanding the Prokaryotic Genome Annotation Pipeline ...
    Dec 3, 2020 · The RefSeq collection for prokaryotes has grown to nearly 200 000 genomes and 150 million non-redundant proteins and, after over a decade, ...
  38. [38]
    Deep ocean metagenomes provide insight into the metabolic ...
    May 21, 2021 · Here we analyze 58 metagenomes from tropical and subtropical deep oceans to generate the Malaspina Gene Database.
  39. [39]
    Metagenomic applications in exploration and development of novel ...
    Aug 4, 2020 · Metagenomics is a strategy used to analyze genomes acquired from the community of environmental microorganisms without culturing them.Missing: ocean | Show results with:ocean
  40. [40]
    Global Patterns of Protein Domain Gain and Loss in Superkingdoms
    Jan 30, 2014 · We show that both gains and losses of domains occurred frequently during proteome evolution. The rate of domain discovery increased approximately linearly in ...
  41. [41]
    One thousand plant transcriptomes and the phylogenomics of green ...
    Oct 23, 2019 · Our findings suggest that extensive gene-family expansions or genome duplications preceded the evolution of major innovations in the history of ...
  42. [42]
    Grammar of protein domain architectures - PNAS
    Feb 7, 2019 · This analogy is reflected in the statistical properties of the domain repertoires of diverse organisms. The frequency distribution of domains ...
  43. [43]
    Pfam 10 years on: 10 000 families and still growing - Oxford Academic
    Mar 15, 2008 · Sequence analysis tools have improved greatly over the last 10 years. Multiple sequence alignment software, such as MAFFT [28] and Muscle [29] ...
  44. [44]
    Pfam: a comprehensive database of protein domain families based ...
    Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments.Missing: iterative E- value threshold boundaries envelope coordinates
  45. [45]
    [PDF] Pfam: A Comprehensive Database of Protein Domain Families ...
    For release 1.0, we strived to include every family with more than 50 members in Pfam-A. All sequence domains not in Pfam-A were then clustered and aligned ...
  46. [46]
    Pfam 10 years on: 10000 families and still growing - PubMed
    Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more ...Missing: 2010 | Show results with:2010
  47. [47]
    Pfam -- Protein Families Database | HSLS
    Feb 1, 2010 · Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam ...Missing: deprecated | Show results with:deprecated
  48. [48]
    Pfam: the protein families database - PMC - NIH
    Nov 27, 2013 · Pfam is a database of curated protein families, each of which is defined by two alignments and a profile hidden Markov model (HMM). Profile HMMs ...
  49. [49]
    [PDF] Annual Scientific Report 2013
    All of the services we host are expected to adopt the new website guidelines by spring 2015. New faces. EMBL-EBI has been the driver behind ELIXIR, the.
  50. [50]
    The Pfam protein families database: towards a more sustainable future
    Dec 15, 2015 · The shift to using two sequence sources for Pfam has many advantages. By driving Pfam seed alignments to use reference proteomes, we believe ...
  51. [51]
    None
    Nothing is retrieved...<|separator|>
  52. [52]
    InterPro: the protein sequence classification resource in 2025
    Nov 20, 2024 · The InterPro website now offers access to 85 000 protein families and domains from its member databases and serves as a long-term archive for ...
  53. [53]
    Frequently Asked Questions (FAQs) - Pfam Documentation
    Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam profile HMM represents a protein family or domain.Missing: iterative | Show results with:iterative