Fact-checked by Grok 2 weeks ago

InterPro

InterPro is a freely accessible bioinformatics resource that classifies protein sequences into families, domains, and functional sites by integrating predictive models, known as signatures, from 13 specialized member databases, enabling comprehensive functional analysis of proteins. Launched in 1999 by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in collaboration with international partners, InterPro was established to consolidate disparate protein signature efforts from databases such as PROSITE, PRINTS, and Pfam, providing a unified platform for identifying shared protein features amid the rapid growth in genomic sequencing data. Over its history, InterPro has evolved into one of the most widely used tools for protein annotation, with its latest release (version 107.0, October 15, 2025) incorporating approximately 85,000 protein families and domains across member databases like CATH/Gene3D, PANTHER, PIRSF, SMART, and SUPERFAMILY, each contributing unique classification methods such as hidden Markov models, profiles, and patterns. The database annotates more than 200 million protein sequences, achieving 84% coverage of UniProtKB (as of April 2025) and extensive mappings to resources including the , (PDB), and structures for enhanced functional and structural insights. Key features include InterProScan, a versatile tool for scanning user-submitted sequences against InterPro signatures via web interface, , or standalone software; visualization tools like Nightingale for domain architectures; and recent advancements such as AI-generated functional descriptions for thousands of entries and integration of disorder predictions from MobiDB Lite. These capabilities make InterPro indispensable for researchers in , , and , supporting applications from function to evolutionary studies.

Overview

Definition and Purpose

InterPro is an integrative bioinformatics database that combines predictive models, known as signatures, from multiple specialized resources to classify proteins into families, domains, and functional sites. These signatures enable the identification of shared features among proteins, facilitating the of functional and evolutionary relationships. By amalgamating from various member databases, InterPro serves as a centralized platform for protein analysis, reducing redundancy and enhancing the accuracy of predictions through cross-validation of methods. The primary purpose of InterPro is to support the functional of proteins, particularly those with unknown or poorly characterized sequences, by detecting motifs and regions that indicate biological roles, structural components, and evolutionary . This annotation process aids researchers in understanding protein within broader biological contexts, such as signaling pathways, enzymatic activities, and molecular interactions. InterPro's approach emphasizes comprehensive coverage, providing insights that inform , , and studies. At its core, InterPro employs signature-based classification, utilizing diverse computational models including hidden Markov models (HMMs) for domain detection, position-specific scoring matrices (PSSMs) or profiles for sequence alignments, and patterns for conserved motifs. These models are applied to query sequences to predict features in novel proteins, allowing for scalable analysis across large datasets. As of release 107.0 on 16 October 2025, InterPro integrates 113,612 signatures from member databases into 49,674 entries, annotating 163,355,728 protein sequences (81.8% coverage of UniProtKB).

History and Development

InterPro was established in 1999 as an integrative resource through a consortium of protein signature databases, initially comprising , , PRINTS, and ProDom, to provide unified functional annotations for protein sequences. The project aimed to address the fragmentation in protein family classification by merging diverse signature methods into a single, curated framework. The beta version launched in October 1999, followed by the full release of version 1.0 in March 2000, marking the early integration of and as core components for domain and motif prediction. During the 2000s, InterPro transitioned to a fully web-based platform, facilitating sequence searches and visualization, while expanding the consortium to include additional databases such as , TIGRFAMs, and SUPERFAMILY. InterPro has utilized UniProtKB (formerly Swiss-Prot and TrEMBL) as the for protein sequences since its , aligning annotations with this comprehensive, centralized to ensure scalability and accuracy. Development has progressed from manual curation of signatures by consortium members to automated pipelines for matching sequences against integrated models, enabling efficient large-scale analysis. Recent advancements include AI-driven enhancements introduced in version 105.0 in April 2025, which improved protein classification through machine learning-based annotations. Version 107.0, released on 16 October 2025, further expanded integration for subfamily predictions alongside updates to 38.0 and over 1,000 new entries. These evolutions are documented in key publications, such as the 2025 paper, which highlights expansions and enhancements in data quality and coverage.

Database Content

Consortium Member Databases

InterPro integrates protein signatures from 13 member , each specializing in different methods for classifying protein families, domains, and functional sites, to provide a comprehensive resource for protein . These databases contribute diverse signature types, such as hidden Markov models (HMMs), profiles, and patterns, which are synchronized in annual releases to ensure compatibility with InterPro's updates. Core member databases include those hosted by EMBL-EBI, such as (version 38.0, containing 25,545 families defined using HMMs for protein domain alignments) and contributions to patterns for protein families and domains. The Swiss Institute of Bioinformatics (SIB) provides profiles and HAMAP, which offers manually curated profiles for conserved protein families in prokaryotes and eukaryotes. The Protein Information Resource (PIR) contributes PIRSF, focusing on evolutionary classifications of protein families. Other key members encompass (version 19.0, with 15,683 entries classifying functionally related protein subfamilies using HMMs and phylogenetic trees), the Conserved Domain Database (CDD) from NCBI (providing annotated alignment models for conserved domains), CATH-Gene3D (using Markov clustering for protein families and domains in genomes), (for domain identification and analysis), NCBIfam (HMM-based models for protein families, including TIGRFAM models for microbial protein families), and (HMMs based on structural classifications). Additional contributors include MobiDB Lite (for protein disorder annotations), PRINTS (protein fingerprints from conserved motifs), and SFLD (sequence-structure-based enzyme classifications). Signatures from these databases undergo a rigorous curation and process in InterPro, where curators inspect and merge overlapping signatures to form non-redundant entries, minimizing redundancy while preserving the strengths of each source. This curation ensures high-quality classifications; for instance, signatures are integrated to represent families via HMMs, contributes functional subfamily details, and CDD supplies conserved models from NCBI resources. As of version 107.0 (October 2025), this process supports over 107,000 protein families and domains across the member databases.

Data Types and Entry Types

InterPro represents various protein features through distinct data types, including protein families that group proteins sharing a common evolutionary origin, domains as modular functional or structural units within proteins, repeats consisting of tandemly occurring sequence motifs such as coiled-coil structures, (PTM) sites where chemical alterations occur on residues, and binding sites involved in or molecule interactions. These features are organized into an entry type hierarchy that reflects their biological significance and scope. At the broadest level, homologous superfamilies encompass diverse protein groups with shared tertiary structures but potentially divergent sequences, often identified using profile hidden Markov models. Families represent broader evolutionary groupings of proteins with related functions and sequence similarities, frequently forming hierarchical structures with subfamilies. Domains denote compact, independently folding regions that perform specific functions, such as the pleckstrin (PH) domain involved in binding. Repeats capture short, recurring motifs like pentapeptide or coiled-coil patterns that contribute to protein architecture. Sites are the most localized type, including active sites comprising catalytic residues essential for enzymatic activity, binding sites for interactions with ligands or ions, conserved sites highlighting functionally important residues of unknown precise role, and PTM sites for modifications like . Additionally, regions address unstructured or flexible segments of proteins that lack a defined fold but may play regulatory roles. Each InterPro entry receives a unique identifier in the format IPR followed by a numeric code, such as IPR001909 for the domain, enabling precise referencing across databases. These entries are further enriched by mappings to (GO) terms, which provide standardized annotations for molecular function, , and cellular component, facilitating automated functional inference for proteins. Over 80% of InterPro entries classify as domains or families, underscoring their prominence in protein . Recent updates, starting from release 105.0 in 2025, have incorporated AI-driven methods to add hundreds of new entries, including those focused on predicted functional sites derived from models like and large language models such as , with version 107.0 (October 2025) continuing these enhancements.

Access Methods

Web Interface

The InterPro web interface provides the main online portal for users to browse, search, and analyze protein family and domain data without requiring programming knowledge, hosted at ebi.ac.uk/interpro since the database's launch in 1999. This graphical platform integrates annotations from multiple member databases, enabling straightforward exploration of protein functions, such as domains and sites, through intuitive navigation and visualization tools. Key features include text-based searches using keywords, protein sequences in , or specific accessions like IDs to identify matching entries and proteins. Users can also browse the dataset organized by taxonomic lineages, domain architectures (combinations of protein domains), or structural classifications, allowing targeted discovery of evolutionary relationships and functional patterns. The sequence search functionality supports uploading files for on-the-fly annotation, leveraging integrated scanners to predict signatures against the full set of InterPro entries. Protein summary pages offer detailed views of individual proteins, displaying matched signatures from contributing databases, graphical depictions of domain layouts along the sequence, and associated terms for inferred biological roles. Entry pages for specific families or domains include hyperlinks to the originating member databases, along with curated evidence such as experimental validations or predictive methods supporting the classification. Release 107.0, dated October 15, 2025, includes AI-generated annotations for enhanced functional insights, building on advancements from release 105.0 in April 2025.

Application Programming Interface (API)

The InterPro Application Programming Interface () is a RESTful that enables programmatic retrieval of protein family, domain, and functional site data in format, supporting integration into bioinformatics workflows and third-party applications. Introduced in late , the API provides structured access to InterPro's curated and predicted annotations without requiring manual web interaction. The API consists of six primary endpoints designed for targeted queries: /entry for detailed information on signatures and entries (e.g., by IPR ID or member database like ), /protein for retrieving matches associated with a specific protein (e.g., by accession), /taxonomy for organism- or taxon-specific data, /structure for structural alignments, /set for curated sets of related entries, and /proteome for proteome-wide summaries. Queries can be constructed by combining endpoint paths with identifiers, such as /entry/interpro/IPR000001 to fetch hierarchical details on a domain entry, including its type (e.g., family or domain), abstract, and cross-references. Responses include metadata like entry hierarchies, match locations, evidence levels, and counters for associated proteins or structures, facilitating efficient data extraction. As of release 105.0 in April 2025, the supports access to AI-predicted features, including over 1.8 billion neural network-based annotations from the InterPro-N model for protein function and structure. Users are advised to avoid excessive or large-scale requests to prevent temporary service unavailability and ensure stability. The supports text-based searches across endpoints and for large-scale analyses, such as querying multiple proteins simultaneously, enabling seamless with resources like for sequence annotation pipelines or Ensembl for genomic context.

Bulk Downloads

InterPro data is available for bulk download via FTP at ftp.ebi.ac.uk/pub/databases/interpro/, providing comprehensive files such as protein matches (e.g., TSV format with signature locations), entry hierarchies, and cross-references to member databases and external resources like and GO. These downloads are synchronized with each release (e.g., version 107.0, October 2025), supporting offline analysis and integration into custom pipelines for large datasets. Files include over 200 million annotated sequences, with options for filtered subsets by or entry type.

Analysis Tools

InterProScan

InterProScan is a standalone command-line software tool that enables the functional annotation of user-submitted protein sequences by scanning them against the predictive signatures compiled in the database. Developed by the (EMBL-EBI), it integrates multiple sequence analysis algorithms from InterPro's member databases, running them in parallel to detect protein families, domains, repeats, and functionally important sites. For instance, it employs for hidden Markov model-based searches against and other profile-based resources, and PSI-BLAST for position-specific iterated searches against the Conserved Domain Database (CDD). This multi-method approach allows for robust, comprehensive predictions without relying on web-based services, making it suitable for high-throughput, offline analyses. Installation of InterProScan is straightforward and restricted to 64-bit systems due to dependencies on third-party binaries; the package is distributed as a downloadable tar.gz archive from the EMBL-EBI FTP site, containing the core Archive () file, data files, and configuration scripts. Users unpack the archive, ensure 11 or higher is installed, and execute the tool via the command line with the interproscan.sh script. Input is accepted in standard for protein sequences (or nucleotide sequences, which are internally translated), supporting of multiple entries. Outputs are produced in flexible formats such as XML for machine-readable parsing or (TSV) for tabular review, including details on match locations, aligned regions, bit scores, and e-values for assessing match significance. Optional parameters allow customization, such as selecting specific member databases or adjusting parallelism for multi-core systems. The most recent release, version 5.76-107.0, launched on October 16, 2025, incorporates updates synchronized with InterPro 107.0 and supports 14 distinct signature search methods derived from consortium member databases. This version enhances efficiency for large-scale tasks, capable of processing up to 10,000 protein sequences on standard hardware (e.g., a multi-core CPU with 16 GB ) within hours, depending on selected methods and input size. optimizations include chunking large inputs into jobs and optional pre-filtering to reduce computational load. At its core, InterProScan unifies disparate signature matches into coherent predictions by mapping them to InterPro entries (IPR terms), which represent curated groupings of similar signatures across databases. This integration resolves overlaps and redundancies, assigning a single IPR identifier to equivalent regions while reporting underlying member database hits. Confidence in predictions is gauged through statistical thresholds, such as e-value cutoffs (typically < 0.001 for high reliability), alongside score-based filtering to prioritize true positives over false matches. By providing these consolidated annotations, the tool facilitates downstream applications in , such as pathway inference or structural modeling, with traceable evidence from source signatures.

Visualization and Integration Tools

InterPro provides several visualization and integration tools to facilitate the display and incorporation of its protein classification data into broader bioinformatics workflows. The Nightingale library, a JavaScript-based of reusable , serves as a core toolkit for rendering protein-related visualizations in web applications. Developed in collaboration with , it incorporates Protvista components originally from the UniProt project, enabling efficient browser-based rendering of complex protein features. Introduced in 2020 as part of the InterPro protein viewer update, Nightingale has been adapted to handle large-scale sequence data with improved performance and maintainability. Key rendering capabilities of Nightingale and Protvista include the visualization of domain architectures, which depict the modular arrangement of protein domains along a sequence; sequence alignments, highlighting conserved regions and matches to InterPro signatures; and trees, illustrating hierarchical functional annotations. These components prioritize representative domain selections to maximize coverage while minimizing overlap, aiding users in interpreting protein function and evolution. In the 2025 release (InterPro 107.0), the protein viewer was updated to use the latest version of Nightingale, including enhanced integration of predicted structures with per-residue confidence scores (pLDDT) and new tracks for features such as short linear motifs from and intrinsically disordered regions from DisProt. For integration, InterPro links seamlessly with resources such as for detailed protein views, Ensembl for genomic context, and PDBe for structural data, allowing users to navigate between sequence classifications and experimental structures. Export options include compatibility with Cytoscape for network analysis of protein interactions derived from InterPro annotations. These tools draw on InterPro's for programmatic access, streamlining incorporation into custom pipelines.

References

  1. [1]
    InterPro - EMBL-EBI
    InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. To classify proteins in this ...Pfam · Download · By Domain Architecture · By Structure
  2. [2]
  3. [3]
    The InterPro protein families and domains database: 20 years on
    Nov 6, 2020 · Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro ...
  4. [4]
    None
    ### InterPro Summary for Encyclopedia Introduction
  5. [5]
    InterPro consortium member databases
    InterPro integrates protein signatures from 13 member databases, which use a variety of different methods to classify proteins.<|control11|><|separator|>
  6. [6]
    About - InterPro - EMBL-EBI
    InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and ...
  7. [7]
    InterPro: the protein sequence classification resource in 2025
    Nov 20, 2024 · The InterPro database provides annotations for over 200 million sequences, ensuring extensive coverage of UniProtKB, the standard repository of ...
  8. [8]
    The InterPro database, an integrated documentation resource for ...
    InterPro is an integrated documentation resource for protein families, domains and functional sites, which amalgamates the efforts of the PROSITE, PRINTS, Pfam ...
  9. [9]
    The InterPro protein families database: the classification resource ...
    Nov 26, 2014 · InterPro was originally launched in beta in October 1999, with a full version 1.0 release in March the following year. From an initial core of ...
  10. [10]
    InterPro: the integrative protein signature database - Oxford Academic
    All our XML and flat files are updated when InterPro is publicly released, which is currently a cycle of ∼3 months. ... October 2025, 76. Citations. Powered by ...
  11. [11]
    InterPro 105.0: AI for protein classification | EMBL-EBI
    Apr 28, 2025 · This update includes 342 new entries, increasing the total number of InterPro entries to 48,003. In addition, since our last release, 385 member ...Missing: 107.0 | Show results with:107.0
  12. [12]
    InterPro consortium member databases - EMBL-EBI
    InterPro integrates protein signatures from 13 member databases, which use a variety of different methods to classify proteins. Each of the databases has a ...Missing: total | Show results with:total<|control11|><|separator|>
  13. [13]
    DBGET search - Pfam - (www.genome.jp).
    Protein families database of alignments and HMMs. Release 38.0, Oct 25. The Pfam Consortium 25,545 entries. Search Pfam for. bfind mode Show.
  14. [14]
    Pfam protein families database: embracing AI/ML - Oxford Academic
    Nov 14, 2024 · A correction has been published: Nucleic Acids Research, Volume 53, Issue 1, 13 January 2025, gkae1276, https://doi.org/10.1093/nar/gkae1276.Abstract · Introduction · Discussion · Data availability
  15. [15]
    InterPro entry types - EMBL-EBI
    InterPro entries are classified into five types: homologous superfamily, protein family, domain, repeat, or site.
  16. [16]
    InterPro: the protein sequence classification resource in 2025 - PMC
    Nov 20, 2024 · Over the past two years, more than 5000 new InterPro entries have been created. The InterPro website now offers access to 85 000 protein ...Missing: 107.0 | Show results with:107.0
  17. [17]
    Gene Ontology annotation through association of InterPro records ...
    Note that some groups filter GO annotations based on InterPro-to-GO transitive assignment, e.g. to remove annotations redundant with manual curation.<|control11|><|separator|>
  18. [18]
    InterPro Documentation — InterPro Documentation
    InterPro Documentation¶ · Quick search · Sequence search · Text search · Domain architecture search · Using Browse feature to search and filter InterPro.
  19. [19]
    InterProScan - InterPro - EMBL-EBI
    This form enables you to submit sequences to the InterProScan web service for scanning against the InterPro protein signature databases.Missing: entry regions GO
  20. [20]
    InterPro 105.0: AI for protein classification | EMBL-EBI
    Apr 28, 2025 · InterPro 105.0 is now live. This AI-driven update makes it easier than ever to explore the protein universe.
  21. [21]
    Release notes - InterPro - EMBL-EBI
    Apr 23, 2025 · InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites.Missing: 107.0 October
  22. [22]
  23. [23]
    InterPro in 2019: improving coverage, classification and access to ...
    Nov 6, 2018 · Under InterPro's previous entry type and annotation rules, there was ... region according to the sub-region properties: positive ...
  24. [24]
    ProteinsWebTeam/interpro7-api - GitHub
    This API provides the data that the new InterPro website uses. You can explore the website at [www.ebi.ac.uk/interpro]. The repository for the InterPro Website ...
  25. [25]
    Querying Pfam using the InterPro API
    Querying Pfam using the InterPro API¶. This is an introduction to the InterPro API to retrieve Pfam annotations. A programmatic interface, commonly called ...
  26. [26]
    InterPro in 2022 | Nucleic Acids Research - Oxford Academic
    Nov 9, 2022 · InterPro regularly incorporates member database updates, which allows us to update InterPro entries and provides new signatures for integration.
  27. [27]
    InterPro in 2022 - PMC - PubMed Central
    Nov 9, 2022 · In late 2021, we reviewed 626 InterPro entries for which the entry name and description had not been updated and no member database signatures ...
  28. [28]
    InterProScan - About - InterPro
    InterProScan is the software package that allows sequences to be scanned against InterPro's member database signatures. Users who have novel nucleotide or ...Missing: documentation | Show results with:documentation
  29. [29]
    InterProScan 5: genome-scale protein function classification
    Signatures are only integrated into InterPro when they are considered to be of good quality; if two signatures are found to be describing the same protein ...Missing: scanners | Show results with:scanners
  30. [30]
    Releases · ebi-pf-team/interproscan - GitHub
    release: interproscan-5.76-107.0 md5: 8c9a8b153e527f8cfc7bf24ee1652d78 cpu: 64 bit os: Linux size: 6.6GB compressed.
  31. [31]
    Running InterProScan
    InterProScan should run through properly without any warnings and it will create a TSV output file containing several member database matches, including Gene3d, ...
  32. [32]
    Release notes: InterProScan 5.76-107.0
    Release notes: InterProScan 5.76-107.0¶. Released on 16 October 2025. What's new¶. Data update¶. Synchronized with InterPro version 107.0. The addition of 1068 ...
  33. [33]
    UniProt: the universal protein knowledgebase in 2021
    Nov 25, 2020 · INTRODUCTION. The UniProt databases exist to support biological and biomedical research by providing a complete compendium of all known protein ...<|control11|><|separator|>