InterPro
InterPro is a freely accessible bioinformatics resource that classifies protein sequences into families, domains, and functional sites by integrating predictive models, known as signatures, from 13 specialized member databases, enabling comprehensive functional analysis of proteins.[1][2] Launched in 1999 by the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) in collaboration with international partners, InterPro was established to consolidate disparate protein signature efforts from databases such as PROSITE, PRINTS, and Pfam, providing a unified platform for identifying shared protein features amid the rapid growth in genomic sequencing data.[3][4] Over its history, InterPro has evolved into one of the most widely used tools for protein annotation, with its latest release (version 107.0, October 15, 2025) incorporating approximately 85,000 protein families and domains across member databases like CATH/Gene3D, PANTHER, PIRSF, SMART, and SUPERFAMILY, each contributing unique classification methods such as hidden Markov models, profiles, and patterns.[1][5][2] The database annotates more than 200 million protein sequences, achieving 84% coverage of UniProtKB (as of April 2025) and extensive mappings to resources including the Gene Ontology, Protein Data Bank (PDB), and AlphaFold structures for enhanced functional and structural insights.[2][4][6] Key features include InterProScan, a versatile tool for scanning user-submitted sequences against InterPro signatures via web interface, API, or standalone software; visualization tools like Nightingale for domain architectures; and recent advancements such as AI-generated functional descriptions for thousands of entries and integration of disorder predictions from MobiDB Lite.[1][2] These capabilities make InterPro indispensable for researchers in genomics, proteomics, and structural biology, supporting applications from gene function prediction to evolutionary studies.Overview
Definition and Purpose
InterPro is an integrative bioinformatics database that combines predictive models, known as signatures, from multiple specialized resources to classify proteins into families, domains, and functional sites. These signatures enable the identification of shared sequence features among proteins, facilitating the inference of functional and evolutionary relationships. By amalgamating data from various member databases, InterPro serves as a centralized platform for protein sequence analysis, reducing redundancy and enhancing the accuracy of predictions through cross-validation of methods.[7] The primary purpose of InterPro is to support the functional annotation of proteins, particularly those with unknown or poorly characterized sequences, by detecting motifs and regions that indicate biological roles, structural components, and evolutionary conservation. This annotation process aids researchers in understanding protein function within broader biological contexts, such as signaling pathways, enzymatic activities, and molecular interactions. InterPro's approach emphasizes comprehensive coverage, providing insights that inform genomics, proteomics, and structural biology studies.[8] At its core, InterPro employs signature-based classification, utilizing diverse computational models including hidden Markov models (HMMs) for domain detection, position-specific scoring matrices (PSSMs) or profiles for sequence alignments, and regular expression patterns for conserved motifs. These models are applied to query sequences to predict features in novel proteins, allowing for scalable analysis across large datasets. As of release 107.0 on 16 October 2025, InterPro integrates 113,612 signatures from member databases into 49,674 entries, annotating 163,355,728 protein sequences (81.8% coverage of UniProtKB).[1][8][9]History and Development
InterPro was established in 1999 as an integrative resource through a consortium of protein signature databases, initially comprising Pfam, PROSITE, PRINTS, and ProDom, to provide unified functional annotations for protein sequences.[10] The project aimed to address the fragmentation in protein family classification by merging diverse signature methods into a single, curated framework. The beta version launched in October 1999, followed by the full release of version 1.0 in March 2000, marking the early integration of Pfam and PROSITE as core components for domain and motif prediction.[11] During the 2000s, InterPro transitioned to a fully web-based platform, facilitating sequence searches and visualization, while expanding the consortium to include additional databases such as SMART, TIGRFAMs, and SUPERFAMILY.[12] InterPro has utilized UniProtKB (formerly Swiss-Prot and TrEMBL) as the primary source for protein sequences since its inception, aligning annotations with this comprehensive, centralized repository to ensure scalability and accuracy.[3] Development has progressed from manual curation of signatures by consortium members to automated pipelines for matching sequences against integrated models, enabling efficient large-scale analysis.[8] Recent advancements include AI-driven enhancements introduced in version 105.0 in April 2025, which improved protein classification through machine learning-based annotations.[6] Version 107.0, released on 16 October 2025, further expanded PANTHER integration for subfamily predictions alongside updates to Pfam 38.0 and over 1,000 new entries.[1] These evolutions are documented in key publications, such as the 2025 Nucleic Acids Research paper, which highlights consortium expansions and enhancements in data quality and coverage.[8]Database Content
Consortium Member Databases
InterPro integrates protein signatures from 13 member databases, each specializing in different methods for classifying protein families, domains, and functional sites, to provide a comprehensive resource for protein sequence analysis.[13][8] These databases contribute diverse signature types, such as hidden Markov models (HMMs), profiles, and patterns, which are synchronized in annual releases to ensure compatibility with InterPro's updates.[8] Core member databases include those hosted by EMBL-EBI, such as Pfam (version 38.0, containing 25,545 families defined using HMMs for protein domain alignments) and contributions to PROSITE patterns for protein families and domains.[14][13] The Swiss Institute of Bioinformatics (SIB) provides PROSITE profiles and HAMAP, which offers manually curated profiles for conserved protein families in prokaryotes and eukaryotes.[13] The Protein Information Resource (PIR) contributes PIRSF, focusing on evolutionary classifications of protein families.[13] Other key members encompass PANTHER (version 19.0, with 15,683 entries classifying functionally related protein subfamilies using HMMs and phylogenetic trees), the Conserved Domain Database (CDD) from NCBI (providing annotated alignment models for conserved domains), CATH-Gene3D (using Markov clustering for protein families and domains in genomes), SMART (for domain identification and analysis), NCBIfam (HMM-based models for protein families, including TIGRFAM models for microbial protein families), and SUPERFAMILY (HMMs based on SCOP structural classifications).[1][13] Additional contributors include MobiDB Lite (for protein disorder annotations), PRINTS (protein fingerprints from conserved motifs), and SFLD (sequence-structure-based enzyme classifications).[13] Signatures from these databases undergo a rigorous manual curation and integration process in InterPro, where curators inspect and merge overlapping signatures to form non-redundant entries, minimizing redundancy while preserving the strengths of each source.[8] This curation ensures high-quality classifications; for instance, Pfam signatures are integrated to represent domain families via HMMs, PANTHER contributes functional subfamily details, and CDD supplies conserved domain models from NCBI resources.[8][15] As of version 107.0 (October 2025), this process supports over 107,000 protein families and domains across the member databases.[1]Data Types and Entry Types
InterPro represents various protein features through distinct data types, including protein families that group proteins sharing a common evolutionary origin, domains as modular functional or structural units within proteins, repeats consisting of tandemly occurring sequence motifs such as coiled-coil structures, post-translational modification (PTM) sites where chemical alterations occur on amino acid residues, and binding sites involved in ligand or molecule interactions.[16] These features are organized into an entry type hierarchy that reflects their biological significance and scope. At the broadest level, homologous superfamilies encompass diverse protein groups with shared tertiary structures but potentially divergent sequences, often identified using profile hidden Markov models. Families represent broader evolutionary groupings of proteins with related functions and sequence similarities, frequently forming hierarchical structures with subfamilies. Domains denote compact, independently folding regions that perform specific functions, such as the pleckstrin homology (PH) domain involved in phospholipid binding. Repeats capture short, recurring motifs like pentapeptide or coiled-coil patterns that contribute to protein architecture. Sites are the most localized type, including active sites comprising catalytic residues essential for enzymatic activity, binding sites for interactions with ligands or ions, conserved sites highlighting functionally important residues of unknown precise role, and PTM sites for modifications like phosphorylation. Additionally, regions address unstructured or flexible segments of proteins that lack a defined fold but may play regulatory roles.[16][17] Each InterPro entry receives a unique identifier in the format IPR followed by a numeric code, such as IPR001909 for the KRAB domain, enabling precise referencing across databases. These entries are further enriched by mappings to Gene Ontology (GO) terms, which provide standardized annotations for molecular function, biological process, and cellular component, facilitating automated functional inference for proteins.[17][18] Over 80% of InterPro entries classify as domains or families, underscoring their prominence in protein annotation. Recent updates, starting from release 105.0 in 2025, have incorporated AI-driven methods to add hundreds of new entries, including those focused on predicted functional sites derived from models like AlphaFold and large language models such as GPT-4, with version 107.0 (October 2025) continuing these enhancements.[17][6][19]Access Methods
Web Interface
The InterPro web interface provides the main online portal for users to browse, search, and analyze protein family and domain data without requiring programming knowledge, hosted at ebi.ac.uk/interpro since the database's launch in 1999.[1][3] This graphical platform integrates annotations from multiple member databases, enabling straightforward exploration of protein functions, such as domains and sites, through intuitive navigation and visualization tools.[8] Key features include text-based searches using keywords, protein sequences in FASTA format, or specific accessions like UniProt IDs to identify matching entries and proteins.[20] Users can also browse the dataset organized by taxonomic lineages, domain architectures (combinations of protein domains), or structural classifications, allowing targeted discovery of evolutionary relationships and functional patterns. The sequence search functionality supports uploading FASTA files for on-the-fly annotation, leveraging integrated scanners to predict signatures against the full set of InterPro entries.[21] Protein summary pages offer detailed views of individual proteins, displaying matched signatures from contributing databases, graphical depictions of domain layouts along the sequence, and associated Gene Ontology (GO) terms for inferred biological roles.[20] Entry pages for specific families or domains include hyperlinks to the originating member databases, along with curated evidence such as experimental validations or predictive methods supporting the classification. Release 107.0, dated October 15, 2025, includes AI-generated annotations for enhanced functional insights, building on advancements from release 105.0 in April 2025.[1][22]Application Programming Interface (API)
The InterPro Application Programming Interface (API) is a RESTful web service that enables programmatic retrieval of protein family, domain, and functional site data in JSON format, supporting integration into bioinformatics workflows and third-party applications.[23] Introduced in late 2018, the API provides structured access to InterPro's curated and predicted annotations without requiring manual web interaction.[24] The API consists of six primary endpoints designed for targeted queries:/entry for detailed information on signatures and entries (e.g., by IPR ID or member database like Pfam), /protein for retrieving matches associated with a specific protein (e.g., by UniProt accession), /taxonomy for organism- or taxon-specific data, /structure for structural alignments, /set for curated sets of related entries, and /proteome for proteome-wide summaries.[25] Queries can be constructed by combining endpoint paths with identifiers, such as /entry/interpro/IPR000001 to fetch hierarchical details on a domain entry, including its type (e.g., family or domain), abstract, and cross-references.[26] Responses include metadata like entry hierarchies, match locations, evidence levels, and counters for associated proteins or structures, facilitating efficient data extraction.[27]
As of release 105.0 in April 2025, the API supports access to AI-predicted features, including over 1.8 billion neural network-based annotations from the InterPro-N model for protein function and structure.[6] Users are advised to avoid excessive or large-scale requests to prevent temporary service unavailability and ensure stability.[8] The API supports text-based searches across endpoints and batch processing for large-scale analyses, such as querying multiple proteins simultaneously, enabling seamless integration with resources like UniProt for sequence annotation pipelines or Ensembl for genomic context.[28]
Bulk Downloads
InterPro data is available for bulk download via FTP at ftp.ebi.ac.uk/pub/databases/interpro/, providing comprehensive files such as protein matches (e.g., TSV format with signature locations), entry hierarchies, and cross-references to member databases and external resources like UniProt and GO.[1][8] These downloads are synchronized with each release (e.g., version 107.0, October 2025), supporting offline analysis and integration into custom pipelines for large datasets. Files include over 200 million annotated sequences, with options for filtered subsets by taxonomy or entry type.Analysis Tools
InterProScan
InterProScan is a standalone command-line software tool that enables the functional annotation of user-submitted protein sequences by scanning them against the predictive signatures compiled in the InterPro database. Developed by the European Bioinformatics Institute (EMBL-EBI), it integrates multiple sequence analysis algorithms from InterPro's member databases, running them in parallel to detect protein families, domains, repeats, and functionally important sites. For instance, it employs HMMER for hidden Markov model-based searches against Pfam and other profile-based resources, and PSI-BLAST for position-specific iterated searches against the Conserved Domain Database (CDD). This multi-method approach allows for robust, comprehensive predictions without relying on web-based services, making it suitable for high-throughput, offline analyses.[29][30] Installation of InterProScan is straightforward and restricted to Linux 64-bit systems due to dependencies on third-party binaries; the package is distributed as a downloadable tar.gz archive from the EMBL-EBI FTP site, containing the core Java Archive (JAR) file, data files, and configuration scripts. Users unpack the archive, ensure Java 11 or higher is installed, and execute the tool via the command line with theinterproscan.sh script. Input is accepted in standard FASTA format for protein sequences (or nucleotide sequences, which are internally translated), supporting batch processing of multiple entries. Outputs are produced in flexible formats such as XML for machine-readable parsing or tab-separated values (TSV) for tabular review, including details on match locations, aligned regions, bit scores, and e-values for assessing match significance. Optional parameters allow customization, such as selecting specific member databases or adjusting parallelism for multi-core systems.[31][32]
The most recent release, version 5.76-107.0, launched on October 16, 2025, incorporates updates synchronized with InterPro 107.0 and supports 14 distinct signature search methods derived from consortium member databases. This version enhances efficiency for large-scale tasks, capable of processing up to 10,000 protein sequences on standard hardware (e.g., a multi-core CPU with 16 GB RAM) within hours, depending on selected methods and input size. Performance optimizations include chunking large inputs into parallel jobs and optional pre-filtering to reduce computational load.[33]
At its core, InterProScan unifies disparate signature matches into coherent predictions by mapping them to InterPro entries (IPR terms), which represent curated groupings of similar signatures across databases. This integration resolves overlaps and redundancies, assigning a single IPR identifier to equivalent regions while reporting underlying member database hits. Confidence in predictions is gauged through statistical thresholds, such as e-value cutoffs (typically < 0.001 for high reliability), alongside score-based filtering to prioritize true positives over false matches. By providing these consolidated annotations, the tool facilitates downstream applications in proteomics, such as pathway inference or structural modeling, with traceable evidence from source signatures.[8][29]