Fact-checked by Grok 2 weeks ago

PROSITE

PROSITE is a specialized database of protein families, domains, and functional sites, designed to facilitate the identification and annotation of these elements in protein sequences through biologically meaningful signatures such as patterns and profiles. It serves as a key resource in bioinformatics for determining the function of uncharacterized proteins by matching sequences against curated motifs derived from conserved regions. Developed by the , PROSITE enables researchers to group proteins based on shared evolutionary ancestry and functional attributes, supporting broader analyses in and . Initiated in 1988 by Amos Bairoch at the , PROSITE originated as a method to catalog biologically significant patterns for protein function prediction, with its first public release occurring shortly thereafter. Over the decades, it has evolved under the stewardship of the SIB Swiss Institute of Bioinformatics, incorporating advances in and integrating with major protein databases like UniProtKB. Updates are released every 8 weeks in synchronization with UniProtKB, ensuring alignment with the latest protein sequence data and incorporating new families, refined signatures, and enhanced documentation. The database's core content comprises documentation entries that describe protein domains, families, and sites, each associated with signatures in the form of patterns—regular expressions capturing short, highly specific motifs—or profiles, which are position-specific scoring matrices for detecting more divergent, longer domains. As of release 2025_04 (dated 15 October 2025), PROSITE includes 1956 documentation entries, 1311 patterns, 1403 profiles, and 1432 ProRules—logical rules that improve the specificity and sensitivity of signature matches. These elements are manually curated from peer-reviewed and experimental data, prioritizing high-confidence signatures that cover a significant portion of known proteins. PROSITE is widely utilized through tools like ScanProsite for sequence scanning and MyDomains for visualizing domain architectures, aiding in functional annotation, evolutionary studies, and hypothesis generation for protein research. It integrates seamlessly with resources such as for comprehensive protein classification, enhancing its utility in large-scale genomic projects and . By providing reliable, interpretable signatures, PROSITE remains a foundational tool for deciphering protein diversity and function in the post-genomic era.

Overview

Definition and Purpose

PROSITE is a specialized database and method designed for the detection of biologically meaningful signatures in protein sequences, enabling the inference of protein , , or evolutionary relationships. It compiles on protein domains, families, and functional sites, along with associated signatures in the form of patterns and profiles that can be used to identify these elements in query sequences. The primary purpose of PROSITE is to facilitate the annotation of uncharacterized proteins derived from genomic or cDNA sequencing projects by matching their sequences against known motifs, thereby predicting potential functions or classifications. This approach allows researchers to automatically classify proteins into specific families or detect critical functional sites, such as active sites or binding regions, based on features. Created in 1988 as a tool for , PROSITE has become an integral part of the broader bioinformatics ecosystem hosted by , supporting automated and manual analyses in . By leveraging signatures like patterns—regular expressions capturing short conserved motifs—and profiles—position-specific scoring matrices for more distant similarities—the database enables reliable detection of evolutionary and functional relationships without requiring full .

Key Components

PROSITE entries are structured around several core elements that enable the identification and annotation of protein motifs, domains, and functional sites. These include textual documentation providing biological context, patterns as regular expressions for detecting short motifs, profiles as position-specific scoring matrices (PSSMs) for recognizing longer protein domains, and ProRules as logical rules that combine multiple criteria to refine predictions of functional sites. Patterns in PROSITE represent short, highly conserved sequence motifs, typically 3 to 30 amino acids long, and are expressed using a specialized syntax that accounts for allowed residues, gaps, and exclusions. The syntax employs square brackets for alternative residues (e.g., [AC] for alanine or cysteine), 'x(n)' for n repetitions of any residue, and curly braces for exclusions (e.g., {P} to exclude proline). A representative example is the N-glycosylation site pattern, defined as N-{P}[ST]{P}, where asparagine (N) is the modification site, followed by any residue except proline, then serine or threonine, and excluding proline afterward; this motif is common in eukaryotic proteins for attaching N-linked glycans. Patterns are designed for high specificity and sensitivity, often validated against known protein sequences to minimize false positives. Profiles serve as more flexible signatures for extended protein regions, functioning as PSSMs derived from alignments of related to score potential matches based on position-specific residue frequencies and evolutionary . Unlike rigid patterns, profiles allow fuzzy matching similar to hidden Markov models (HMMs), accommodating variations in length and sequence divergence through logarithmic scoring and cut-offs (e.g., a positive for reliable hits and a noisy for exploratory matches). For instance, a for the HSP20 uses a to evaluate alignments across heat shock protein , enabling detection of distant homologs. These are particularly useful for globular domains where sequence is moderate. ProRules extend the utility of patterns and profiles by incorporating logical conditions, such as requiring specific residue contexts or structural features, to predict functional sites more accurately; for example, a ProRule might combine a pattern with proximity to a catalytic residue. Written in the UniRule format, these rules are triggered only when underlying signatures match, enhancing annotation precision in automated pipelines. Supporting these signatures are associated data structures that provide context and reliability assessments, including taxonomic scope to indicate applicable organism groups (e.g., restricted to eukaryotes via qualifiers like /TAXO-RANGE=E?), cross-references to databases such as UniProtKB for validated examples and PDB for structural data, and evidence levels denoting the strength of annotations (e.g., experimental versus computational , often quantified by positive hit counts in curated sequences). These elements ensure that PROSITE signatures are biologically grounded and interoperable with other resources.

History and Development

Origins and Creation

PROSITE was created in 1988 by Amos Bairoch, a bioinformatician at the Department of Medical Biochemistry, , as an early tool for protein sequence analysis in the emerging field of bioinformatics. Bairoch, who had previously initiated the Swiss-Prot protein sequence database in 1986, recognized the limitations of basic sequence comparisons and sought to develop a specialized resource for detecting biologically significant features in proteins. The primary motivation for PROSITE arose from the rapid accumulation of protein sequence data in the late 1980s, particularly through databases like Swiss-Prot, which by 1988 contained thousands of entries but lacked efficient methods for identifying shared functional s across related proteins. Bairoch aimed to address this by compiling patterns derived from Swiss-Prot annotations, enabling the systematic detection of protein families, domains, and functional sites to facilitate and discovery of new members in uncharacterized sequences. This approach was essential in an era when manual curation was predominant, and automated tools for motif recognition were scarce. PROSITE's first release occurred in March 1988, distributed via the PC/Gene software package from IntelliGenetics, and included a modest collection of 58 manually curated patterns extracted from the scientific literature, each accompanied by a descriptive abstract outlining the associated protein family or domain. Early development faced significant challenges, including the constraints of limited computational power on personal computers of the time, which restricted the scope and sophistication of pattern searches. Moreover, the initial patterns depended on exact sequence matching, making them vulnerable to false negatives from sequence variations, errors, or evolutionary divergence, a limitation that persisted until the later adoption of profile-based methods.

Evolution and Milestones

In the 1990s, PROSITE underwent significant expansions to address limitations in its initial pattern-based approach, particularly for detecting variable protein domains. In 1994, generalized profiles were introduced by Philipp Bucher, enabling the representation of more flexible motifs through position-specific scoring matrices (PSSMs) that captured sequence conservation and variability more effectively than rigid patterns. This innovation allowed PROSITE to handle diverse family alignments, improving sensitivity for distant homologs. From its inception, integration with the Swiss-Prot (now UniProtKB/Swiss-Prot) database has facilitated automated annotation, where PROSITE signatures were cross-referenced to annotate protein functions directly during Swiss-Prot curation, enhancing the database's utility for large-scale sequence analysis. The 2000s marked further milestones in diversifying PROSITE's methodology and scale. In 2005, the ProRule system was added, providing rule-based predictions that generate precise functional annotations based on profile or pattern matches, such as specifying post-translational modifications or active sites. By that year, PROSITE had surpassed 1,000 documentation entries, reflecting steady growth in curated motifs. The database celebrated its 20-year anniversary in 2008, at which point release 20.19 covered 53% of UniProtKB/Swiss-Prot entries, demonstrating its expanding impact on protein annotation. From the 2010s to the 2020s, PROSITE refined its signature detection through methodological advancements and broader interoperability. Around 2008, PROSITE's development was formally placed under the stewardship of the SIB Swiss Institute of Bioinformatics, enhancing its integration with other SIB resources like UniProtKB. Alignment with since the late 1990s has enabled hierarchical organization of domains within protein families, allowing PROSITE signatures to contribute to integrated views of evolutionary relationships across multiple databases. As of release 2025_04 in October 2025, PROSITE comprises over 1,900 documentation entries, underscoring its ongoing evolution. These developments represent a shift from a pattern-only system to a hybrid framework combining patterns, , and rules, which has broadened PROSITE's applicability in . Open access via the server, established in the 1990s, has supported global usage and continuous updates.

Database Content

Entry Formats

PROSITE entries follow a standardized designed for clarity and machine readability, beginning with an that provides the entry name and type, such as "PROTEIN_KINASE_DOM " for a pattern-based entry or "" for profiles. This is followed by the (AC), a like PS50011; a description line (DE) summarizing the motif or , e.g., "Protein domain "; and specific signature lines such as PA for patterns (using IUPAC codes and qualifiers like x for any residue) or MA for . Additional lines include PR for associated ProRules (logical validation rules, e.g., PRU00159), RU for numerical performance results from scans against , and DR for cross-references to external databases like entries. Other sections cover comments (CC) for evidence and via /TAXO-RANGE qualifiers, documentation references (DO) linking to PDOC entries, and termination with "//". The primary distribution format is a flat-file text-based representation, human-readable and structured with fixed two-character line codes followed by content, limited to 78 characters per line except for matrix data. This format, contained in files like prosite.dat, prosite.doc for documentation, and profile.dat for matrices, enables bulk downloads from the (ftp.expasy.org/databases/prosite/) and supports parsing by bioinformatics tools. Since the 2000s, PROSITE data has also been accessible in XML for structured programmatic querying, particularly through integrations like UniProt's XML exports that embed PROSITE annotations, and in RDF for applications via projects like Bio2RDF. A representative example is the entry for the protein kinase domain (PS50011), which illustrates the format's organization (as of release 2025_04):
ID   PROTEIN_KINASE_DOM MATRIX; PRF; 259 aa; matrix.
AC   PS50011;
DE   Protein kinase domain profile.
DO   PDOC00100;
CC   -!- MATRIX_TYPE: protein_domain;
CC   -!- TAXO-RANGE: Archaea; Bacteria; Eukaryota; Eukaryotic viruses.
CC   -!- AUTHOR: P.Bucher
PR   PRU00159;
RU   True positives: 4504 (4438 sequences); False positives: 11; False negatives: 243.
DR   UNI; Q6GZV6; Q197B6; ... (4438 true positive sequences); B9DGY1; Q93Y08; ... (243 false negatives); P58551; Q9KVB9; ... (11 false positives).
MA   /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ*'; LENGTH=259;
MA   [Excerpt of position-specific scores, e.g., row 1: A= 0 B=-5 ... Z=-5]
This breakdown shows the pattern/profile integration (via MA excerpt), ProRule linkage, and cross-references, facilitating annotation of kinase-related sequences. These formats support efficient search tools by allowing direct parsing of signatures and metadata without .

Pattern and Profile Types

PROSITE signatures are categorized into patterns and profiles, each designed to detect specific biological features in protein sequences with varying degrees of . Patterns represent short, motifs using notation, while profiles employ position-specific scoring matrices (PSSMs) or more advanced generalized profiles to model entire protein domains or families. These signature types target distinct biological entities, such as functional sites, structural domains, evolutionary families, and tandem repeats, enabling the identification of protein functions and relationships. Patterns in PROSITE are qualitative descriptors that match sequences based on exact or fuzzy criteria, often using IUPAC ambiguity codes and operators like 'x' for any residue or '{' and '}' for exclusions. They are particularly suited for highly conserved, short motifs, such as catalytic s or sites of post-translational modifications, where high specificity is crucial to minimize false positives. For instance, a pattern for the cutinase active site is expressed as P-x-[STA]-x-[LIV]-[IVT]-x-[GS]-G-Y-S-[QL]-G, which detects the precise arrangement of residues essential for enzymatic activity. Patterns for rare functional sites, like phosphorylation motifs (e.g., [ST]-x-[RK] for targets), employ stringent criteria to ensure matches are biologically relevant, whereas those for more common domains may allow greater flexibility to capture broader occurrences. This approach balances detection of true positives against the risk of over-matching in diverse protein contexts. Profiles, in contrast, provide quantitative models for longer sequence regions, constructed from multiple sequence alignments (MSAs) of related proteins using tools like ps_scan or pftools. Basic PSSMs assign position-specific scores based on observed residue frequencies and substitution matrices, such as , to evaluate sequence similarity across an entire domain; for example, the globin family profile spans approximately 140 positions and scores alignments to identify oxygen-binding domains with a cutoff threshold for significance. Generalized profiles, introduced by Bucher in 1994, extend this by incorporating (HMM)-like features, including penalties for insertions and deletions, which enhance sensitivity for detecting distant homologs in less conserved families. These are built by aligning sequences (e.g., via ClustalW or T-Coffee), extending fragments to include flanking regions, and optimizing for subfamily specificity, as seen in profiles for animal peroxidases that capture evolutionary divergences while maintaining low false-positive rates. Profiles thus excel in annotating structural and functional domains where patterns alone lack sufficient power. Hybrid signatures combine and to optimize both , often integrating ProRules for contextual validation. For example, a may detect a broad domain like an ATP-binding , while an embedded confirms critical catalytic residues, and rules promote weak matches if the aligns. This synergy is vital for complex annotations, reducing errors in family assignment. Biologically, PROSITE signatures classify protein features into domains as compact structural or functional units (e.g., zinc fingers detected by profiles), families as evolutionarily related groups (e.g., small via generalized profiles), repeats as tandemly arrayed motifs (e.g., EF-hand calcium-binding repeats with context-dependent scoring), and sites as localized functional elements (e.g., or metal-binding sites marked in patterns). These categories facilitate targeted detection, with domains and families relying more on profiles for comprehensive coverage, while sites and repeats favor patterns for precision.

Documentation and Rules

Each PROSITE entry includes extensive that provides a description of the biological role, evolutionary context, and structural features associated with the protein , , or functional site it represents. For instance, the for the Copper/Zinc superoxide dismutase details the enzyme's catalytic mechanism involving metal ion coordination and its evolutionary conservation across eukaryotes and prokaryotes, drawing on biochemical and phylogenetic evidence. These descriptions are curated by experts and serve to contextualize the signature's significance, enabling users to understand not just sequence matches but their functional implications. Literature references form a core component of the documentation, with each entry citing key primary sources such as seminal papers on the motif's discovery or validation. Examples include references to Bannister et al. (1987) for mechanisms and Smith & Doolittle (1992) for evolutionary analyses in prokaryotic entries like PDOC00013 on lipoproteins. These citations, typically 5–20 per entry, are selected for their high impact and direct relevance, ensuring traceability to experimental data or computational validations. ProRules in PROSITE consist of manually curated logical rules that augment the precision of motif-based predictions by incorporating conditional logic and contextual constraints. Written in the UniRule format, these rules use and logical conditions, such as "IF the profile matches AND a specific residue (e.g., a conserved ) is present at position X, THEN assign function Y (e.g., catalytic activity)." For example, a ProRule for zinc finger domains might specify that a match combined with a flanking basic residue predicts DNA-binding capability, thereby refining site assignments and reducing ambiguity in automated annotations. This approach enhances the discriminatory power of profiles, particularly for complex families, by integrating structural and functional criteria beyond simple sequence similarity. Evidence coding in PROSITE documentation assigns reliability levels to signatures based on validation against the UniProtKB/Swiss-Prot database, categorizing outcomes as true positives, false positives, or false negatives through metrics like /POSITIVE=20(20) and /FALSE_POS=0(0). Entries distinguish evidence derived from direct experimental data (e.g., studies) from that inferred by similarity to well-characterized homologs, with the former prioritized for high-confidence sites. Taxonomic restrictions further mitigate false positives by limiting applicability via /TAXO-RANGE qualifiers, such as restricting a motif to and eukaryotes (A?E??) to exclude prokaryotic sequences where it lacks functional relevance. Cross-references in PROSITE entries link signatures to external resources for comprehensive validation and exploration, including DR lines to UniProtKB accessions, 3D lines to (PDB) structures (e.g., 1AGY for a specific domain), and mappings to domains or (GO) terms for functional annotation. These interconnections, such as associating a profile with GO:0004672 for activity, facilitate integration with structural and ontological databases, supporting evidence-based interpretations of matches.

Access and Usage

Search Tools

ScanProsite serves as the primary web-based tool for querying protein sequences against the PROSITE database, enabling users to identify matches to patterns, profiles, and rules that signify protein domains, families, or functional sites. Users can input sequences in , UniProtKB accessions, or PDB identifiers, with support for up to 10 sequences in standard mode or larger batches for advanced scans. The tool scans against the full PROSITE collection or user-defined motifs, incorporating ProRules for additional validation of sites and structural features. Key options include adjustable sensitivity thresholds, such as high-sensitivity mode (LEVEL=-1) for detecting weak profile matches, and filters to exclude high-probability false positives or restrict by and length. Output formats encompass for graphical views with alignments and scores, XML for programmatic , and text-based lists of matches, facilitating interpretation of hit reliability through normalized scores and e-values. For programmatic access, ScanProsite provides a RESTful via GET or requests to the PSScan. , allowing integration into workflows for of up to 1,000 sequences. MyDomains complements ScanProsite by offering a visualization tool to generate graphical representations of domain architectures derived from scan results or manual inputs. Users specify domain positions, shapes (1-6 options), colors (1-4), and labels, along with ranges or sites, to produce customizable PNG images showing protein layouts with rulers for scale. This enables clear depiction of multi-domain arrangements, aiding in the interpretation of overlapping or adjacent motifs identified in scans. Advanced capabilities extend to batch scanning for high-throughput analysis, motif extraction using integrated tools like PRATT for deriving patterns from unaligned sequences, and predictions of post-translational modifications via ProRule evaluations embedded in scan outputs. The typical user workflow involves submitting FASTA sequences or identifiers, selecting signature types (patterns, profiles, or rules), applying optional thresholds, and reviewing results for match scores, sequence alignments, and graphical summaries before exporting or visualizing via MyDomains. These tools are integrated within the ExPASy bioinformatics suite for seamless access alongside related resources.

Integration with Other Resources

PROSITE serves as a core component of the database, where its patterns and profiles are integrated alongside signatures from , , and other member databases to provide comprehensive protein family and domain classifications, reducing redundancy and enhancing annotation accuracy across diverse protein sequences. This integration enables unified entries, such as IPR000859 for the , which merge PROSITE data with contributions from and to support automated functional predictions. Furthermore, PROSITE signatures are directly linked within UniProtKB entries, facilitating automated domain and feature annotations for over 81% of UniProtKB sequences through tools like InterProScan, with updates approximately every eight weeks ensuring timely synchronization. Within the ExPASy ecosystem, synergizes with Swiss-Prot (part of UniProtKB) to annotate protein domains and functional sites, providing curated sequence data enriched with PROSITE-derived motifs for reliable family assignments. It also complements the database by identifying catalytic and binding sites through its patterns, aiding in enzyme classification and nomenclature within the shared platform. Additionally, PROSITE's domain information supports structure-function predictions in , where motif matches guide to infer three-dimensional structures and associated biological roles. PROSITE extends to broader bioinformatics tools, including extensions of such as PHI-BLAST, which incorporates PROSITE patterns to detect protein motifs alongside sequence similarities for more precise functional identification. In genome browsers like Ensembl, PROSITE motifs are visualized alongside domains in transcript views, enabling integrated analysis of genomic context and protein features. It is also utilized in analysis pipelines similar to PfamScan, notably through , which applies PROSITE alongside other signatures for high-throughput domain scanning. Moreover, PROSITE data is exported to resources like the (GO) via InterPro mappings, associating motifs with standardized functional terms, and to KEGG's SSDB for precomputing motifs in pathway-related protein sets. Through its development by the SIB Swiss Institute of Bioinformatics, PROSITE contributes to the infrastructure as a key resource supporting European life sciences and . Its updates are synchronized with releases, such as the 2025_04 version of UniProtKB/Swiss-Prot, ensuring alignment with quarterly or bi-monthly database cycles for consistent data flow across interconnected resources.

Applications and Significance

Functional Protein Annotation

PROSITE facilitates functional protein annotation by aligning query sequences against its curated signatures—primarily regular expression-based patterns and position-specific score matrices (profiles)—to detect conserved motifs indicative of protein domains, families, or sites. A successful match transfers documented functional information from the signature's entry to the query protein, enabling inference of biological roles such as enzymatic activity or subcellular localization. For example, alignment to the signatures (PDOC00100) identifies the catalytic domain, implying ATP-binding and capabilities essential for . Similarly, detection of the N-myristoylation site pattern (PS00008) annotates proteins for membrane association, as this lipid modification anchors them to cellular membranes. This process relies on the signatures' derivation from multiple sequence alignments and literature-verified functional regions, ensuring annotations are grounded in evolutionary conservation. The choice between pattern and profile signatures balances specificity and sensitivity in annotations. Patterns prioritize high specificity by targeting strictly conserved residues, yielding confident functional calls with low false-positive rates but potentially overlooking sequence variants; for instance, they excel in pinpointing precise active sites like those in kinases. Profiles, conversely, incorporate probabilistic scoring to accommodate substitutions, enhancing sensitivity for assigning broader family memberships while maintaining reasonable specificity through calibration against positive and negative sets. This trade-off is critical for accurate inference, as over-sensitive matches might propagate errors in downstream analyses, whereas overly strict patterns could under-annotate divergent homologs. Representative examples illustrate PROSITE's utility in diverse contexts. In proteins, matching the VP35 antagonist (PDOC51735) annotates immune evasion functions, such as suppression of host antiviral responses in filoviruses like . For bacterial enzymes, the class B signature (PDOC00606) identifies zinc-dependent hydrolases conferring resistance to , aiding in the functional classification of resistance mechanisms. In proteomics workflows, automates large-scale annotation by scanning vast sequence datasets, such as those from projects, to assign functions to uncharacterized proteins derived from environmental microbial communities. This is particularly valuable for orphan genes—sequences without clear orthologs—where motif detection reveals hidden functional similarities, accelerating genome-scale functional mapping in studies of or novel enzymes.

Impact on Research

PROSITE has played a pivotal role in genomics by enabling large-scale functional predictions of protein sequences during the Human Genome Project in the 2000s, where its patterns and profiles were integrated into annotation pipelines to identify domains and motifs in newly sequenced genes. This facilitated the assignment of biological roles to thousands of predicted proteins, accelerating the interpretation of the human proteome. Beyond human genomics, PROSITE has supported functional annotation in non-model organisms, such as bacteria and plants, by detecting conserved signatures in understudied proteomes, thereby bridging gaps in comparative genomics. A major impact of PROSITE lies in the discovery of novel motifs within disease-related proteins, exemplified by its identification of kinase signatures in cancer, where aberrant motifs in proteins like have revealed key drivers of oncogenic signaling. Furthermore, PROSITE's family classifications have advanced evolutionary studies by highlighting conserved domains that illuminate protein divergence and phylogenetic relationships across species. PROSITE has been widely referenced in , reflecting its broad adoption as a of bioinformatics workflows. While powerful, PROSITE serves as a complement to experimental validation, with ongoing rule refinements via ProRules reducing false positives by improving the discriminatory power of signatures against non-homologous sequences.

Current Status

Statistics and Coverage

As of the 2025_04 release on , 2025, PROSITE contains 1,956 documentation entries describing protein families, domains, and functional sites. These include 1,311 patterns, 1,403 profiles, and 1,432 ProRules, which provide rules for automated of matching sequences. PROSITE signatures match approximately 54.5% of the reviewed proteins in , covering 312,414 unique entries out of 573,661 total sequences in the 2025_04 release. This coverage is achieved through 494,627 total annotations, reflecting the database's focus on well-characterized, manually curated motifs with high specificity to minimize false positives. Patterns are designed for high specificity, while profiles offer broader sensitivity, enabling reliable functional predictions for a substantial portion of eukaryotic and prokaryotic proteins. The database has shown steady but modest growth over two decades. In release 19.11 from September 2005, PROSITE included 1,329 patterns and 552 profiles; by , patterns have remained stable near 1,300, while profiles and ProRules have expanded significantly to over 1,400 each, with annual additions typically ranging from 50 to 100 new or updated entries. This incremental development ensures comprehensive yet conservative expansion, prioritizing quality over volume. Performance metrics underscore PROSITE's efficiency in practical use. The ScanProsite tool, which scans sequences against all signatures, processes queries rapidly, with false positive rates minimized through rigorous validation and ProRule constraints during curation. These attributes contribute to its utility in large-scale .

Updates and Future Directions

PROSITE undergoes regular updates through releases that align with the biannual cycle of UniProtKB, typically occurring every eight weeks to incorporate new sequence data and annotations. These releases, such as version 2025_04 dated October 15, 2025, are curated by experts at the SIB Swiss Institute of Bioinformatics, drawing from peer-reviewed literature and incorporating user-submitted feedback to refine patterns, profiles, and ProRules. In the 2020s, enhancements to PROSITE have included the integration of its signatures into the InterPro consortium, where patterns and profiles from PROSITE release 2023_05 were consolidated to minimize redundancy and improve annotation accuracy across 97.9% of patterns and 93.6% of profiles in InterPro version 101.0. This has facilitated the incorporation of deep learning approaches indirectly through InterPro's use of AI models like GPT-4 for annotation support and generating protein family descriptions, along with functional predictions. Additionally, expanded coverage of intrinsically disordered regions has been achieved via InterPro's inclusion of MobiDB-lite predictions, allowing PROSITE signatures to better annotate such regions in proteins. Looking ahead, future directions for PROSITE emphasize deeper integration with structural prediction tools like to develop structure-aware signatures that combine sequence motifs with predicted 3D models for more precise functional insights. AI-driven automation of rule creation is anticipated to streamline curation, reducing manual effort while maintaining accuracy in generating ProRules. Efforts are also underway to address microbiome diversity by leveraging PROSITE within InterPro's support for resources like MGnify, enabling better annotation of microbial protein families. Key challenges include keeping pace with the explosive growth in protein sequences, with UniProtKB expanding by 371% over the past decade, which strains coverage despite ongoing updates. achieves 81.8% coverage of UniProtKB sequences, highlighting the need for PROSITE to improve in this area. Improving predictions for multi- proteins remains critical, as current signatures often struggle with overlapping or complex architectures, necessitating advanced computational methods to enhance .

References

  1. [1]
    PROSITE - Expasy
    Database of protein domains, families and functional sites ... PROSITE consists of documentation entries describing protein domains, families and functional sites ...ScanProsite toolAbout PROSITEUser ManualScanProsite user manualDatabase
  2. [2]
    About PROSITE
    PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge number of different proteins, most of them ...
  3. [3]
    The 20 years of PROSITE - PubMed
    Over the past 2 years, about 200 domains have been added, and now 53% of UniProtKB/Swiss-Prot entries (release 54.2 of 11 September 2007) have a PROSITE match.
  4. [4]
    InterPro - EMBL-EBI
    InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. To classify proteins in this ...
  5. [5]
    PROSITE: a dictionary of sites an... | Archive ouverte UNIGE
    It is for these reasons that we have developed, since 1988, a dictionary of sites and patterns which we call PROSITE. Some of the patterns compiled in PROSITE ...
  6. [6]
    User Manual - Expasy - PROSITE
    The PROSITE database is a collection of protein families, domains and/or motifs. Its patterns and profiles can help to determine what is the function of ...
  7. [7]
    PROSITE, a protein domain database for functional characterization ...
    Oct 24, 2009 · PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles ...Missing: components | Show results with:components
  8. [8]
    PROSITE entry PS00001
    N-glycosylation site. Pattern [info], N-{P}-[ST]-{P}. Comments [info]. Site [info], carbohydrate at position 1. Skip flag [info], TRUE. Version [info], 1. SIB ...
  9. [9]
  10. [10]
    ProRule User Manual - PROSITE - Expasy
    ProRules are written in the UniRule format, which is used by the UniProt Knowledgebase (UniProtKB) automated annotation projects to annotate protein records.
  11. [11]
    PROSITE: a dictionary of sites and patterns in proteins - PMC - NIH
    PROSITE: a dictionary of sites and patterns in proteins. Amos Bairoch. Amos Bairoch. 1. Department of Medical Biochemistry, University of Geneva.
  12. [12]
    The 20 years of PROSITE - PMC - NIH
    PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify ...Historical Background · Figure 1 · Assigning A Status To...Missing: key components
  13. [13]
    Swiss-Prot release 28.0 | UniProt help
    Apr 19, 2022 · ... PROSITE data bank is distributed with release 28 of SWISS-PROT. ... integration of profiles into PROSITE will not "break" the current format.
  14. [14]
    ProRule: a new database containing functional and structural ...
    Abstract. Motivation: Increase the discriminatory power of PROSITE profiles to facilitate function determination and provide biologically relevant informat.Missing: early | Show results with:early<|control11|><|separator|>
  15. [15]
    PROSITE database, its status in 1999 | Nucleic Acids Research
    The first file (PROSITE. DAT) is a computer-readable file that contains all the information necessary for programs that make use of PROSITE to scan sequence(s) ...The Prosite Database, Its... · Background · Leading Concepts
  16. [16]
    The InterPro database, an integrated documentation resource for ...
    InterPro is an integrated documentation resource for protein families, domains and functional sites, which amalgamates the efforts of the PROSITE, PRINTS, Pfam ...
  17. [17]
    PROSITE entry PS50011
    Oct 1, 2013 · Protein kinase domain profile. Matrix / Profile [info], /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=259; /DISJOINT: DEFINITION= ...Missing: example | Show results with:example
  18. [18]
    ProRule - Expasy - PROSITE
    The ProRule section of PROSITE is constituted of manually created rules that increase the discriminatory power of PROSITE motifs (generally profiles) by ...
  19. [19]
    ProRule Description - Expasy - PROSITE
    ProRule are manually created rules that automatically generate annotations in UniProtKB/Swiss-Prot format based on PROSITE motifs, using the UniRule format.
  20. [20]
    ScanProsite user manual - Expasy - PROSITE
    ScanProsite allows to scan proteins for matches against the PROSITE collection of motifs as well as against user-defined patterns.
  21. [21]
    ScanProsite: detection of PROSITE signature matches and ProRule ...
    ScanProsite provides a web interface to identify protein matches against signatures from the PROSITE database (2). The PROSITE database consists of a large ...
  22. [22]
    MyDomains - Image Creator - Expasy - PROSITE
    To add a domain: start, stop, shape, color, text. Shape is any number between 1 and 6: Color is any number between 1 and 4: · To add a range: start, stop, type.
  23. [23]
    InterPro: the protein sequence classification resource in 2025 - PMC
    Nov 20, 2024 · InterPro (https://www.ebi.ac.uk/interpro) is a freely accessible resource for the classification of protein sequences into families.Missing: ExPASy ENZYME BLAST KEGG
  24. [24]
    PROSITE - SIB Swiss Institute of Bioinformatics - Expasy
    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them ...Missing: RDF | Show results with:RDF
  25. [25]
    SWISS-MODEL - Expasy
    SWISS-MODEL is a fully automated protein structure homology-modelling server. The purpose of this server is to make protein modelling accessible to all life ...Modelling · Repository · References · Help
  26. [26]
    Rules for pattern syntax for PHI-BLAST - NIH
    The lines starting with AC, DT, DE, NR, NR, and CC are relevant to PROSITE users, but irrelevant to PHI-BLAST. These lines are tolerated, but ignored by PHI- ...Missing: extensions | Show results with:extensions
  27. [27]
    Ensembl variation resources - PMC - PubMed Central
    The Ensembl variation resources are integrated into the Ensembl genome browser ... PROSITE and Pfam are drawn in purple along each transcript. B. An ...
  28. [28]
    The KEGG databases at GenomeNet - PMC - NIH
    As part of the SSDB database, sequence motifs in PROSITE (8) and Pfam (9) are precomputed for all proteins in the GENES database.Missing: Ensembl GO
  29. [29]
    PROSITE entry PS52025
    Sep 13, 2023 · Numerical results for UniProtKB/Swiss-Prot release 2025_03 which contains 573'661 sequence entries. Total number of hits, 17 in 17 different ...
  30. [30]
    Can I gain access to previous releases? - UniProt
    Oct 16, 2025 · UniProt releases every 8 weeks, with possible exceptions. There is an 8-16 week delay. Previous releases are archived for at least 2 years.Missing: PROSITE | Show results with:PROSITE
  31. [31]
    PROSITE documentation PDOC00100 Protein kinases signatures ...
    Protein kinases signatures and profile. View entry in original PROSITE document format · View entry in raw text format (no links) · PURL: https://purl.expasy.Missing: example | Show results with:example
  32. [32]
    PROSITE documentation PDOC00008 N-myristoylation site - Expasy
    The N-terminal residue must be glycine. In position 2, uncharged residues are allowed. Charged residues, proline and large hydrophobic residues are not allowed.Missing: annotation | Show results with:annotation
  33. [33]
    PROSITE documentation PDOC51735 Filoviruses VP35 interferon ...
    Mutations within the VP35 IID result in loss of host immune suppression. The N-terminus provides a critical oligomerization function, which facilitates ...
  34. [34]
    PROSITE documentation PDOC00606 Beta-lactamases class B ...
    Class-B enzymes are zinc containing proteins whilst class -A, C and D enzymes are serine hydrolases. Class-B β-lactamases have been described in several Gram- ...
  35. [35]
    Functional assignment of metagenomic data - Oxford Academic
    Jul 6, 2012 · PROSITE, a protein domain database for functional characterization and annotation. ,. Nucleic Acids Res. ,. 2010. , vol. 38. (pg. D161. -. 6. ).
  36. [36]
    New and continuing developments at PROSITE - Oxford Academic
    Since our last report in the NAR database issue (6), PROSITE has increased the number of available signatures to 1308 patterns and 1039 profiles, which are ...
  37. [37]
    PROSITE, a protein domain database for functional characterization ...
    PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify ...
  38. [38]
    MoKCa database—mutations of kinases in cancer - PMC - NIH
    PROSITE patterns (22) are used to identify kinase signature patterns, for example the Serine/Threonine protein kinases active-site signature and Protein ...
  39. [39]
    The PROSITE database, its status in 1999 - Semantic Scholar
    1,148 Citations ; The PROSITE database, its status in 2002 · L. FalquetM. Pagni +4 authors. A. Bairoch · Nucleic Acids Res. ; InterPro (The Integrated Resource of ...
  40. [40]
    PROSITE - an overview | ScienceDirect Topics
    The results were released in October 1991 in a new resource, initially known as the Features Database (Akrigg et al., 1992). Inspired by PROSITE, the Features ...
  41. [41]
    Refinement and prediction of protein prenylation motifs
    May 27, 2005 · However, any reduction in motif stringency concomitantly results in a dramatic increase in the number of false-positive predictions. Table 3 ...
  42. [42]
    UniProtKB/Swiss-Prot Release 2025_04 statistics - Expasy
    Total Number of Average Line type / subtype number entries per entry ... databases Total number of cross-referenced databases: 169 6. AMINO ACID ...
  43. [43]
    PROSITE - Database Commons
    The latest version of PROSITE (release 20.19 of 11 September 2007) contains 1319 patterns, 745 profiles and 764 ProRules. Over the past 2 years, about 200 ...