Fact-checked by Grok 2 weeks ago

Sequence motif

A sequence motif is a short, conserved pattern of nucleotides or amino acids that recurs within biological sequences, such as DNA, RNA, or proteins, and typically performs a specific structural or functional role, such as facilitating protein binding or enzymatic activity. These motifs are often represented as consensus sequences, degenerate patterns allowing for variability (e.g., "R" denoting adenine or guanine), or probabilistic models like position-specific scoring matrices (PSSMs) that capture sequence preferences at each position. In DNA, motifs commonly appear in regulatory regions like promoters or enhancers, serving as binding sites for transcription factors to control gene expression. In protein sequences, motifs correspond to functional domains or structural elements, such as active sites in enzymes or ligand-binding regions, which are conserved across evolutionarily related proteins due to their critical roles in cellular processes. Examples include the motif in DNA-binding proteins, which enables sequence-specific interactions with DNA, or the motif that stabilizes protein structures for nucleic acid recognition. These patterns are usually 6–20 residues long and can be exact matches or allow gaps and substitutions, reflecting evolutionary pressures to maintain function amid sequence divergence. The identification of sequence motifs is a cornerstone of bioinformatics, enabling the of genomic and proteomic to infer biological , predict interactions, and understand evolutionary relationships. Algorithms for motif discovery, such as (Multiple EM for Motif Elicitation) or , analyze unaligned sequences to detect statistically significant patterns, often validated against databases like JASPAR for motifs or for protein domains. This computational approach has roots in early work on probabilistic models for detection and continues to advance with methods for more accurate functional motif prediction in large-scale sequencing . Applications span gene regulation studies, targeting motif-mediated interactions, and diagnostics for motif alterations linked to diseases like cancer.

Fundamentals

Definition and Types

A sequence motif is a short, recurring pattern of nucleotides or amino acids in biological sequences, such as DNA, RNA, or proteins, that occurs more frequently than expected by chance and is often associated with a specific biological function. These patterns are typically 6 to 20 residues long and exhibit statistical significance when compared to random sequences, reflecting evolutionary conservation due to their functional importance. In DNA and RNA, motifs consist of nucleotide bases (A, C, G, T/U), while in proteins, they are composed of amino acids (20 standard types), assuming basic familiarity with these building blocks of genetic and proteomic information. Sequence motifs are classified primarily by the type of biological sequence they occur in. Nucleotide motifs are patterns in DNA or RNA sequences, commonly found in regulatory regions like promoters or enhancers, where they facilitate processes such as transcription factor binding. Amino acid motifs appear in protein sequences, often marking functional sites within domains, such as those involved in catalysis or ligand binding. Structural motifs, distinct from linear sequence patterns, refer to short three-dimensional arrangements of secondary structures in proteins, like alpha-helices or beta-sheets, that contribute to overall folding but are not detailed here beyond their recognition in spatial contexts. The concept of sequence motifs emerged in the early 1980s with the advent of bioinformatics tools for analyzing aligned sequences, enabling the identification of conserved amid variability. A seminal example was the Walker motif, first described in 1982 as a conserved nucleotide-binding (GXXXXGK[T/S]) in ATP-requiring enzymes, such as subunits, highlighting motifs' role in predicting functional sites across distantly related proteins. This discovery laid the foundation for motif-based annotations in databases like , established in the mid-1980s to catalog such patterns systematically.

Biological Significance

Sequence motifs play crucial roles in various biological processes by serving as recognition elements for molecular interactions. In DNA, they often function as binding sites for transcription factors, which regulate gene expression by recognizing specific short sequences in promoter and enhancer regions. For instance, these motifs enable sequence-specific binding that controls transcriptional activation or repression. In proteins, sequence motifs can delineate catalytic sites in enzymes, where conserved patterns of amino acids facilitate chemical reactions, as exemplified by the conserved motifs in cytosine-5 DNA methyltransferases, such as motif VI (containing the PCQ sequence), that are essential for catalysis. Additionally, motifs act as signals for post-translational modifications, such as phosphorylation or ubiquitination, by providing short sequence patterns that localize modifying enzymes to target residues. From an evolutionary perspective, sequence motifs exhibit high across , reflecting their functional importance and the selective pressures that maintain them. This is particularly evident in regions flanking post-translational modification sites, where motifs for , ubiquitination, and show elevated sequence similarity compared to non-modified regions, indicating strong purifying selection against disruptive changes. Such slow evolutionary rates arise because alterations in these motifs could impair essential functions, leading to disadvantages; for example, motifs in methyl-CpG-binding domains are preserved across mammals due to their role in epigenetic . This pattern underscores how selective pressure enforces motif integrity over evolutionary time, distinguishing them from less constrained sequence elements. The biological relevance of sequence motifs is often assessed through statistical measures of their over-representation in biological sequence sets. Information content quantifies motif conservation by calculating the difference between observed or frequencies at each position and the expected background frequencies, with higher values indicating greater specificity and functional constraint. Complementarily, the E-value estimates the expected number of motifs with equal or greater occurring by chance in random sequences of the same length, providing a measure of ; low E-values (e.g., below 0.01) suggest non-random enrichment and biological importance. Mutations disrupting sequence motifs can have profound pathological consequences, particularly in disease contexts like cancer. Alterations in promoter motifs, such as those in the TERT gene, create or abolish transcription factor binding sites, leading to dysregulated and tumor progression; for example, specific C>T mutations in the TERT promoter generate factor binding motifs that drive activation. Similarly, mutations in cis-regulatory motifs across cancer genomes are under negative selection in normal cells but can confer selective advantages in tumors by altering transcriptional programs. These changes highlight motifs as critical vulnerabilities in .

Representation Methods

Consensus Sequences

A consensus sequence represents an idealized average of a sequence motif, where each position is defined by the most frequent nucleotide or amino acid residue observed across aligned instances of the motif. This approach provides a simplified summary of conserved patterns in biological sequences, such as DNA binding sites for transcription factors or functional domains in proteins. Consensus sequences are derived through of known examples, followed by selection of the predominant residue at each aligned position. In cases where no single residue dominates—typically when two or more options appear with comparable frequency—ambiguous symbols from the IUPAC nomenclature are employed to capture this degeneracy; for instance, the code indicates either (A) or (G) at that site. This construction process, often performed manually in early applications or via tools, allows for the representation of motifs like the promoter element as TATAAT, accommodating minor variations while highlighting core conservation. The primary advantages of consensus sequences lie in their simplicity and accessibility: they are straightforward for humans to read and interpret, facilitating quick visualization of patterns without requiring computational expertise. This made them a staple in the initial development of databases and predictive tools for identifying regulatory elements in genomic sequences. Despite these benefits, sequences have notable limitations, as they do not account for the quantitative variability or relative frequencies of residues at each position, resulting in an oversimplified depiction that can obscure subtle differences in specificity. The choice of threshold for mismatches or ambiguity is often arbitrary, potentially reducing predictive accuracy compared to frequency-weighted representations.

Pattern Notations

Pattern notations offer a structured syntax for defining sequence motifs, enabling the precise specification of permissible residues at each position along with constraints on length and composition, thereby accommodating natural variability in biological sequences more flexibly than basic consensus representations. These notations typically employ symbolic rules akin to regular expressions, facilitating computational pattern matching in large sequence databases. By encoding discrete allowances and exclusions, they support targeted searches for functional sites without relying on probabilistic weights. The database, a primary repository for such descriptors, utilizes a specialized regular expression-like syntax to articulate motifs. In this system, standard IUPAC one-letter amino acid codes denote specific residues, while 'x' represents any . Square brackets enclose positive lists of alternatives (e.g., [ST] for serine or ), and curly braces indicate exclusions (e.g., {A} for any residue except ). Repetition is controlled via parentheses: (n) specifies exactly n occurrences of the preceding element, {n} indicates at least n, and x(n) allows n arbitrary residues. Terminal constraints are marked with '<' for N-terminal proximity or '>' for C-terminal. This notation, originally implemented in the release, powers tools for scanning sequences against curated patterns. Beyond , other systems adapt similar symbolic approaches for motif representation. The Suite supports a minimal text format that can incorporate IUPAC ambiguity codes for consensus-like patterns, often rendered in for visualization, though it primarily emphasizes matrix-based descriptions. Tools like ScanProsite extend PROSITE syntax by converting patterns to standard regular expressions, enhancing compatibility with broader bioinformatics pipelines for motif detection. These notations are integral to databases such as , where over 1,300 patterns enable automated annotation of protein functions across genomes.

Position-Specific Scoring Matrices

A position-specific scoring matrix (PSSM), also referred to as a position weight matrix (PWM), is a quantitative representation of a sequence motif that captures position-dependent variability in residue preferences through a log-odds scoring system. Each column in the matrix corresponds to a specific position within the motif, while each row represents a possible residue from the alphabet (e.g., the four nucleotides A, C, G, T for DNA motifs or the 20 amino acids for protein motifs). The matrix entry at position i for residue j quantifies the log-ratio of the likelihood of observing that residue at that position relative to its expected background frequency, enabling probabilistic modeling of motif conservation across aligned sequences. This approach was first formalized for locating signals in nucleic acid sequences. PSSMs are constructed from multiple sequence alignments of known motif instances, where observed residue frequencies are derived and converted into scores to account for evolutionary conservation. The frequency f_{i,j} of residue j at position i is calculated as the count of occurrences divided by the total number of sequences, often adjusted with pseudocounts k (typically small, e.g., 0.1–1) to smooth estimates and avoid zero probabilities: f'_{i,j} = \frac{n_{i,j} + p_j k}{\sum n_{i,j} + k}, where n_{i,j} is the raw count and p_j is the residue's background probability. The score is then S(i,j) = \ln \left( \frac{f'_{i,j}}{p_j} \right), yielding a matrix where positive scores indicate residues more frequent than expected by chance, and negative scores indicate underrepresentation. This log-odds formulation, rooted in a Bernoulli model of independence across positions, was refined in early applications to transcription factor binding sites. For protein motifs, similar constructions underpin profile-based searches, as in PSI-BLAST, where PSSMs evolve iteratively from alignments to detect remote homologs. In applications, PSSMs score potential motif occurrences in query sequences by summing the matrix scores aligned to each position, with higher aggregate scores signifying stronger matches to the motif model and thus greater biological relevance. Thresholds for significance are often set based on statistical models, such as extreme value distributions, to distinguish true positives from random matches. This scoring enables genome-wide scans for regulatory elements or protein domains, improving sensitivity over simple consensus sequences. Extensions to standard PSSMs address limitations in assuming positional by incorporating higher-order dependencies, such as dinucleotide frequencies, to model adjacent residue correlations that influence or . In higher-order PWMs, joint probabilities for pairs (or more) of residues replace single-position frequencies, expanding the matrix dimensionality while enhancing predictive accuracy for complex , though at increased computational cost. These models have been applied to refine motif matching in the presence of sequence variations like SNPs.

Discovery Techniques

De Novo Motif Discovery

De novo motif discovery encompasses unsupervised algorithms designed to identify over-represented sequence patterns, known as motifs, directly from sets of unaligned biological sequences without relying on prior knowledge of the motifs or alignments. These methods assume that motifs appear more frequently than expected by chance in the input sequences, often modeling them as position weight matrices (PWMs) to capture position-specific nucleotide or amino acid preferences. The primary challenge lies in distinguishing genuine motifs from random noise or compositional biases in the sequence data, which can lead to false positives if not addressed through statistical validation. Among the foundational algorithms, the approach implemented in the tool iteratively optimizes a to fit motifs to the . begins by assuming each contains zero or one occurrence of the and uses to alternate between assigning motif sites (expectation step) and updating the PWM parameters (maximization step) until . This method excels at discovering multiple motifs per and incorporates background models to account for sequence composition. Similarly, the Gibbs Motif Sampler employs a Gibbs sampling strategy to progressively align potential motif sites across . It randomly selects starting positions for motif occurrences, iteratively samples new positions while holding one out to avoid , and refines the alignment to maximize the likelihood of the motif model. Both algorithms typically start with seed motifs derived from random or searches of high-scoring subsequences, followed by refinement through iterative optimization to improve the motif's representation. Recent approaches, such as MotifAE (2025), further advance discovery by interpreting functional motifs from protein language models. To assess the reliability of discovered motifs, these methods incorporate significance testing using metrics such as E-values or p-values, which estimate the probability of observing a motif by chance under a null model of random s. For instance, computes E-values based on the motif's likelihood score relative to shuffled sequence controls, enabling users to filter motifs above a . Recent advancements include the , which extends de novo discovery to discriminative settings by contrasting primary sequences against a negative set, thereby enhancing for short or subtle motifs while maintaining through a branch-and-bound search strategy. reports adjusted p-values to control for multiple testing and has demonstrated superior performance on diverse datasets compared to earlier tools like .

Phylogenetic Motif Discovery

Phylogenetic motif discovery relies on the principle that functional motifs, such as binding sites, exhibit greater evolutionary conservation across related species compared to non-functional background sequences. This approach, often termed phylogenetic footprinting, leverages multiple alignments of orthologous genomic regions from diverse species to identify conserved patterns indicative of regulatory elements. By incorporating phylogenetic relationships, these methods distinguish true motifs from spurious similarities arising from neutral evolution, thereby enhancing detection accuracy in complex genomes. The process begins with aligning orthologous sequences from multiple , typically using tools that account for evolutionary divergence. Conservation is then scored by modeling nucleotide substitutions along phylogenetic branches, often employing substitution models such as the Jukes-Cantor model, which assumes equal rates of change among and corrects for multiple substitutions at the same site. These models generate likelihoods for observed alignments under constrained () versus unconstrained (background) scenarios, highlighting regions where exceeds expectation. Prominent algorithms include PhyloGibbs, which extends to incorporate phylogenetic information through a Bayesian on multiple alignments. It uses Markov chains to sample configurations, modeling weight matrix with fixed rates and branch-specific probabilities derived from phylogenetic trees. Another key method is , which employs Markov models tailored to binding sites, integrating branch-specific evolutionary rates via the Halpern-Bruno model to compute likelihood ratios for conservation. Unlike de novo discovery methods that analyze unaligned sequences within a single species, these phylogenetic approaches explicitly model cross-species divergence to refine predictions. These techniques offer significant advantages, including reduced false positives in large eukaryotic genomes by filtering out neutrally evolving sequences, and have been particularly effective for identifying cis-regulatory elements in . For instance, PhyloGibbs demonstrates superior performance in recovering known binding sites with over 50% at 50% specificity in intergenic regions, outperforming non-phylogenetic samplers. Similarly, MONKEY improves discrimination of functional sites, with 90% of positive Gal4p binding sites receiving lower p-values compared to simple scoring methods in .

Motif Pair Discovery

Motif pair discovery focuses on identifying composite regulatory elements composed of two or more motifs that co-occur within sequences, such as in enhancer modules where the relative spacing, order, and orientation between motifs are critical for functional synergy in regulation. These pairs often represent cis-regulatory modules (CRMs) that integrate signals from multiple transcription factors, enabling precise control of in contexts like and specificity. The typical process begins with de novo or known single motif discovery to generate a set of candidate motifs, followed by scanning sequences for pairwise co-occurrences and assessing using tests such as the to identify non-random associations. This step quantifies enrichment by comparing observed co-occurrence frequencies against expected values under a null model of random motif placement, often incorporating constraints on inter-motif distance to model biological realism. Key algorithms include Dyad-analysis, which systematically enumerates and counts pairs of short (e.g., trinucleotides) separated by variable spacers (typically 0-20 ) in upstream sequences, ranking them by a significance index based on over-representation relative to background expectations. Introduced by van Helden et al. in 2000, this method excels at detecting spaced dyads in prokaryotic and eukaryotic promoters, such as those bound by zinc-finger transcription factors. Another approach, ModuleMiner, employs a to optimize combinations of position weight matrices (PWMs) for co-occurring motifs across the , using and Pearson's to score positional and combinatorial patterns in co-expressed gene sets. Developed by Van Loo et al. in 2008, ModuleMiner clusters these into transcriptional regulatory models, revealing tissue-specific CRMs with high specificity. Applications of motif pair discovery are prominent in elucidating cis-regulatory modules that orchestrate regulation, such as distinguishing proximal promoters in adult tissues from distal enhancers in embryonic development through enriched pairs. These methods have facilitated the annotation of regulatory networks by prioritizing pairs with functional validation potential, enhancing predictions of combinatorial control in diverse biological systems.

Structure-Based Motif Recognition

Structure-based motif recognition involves mapping sequence patterns onto three-dimensional protein structures to identify functional motifs that may not be evident from sequence alone. This approach leverages tools such as protein threading, which threads a query sequence onto known structural templates to assess compatibility and detect fold similarities indicative of motifs, as pioneered in early folding methods. Homology modeling complements this by constructing structural models from sequence alignments to homologous templates, enabling motif annotation in uncrystallized proteins. These techniques reveal motifs by aligning sequences to structural scaffolds, often using energy-based scoring functions to evaluate fit. Methods for motif recognition from protein structures typically employ representations like graph-based models, where residues are nodes and spatial relationships are edges, allowing sequence-order-independent detection of motifs. For instance, the SiteMotif algorithm constructs distance matrices from residue coordinates and performs progressive alignments to derive consensus structural motifs across diverse protein families. Integration with sequence data is facilitated by structural alignment tools such as , which uses distance-matrix superposition to compare folds and identify conserved structural elements, and TM-align, which optimizes alignments via a for sensitive motif detection in remote homologs. These methods excel in capturing geometric patterns, such as ligand-binding configurations, that correlate with sequence motifs. Recent advances incorporate algorithms, notably (version 2 released in 2021 and version 3 in 2024), which predicts high-accuracy structures from sequences, including interactions like protein-ligand in 3, thereby uncovering hidden motifs in previously unstructured or unpredicted proteins. 's architecture, trained on evolutionary and physical constraints, has enabled structure-based clustering of protein families to reveal functional motifs, as demonstrated in the discovery of deaminase activities through predicted structural similarities. This has transformed motif recognition by providing proteome-scale structural insights, with applications in grouping distantly related proteins sharing catalytic motifs. A key challenge in structure-based motif recognition is the indirect mapping between and due to conformational flexibility, where proteins adopt multiple states that obscure conserved motifs in static models. Dynamic regions, such as loops or intrinsically disordered segments, can lead to alignment inaccuracies, necessitating ensemble predictions or flexibility-aware algorithms to robustly identify motifs.

Examples and Applications

Protein Sequence Motifs

Protein motifs are short, conserved patterns within sequences that often confer specific functional roles to proteins, such as enzymatic activity, protein-protein interactions, or subcellular targeting. One prominent example is the Walker A and B motifs, commonly found in nucleotide-binding proteins like ATPases and kinases. The Walker A motif, also known as the P-loop, has the GxxxxGK[T/S], where x represents any , and this glycine-rich interacts with the groups of ATP or GTP to facilitate nucleotide binding. The adjacent Walker B motif features a consensus of hhhhDE, with h denoting hydrophobic residues followed by aspartate and glutamate, which coordinates magnesium ions and enables . These motifs are evolutionarily conserved across diverse protein families, underscoring their role in energy-dependent processes. Another key example is the motif, which promotes protein dimerization, particularly in transcription factors of the bZIP family. This motif consists of a heptad repeat (a-b-c-d-e-f-g)n, where residues occupy the d position every seven , forming a coiled-coil structure that stabilizes homodimers or heterodimers. Signal peptides exemplify motifs involved in protein localization; these N-terminal sequences, typically 16-30 long, feature a positively charged n-region, a hydrophobic h-region of 7-15 residues, and a c-region with a cleavage site often following the A-X-A rule (X any ). They direct nascent proteins to the secretory pathway or specific organelles like the via recognition by the . motifs, such as the C2H2 type, mediate DNA binding in many transcription factors; their consensus is C-X{2-4}-C-X_{12}-H-X_{3-5}-H, where two cysteines and two histidines coordinate a to stabilize an alpha-helix that inserts into the DNA major groove. Databases like provide comprehensive annotations of these motifs, cataloging over 1,900 documentation entries (including 1,311 patterns and 1,403 profiles) for protein families, domains, and functional sites, with patterns and profiles derived from aligned sequences, as of release 2025_04. In , annotations appear in more than 250 million protein sequences as of 2025, highlighting the widespread prevalence of motifs across proteomes, such as Walker motifs in over 50,000 ATP-binding entries and leucine zippers in hundreds of transcription factors. These resources enable functional prediction by matching query sequences to documented motif instances. From an evolutionary perspective, protein sequence motifs serve as modular building blocks that facilitate domain shuffling and functional innovation without disrupting overall protein architecture. This modularity allows motifs to be recruited across unrelated proteins, driving the diversification of enzymatic and regulatory functions through gene duplication, fusion, and exon shuffling events. For instance, the repeated emergence of zinc finger motifs in eukaryotic genomes reflects their adaptability in expanding transcriptional control networks.

DNA and RNA Regulatory Motifs

DNA and RNA regulatory motifs are short, conserved nucleotide sequences within non-coding regions that play crucial roles in modulating gene expression through interactions with proteins or other nucleic acids. These motifs often function as binding sites for transcription factors in DNA or as recognition elements in RNA, influencing processes such as transcription initiation, translation efficiency, and post-transcriptional regulation. Unlike protein-coding motifs that primarily affect structural or catalytic functions, nucleic acid regulatory motifs control the timing, location, and level of gene activity, enabling precise cellular responses to environmental cues. A prominent example of regulatory is the , typically consisting of the TATAAA located approximately 25-35 base pairs upstream of the transcription start site in eukaryotic promoters. This serves as a core promoter element that recruits the (TBP), a subunit of the transcription factor , to facilitate the assembly of the pre- complex and accurate transcription . The is particularly enriched in genes involved in responses and development, where inducible expression is required. In prokaryotes, the Shine-Dalgarno sequence, with the consensus AGGAGG, is located 6-10 nucleotides upstream of the in bacterial mRNA and base-pairs with the 3' end of the 16S rRNA to position the for , enhancing translational efficiency. Transcription factor binding sites (TFBS) represent a diverse class of DNA regulatory motifs, often 6-20 nucleotides long, that exhibit sequence-specific affinity for particular transcription factors to activate or repress gene transcription. These motifs are modeled using position weight matrices (PWMs) to account for positional nucleotide preferences and are cataloged in databases like JASPAR, which provides over 2,300 curated, non-redundant profiles derived from experimental data across taxa, as of the 2024 release. Recent JASPAR updates integrate chromatin immunoprecipitation followed by sequencing (ChIP-seq) data to refine PWM accuracy, linking motifs to in vivo binding events and improving predictions of regulatory landscapes. In RNA, microRNA (miRNA) seed motifs involve 6-8 nucleotides of complementarity between the miRNA's 5' seed region (positions 2-8) and target mRNA 3' untranslated regions, enabling Argonaute protein-mediated silencing through mRNA destabilization or translational repression; this mechanism regulates up to 60% of human genes. Regulatory motifs display significant variations across cell types and conditions, with tissue-specific enrichment influencing patterns; for instance, certain TFBS motifs are overrepresented in liver enhancers compared to , correlating with organ-specific transcriptional programs. Epigenetic modifications, such as and , further modulate motif accessibility by altering structure—hypermethylated CpG islands within motifs can block TF binding, while open states enhance it, as evidenced in genome-wide studies linking epigenetic marks to susceptibility. These variations underscore the dynamic nature of regulatory motifs in adapting to developmental and environmental contexts.

Three-Dimensional Motifs

Three-dimensional motifs, also known as structural motifs, refer to recurring spatial patterns in the folded structures of biomolecules, distinct from linear sequence arrangements. In proteins, these motifs typically involve combinations of secondary structure elements, such as alpha helices and beta sheets, that form functional units like the helix-turn-helix (HTH) motif, where two alpha helices are connected by a short turn to facilitate DNA binding. In RNA, 3D motifs manifest as modular elements within secondary structures, often in loop regions, comprising non-canonical base pairs that contribute to tertiary folding and molecular recognition. These motifs are conserved across evolutionarily diverse molecules, enabling shared functionalities despite sequence variability. Representation of 3D motifs often employs chain codes to encode the geometric paths of polypeptide or polynucleotide backbones in , allowing compact description of curves and folds for comparison and search. Alternatively, graph-based approaches model motifs as subgraphs in protein networks, where nodes represent residues or domains and edges denote spatial proximities or s, facilitating the detection of recurrent topological patterns. Prominent examples include the EF-hand motif in calcium-binding proteins, a helix-loop-helix structure that coordinates Ca²⁺ ions through a 12-residue loop with conserved aspartate and glutamate residues, enabling in processes like . Recent advances in computational prediction, such as those from , have enabled the identification of novel 3D motifs by generating high-accuracy atomic models from sequences alone, revealing previously undetected structural patterns in uncharacterized proteins; further improvements in AlphaFold3 (2024) enhance multimodal predictions including ligands. Applications of 3D motifs extend to , where targeting these structural features—such as disrupting motif-mediated protein-protein interactions—allows for selective inhibition of disease-related pathways, as seen in the development of small-molecule modulators for PPI interfaces. As of 2025, databases like have expanded to over 85,000 entries, integrating 3D motif data from sources like for broader functional annotation.

References

  1. [1]
    Chapter 2: Sequence Motifs – Applied Bioinformatics
    A biological motif, broadly speaking, is a pattern found occurring in a set of biological sequences, such as in DNA or protein sequences.
  2. [2]
    Review of Different Sequence Motif Finding Algorithms - PMC - NIH
    A DNA sequence motif is a subsequence of DNA sequence that is a short similar recurring pattern of nucleotides, and it has many biological functions 1. A DNA ...
  3. [3]
    Sequence Motif Search - RCSB PDB
    Feb 22, 2024 · Sequence motifs are short segments of conserved protein or nucleic acid sequences, that are present in many proteins or genes (respectively) and ...
  4. [4]
    Motif Discovery in Protein Sequences - IntechOpen
    Dec 14, 2016 · Biological sequence motifs are defined as short, usually fixed length, sequence patterns that may represent important structural or functional ...1. Introduction · 2. Protein Sequences, Active... · 3.1. Motif Representation
  5. [5]
    STREME: accurate and versatile sequence motif discovery
    Sequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences—for example, the ...Abstract · Introduction · Results · Discussion
  6. [6]
  7. [7]
  8. [8]
    Finding functional motifs in protein sequences with deep learning ...
    Protein sequence motifs are patterns of residues with different structural and/or functional features. They were recognized with protein multiple sequence ...Finding Functional Motifs In... · Data Encoding · Recent Results In Predicting...<|control11|><|separator|>
  9. [9]
    How to Identify Protein Motifs from Protein Sequences - Bitesize Bio
    May 29, 2025 · A protein sequence motif is an amino-acid sequence pattern found in similar proteins; change of a motif changes the corresponding biological function.
  10. [10]
    Structural Motifs | Biomacromolecular structures - EMBL-EBI
    Structural motifs are short segments of protein 3D structure, which are spatially close but not necessarily adjacent in the sequence.
  11. [11]
    Discovering sequence motifs - PubMed
    The purpose of motif discovery is to discover patterns in biopolymer (nucleotide or protein) sequences to better understand the structure and function of the ...
  12. [12]
    Transcription factor–DNA binding: beyond binding site motifs - PMC
    Mar 27, 2017 · Sequence-specific transcription factors (TFs) regulate gene expression by binding to cis-regulatory elements in promoter and enhancer DNA.
  13. [13]
    Structure-guided Analysis Reveals Nine Sequence Motifs ...
    Structure-guided Analysis Reveals Nine Sequence Motifs Conserved among DNA Amino-methyl-transferases, and Suggests a Catalytic Mechanism for these Enzymes.
  14. [14]
    MoMo: discovery of statistically significant post-translational ...
    Many PTMs are associated with short sequence patterns called 'motifs' that help localize the modifying enzyme.
  15. [15]
    Evolutionary conservation of sequence motifs at sites of protein ...
    We found that the most common modifications—phosphorylation, ubiquitylation, and acylation but not N-glycosylation—occur in regions of high sequence ...
  16. [16]
  17. [17]
    6.2 Protein sequence analysis and motif discovery - Fiveable
    E-value: assesses the statistical significance of a motif's occurrence in the dataset, representing the expected number of motifs with a similar score that ...
  18. [18]
    [PDF] Refining motif finders with E-value calculations
    Motif finders are an important tool for searching for regulatory elements in DNA. Popular existing programs optimize the entropy score to efficiently search for ...
  19. [19]
    The search for cis -regulatory driver mutations in cancer genomes
    Oct 20, 2015 · The mutated TERT promoter sequence is given, featuring a C > T mutation which creates a consensus binding motif for an ETS transcription factor.Tert Promoter Mutations... · Observations From Recent... · Recurrent Mutations Are...<|control11|><|separator|>
  20. [20]
    Negative selection maintains transcription factor binding motifs in ...
    A number of promoter mutations have been linked with an increased risk of cancer. Cancer somatic mutations in binding sites of selected transcription ...
  21. [21]
    Consensus Sequence - MeSH - NCBI - NIH
    A theoretical representative nucleotide or amino acid sequence in which each nucleotide or amino acid is the one which occurs most frequently at that site.
  22. [22]
    DNA binding sites: representation and discovery - BIOINFORMATICS
    This article discusses computer algorithms for analyzing and predicting DNA binding sites, including representing known sites and discovering new ones.
  23. [23]
    ScanProsite user manual - Expasy - PROSITE
    ... pattern; Match mode. Pattern syntax. The standard IUPAC one letter code for the amino acids is used in PROSITE. The symbol 'x' is used for a position where any ...
  24. [24]
    The 20 years of PROSITE - PMC - NIH
    The first release of PROSITE was made available in PC/Gene in March 1988 and contained 58 patterns. Each pattern was accompanied by an abstract that described ...Missing: paper | Show results with:paper
  25. [25]
    Minimal MEME Motif format - MEME Suite
    The MEME Minimal Motif Format is a simple text format for motifs that is accepted by the programs in the MEME Suite that require MEME Motif Format.Alphabet (recommended) · Strands (optional) · Background frequencies...
  26. [26]
    Expasy - PROSITE
    PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify ...
  27. [27]
    User Manual - Expasy - PROSITE
    The second file (PROSITE.DAT) is a computer readable file made of patterns and profiles. It contains all the information necessary to programs to scan sequence( ...
  28. [28]
    Fitting a mixture model by expectation maximization to discover ...
    The algorithm described in this paper discovers one or more motifs in a collection of DNA or protein sequences by using the technique of expectation ...Missing: de novo
  29. [29]
    Detecting subtle sequence signals: a Gibbs sampling strategy for ...
    Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment ... Science. 1993 Oct 8;262(5131):208-14. doi: 10.1126/science ...Missing: sampler paper
  30. [30]
    Discovery of Regulatory Elements by a Computational Method ... - NIH
    In this paper, we describe an algorithmic method designed specifically for phylogenetic footprinting in multiple species. Because it is tailored to this purpose ...Missing: seminal | Show results with:seminal
  31. [31]
    Of mice and men: phylogenetic footprinting aids the discovery of ...
    Phylogenetic footprinting is an approach to finding functionally important sequences in the genome that relies on detecting their high degrees of ...Missing: seminal papers
  32. [32]
    PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates ...
    Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences.
  33. [33]
    PhyME: A probabilistic algorithm for finding motifs in sets of ...
    Oct 28, 2004 · While EMnEM is implemented to use the Jukes-Cantor model [19], PhyME uses a more realistic model that incorporates binding site specificities.<|control11|><|separator|>
  34. [34]
    Phylogeny based discovery of regulatory elements - PMC
    EMnEM [14] uses a Jukes-Cantor model [16] in which the substitution rate ... Phylogenetic motif detection by expectation-maximization on evolutionary mixtures.
  35. [35]
    Statistical significance of cis-regulatory modules - BMC Bioinformatics
    Jan 22, 2007 · The organizational parameters we consider are the order, spacing and orientation of motifs in the module. ... spacing between any two pairs of ...
  36. [36]
    A New Algorithm for Identifying Cis-Regulatory Modules Based on ...
    Eukaryotic gene expression is regulated by cis-regulatory modules (CRMs). A CRM is a DNA sequence that contains multiple binding sites of one or more specific ...
  37. [37]
    A novel unbiased measure for motif co-occurrence predicts ...
    Dec 7, 2012 · Here we propose a new measure for regulatory motif co-occurrence and a new methodology to systematically identify TF pairs significantly co- ...Measure For Tfbs... · Micro-Array Gene Expression... · Associated DataMissing: ModuleMiner | Show results with:ModuleMiner
  38. [38]
    RSA-tools - tutorials - dyad-analysis
    Motif discovery with a small gene family. We will illustrate the usage of dyad-analysis with a family of genes which are expressed when galactose is provided ...Missing: bioinformatics | Show results with:bioinformatics<|control11|><|separator|>
  39. [39]
  40. [40]
    ModuleMiner - improved computational detection of cis-regulatory ...
    ModuleMiner detects cis-regulatory modules in a set of co-expressed genes in tissue-specific microarray clusters and in embryonic development datasets.
  41. [41]
  42. [42]
    Computational methods for the detection of cis-regulatory modules
    Jun 4, 2009 · ... co-occurring motifs are sparsely distributed throughout the genome, (i) Random sets of genes (negative control); (ii) yeast cell cycle; (iii) ...Crm Scanners · Crm Builders · Crm Genome Screeners
  43. [43]
    The functions and consensus motifs of nine types of peptide ...
    Under the heading of Kinase-la are listed all data for proteins where a sequence, comparable to the original Walker A motif, contributes to the binding site. In ...
  44. [44]
    Leucine zippers of fos, jun and GCN4 dictate dimerization ... - Nature
    Aug 17, 1989 · These results indicate that leucine zippers control the types of protein complexes which can associate with a TRE and regulate gene expression.
  45. [45]
    A comprehensive review of signal peptides: Structure, roles, and ...
    Signal peptides (SP) are short peptides located in the N-terminal of proteins, carrying information for protein secretion. They are ubiquitous to all ...Review · Secretory Systems In... · Signal Peptide Structure
  46. [46]
    C2H2-Type Zinc Finger Proteins: Evolutionarily Old and New ...
    The C2H2-type ZF comprises up to 30 amino acids in the consensus sequence CX2-4CX12HX2-8H (X refers to any amino acid), which forms 1 α-helix and 2 β-sheets ...
  47. [47]
    UniProtKB | Statistics | UniProt
    Oct 15, 2025 · PROSITE, 106,891,747, 66,847,835, 0.54, Family and domain databases. Expand table. Amino acid composition. UniProtKB. Reviewed (Swiss-Prot).
  48. [48]
    Evolution of protein modularity - ScienceDirect.com
    The main goal of this article is to outline the likely evolutionary path of protein development from simpler molecules still recognizable as parts of the ...
  49. [49]
    The modular nature of protein evolution: domain rearrangement ...
    Feb 14, 2020 · In this study we analyse the mechanisms leading to new domain arrangements in five major eukaryotic clades (vertebrates, insects, fungi, monocots and eudicots)
  50. [50]
    JASPAR 2024: 20th anniversary of the open-access database of ...
    Nov 14, 2023 · The primary function of PWMs is to model the binding affinity or probability of interaction between a TF and a DNA sequence (8). As such, PWMs ...
  51. [51]
    Improved regulatory element prediction based on tissue-specific ...
    We develop a computational approach, regulatory element prediction based on tissue-specific local epigenetic marks (REPTILE), to precisely locate enhancers.
  52. [52]
    A comprehensive atlas of epigenetic regulators reveals tissue ... - NIH
    Oct 28, 2022 · Epigenetics, a study of heritable traits that are not encoded in DNA sequences, has become as the key to dissect the regulatory mechanism behind ...
  53. [53]
    Highly accurate protein structure prediction with AlphaFold - Nature
    ### Summary of AlphaFold for Motif Recognition and Structure Prediction
  54. [54]
    Automated classification of RNA 3D motifs and the RNA 3D Motif Atlas
    Modular RNA 3D motifs are the building blocks of complex RNA molecules and RNA-based nanomachines, such as the ribosome and the spliceosome.
  55. [55]
    Conservation of the three-dimensional structure in non-homologous ...
    Aug 2, 2012 · In this review, we examine examples of conservation of protein structural motifs in unrelated or non-homologous proteins.
  56. [56]
    A chain code for representing 3D curves | Request PDF
    Aug 5, 2025 · A chain code for representing three-dimensional (3D) curves is defined. Any 3D continuous curve can be digitalized and represented as a 3D ...
  57. [57]
    Local graph alignment and motif search in biological networks - PNAS
    We develop a search algorithm for topological motifs called graph alignment, a procedure with some analogies to sequence alignment.
  58. [58]
    EF-hand calcium-binding proteins - PubMed
    The EF-hand motif is the most common calcium-binding motif found in proteins. Several high-resolution structures containing different metal ions bound to EF ...
  59. [59]
    Motif mediated protein-protein interactions as drug targets - PMC
    Mar 2, 2016 · Protein-protein interactions (PPI) are involved in virtually every cellular process and thus represent an attractive target for therapeutic interventions.
  60. [60]
    Structural biology and bioinformatics in drug design - Journals
    Feb 3, 2006 · Knowledge of the three-dimensional structures of protein targets is now playing a major role in all stages of drug discovery. Its place in lead ...<|separator|>