Fact-checked by Grok 2 weeks ago

Nucleic acid sequence

A nucleic acid sequence is a polymer composed of nucleotides that forms the primary structure of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA), serving as the fundamental carrier of genetic information in living organisms. Each nucleotide consists of a nitrogenous base, a five-carbon sugar (deoxyribose in DNA or ribose in RNA), and a phosphate group, with the bases—adenine (A), guanine (G), cytosine (C), thymine (T) in DNA, or uracil (U) replacing thymine in RNA—linked in a specific linear order that encodes instructions for biological processes. By convention, these sequences are written and read from the 5' end to the 3' end, reflecting the directionality of the phosphodiester bonds that connect the nucleotides. In DNA, the sequence typically forms a double-stranded helix where complementary bases pair (A with T, G with C), enabling stable storage of genetic data across generations, while RNA sequences are generally single-stranded and versatile, functioning in roles such as (mRNA) for protein synthesis or regulatory non-coding RNAs. The , for instance, comprises approximately 3 billion base pairs of DNA sequence, underscoring the immense informational capacity of these molecules. Variations in sequences, known as , can lead to , , and diseases, making central to fields like and . Discovered in the late by , were later recognized for their sequence-based coding of through landmark experiments in the mid-20th century.

Components and Representation

Nucleotides

Nucleotides are the fundamental monomeric units that compose nucleic acid sequences, each consisting of a nitrogenous base, a five-carbon pentose sugar, and one or more phosphate groups attached to the sugar. The pentose sugar is ribose in ribonucleic acid (RNA) or 2'-deoxyribose in deoxyribonucleic acid (DNA), differing by the absence of a hydroxyl group at the 2' carbon position in deoxyribose. The phosphate group is typically linked to the 5' carbon of the sugar, forming a nucleotide monophosphate, though di- or triphosphate forms occur in metabolic contexts. The nitrogenous bases in nucleotides are heterocyclic aromatic compounds classified into two main types: purines and pyrimidines. Purines, and , feature a double-ring structure—a six-membered ring fused to a five-membered ring—with nitrogen atoms at positions 1, 3, 7, and 9. has an amino group at position 6, while has a at position 6 and an amino group at position 2. Pyrimidines, , , and uracil (U), possess a single six-membered ring with nitrogens at positions 1 and 3; has an amino group at position 4 and a at position 2, has a at position 5 and at positions 2 and 4, and uracil mirrors without the . In DNA, the canonical bases are , , , and , while in RNA, uracil substitutes for .
BaseTypeDNARNAKey Structural Features
Adenine (A)PurineYesYesFused pyrimidine-imidazole rings; amino group at C6
Guanine (G)PurineYesYesFused rings; carbonyl at C6, amino at C2
Cytosine (C)PyrimidineYesYesSingle ring; amino at C4, carbonyl at C2
Thymine (T)PyrimidineYesNoSingle ring; methyl at C5, carbonyls at C2 and C4
Uracil (U)PyrimidineNoYesSingle ring; carbonyls at C2 and C4
Nucleotides polymerize through phosphodiester bonds, where the 5' of one links to the 3' hydroxyl group of another via a , forming the sugar- backbone that provides structural integrity to the chain. This backbone alternates sugar and units, with the nitrogenous bases projecting inward or outward depending on the 's conformation. Beyond the canonical bases, nucleic acids contain non-canonical or modified bases, such as in (), which is an of with the base attached via a C-C rather than the standard N-glycosidic linkage. Other examples include dihydrouridine and , which arise from post-transcriptional modifications to standard bases.

Notation Systems

Nucleic acid sequences are symbolically represented using single-letter abbreviations for the four standard nucleotide bases. In deoxyribonucleic acid (DNA), these are A for adenine, C for cytosine, G for guanine, and T for thymine. For ribonucleic acid (RNA), uracil (U) replaces thymine, resulting in A, C, G, and U. These abbreviations, established as a compact notation for sequence description, facilitate clear communication in scientific literature and databases. By convention, nucleic acid sequences are written in the 5' to 3' direction, reflecting the of the sugar-phosphate backbone where the 5' end terminates in a group attached to the 5' carbon of the , and the 3' end has a free hydroxyl group on the 3' carbon. This directionality aligns with the biochemical processes of replication and transcription, which proceed from 5' to 3'. For example, a short DNA sequence might be denoted as 5'-ATCG-3', indicating the order of bases from the 5' end to the 3' end. sequences follow the same convention, such as 5'-AUCG-3'. The prefixes 5' and 3' are often omitted when the direction is unambiguous, but explicit notation is used for clarity, especially in diagrams or when specifying strands. To handle uncertainty or variability in sequencing data, the International Union of Pure and Applied Chemistry (IUPAC) and the International Union of Biochemistry (IUB), now IUBMB, introduced ambiguity codes in their recommendations. These single-letter symbols represent groups of bases, allowing concise notation for degenerate or polymorphic sites. For instance, N denotes any base (A, C, G, or T/U), R specifies a (A or G), and Y indicates a (C or T/U). The full set of IUPAC codes is as follows:
SymbolBases RepresentedComplementary BasesOrigin of Designation
AAT
CCG
GGC
T (DNA)/U (RNA)T/UA/Uracil
RA or GY
YC or T/UR
MA or CKaMino
KG or T/UMKeto
SC or GSStrong (3 H-bonds)
WA or T/UWWeak (2 H-bonds)
HA or C or T/UDnot-G (H)
BC or G or T/UVnot-A (B)
VA or C or GBnot-T/U (V)
DA or G or T/UHnot-C (D)
NA or C or G or T/UNaNy
These codes ensure that complementary relationships are preserved, with each symbol mapping to its complement (e.g., complements ). The standardized notation evolved from early manual representations in the mid-20th century, which used full names or chemical formulas, to a unified system formalized by the IUPAC-IUB Commission on Biochemical Nomenclature in 1970. This addressed the growing need for consistency as techniques advanced, providing rules for abbreviations, , and in publications. The 1970 recommendations were later refined in 1984 to incorporate expanded ambiguity symbols for incompletely specified , reflecting improvements in sequencing accuracy. For double-stranded sequences, notation distinguishes the two antiparallel strands, which are connected via Watson-Crick base pairing: adenine pairs with thymine (A-T) in DNA or uracil (A-U) in RNA, and guanine pairs with cytosine (G-C). The sense strand (often the coding strand) is typically written 5' to 3' on the top line, with its complement below in the 3' to 5' direction to reflect the antiparallel orientation. For example: 5'-ATCG-3'
3'-TAGC-5'
Ambiguity codes can be applied to either strand, with complementary symbols used for the opposite strand (e.g., an R on one strand corresponds to a Y on the complement). This format highlights base pairing and is essential for representing genomic regions or restriction sites.

Biological Roles

Genetic Information in DNA

DNA serves as the primary genetic material in most , carrying the instructions necessary for , functioning, , and . These sequences are organized into , which are compact structures consisting of DNA wrapped around proteins, enabling efficient storage and transmission during . In eukaryotic cells, the nuclear genome is divided among multiple linear s, while prokaryotes typically maintain a single circular . The central dogma of molecular biology posits that genetic information is stored in DNA and flows unidirectionally to RNA and then to proteins, with DNA acting as the stable repository for hereditary information. First articulated by Francis Crick in 1958, this framework emphasizes DNA's role in information storage and replication, ensuring the faithful transmission of genetic instructions across generations. Genes within the DNA sequence represent functional units that encode proteins or regulatory RNAs, structured with coding regions known as exons interspersed with non-coding introns, as well as upstream promoters that initiate transcription. For instance, the human β-globin gene consists of three exons separated by two introns, where exons contain the coding sequence (e.g., the first exon includes codons for the N-terminal amino acids of the β-globin protein), and the promoter features a TATA box consensus sequence like TATAAA approximately 25-30 base pairs upstream of the transcription start site to recruit RNA polymerase. DNA replication occurs via a semiconservative mechanism, in which each parental strand serves as a template for synthesizing a new complementary strand, resulting in two daughter molecules each containing one original and one newly synthesized strand. This process, experimentally demonstrated by Meselson and Stahl in 1958 using density-labeled DNA in E. coli, maintains sequence fidelity with an error rate of approximately 1 in 10^9 base pairs after proofreading and repair mechanisms. Mutations, such as point substitutions, insertions, or deletions, introduce variations in DNA sequences that drive evolutionary change by generating genetic diversity upon which natural selection acts. For example, a point mutation altering a single base can change an amino acid in a protein, while insertions or deletions may shift the reading frame, potentially leading to new traits or adaptations over time. In humans, the genome comprises about 3 billion base pairs across 23 pairs of chromosomes, encoding roughly 20,000 genes that collectively determine an individual's hereditary characteristics.

Functional Roles in RNA

RNA sequences play diverse functional roles in cellular processes, extending beyond their role as transcripts of DNA templates. These roles are often determined by specific sequence motifs that enable base-pairing, structural folding, and interactions with proteins or other nucleic acids. In eukaryotes and prokaryotes, RNA types such as (mRNA), (tRNA), (rRNA), and non-coding RNAs (ncRNAs) each exhibit sequence-dependent functions critical for and regulation. mRNA carries coding sequences from genes to for protein synthesis, with its untranslated regions (UTRs) containing regulatory sequences that influence stability, localization, and efficiency. For instance, in prokaryotes, the Shine-Dalgarno sequence—a purine-rich (typically AGGAGG) located 6-10 upstream of the —facilitates binding by base-pairing with the anti-Shine-Dalgarno sequence in 16S rRNA, enabling precise initiation. tRNA molecules feature anticodon sequences that base-pair with mRNA codons during , ensuring accurate incorporation; their cloverleaf secondary structure, formed by intramolecular base-pairing, is essential for recognition and function. rRNA forms the core of , where conserved sequence elements drive inter- and intramolecular base-pairing to create complex secondary and tertiary structures that position catalytic sites for formation. Many RNA functions rely on sequence-driven secondary structures, such as hairpins and loops, which arise from complementary base-pairing and modulate activity. Hairpins, consisting of a double-stranded stem and a single-stranded loop, are prevalent in ncRNAs like microRNAs (miRNAs), where the stem-loop structure is processed into mature miRNA for ; for example, the let-7 miRNA hairpin is recognized by LIN28A protein, inhibiting its maturation and thus regulating developmental timing. Loops, including apical or internal loops, often serve as binding sites for proteins or ligands, as seen in riboswitches—5' UTR sequences in bacterial mRNAs that fold into alternative conformations upon metabolite binding, thereby switching between terminator and antiterminator s to control transcription or translation. Post-transcriptional modifications like further diversify sequences and functions. Adenosine-to-inosine (A-to-I) editing, catalyzed by acting on () enzymes, targets double-stranded regions and is read as during , altering codons, splice sites, or miRNA targets to expand diversity and regulate innate immunity. In the , -mediated editing of subunits fine-tunes neuronal signaling, with editing levels varying by tissue and development stage. In viral RNA genomes, sequence features enable efficient replication and host interaction. Single-stranded RNA viruses like possess a ~30 kb positive-sense with open reading frames (ORFs) encoding structural and non-structural proteins; key features include a slippery sequence and in the ORF1ab region that induces -1 ribosomal frameshifting (15-60% efficiency), essential for producing the replicase polyprotein. The 's codon bias, favoring T/A-ending codons unlike the human host, optimizes viral while evading immune detection, and its 5' cap-like and 3' poly-A tail mimic host mRNAs for efficient expression. Regulatory roles of RNA sequences often involve ncRNAs acting as enhancers or silencers of . miRNAs, short ncRNAs (~22 nt), bind complementary sites in mRNA 3' UTRs via base-pairing, recruiting the (RISC) to repress or promote degradation, thereby fine-tuning gene networks in development and . In mRNA 3' UTRs, AU-rich elements ()—sequences like AUUUA repeats—act as silencers by accelerating deadenylation and decay, as in tumor necrosis factor-alpha (TNF-α) mRNA, where they limit inflammatory responses; binding proteins like tristetraprolin (TTP) mediate this instability. Conversely, stabilizing elements in UTRs, such as G-quadruplexes or stem-loops, can enhance expression by protecting against nucleases.

Sequencing Technologies

Historical Methods

Prior to the development of direct sequencing methods in the 1970s, determining nucleic acid sequences relied on indirect approaches, such as hybridization probes, which allowed partial characterization by detecting complementary base pairing between known oligonucleotide probes and target DNA or RNA under controlled conditions. These techniques, often used in conjunction with restriction enzyme mapping, provided limited insights into sequence motifs or restriction sites but could not yield complete linear orders of nucleotides due to their reliance on inference rather than direct readout. For instance, early hybridization experiments in the 1960s and early 1970s helped infer short RNA sequences, like those in transfer RNAs, by comparing melting temperatures and specificity of probe binding. The first direct DNA sequencing methods emerged in 1977 with the independent publications of the Maxam-Gilbert chemical cleavage technique and Sanger's chain-termination method, marking the onset of practical nucleic acid sequencing. In the Maxam-Gilbert approach, DNA is labeled at one end and subjected to base-specific chemical treatments—such as dimethyl sulfate for guanine, hydrazine for pyrimidines, or formic acid for adenine and guanine—to induce strand breaks at particular nucleotides, generating a set of fragments whose sizes are resolved by polyacrylamide gel electrophoresis to infer the sequence from band patterns. This method enabled the sequencing of up to several hundred base pairs but required hazardous chemicals and radioactive labeling, limiting its scalability. Concurrently, and colleagues introduced the chain-termination method, also known as dideoxy sequencing, which enzymatically synthesizes strands using in the presence of normal deoxynucleotides (dNTPs) and chain-terminating dideoxynucleotides (ddNTPs) that lack a 3'-hydroxyl group, halting extension at random positions corresponding to each base. The resulting fragments are separated by , with Sanger's plus-minus variant initially using differential incorporation to generate overlapping reads, allowing . This enzymatic method proved more reproducible and safer than chemical cleavage, facilitating the first complete sequence of the φX174, a 5,386-nucleotide single-stranded , achieved by Sanger's team in 1977. Early applications of these techniques included sequencing biologically significant genes, such as the rat insulin gene reported in , which demonstrated their utility in elucidating eukaryotic regulatory elements and coding sequences. However, both methods were labor-intensive, requiring manual gel pouring, radioactive isotope handling, and film autoradiography for detection, often taking days per run. Read lengths were typically limited to 100-500 base pairs, with error rates around 1-2% due to compression artifacts and ambiguous band resolution, particularly in repetitive regions, restricting analyses to small genomes or targeted fragments. The transition to began in the with the replacement of radioactive labels by fluorescent dyes attached to ddNTPs, enabling four-color detection in a single lane and machine-based readout, which increased throughput and reduced manual effort while paving the way for larger-scale projects.

Contemporary Techniques

Contemporary techniques in nucleic acid sequencing have revolutionized through high-throughput, massively parallel methods that enable rapid and cost-effective analysis of DNA and RNA sequences. Next-generation sequencing (NGS), also known as second-generation sequencing, relies on amplifying and sequencing millions of DNA fragments simultaneously, achieving throughput in the gigabase range per run. These platforms have democratized sequencing, facilitating applications from personalized medicine to . Key NGS platforms include Illumina's sequencing by , which uses reversible terminator to detect incorporated bases via during stepwise , producing short reads (typically 50-300 ) with high accuracy (>99.9%). Ion Torrent employs technology to measure changes from released ions during incorporation, offering faster turnaround times but with read lengths around 200-400 . (PacBio) utilizes single-molecule real-time (SMRT) sequencing, where a incorporates fluorescently labeled in zero-mode waveguides, generating long reads up to 20 kb or more, ideal for resolving structural variants; high-fidelity (HiFi) reads achieve >99% accuracy through circular consensus, though raw reads have higher error rates (around 10-15%). Third-generation sequencing advances single-molecule analysis without amplification, reducing biases and enabling real-time data generation. Oxford Nanopore Technologies' platform sequences DNA or RNA by passing molecules through protein nanopores, detecting ionic current disruptions as bases translocate, which allows portable, real-time analysis with read lengths exceeding 1 Mb. Accuracy has reached >99% single-read and consensus levels as of 2025 through advanced basecalling algorithms and chemistry updates like R10.4.1, addressing early limitations in homopolymer resolution. These techniques support diverse applications, such as whole-genome sequencing, where the cost of sequencing fell below $1,000 by 2020 and further declined to approximately $200-$600 as of 2025, enabling population-scale studies. profiles microbial communities directly from environmental samples, while single-cell sequencing (scRNA-seq) captures transcriptomes from individual cells, revealing cellular heterogeneity in and . Challenges persist in error correction, particularly for repetitive regions that confound short-read ; algorithms like those in the Canu assembler integrate long-read data to achieve near-complete . Recent advances as of 2025 include ultra-high-throughput platforms like Ultima Genomics UG 100, capable of sequencing at under $100 per , and novel spatial methods such as expansion genome sequencing for mapping DNA relative to cellular structures. CRISPR-based nucleic acid detection methods, such as (using Cas13) and adaptations with CRISPR-Cas12a developed in the late and , amplify and detect specific sequences with high for diagnostics, complementing sequencing in point-of-care applications. For RNA, direct sequencing without cDNA conversion preserves modifications like m6A, using platforms like Oxford Nanopore to sequence native strands and reveal epitranscriptomic features.

Digital Handling

Data Formats

Nucleic acid sequence data is stored and exchanged using standardized text-based and binary formats that encode sequences, metadata, and quality information for computational processing. These formats facilitate interoperability across sequencing platforms and analysis tools, building on basic notation systems for nucleotides. The is a simple, human-readable text format consisting of a header line beginning with a greater-than (>) followed by a sequence identifier and optional description, and subsequent lines containing the or sequence, typically limited to 60-80 characters per line for readability. It supports basic sequence representation without quality scores, though a variant called QUAL extends it by pairing a FASTA-like sequence file with a corresponding quality file. The builds on by incorporating per-base scores alongside , structured as four lines per record: a header starting with @, , a separator line with +, and a of equal to . scores are encoded using Phred values, where Q = -10 \log_{10} P and P is the estimated probability of an incorrect call, allowing of sequencing accuracy. This format originated at the Sanger Institute for sequencing and has variants for next-generation platforms like Illumina. For aligned sequence reads, the Sequence Alignment/Map () format provides a tab-delimited text structure detailing mappings to a , including fields for read name, flags (bitwise integers indicating properties like paired-end status), , position, mapping quality, and optional tags for additional metadata. Its binary counterpart, BAM, compresses SAM data losslessly for efficient storage and indexing, supporting flags that denote paired-end alignments and other read attributes. Specialized formats enrich sequences with annotations: uses a flat-file structure with sections for locus, definition, features (e.g., genes, exons), and the sequence itself, enabling detailed biological context like source organism and publication references. The General Feature Format (GFF), particularly GFF3, employs a nine-column tab-delimited layout per feature line to specify genomic elements such as genes or regulatory regions, with columns for sequence ID, source, type, coordinates, score, strand, phase, and attributes. To address growing data volumes, formats have evolved toward ; for instance, (Compressed Reference-oriented Alignment Map) refines BAM by leveraging dependencies for lossy or lossless encoding, achieving typical file size reductions of 30-50% over BAM while maintaining compatibility with tools. Broader techniques, including reference-based and , further reduce genomic dataset sizes by 50-90% in specialized implementations, balancing efficiency with accessibility for large-scale analyses.

Storage and Databases

The primary repositories for nucleic acid sequences are maintained through the International Nucleotide Sequence Database Collaboration (INSDC), a longstanding partnership among (hosted by the , NCBI, in the United States, established in 1982), the European Nucleotide Archive (ENA, managed by the European Molecular Biology Laboratory's , EMBL-EBI), and the DNA Data Bank of Japan (DDBJ). These organizations synchronize their data daily to provide a unified, non-redundant view of global nucleotide sequence submissions, encompassing raw reads, assemblies, and annotations from diverse sources including genomic projects and research submissions. GenBank sequences are distributed in flat file formats that include detailed annotations such as locus identifiers, features, , and bibliographic references, enabling comprehensive alongside the primary data. Comprehensive releases occur bimonthly, with daily incremental updates available via FTP to reflect ongoing submissions and ensure timely access. As of August 2025, alone holds over 47 trillion base pairs across nearly 6 billion records, reflecting driven by advances in sequencing technologies. Specialized repositories complement these core databases by focusing on niche aspects of nucleic acid data. RNAcentral serves as a centralized hub for sequences, aggregating data from 52 expert databases to provide unified access to ncRNA types such as miRNAs, lncRNAs, and snoRNAs across organisms. The , an international effort, maintains a detailed catalog of , including millions of single nucleotide polymorphisms (SNPs) and structural variants derived from sequencing over 2,500 individuals, supporting and disease association studies. Managing these repositories faces significant challenges due to the explosive growth of sequence data, projected to require up to 40 exabytes of storage capacity by 2025 for human genomics alone. concerns are paramount in human-related datasets, with regulations like the European Union's (GDPR) mandating strict controls on data sharing, consent, and re-identification risks to prevent misuse of sensitive genetic information. Versioning systems are essential to track updates and revisions in sequence records, ensuring reproducibility while handling the complexity of iterative assemblies and annotations. Access to these databases is facilitated by tools like NCBI's system, which integrates , protein, and genomic data for cross-database searching and retrieval. The Basic Local Alignment Search Tool () enables rapid similarity searches against these repositories, supporting tasks from detection to functional . In the 2020s, integrations with and ontologies, such as the Sequence Alignment Ontology (SALON), have enhanced querying by enabling semantic searches and automated interpretation of alignments and metadata.

Analytical Approaches

Sequence Alignment

Sequence alignment is a fundamental computational technique in bioinformatics used to identify regions of similarity between sequences, which can indicate functional, structural, or evolutionary relationships. By comparing two or more sequences, alignments reveal conserved regions, insertions, deletions (indels), and substitutions, aiding in the inference of biological processes such as function prediction and phylogenetic . Pairwise sequence alignment compares two sequences to find the optimal arrangement that maximizes similarity. The Needleman-Wunsch algorithm, introduced in , performs global alignment by aligning entire sequences using dynamic programming, ensuring that the full length of both sequences is considered from end to end. This method constructs a scoring matrix where each cell represents the best alignment score up to that position, to recover the alignment path. For nucleotide sequences, it employs a to score matches and mismatches; a common simple scheme assigns +1 for identical bases and -1 for differences, though more sophisticated matrices like the NUC.4.4, derived from observed substitutions, can be used to account for /transversion biases. In contrast, the Smith-Waterman algorithm, developed in 1981, focuses on local alignment to detect high-similarity regions within longer sequences, which is particularly useful for identifying conserved domains in divergent nucleic acids. It modifies the Needleman-Wunsch approach by initializing the matrix with zeros and setting negative scores to zero, preventing penalties from propagating across unrelated regions. Both algorithms incorporate gap penalties to handle indels: linear penalties charge a constant cost (-d) per gap position, while affine penalties, introduced by Gotoh in 1982, distinguish gap opening (-a) from extension (-(g-1)d), better modeling biological insertion/deletion events by penalizing starts more heavily than continuations. The total alignment score can be expressed as: S = \sum s(x_i, y_i) + \sum g(k) where s(x_i, y_i) is the substitution score for aligned positions, and g(k) is the gap penalty for each gap of length k, typically negative. For comparing more than two sequences, multiple sequence alignment (MSA) extends pairwise methods to reveal patterns across a set. Progressive alignment strategies, a cornerstone of MSA, build alignments iteratively: first, a distance matrix is computed from pairwise scores, a guide tree is constructed via hierarchical clustering, and sequences are then aligned following the tree branches, starting with the most similar pairs. Clustal Omega, released in 2011, implements this approach with enhancements like mBed for large-scale alignments, enabling rapid processing of thousands of sequences while maintaining accuracy comparable to slower methods. Similarly, MAFFT, first described in 2002, uses fast Fourier transform to approximate distance calculations, accelerating progressive alignment and supporting iterative refinement for improved handling of divergent sequences. These alignments have key applications in detecting between sequences, where significant similarity suggests shared ancestry, and in constructing evolutionary trees by using scores or distances to infer phylogenies. They are essential for managing indels in highly divergent sequences, as gap models allow flexible insertions without overly disrupting conserved regions. In the era of next-generation sequencing (NGS), particularly with long-read technologies like PacBio and Oxford Nanopore, alignment methods have evolved to accommodate error-prone, lengthy reads; tools such as Minimap2 employ seed-and-extend heuristics with affine gaps to map these efficiently against reference genomes, addressing challenges like structural variants that short-read aligners struggle with.

Motif Detection

Sequence motifs are short, recurring patterns in DNA or RNA sequences, typically 6-20 base pairs long, that often indicate functional elements due to their conservation across related sequences. For example, the in eukaryotic promoters, with the TATAAA, serves as a for the to initiate transcription. These motifs can vary slightly in sequence but maintain functional significance through evolutionary conservation. Motifs are classified by their biological roles, including regulatory motifs in enhancers that control gene expression, structural motifs such as ribosome binding sites (RBS) in mRNA, and protein-binding motifs like transcription factor binding sites (TFBS). Regulatory motifs, such as those in enhancers, recruit transcription factors to modulate gene activity in specific cellular contexts. Structural motifs, prominent in RNA, include the Shine-Dalgarno sequence (e.g., AGGAGG) upstream of start codons in prokaryotic mRNA, which facilitates ribosome assembly for translation initiation. Protein-binding motifs encompass TFBS in DNA, where sequence patterns enable specific interactions with regulatory proteins, and analogous sites in RNA for RNA-binding proteins. RNA motifs, often underemphasized, play critical roles in processes like splicing and RNA stability, with examples including internal ribosome entry sites (IRES) that direct cap-independent translation. Key tools for motif detection include (Multiple EM for Motif Elicitation), which uses expectation maximization to discover ungapped motifs in unaligned sequences by modeling them as position-specific scoring matrices. provides a database of documented patterns and profiles for identifying functional motifs in protein-coding nucleic acid sequences, aiding in the annotation of domains and sites. For motifs with positional variability, position weight matrices (PWMs) represent the likelihood of each nucleotide at every position, derived from aligned sequences. The score for a candidate sequence is calculated as the sum over positions j of \log_2 \left( \frac{f_{j,b}}{b_b} \right), where f_{j,b} is the observed frequency of base b at position j, and b_b is the background frequency; higher scores indicate better matches. Motif detection enables applications in predicting gene regulation, where identified patterns forecast enhancer activity or TFBS occupancy to model expression dynamics. It also supports functional annotation by linking motifs to biological roles, such as classifying non-coding regions as regulatory elements. Advances in machine learning, starting with DeepBind in 2015, employ convolutional neural networks to predict protein-DNA and protein-RNA binding specificities from sequence data, outperforming traditional PWM-based methods on large datasets. Post-2020 integrations of deep learning, including transformer-based models, have enhanced motif discovery in RNA contexts by incorporating structural features and improving accuracy in high-throughput data like CLIP-seq, addressing gaps in earlier approaches.

Complexity Measures

Complexity measures in nucleic acid sequences quantify the variability, randomness, and information content inherent in DNA or RNA strings, providing insights into their structural and functional properties independent of relational comparisons like alignments. These metrics assess how unpredictable or repetitive a sequence is, which correlates with biological constraints such as evolutionary pressures or mutational patterns. For instance, highly random sequences approach maximum entropy, indicating minimal redundancy, while repetitive or biased ones exhibit lower complexity, often reflecting functional adaptations. A primary measure is Shannon entropy, which evaluates the uncertainty or information content per base in a sequence. Defined as H = -\sum p_i \log_2 p_i, where p_i is the frequency of each base (A, C, G, T/U), it ranges from 0 bits per base for a fully repetitive sequence to 2 bits per base for a uniformly random one with equal base frequencies. This metric, adapted from , highlights ; for example, coding regions in genomes often show values around 1.8-1.9 bits due to codon usage biases, while non-coding repeats drop below 1 bit. Shannon entropy can be computed from base frequencies derived briefly from aligned sequences but applies to individual sequences as well. Other metrics complement by capturing different aspects of repetitiveness and . Lempel-Ziv approximates by counting distinct substrings in a during compression-like , yielding a normalized score between 0 (purely repetitive) and 1 (incompressible ); it is particularly useful for identifying low- regions in genomes, such as tandem repeats, where scores below 0.3 indicate high repetitiveness. , measured as the deviation from 50% guanine-cytosine proportion (e.g., via |\text{GC\%} - 50|), influences perceived by skewing distributions, with extreme biases (e.g., >70% GC in vertebrate CpG islands) reducing and affecting evolutionary analyses. k-mer diversity assesses repetitiveness by counting unique substrings of length k (typically 3-6 s), where lower diversity signals tandem repeats or segmental duplications, as observed in repetitive regions of eukaryotic genomes such as . These measures find applications in distinguishing functional genomic elements and analyzing . In distinguishing coding from non-coding regions, low and Lempel-Ziv scores (e.g., <0.4) mark non-coding areas with repeats, while higher values (~1.9 bits) typify protein-coding exons under selective pressure for diversity. For —clouds of mutant variants—Shannon quantifies intra-host diversity, with higher values indicating substantial mutation rates in variable regions. Compression efficiency, tied to Lempel-Ziv, optimizes storage of repetitive genomes. Biologically, low often arises in regulatory regions due to physicochemical constraints, such as secondary in promoters, where base pairing reduces variability to ~1.2 bits compared to 1.8 in intergenic spacers. Tools like the Entropy-One calculator from the HIV Database compute site-specific Shannon for aligned sequences, facilitating variability analysis in viral datasets. In , profiles reveal community diversity, with recent studies using vectors to encode microbial sequences for efficient , capturing up to 95% of variability in gut microbiomes. Emerging models, such as those predicting long-range dependencies, infer from raw data, enhancing detection of cryptic regulatory motifs missed by traditional metrics.

References

  1. [1]
  2. [2]
    [PDF] Chapter 28: Nucleosides, Nucleotides, and Nucleic Acids.
    By convention, nucleic acid sequences are written from left to right, from the 5'-end to the 3'-end. Nucleic acids are negatively charged. 28.7: Nucleic Acids.
  3. [3]
    DNA Sequencing Fact Sheet
    Jun 27, 2023 · Sequencing DNA means determining the order of the four chemical building blocks - called "bases" - that make up the DNA molecule.
  4. [4]
    Nucleotide - National Human Genome Research Institute
    A nucleotide consists of a sugar molecule (either ribose in RNA or deoxyribose in DNA) attached to a phosphate group and a nitrogen-containing base. The bases ...Missing: authoritative sources
  5. [5]
    Understanding biochemistry: structure and function of nucleic acids
    Oct 11, 2019 · We explain the structure of the DNA molecule, how it is packaged into chromosomes and how it is replicated prior to cell division.
  6. [6]
    Chapter 1: Nucleic acid structure - ATDBio
    Adenine and guanine are purines and cytosine and thymine are pyrimidines (Figure 2). Structures of the purine and pyrimidine heterocyclic ring systems.Nucleic Acid Structure · 2'-Deoxyribonucleic Acid... · Nucleic Acid Duplexes
  7. [7]
    Nucleic acids (article) - Khan Academy
    Each nucleotide in DNA contains one of four possible nitrogenous bases: adenine (A), guanine (G) cytosine (C), and thymine (T). Adenine and guanine are purines, ...<|control11|><|separator|>
  8. [8]
    Phosphate Backbone - National Human Genome Research Institute
    Each strand has a backbone made of alternating sugar (deoxyribose) and phosphate groups. Attached to each sugar is one of four bases--adenine (A), cytosine (C), ...
  9. [9]
    Pseudouridine: Still mysterious, but never a fake (uridine)! - PMC - NIH
    Pseudouridine (Ψ) is the most abundant RNA modification, a C-C glycosidic isomer of uridine, and the first modified nucleoside discovered.
  10. [10]
    Non-canonical roles of tRNAs: tRNA fragments and beyond - PMC
    In this review, we will discuss the non-canonical functions of tRNAs. These include tRNAs as precursors to novel small RNA molecules derived from tRNAs.Missing: authoritative sources
  11. [11]
    Incomplete nucleic acid sequences - IUBMB Nomenclature
    Where two descendants differ in nucleic acid sequence at a particular position (for instance A in one and G in the other), the putative ancestral sequence can ...
  12. [12]
    What is DNA?: MedlinePlus Genetics
    Jan 19, 2021 · DNA, or deoxyribonucleic acid, is the hereditary material in organisms. It's stored as a code of four chemical bases, and forms a double helix.
  13. [13]
    Deoxyribonucleic Acid (DNA) Fact Sheet
    Aug 24, 2020 · But during cell division, DNA is in its compact chromosome form to enable transfer to new cells. Researchers refer to DNA found in the cell's ...
  14. [14]
    Heredity, Genes, and DNA - The Cell - NCBI Bookshelf
    Chromosomes contain proteins as well as DNA, and it was initially thought that genes were proteins. The first evidence leading to the identification of DNA as ...
  15. [15]
    Central Dogma - National Human Genome Research Institute
    The central dogma of molecular biology is a theory that states that genetic information flows only in one direction, from DNA to RNA to protein.
  16. [16]
    [PDF] Central Dogma of Molecular Biology - Caltech
    "The central dogma, enunciated by Crick in 1958 and the keystone of molecular biology ever since. is likely to prove ill considerable. over-simplification ...
  17. [17]
    Differential Gene Transcription - Developmental Biology - NCBI - NIH
    Anatomy of the gene: Exons and introns. There are two fundamental differences distinguishing most eukaryotic genes from most prokaryotic genes.
  18. [18]
    From DNA to RNA - Molecular Biology of the Cell - NCBI Bookshelf
    Both intron and exon sequences are transcribed into RNA. The intron sequences are removed from the newly synthesized RNA through the process of RNA splicing.
  19. [19]
    The TATA-Box Sequence in the Basal Promoter Contributes to ... - NIH
    A prototype 13-bp TATA-box sequence, TCACTATATATAG, was mutated at each nucleotide position and examined for its function in the core promoter.
  20. [20]
    THE REPLICATION OF DNA IN ESCHERICHIA COLI - PMC - NIH
    Proc Natl Acad Sci USA. 1958 Jul 15;44(7):671–682. doi: 10.1073/pnas.44.7.671 THE REPLICATION OF DNA IN ESCHERICHIA COLI
  21. [21]
    Genetic Mutation | Learn Science at Scitable - Nature
    Mutations are changes in the genetic sequence, and they are a main cause of diversity among organisms.
  22. [22]
    Studying Mutation and Its Role in the Evolution of Bacteria - PMC - NIH
    Mutation is the engine of evolution in that it generates the genetic variation on which the evolutionary process depends.
  23. [23]
    What is an RNA? A top layer for RNA classification - PMC - NIH
    An RNA is a transcript with a function, usually associated with a functional or structural role, unlike non-functional transcripts.
  24. [24]
    The roles of structural dynamics in the cellular functions of RNAs
    Abstract. RNAs fold into 3D structures that range from simple helical elements to complex tertiary structures and quaternary ribonucleoprotein assemblies.
  25. [25]
    Riboswitches: A Common RNA Regulatory Element - Nature
    One of the recently discovered forms of genetic regulation by RNA is the riboswitch. This ribonucleic acid sequence is most often found at the 5' end of the ...
  26. [26]
    A-to-I RNA editing – thinking beyond the single nucleotide - PMC
    Adenosine-to-inosine RNA editing is a conserved process, which is performed by ADAR enzymes. By changing nucleotides in coding regions of genes and altering ...
  27. [27]
    Sequence analysis of SARS-CoV-2 genome reveals features ...
    Sep 24, 2020 · We performed comprehensive in silico analyses of several features of SARS-CoV-2 genomic sequence (e.g., codon usage, codon pair usage, ...Missing: roles | Show results with:roles
  28. [28]
    AU-rich elements and associated factors: are there unifying principles?
    Sequence elements rich in A and U nucleotides or AU-rich elements (AREs) have been known for many years to target mRNAs for rapid degradation.
  29. [29]
    Regulation of eukaryotic gene expression by the untranslated gene ...
    Genes with complex promoters are likely to make use of regulatory elements, such as enhancers and silencers, selectively, allowing varying levels of expression ...
  30. [30]
    Query Input and database selection - BLAST - NIH
    FASTA¶. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished ...
  31. [31]
    The Sanger FASTQ file format for sequences with quality scores ...
    Dec 16, 2009 · This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them.
  32. [32]
    The Sequence Alignment/Map format and SAMtools - PMC - NIH
    The SAM format specification gives a detailed description of each field and the predefined TAG s. ... The SAM/BAM format, together with SAMtools, separates the ...
  33. [33]
    [PDF] Sequence Alignment/Map Format Specification - Samtools
    This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAM file may optionally specify the version being used via the @ ...Missing: Broad | Show results with:Broad
  34. [34]
    Sample GenBank Record - NCBI
    This page presents an annotated sample GenBank record (accession number U49845 ) in its GenBank Flat File format.Genbank · Field Comments · Public Nucleic Acid Sequence...
  35. [35]
    GFF3 File Format - Ensembl
    The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.
  36. [36]
    [PDF] CRAM format specification (version 3.1) - Samtools
    Sep 4, 2024 · A CRAM file consists of a fixed length file definition, followed by a CRAM header container, then zero or more data containers, and finally a ...
  37. [37]
    CRAM 3.1: advances in the CRAM file format - PMC - NIH
    The basic structure of CRAM can be seen in Figure 1. It starts with a header matching the SAM specification, although it mandates the use of MD5sums on ...
  38. [38]
    International Nucleotide Sequence Database Collaboration (INSDC)
    The International Nucleotide Sequence Database Collaboration (INSDC) archives nucleotide sequence data, from raw to assembled and annotated sequences, ...About INSDC · Global Participation · Technical Specifications · Publications
  39. [39]
    International Nucleotide Sequence Database Collaboration - NCBI
    Jun 12, 2024 · This site presents the aims and policies of this long-established collaboration in gathering and publishing nucleotide sequence and annotation.
  40. [40]
    The international nucleotide sequence database collaboration
    Jan 8, 2021 · The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org/) has been the core infrastructure for collecting and providing ...
  41. [41]
    international nucleotide sequence database collaboration (INSDC)
    Nov 13, 2024 · The INSDC is committed to expand the collaboration to be more representative of the global community of sequences and users.
  42. [42]
    Current GenBank Release Notes - NCBI - NIH
    Aug 15, 2025 · This document describes the format and content of the flat files that comprise releases of the GenBank nucleotide sequence database.
  43. [43]
    GenBank 2025 update | Nucleic Acids Research - Oxford Academic
    Nov 18, 2024 · NCBI provides bimonthly comprehensive releases of GenBank sequence records in both the traditional flat file format and a structured ASN.1 ...
  44. [44]
    GenBank Overview - NCBI
    Dec 8, 2022 · GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.Sample GenBank Record · Sequence Identifiers · How to submit data · About TSAMissing: specification | Show results with:specification
  45. [45]
    GenBank Release 268.0 is Available! - NCBI Insights - NIH
    Aug 26, 2025 · GenBank release 268.0 (8/18/2025) is now available on the NCBI FTP site. This release has 47.01 trillion bases and 5.90 billion records.Missing: format | Show results with:format
  46. [46]
    RNAcentral: The non-coding RNA sequence database
    RNAcentral is a comprehensive database of non-coding RNA sequences that represents all types of ncRNA from a broad range of organisms. RNAcentral is the ...Expert Databases · About RNAcentral · Downloads · Public Postgres database
  47. [47]
    RNAcentral: a comprehensive database of non-coding RNA ...
    RNAcentral is a database of non-coding RNA (ncRNA) sequences that aggregates data from specialised ncRNA resources and provides a single entry point for ...Abstract · INTRODUCTION · DATA OVERVIEW · RNACENTRAL USE CASES
  48. [48]
    1000 Genomes | A Deep Catalog of Human Genetic Variation
    The 1000 Genomes Project created a catalogue of common human genetic variation, using openly consented samples from people who declared themselves to be healthy ...Data · About · Samples · Access HGSVC data
  49. [49]
    Data | 1000 Genomes
    Ensembl presents some of the key call sets in IGSR, placing the variation data in genomic context and adding up-to-date annotation of the variant data in their ...Samples · Data collections · About · What tools can I use to...
  50. [50]
    The Genomic Data Challenges Of The Future - The Medical Futurist
    Oct 27, 2018 · By 2025, an estimated 40 exabytes of storage capacity will be required for human genomic data. Moreover, for every 3 billion bases of the human ...
  51. [51]
    [PDF] A sociotechnical approach to genomic data privacy - College of Law
    Oct 17, 2025 · In the European Union (EU), the General Data Protection. Regulation (GDPR) generally divides rights and responsibilities among data 'subjects', ...
  52. [52]
    A Sociotechnical Approach to Genomic Data Privacy - PubMed Central
    Oct 17, 2025 · The sharing of genomic data across international borders presents significant privacy law challenges. Secured computed environments on ...
  53. [53]
    Entrez Molecular Sequence Database System - NCBI
    Entrez is a molecular biology database system that provides integrated access to nucleotide and protein sequence data, gene-centered and genomic mapping ...
  54. [54]
    BLAST: Basic Local Alignment Search Tool
    BLAST finds regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates ...Standard Nucleotide BLAST · Standard Protein BLAST · NCBI Protein BLAST · Rat
  55. [55]
    SALON ontology for the formal description of sequence alignments
    Feb 27, 2023 · The Sequence Alignment Ontology (SALON) defines a helpful vocabulary for representing and semantically annotating pairwise and multiple sequence alignments.Methods · Ensuring Alignment... · Generation Of Fasta...Missing: 2020s | Show results with:2020s
  56. [56]
    [PDF] A Conceptual Framework for Human-AI Collaborative Genome ...
    The sequence search agent identifies homologous genes for a target gene, for example, by running BLAST against genome sequence data. The database agent ...<|control11|><|separator|>
  57. [57]
    A general method applicable to the search for similarities ... - PubMed
    A general method applicable to the search for similarities in the amino acid sequence of two proteins. ... Authors. S B Needleman, C D Wunsch. PMID: 5420325; DOI: ...
  58. [58]
    Fast, scalable generation of high-quality protein multiple sequence ...
    Oct 11, 2011 · In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate ...
  59. [59]
    A survey of mapping algorithms in the long-reads era
    Jun 1, 2023 · The rationale was to improve the time efficiency of the long-read mapping problem in comparison to the throughput of the second generation ...
  60. [60]
    Chapter 2: Sequence Motifs – Applied Bioinformatics
    A biological motif, broadly speaking, is a pattern found occurring in a set of biological sequences, such as in DNA or protein sequences.
  61. [61]
    The Downstream Promoter Element DPE Appears To Be as Widely ...
    TATA boxes were defined as sequences with at least a 5 out of 6 match with the TATAAA sequence upstream of −20 relative to the transcription start site.
  62. [62]
    DNA-Binding Motifs in Gene Regulatory Proteins - NCBI - NIH
    These motifs generally use either α helices or β sheets to bind to the major groove of DNA; this groove, as we have seen, contains sufficient information to ...
  63. [63]
  64. [64]
    A large-scale binding and functional map of human RNA ... - Nature
    Jul 29, 2020 · We describe the spectrum of RBP binding throughout the transcriptome and the connections between these interactions and various aspects of RNA ...
  65. [65]
    MEME: discovering and analyzing DNA and protein sequence motifs
    MEME (Multiple EM for Motif Elicitation) is one of the most widely used tools for searching for novel 'signals' in sets of biological sequences.
  66. [66]
  67. [67]
    Position Weight Matrix, Gibbs Sampler, and the Associated ... - NIH
    Position weight matrix or PWM [2–6] is one of the key bioinformatic tools used extensively in characterizing and predicting motifs in nucleotide and amino acid ...
  68. [68]
    Predicting the sequence specificities of DNA- and RNA-binding ...
    The binding specificities of RNA- and DNA-binding proteins are determined from experimental data using a 'deep learning' approach.
  69. [69]
    Big data and deep learning for RNA biology - Nature
    Jun 14, 2024 · This review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA ...Missing: post- | Show results with:post-
  70. [70]
    Quantifying selection and diversity in viruses by entropy methods ...
    Shannon entropy quantifies the amount of sequence information in each position of aligned sequences [19,20]. The sequence information reflects the variation, ...
  71. [71]
    Low Complexity Regions in Proteins and DNA are Poorly Correlated
    Apr 10, 2023 · In this study, we have identified LCRs in proteins and assessed the correlation between their entropy and their corresponding DNA sequence ...
  72. [72]
    Complexity: an internet resource for analysis of DNA sequence ... - NIH
    The approach of Lempel and Ziv is oriented to the development of an efficient algorithm for data compression. While studying complexity, we are interested not ...
  73. [73]
    Analytical Biases Associated with GC-Content in Molecular Evolution
    Feb 14, 2017 · Here, we show how base composition heterogeneity among loci and taxa can bias common molecular evolution analyses such as phylogenetic tree reconstruction.
  74. [74]
    k-mer approaches for biodiversity genomics - PMC - PubMed Central
    For example, they can be used to estimate genomic properties such as genome size, genome repetitiveness, and heterozygosity (Chikhi and Medvedev 2014; Vurture ...
  75. [75]
    Stable stem enabled Shannon entropies distinguish non-coding ...
    Shannon base pairing entropy is an indicator for RNA secondary structure fold certainty in detection of structural, non-coding RNAs (ncRNAs).
  76. [76]
    Entropy-One Submission Form - HIV Databases
    Feb 24, 2020 · This tool applies phylogenetics into Shannon entropy as a measure of variation in DNA and protein sequence alignments.
  77. [77]
    a novel approach for efficient microbial genomic sequence analysis ...
    Sep 6, 2025 · To address these issues, we propose a novel encoding method-Energy Entropy Vector (EEV). This method encodes gene sequences of arbitrary length ...
  78. [78]
    AlphaGenome: AI for better understanding the genome
    Jun 25, 2025 · Our model analyzes up to 1 million DNA letters and makes predictions at the resolution of individual letters. Long sequence context is important ...Missing: 2020s | Show results with:2020s